Your cloud monthly bill has arrived, and… to be frank, you expected bigger savings. You analyze the bill and the usage logs and detect the following:
- Your dev team is constantly recreating new VMs, only to use them for a relatively short time, and then they’re terminated.
- There are VMs that are heavily used for several hours each day and then lie dormant until the next cycle; you find out that these VMs run some batch jobs for your analytics team.
- Your research team had to run some calculations that required GPUs and lots of CPU power. Those calculations were run for half a month, and they’re waiting for the new data to run those calculations again next month.
You consider reserved instances but the savings will be negligible. RI requires that you commit yourself to at least a year, and those VMs will not be used that long. In the end, you’ll have to pay for resources that you will never use. You complain about this on some online forums and somebody suggests that you should use ‘spot instances’ for these VMs. Wait, what?
What are spot instances?
When a cloud provider designs its infrastructure, they design it to have more hardware resources available then they need or plan to sell. The reason is simple – you cannot expect your hardware to perform 100% of the time, so, in order to have a near 100% availability, they’ll need the backup resources. Generally, those resources are often unused, which is bad from the revenue standpoint. Similarly, hardware resources that have been sold to their customers are not always used in full, which enables some resources to be free at times. And ‘free’ in this context means that there’s a possibility of generating revenue.
Spot instances as AWS calls them, or ‘spot VMs’ on Microsoft Azure and ‘preemptible instances’ on Google Cloud, are VMs that are run on these surplus resources. Essentially, they are the same as on-demand instances with a little exception – cloud provider’s compute engine may decide to terminate spot instances when these surplus resources are needed for other tasks. Customers cannot rely on the availability of spot instances or SLAs like for on-demand instances; therefore, cloud providers sell these with up to 90% discount.
When can you run spot instances?
Since they run on surplus resources, you can run them only when there are surplus resources on the system. Surplus instances can be available anytime, but they are not guaranteed to exist so there’s a possibility that you will not be able to run your spot instances since there aren’t enough surplus resources. Usually, surplus resources will appear during night hours and weekends and holidays, but this depends on data-centre capacities in the availability zone. Different instance types may also be indifferent demands, so if you choose a less common instance type for your spot instance, you’ll probably have more chance to find surplus resources available when you need them.
Amazon Web Services and Microsoft Azure offer the possibility for a customer to set a maximum price per hour for a requested resource. Cloud provider calculates a dynamic price based on the long-term usage of said resource, and this price fluctuates by the hour. When this price is lower than your maximum price limit and there are enough resources, your spot instance will run.
When and how will compute engine terminate a spot instance?
You can terminate your spot instance yourself, of course. However, do know that most providers will charge you for the full hour of usage.
As said earlier, compute engine itself will terminate your spot instance when one of the following conditions are met:
- The compute engine requires a resource which your spot instance is using
- In the case of AWS and Azure, if your maximum price limit becomes lower than the current resource price.
In both cases, the compute engine will notify your spot instance that it is scheduled to be terminated. AWS will give you a two-minute warning to shut down your calculations and prepare your spot instance for the shutdown. Microsoft Azure and Google Cloud will give you a 30-second warning, after which your instance will be terminated.
So, can we use spot instances for any kind of job?
Uhm…no. Or, more precisely you can, but you shouldn’t. Spot instances are volatile so the apps that you plan to run on spot instances have to be able to survive the termination, meaning they have to be able to save their state when they receive the termination signal, and, later to resume calculations from the saved state. Mission-critical applications and those who cannot be architected in such a way to save and resume should not be run on spot instances.
What can be run on spot instances?
Spot instances are best for batch jobs and stateless apps.
Batch jobs are usually not time-critical (you really don’t care if batch job execution takes two or three hours to execute) and can be saved and then resumed if needs be. Combine this with the fact that batch jobs are usually run in after-hours, they are almost perfect candidates for running on spot instances.
Stateless apps, like image manipulation and stateless web services, can also be run on spot instances. Since you probably require these apps to be available full-time, the best way is to combine the on-demand (or reserved) instances with spot instances. For instance, you can have a ‘core’ consisting of on-demand instances that will provide 100% availability and use spot instances for scaling out when needed. If spot instances are, for some reason, unavailable at the moment, you can scale out using a group of on-demand machines.
Testing and development machines, as well as CI/CD workflows, are also good candidates for spot instances as they usually can be terminated without consequences.
Big data and analytics are also perfect candidates for spot instances as those are usually run occasionally or periodically.
High-performance computing is also used on a case-by-case basis and can leverage savings that spot instances provide.
What do you recommend?
First, analyze your apps and workload and determine if there are jobs or apps that can be good candidates for running on spot instances. If you’re still in the design phase for your new app, try to design it in such a way that it can use spot instances at least partially, since savings can be huge.
Second, check out the documentation and best practices for a specific cloud provider. For instance, AWS recommends using different instance types, and emphasize that using older generation virtual machine types can be better as their prices fluctuate less than newer instance types. Google recommends using non-standard machine types as there is a better chance that there are more surplus machines of such types then of standard types.
Third, use a combination of on-demand groups and spot instances groups which can provide both requested availability and scale under load with minimal costs.