Spot GPUs cut AI training costs by up to 90 percent because the cloud sells spare accelerator capacity at a steep discount to on-demand, in exchange for the right to reclaim the instance when it needs the hardware back. Training jobs tolerate this well: with frequent checkpointing, a reclaimed run resumes from its last saved state instead of starting over, so you capture most of the discount while losing only minutes of work per interruption. The catch is engineering, not economics. You pay in fault tolerance, and the payoff is the cheapest GPU-hour you can buy.
This article is part of our AI, GPU and ML cluster. For the full picture, start with the complete guide to AI and GPU cost optimization, the pillar this piece links up to. Spot capacity sits in the Cut step of our See, Cut, Lock, Run method: lower the rate per GPU-hour on the part of the workload that can absorb interruption.
Spot pricing reflects spare capacity, and GPU demand is spiky, so the gap between on-demand and spot is often wider on accelerators than on ordinary CPU instances. That is exactly why the lever matters most here: the discount is biggest on the most expensive compute you rent.
How spot GPU pricing works
Each major cloud sells interruptible GPU capacity under its own name: Spot Instances on AWS, Spot Virtual Machines on Azure, and Preemptible or Spot VMs on Google Cloud, with comparable preemptible options on OCI. The mechanics rhyme. You request a GPU instance at a discounted rate, you keep it while spare capacity exists, and when the provider needs it back you receive a short termination notice, typically on the order of seconds to a couple of minutes, before the instance is reclaimed. The discount fluctuates with supply and demand, and the deepest discounts usually sit on older or less contended GPU families and regions. Because these terms and the exact notice windows change, verify the current spot behavior and discount ranges against each provider's live documentation before you design around a specific number.
Checkpointing is what makes the discount safe
The single practice that turns spot from risky to routine is checkpointing: writing model weights, optimizer state, and training step to durable storage at a regular cadence so any interruption costs at most one checkpoint interval of recomputation. Save too rarely and a reclaim near the end of an epoch throws away expensive progress; save too often and checkpoint I/O eats into throughput. The sweet spot balances the cost of a lost interval against the overhead of writing state, and for long runs it usually lands at a few minutes. Pair checkpointing with automation that detects the termination notice, flushes a final checkpoint, and relaunches the job on fresh spot capacity, and a multi-day training run can ride through dozens of interruptions without manual intervention. This is the same discipline described in how to reduce GPU costs for AI training, the sibling article worth reading alongside this one.
| Workload | Spot fit | Why |
|---|---|---|
| Long pretraining runs | Strong | Checkpointable, days long, discount compounds |
| Hyperparameter sweeps | Strong | Many independent jobs, each disposable |
| Fine-tuning | Good | Short and checkpointable, easy to resume |
| Latency-bound inference | Weak | Reclaim breaks the SLA; use on-demand or reserved |
| Single tight deadline | Weak | Interruptions add wall-clock risk |
The real trade-off: cost against wall-clock certainty
Spot does not make training slower on average, but it makes the finish time less predictable. If your capacity is reclaimed and equivalent spot is briefly unavailable, the job waits. For research and batch training that flexibility is nearly free; for a run that must finish by a board demo on Friday, the uncertainty has a real cost. The mature pattern is a blend: run the bulk of fault-tolerant training on spot to capture the discount, and keep a smaller on-demand or reserved floor for the phases that cannot slip. That blend is also how you reason about steady baseline demand, covered in reserved and committed GPU capacity explained.
Paying on-demand rates for training that could run on spot?
Our cost audit identifies which of your training and sweep workloads are safely interruptible, builds the checkpointing and auto-resume pattern, and moves them to spot while sizing an on-demand floor for the deadline-bound work. On the performance model, you pay only from realized savings. No savings, no fee.
Book a cloud cost audit →Getting spot capacity reliably
The practical worry with spot is not the discount but availability: can you actually get the GPUs when you want them. The tactics that raise your fill rate are diversification and flexibility. Request across multiple GPU instance types that fit your job rather than pinning to one, spread across availability zones, and let your orchestration fall back gracefully when a given pool is contended. Kubernetes and managed training services support this through node pools and capacity strategies that mix spot with an on-demand backstop, so a job keeps running even when spot is briefly scarce. Treat a single hardcoded instance type in one zone as the main reason spot feels unreliable, because it usually is.
Worked example
Consider a seven-day pretraining run on eight accelerators. At on-demand rates the GPU-hours are fixed and the bill is whatever the on-demand price multiplies out to. Move that run to spot at, say, a 70 to 90 percent discount depending on family and region, add checkpointing every few minutes, and the same GPU-hours cost a fraction of the on-demand total. The interruptions you absorb cost only the recomputation since the last checkpoint, a rounding error against the discount. The saving is not hypothetical; it is the on-demand price minus the spot price, captured on every GPU-hour the job runs, which is why this is the single biggest rate lever on training compute.
The AI and GPU Cost Control Guide includes the spot-with-checkpointing reference pattern and the capacity-diversification config we deploy on engagements. It is the downloadable companion to this article.
The short version
Spot GPUs are the cheapest accelerator-hours available, discounted up to 90 percent because the cloud can reclaim them. Checkpointing makes that reclaim cheap to absorb, and capacity diversification makes spot reliable enough to depend on. Run fault-tolerant training and sweeps on spot, keep a smaller on-demand or reserved floor for deadline-bound and latency-bound work, and verify current spot behavior against each provider's documentation before standardizing. For the inference side of the bill, see inference cost optimization for large language models. When you want your training fleet moved to spot safely, that is exactly what our FinOps implementation service delivers.