GPU scheduling in Kubernetes is the practice of placing GPU workloads onto accelerator nodes so the smallest possible number of those nodes runs and each one stays busy. Because a single GPU node can cost many times a CPU node, every idle accelerator is a large and visible waste. The default behavior treats a GPU as a whole-number resource that one pod claims entirely, which is simple but leaves most clusters paying for silicon that sits at low utilization. Getting GPU cost under control means changing how GPUs are requested, shared, sourced and scaled.
This article sits in our Kubernetes and container cost cluster. For the full map, start with the complete guide to Kubernetes cost optimization, the pillar this piece links up to. GPU work also leans heavily on cheap, interruptible capacity, which is covered in how to use spot instances for Kubernetes workloads.
Why the default GPU request wastes money
By default the device plugin model exposes a GPU as an integer resource, so a pod that asks for one GPU gets one whole device even if it uses a fraction of it. Inference services, notebooks and small training jobs frequently use a sliver of a card, yet each holds an entire accelerator for its whole lifetime. Multiply that across a team and you are running ten GPU nodes to do the work of two. The first question on any GPU bill is not how fast the cards are but how much of each card is actually busy, and that number is usually low.
Share a GPU across pods
Several mechanisms let more than one pod share a physical GPU. Time-slicing lets the scheduler place multiple pods on one card and rotate them through it, which suits bursty or low-duty inference and development workloads that never need the whole device at once. Multi-Instance GPU, available on some newer data-center cards, partitions one physical GPU into several isolated slices with dedicated memory, giving harder guarantees than time-slicing. Both approaches turn one expensive card into capacity for several tenants, which is the single biggest GPU saving available to most teams. Verify which sharing modes your card generation and driver support, since they vary by hardware and change over time.
Paying for GPUs that sit half-idle?
Our cost audit measures real GPU utilization, finds the cards running at a fraction of capacity, and puts sharing, spot sourcing and scale-to-zero in place so you run the same models on fewer accelerators. On the performance model, you pay only from realized savings. No savings, no fee.
Book a cloud cost audit →Source GPUs from spot and interruptible pools
GPU capacity is available at a steep discount on spot and preemptible pools, which is decisive when the workload tolerates interruption. Batch training that checkpoints, hyperparameter sweeps and most offline inference fit well, because a reclaimed node simply resumes from its last checkpoint. Latency-sensitive serving usually does not, so the pattern is to split the cluster: serve from a small on-demand or committed pool and run everything elastic on spot GPUs. The discount on accelerators is large enough that this split often saves more than any tuning of the workload itself.
Pack GPU pods and isolate them with taints
GPU nodes should be reserved for GPU work. Taint them so that ordinary CPU pods cannot land there and waste expensive capacity, and tolerate the taint only on workloads that genuinely need an accelerator. Then pack the GPU pods densely, the same discipline covered in bin packing across nodes, so partly used GPU nodes fill before new ones start. Matching node shape matters here too, which is the subject of rightsizing node pools and instance types, because a GPU node with too little CPU or memory will strand the accelerator behind a bottleneck.
Scale GPU pools to zero when idle
The most reliable GPU saving is to run no GPU nodes at all when there is no GPU work. Configure the GPU node pool to scale to zero so that an empty training queue costs nothing, and let the autoscaler add cards only when a job lands. For interactive workloads like notebooks, an idle timeout that releases the GPU after a period of inactivity prevents a forgotten session from holding a card overnight. These two settings, scale-to-zero and idle reclaim, remove the most common form of GPU waste, which is capacity left running with nothing to do.
| Lever | Best for | Watch out for |
|---|---|---|
| Time-slicing | Bursty inference, dev | No hard isolation |
| Multi-Instance GPU | Multi-tenant serving | Card support varies |
| Spot GPUs | Checkpointed training | Interruption handling |
| Taints plus packing | Mixed clusters | Stranded CPU or memory |
| Scale to zero | Batch and notebooks | Cold start latency |
GPU sharing modes, card generations and spot availability above reflect the major providers as of May 2026. Verify the current accelerator types, driver features and pricing in your provider's documentation before committing, as this area changes quickly.
The Kubernetes Cost Optimization Handbook includes the GPU utilization worksheet and the spot-versus-on-demand split model behind this article. It is the downloadable companion.
The short version
GPU scheduling decides GPU cost. Stop handing a whole card to a pod that needs a slice, share cards through time-slicing or Multi-Instance GPU, source elastic capacity from spot pools, taint and pack GPU nodes so they stay busy, and scale the pool to zero when the queue is empty. Done together these usually cut a GPU bill more than any model-level change. When you want it measured and implemented for you, that is what our rightsizing and waste elimination service delivers.