Home/Library/Rightsizing Kubernetes
How-to · Kubernetes · Updated May 2026

Rightsizing Kubernetes Requests and Limits

Rightsizing Kubernetes requests and limits is where most container bills are won or lost. This is the method we use to set pod requests from real usage so nodes pack tightly and cost falls, without triggering throttling or OOMKills.

Rightsizing Kubernetes requests and limits means setting each container's CPU and memory reservations to what its workload actually uses, rather than the round numbers a developer typed at creation. Requests drive scheduling and bin-packing, so inflated requests strand capacity on every node and force the cluster to run more nodes than it needs. Get the requests right and the same workloads fit on fewer, smaller nodes, which is where the saving lands.

This article is part of our complete guide to cloud rightsizing and waste elimination, the cluster pillar it links up to. Container rightsizing is the Kubernetes-specific version of the Cut step in our See, Cut, Lock, Run method, and it follows the same logic as rightsizing compute at the VM level, only the unit you size is the pod, not the instance.

Requests are the cost lever, not limits

The scheduler reserves capacity based on requests. A pod requesting 2 CPU and 4 GiB but using 0.2 CPU and 800 MiB still blocks that full reservation on its node. Cut requests to true usage plus headroom and the cluster bin-packs onto fewer nodes. Limits protect neighbors; requests control the bill.

Understand what requests and limits actually do

A request is a guarantee: the scheduler will only place a pod on a node with that much CPU and memory free, and it counts against the node for the pod's whole life. A limit is a ceiling: CPU over the limit is throttled, and memory over the limit gets the container OOMKilled. The two are independent, and the gap between them is where most teams go wrong. Set requests too high and you waste capacity. Set memory limits too low and pods die under load. Set CPU limits aggressively and latency spikes from throttling even though the node has spare cores.

The cost mistake is almost always inflated requests. Developers copy a manifest, pad the numbers for safety, and never revisit. Multiply that padding across hundreds of pods and you are paying for a cluster two or three times larger than the work requires.

Step 1: Read actual usage over a real window

You cannot rightsize on a guess. Pull at least two weeks of per-container CPU and memory usage from your metrics stack, Prometheus and the metrics-server being the common source, so weekly cycles and batch jobs show up. Look at the working set for memory and the busy-period rate for CPU. The Vertical Pod Autoscaler in recommendation mode is a useful starting point: run it as "Off" so it only emits target requests rather than evicting pods, and treat its numbers as candidates to review, not commands.

Step 2: Size requests to the percentile, with headroom

Set CPU requests near the workload's typical busy-period usage, not its rare peak, because CPU is compressible and short spikes are absorbed by spare cores on the node. Set memory requests against a high percentile of the working set, the 95th or higher, because memory is not compressible and a pod that exceeds available memory is killed. A practical target is requests landing around the p95 of real usage with a modest headroom margin on memory. This is the same percentile discipline we apply to VMs in rightsizing compute, adapted to the pod.

Want your clusters rightsized for you?

Our cloud cost audit reads per-pod utilization across every cluster, ranks the requests that are stranding node capacity by dollars, and hands you a safe resize plan. On the performance model, you pay only from realized savings. No savings, no fee.

Book a cloud cost audit →

Step 3: Set limits to protect, not to constrain

Treat CPU and memory limits differently. For memory, set a limit close to the request so a leaking container is killed and restarted rather than taking the node down with it, since a memory-starved node evicts unrelated pods. For CPU, many teams now leave the limit unset or generous, because a tight CPU limit throttles a pod even when the node has idle cores, hurting latency for no cost benefit. The cost saving comes from the request; the CPU limit mostly buys you throttling problems. Decide this per workload class and write it into the defaults.

Step 4: Pair pod rightsizing with node and autoscaler choices

Right-sized pods only save money if the cluster actually gives back the freed nodes. Make sure the Cluster Autoscaler or Karpenter is allowed to scale node groups down, and pick node shapes that match your pod profile so bin-packing leaves little stranded headroom. Combine this with the Horizontal Pod Autoscaler for replica count so the cluster grows on demand rather than running padded all day. The broader pattern of scaling without risking outages is in autoscaling done right.

SymptomLikely causeFix
Nodes 30% utilized but cluster fullInflated requests strand capacityCut requests to p95 usage, let nodes scale in
Pods OOMKilled under loadMemory request or limit too lowRaise memory request to working-set p95 plus headroom
Latency spikes, CPU idleCPU limit throttlingRelax or remove the CPU limit
Requests never updatedNo review loopVPA in recommendation mode, monthly review

Behaviors above reflect Kubernetes scheduling and autoscaler defaults as of May 2026. Verify your distribution's specifics, since managed offerings on AWS, Azure, GCP and OCI differ in autoscaler integration.

Go deeper · free framework

The Cloud Waste Audit Framework includes the utilization queries and the scoring model we use to rank stranded cluster capacity by dollars. It is the downloadable companion to this method.

Stop the inflation at the source

Rightsizing once leaves you fighting the same battle next quarter unless you stop oversized requests from being created in the first place. Two cluster-native guardrails help. A LimitRange in each namespace sets default requests and limits for pods that declare none, and caps the maximum a single pod can request, so a copied manifest cannot quietly reserve a whole node. A ResourceQuota caps the total requests a namespace can consume, which forces teams to live within an allocation rather than scaling requests without limit. Pair those with the VPA running in recommendation mode so the gap between requested and used stays visible on a dashboard rather than discovered at audit time. Governance like this is the Lock step of our method, the same principle covered more broadly in the rightsizing and waste pillar: cut once, then put a guardrail in place so the saving does not erode.

The short version

Read two weeks of real per-container usage, set CPU requests near busy-period usage and memory requests at the working-set p95 with headroom, keep memory limits close to requests but leave CPU limits loose, and make sure the node autoscaler is free to scale in. Rightsizing GPU pods follows a different logic, covered in rightsizing GPU and accelerated instances. When you want it run across every cluster at once, that is exactly what our rightsizing and waste elimination service delivers.

The Cloud Cost Brief

Cloud pricing moves. We tell you when it matters.

New commitment instruments, FOCUS changes, hyperscaler pricing shifts, and the plays that actually move a bill. No schedule, no filler.

Subscribe · Work email only