Token Economics: Understanding LLM API Pricing

LLM API pricing is built on the token, the unit a model uses to read and generate text. Roughly, a token is a few characters, and a rule of thumb is that one token is about three quarters of an English word. Providers charge a price per million input tokens, the text you send into the model including the prompt and any context, and a separate, usually higher, price per million output tokens, the text the model generates back. Your bill for any feature is therefore the volume of tokens flowing through it multiplied by these two rates, summed across every call. Once you see the bill this way, the levers to reduce it become obvious.

This article is part of our AI, GPU and ML cluster. For the full picture, start with the complete guide to AI and GPU cost optimization, the pillar this piece links up to. In our See, Cut, Lock, Run method, token economics is the See step for managed AI: you cannot optimize an LLM feature until you can attribute its token volume to a workload and a team.

The pricing in one line

Cost equals input tokens times the input rate, plus output tokens times the output rate, summed over every call. Output is usually the pricier side, so the length of the model's answers often matters more than the length of your prompts.

Input tokens versus output tokens

The two sides of a call are priced differently, and the gap drives where you should focus. Input tokens cover everything you feed the model: the system prompt, any retrieved context, examples, conversation history, and the user's message. Output tokens cover only what the model writes back. Output is typically billed at a meaningfully higher rate per token than input, because generation is the expensive part computationally. The practical consequence is that a feature which returns long, verbose answers can cost more than one with a long prompt but a short reply, and capping output length is one of the cleanest cost wins available. The deeper economics of where a hosted call wins or loses against running your own model are covered in managed AI services versus self-hosted.

Context windows and why they get expensive

The context window is the maximum number of tokens a model can consider in one call, and larger windows tempt teams into stuffing more in: full documents, long histories, many retrieved passages. Every one of those tokens is billed as input on every call. A chatbot that resends the entire conversation each turn pays for the whole history repeatedly, so cost grows with the square of the conversation length rather than linearly. Retrieval-augmented features that pull in large context to improve answers do the same. Trimming context to what the model actually needs, summarizing history rather than resending it, and retrieving fewer, more relevant passages all cut input tokens directly. Storing and searching that retrieved context efficiently is its own cost topic, covered in how to optimize vector database costs.

Lever	What it cuts	When to use
Cap output length	Output tokens	Always; set a sensible max
Trim and summarize context	Input tokens	Chat, RAG, long histories
Prompt caching	Repeated input tokens	Stable system prompts, shared context
Smaller model for the task	Both rates	Simple classification, routing
Batch API	Per-token rate	Non-urgent, high-volume jobs

Caching, batching, and model choice

Three structural levers go beyond trimming text. Prompt caching lets you reuse a stable chunk of input, a long system prompt or a fixed context, at a reduced rate on repeat calls, which is powerful when the same preamble is sent thousands of times. A batch interface, where the provider processes non-urgent requests within a window rather than instantly, is commonly priced at a discount to the synchronous endpoint, the same batch-versus-real-time trade-off explained in batch versus real-time inference. And model choice is the biggest lever of all: routing simple tasks to a smaller, cheaper model and reserving the flagship for genuinely hard requests can cut the rate by an order of magnitude. The general principle of paying for capability only where it earns its keep echoes the cost of fine-tuning versus prompting.

Token bill climbing faster than usage?

Our cost audit instruments your LLM features, attributes token volume to each one, and pulls the levers that fit: output caps, context trimming, caching, batching, and model routing. On the performance model you pay only from realized savings. No savings, no fee.

Book a cloud cost audit →

Putting a unit cost on every feature

The discipline that keeps a token bill under control is measuring cost per unit of value, not just total spend. Work out the tokens, and therefore the dollars, per chat session, per document processed, or per customer served, and you can see which features are economical and which are not. That unit cost feeds your AI infrastructure spend forecast and lets you decide, with numbers, whether a feature should ship, change, or run on a different model. Attributing those tokens back to the teams that generate them is the allocation discipline in how to allocate AI and ML costs by team.

Go deeper · free guide

The AI and GPU Cost Control Guide includes our per-feature token cost worksheet and the model-routing pattern we deploy on engagements. It is the downloadable companion to this article.

The short version

LLM API pricing charges per input and output token, with output usually the pricier side, and context windows quietly multiply input cost on every call. Cap output length, trim and summarize context, cache stable prompts, batch non-urgent work, and route simple tasks to smaller models, then track cost per unit of value so every feature has a number. Token rates and model lineups change often, so verify current per-token pricing against each provider's live documentation before you standardize. When you want your AI features instrumented and the token waste engineered out, that is what our FinOps implementation service delivers.