You've spent the day talking to an AI. It probably read the same system prompt, the same product manual, and the same tool definitions dozens of times. Does that mean you burned tokens dozens of times? Here's the most important mechanism in modern language models: repeated input does not need to be recomputed. Configure it right and your bill can drop by up to 90%.
Open a webpage. The first time it's slow. The second time it loads in a second. The reason is simple: the browser stored the images, scripts, and styles in a local cache, and the next time it doesn't need to ask the server for them.
LLM caching follows the same logic, with one important difference. While browser cache stores files, LLM cache stores the "compute state" the model produced after reading the first part of your input. More precisely: many providers cache the intermediate key and value matrices from the attention layers. That's why the industry calls it KV cache.
This mechanism lets the model, after having read a large block of fixed content, skip most of the work the next time it sees the same beginning.
When the model receives your question, it goes through two clearly different phases.
The first one is called prefill. Think of it as "reading everything you sent." The model scans the system prompt, the conversation history, attached documents, and tool definitions, and computes the relationships among all those tokens. The longer the input, the more expensive this step.
The second one is called decode. This is "writing out." Building on the state computed during prefill, the model generates the response one token at a time.
The key insight: in applications with long input and short response, almost all the cost happens before the model says a word. Typical cases: customer support bots, RAG, codebase analysis, agents with many tools. In these scenarios, the fixed block can be thousands to hundreds of thousands of tokens, while the user's actual new question is two lines.
Prompt caching attacks exactly that first half. The first request reads the fixed content and stores it in cache. The second request, if it shares the same beginning, reuses that state and only processes what changed.
Important: caching does not make the model "remember the answer" and does not save you output tokens. It only saves the cost of re-reading the repeated input.
This is the most counterintuitive part of prompt caching:
The system checks whether the prefix is exactly equal byte by byte, not whether the meaning is similar.
Compare these two requests:
You are a cooking assistant. Tell me what to have for dinner.
You are a cooking assistant. Tell me what to have for lunch.
The first half is identical. The model has the chance to reuse the block "You are a cooking assistant."
Now compare these two:
What should I have for dinner? You are a cooking assistant.
What should I have for lunch? You are a cooking assistant.
The first word is already different. Even though the assistant role is exactly the same, the cache will almost never hit.
The same goes for JSON. Changing field order, line breaks, or whitespace produces, from the model's point of view, a different input. The system doesn't understand "roughly the same": it only understands "exactly equal."
That gives us the most important rule of this whole article:
What doesn't change goes at the beginning. What changes goes at the end.
If you're building an LLM application, this is the recommended order when assembling your prompt:
A classic mistake seen in production: someone writes on the first line of the system prompt:
It's now 3:30 PM on May 4, 2026.
A minute later it becomes 3:31 PM. The whole prefix changes. The thousands of tokens that came after lose their cache hit.
The underlying idea is the same across providers. The differences lie in how you activate it, the minimum threshold, and how long it lasts. This table summarizes the state as of May 2026:
| Provider | How to enable | Minimum threshold | Savings on hit | Retention time |
|---|---|---|---|---|
| OpenAI | Automatic | 1024 tokens, increases by 128 | ~10% of base price on GPT-5; slightly more on older models | 5–10 min idle; 24 h optional on some models |
| Claude | Manual: add cache_control on reusable blocks | 1024 or 2048 tokens depending on model | Read ~10% of input price; write slightly more expensive | 5 min default; 1 h optional |
| Gemini | Implicit automatic; explicit manual | Flash: 1024 · Pro: 4096 | Implicit: automatic savings. Explicit: price varies by platform | Explicit: 1 h default, configurable TTL |
| DeepSeek | Automatic | Internal prefix rules, no manual setup | Hit price notably lower than miss; billed separately | Disk cache, best-effort, hours to days |
OpenAI is the most hands-off. You don't have to touch your code: just make sure your prompt is over the threshold and keeps the same prefix. The field to watch in the response is cached_tokens.
In Claude Code, you don't need to worry about caching: the tool handles it for you. But if you call the Claude API directly, you have to add cache_control on tool, system, or message blocks to mark where the reusable content ends. The system searches backward from that point for the longest reusable prefix. You get more control, but it requires careful design. The fields to watch are cache_creation_input_tokens and cache_read_input_tokens.
Gemini offers two modes. Implicit cache works like OpenAI's: automatic. Explicit cache is different: you create a cache object with a large block of material and reference it in subsequent calls. Ideal for "one document, many questions," but storage has a cost.
DeepSeek stands out for using disk cache enabled by default. No interface change needed. The response includes prompt_cache_hit_tokens and prompt_cache_miss_tokens. It's best-effort: it doesn't guarantee a hit every time, and building the cache takes a moment the first time.
Imagine you're building a customer support bot. Each request includes 5,000 fixed tokens between system prompt and product manual. The average user question is 200 tokens. The average response is 300 tokens. You process 10,000 requests per day.
Without caching, the input tokens are:
(5,000 + 200) × 10,000 = 52 million tokens per day
At a normal input price of $2.50 per million tokens and output of $15 per million:
With caching enabled, suppose 4,500 of the 5,000 fixed tokens hit the cache. Cache price is approximately 10% of normal:
Cache hits: 4,500 × 10,000 = 45 million tokens
Full-price input: (500 + 200) × 10,000 = 7 million
Output: 300 × 10,000 = 3 million
Daily savings: roughly $100. Almost half the bill, without touching response quality.
The underlying rule doesn't change: the longer the input, the more it repeats, and the shorter the response, the more caching is worth. If your model writes long content from a small input, the relative savings are much smaller, because the output still costs the same.
Prompt caching looks at exact characters. A different punctuation mark breaks the match. Semantic cache plays a different game: it compares meanings.
Imagine user A asks:
What is your return policy?
And user B asks:
I bought something by mistake, how do I return it?
Semantic cache evaluates whether the two questions are about the same topic. If similarity is high enough, it returns the previous answer without calling the model again.
This is implemented at the application layer, not in the model. The usual approach: store historical questions and answers in a vector database; when a new question comes in, look up the most similar ones; if the similarity exceeds a threshold, return the cached answer.
The two caches combine well:
But semantic cache has a serious risk. Similar is not equal. In contexts like returns, contracts, healthcare, or finance, a false match destroys user trust. The threshold should be very conservative, and for critical questions it's always better to let the model answer from scratch.
Any variable that changes every request breaks the cache if placed at the start. It's a silent failure: the bill grows and nobody knows why.
The same configuration with a different field order is, to the model, a different input. Make sure to always serialize with fixed order and the same indentation.
Something like "You're serving John, he likes short answers" at the beginning means each user generates their own unique prefix. Better: keep global rules at the top and put user profile at the end.
Each provider exposes them with a different name: OpenAI uses cached_tokens, Claude uses cache_read_input_tokens, Gemini delivers it inside usage_metadata, DeepSeek uses prompt_cache_hit_tokens. Without watching these fields you don't know if your cache is working.
Prompt caching does not make the model smarter or responses more stable. It only reuses prefix computation. Output is still generated on the spot, and temperature, sampling, and context still influence it the same way.
This is the part most people skip. Without verification, you don't know if the savings you think you have are real:
What doesn't change goes at the beginning. What changes goes at the end.
If you do this one thing right, LLM caching has the chance to save you up to 90% of your input bill. It does not make you smarter, it does not change response quality, and it's not magic. It just stops the model from redoing work it already did.
In real applications — agents, RAG, customer support, assistants — that savings is often the difference between a profitable product and one that doesn't scale. Worth the time it takes to do well.
If you're building an agent with Claude Code, you don't need to configure caching manually: the tool handles it for you. But knowing how it works helps you structure prompts and projects so the savings are real.
See the complete Claude Code guide