Practical tutorial · May 2026

LLM caching: how to cut your AI bill by 90% with prompt caching

You've spent the day talking to an AI. It probably read the same system prompt, the same product manual, and the same tool definitions dozens of times. Does that mean you burned tokens dozens of times? Here's the most important mechanism in modern language models: repeated input does not need to be recomputed. Configure it right and your bill can drop by up to 90%.

✓ OpenAI, Claude, Gemini, DeepSeek comparison ✓ Real-world savings calculation ✓ 5 production mistakes to avoid

The cache you've already used without knowing

Open a webpage. The first time it's slow. The second time it loads in a second. The reason is simple: the browser stored the images, scripts, and styles in a local cache, and the next time it doesn't need to ask the server for them.

LLM caching follows the same logic, with one important difference. While browser cache stores files, LLM cache stores the "compute state" the model produced after reading the first part of your input. More precisely: many providers cache the intermediate key and value matrices from the attention layers. That's why the industry calls it KV cache.

This mechanism lets the model, after having read a large block of fixed content, skip most of the work the next time it sees the same beginning.

Prefill and decode: where your money actually goes

When the model receives your question, it goes through two clearly different phases.

The first one is called prefill. Think of it as "reading everything you sent." The model scans the system prompt, the conversation history, attached documents, and tool definitions, and computes the relationships among all those tokens. The longer the input, the more expensive this step.

The second one is called decode. This is "writing out." Building on the state computed during prefill, the model generates the response one token at a time.

The key insight: in applications with long input and short response, almost all the cost happens before the model says a word. Typical cases: customer support bots, RAG, codebase analysis, agents with many tools. In these scenarios, the fixed block can be thousands to hundreds of thousands of tokens, while the user's actual new question is two lines.

Prompt caching attacks exactly that first half. The first request reads the fixed content and stores it in cache. The second request, if it shares the same beginning, reuses that state and only processes what changed.

Important: caching does not make the model "remember the answer" and does not save you output tokens. It only saves the cost of re-reading the repeated input.

Why a single space can break the whole cache

This is the most counterintuitive part of prompt caching:

The system checks whether the prefix is exactly equal byte by byte, not whether the meaning is similar.

Compare these two requests:

You are a cooking assistant. Tell me what to have for dinner.
You are a cooking assistant. Tell me what to have for lunch.

The first half is identical. The model has the chance to reuse the block "You are a cooking assistant."

Now compare these two:

What should I have for dinner? You are a cooking assistant.
What should I have for lunch? You are a cooking assistant.

The first word is already different. Even though the assistant role is exactly the same, the cache will almost never hit.

The same goes for JSON. Changing field order, line breaks, or whitespace produces, from the model's point of view, a different input. The system doesn't understand "roughly the same": it only understands "exactly equal."

That gives us the most important rule of this whole article:

What doesn't change goes at the beginning. What changes goes at the end.

How to order your prompt to maximize caching

If you're building an LLM application, this is the recommended order when assembling your prompt:

  1. System instructions. Role, boundaries, response format, style. These are usually stable.
  2. Tool definitions. What tools the agent can call and their parameters. These blocks are usually long and included in every request, so they're perfect to cache.
  3. Knowledge base and long documents. Product manuals, contracts, code snippets, retrieved materials. They cache well only if they are truly identical between requests.
  4. Conversation history. What already happened doesn't change. The longer the conversation, the more valuable caching becomes.
  5. Whatever changes only this time. The user's question, current time, user ID, A/B test parameters, temporary permissions.

A classic mistake seen in production: someone writes on the first line of the system prompt:

It's now 3:30 PM on May 4, 2026.

A minute later it becomes 3:31 PM. The whole prefix changes. The thousands of tokens that came after lose their cache hit.

How to enable caching with each provider: OpenAI, Claude, Gemini, DeepSeek

The underlying idea is the same across providers. The differences lie in how you activate it, the minimum threshold, and how long it lasts. This table summarizes the state as of May 2026:

ProviderHow to enableMinimum thresholdSavings on hitRetention time
OpenAIAutomatic1024 tokens, increases by 128~10% of base price on GPT-5; slightly more on older models5–10 min idle; 24 h optional on some models
ClaudeManual: add cache_control on reusable blocks1024 or 2048 tokens depending on modelRead ~10% of input price; write slightly more expensive5 min default; 1 h optional
GeminiImplicit automatic; explicit manualFlash: 1024 · Pro: 4096Implicit: automatic savings. Explicit: price varies by platformExplicit: 1 h default, configurable TTL
DeepSeekAutomaticInternal prefix rules, no manual setupHit price notably lower than miss; billed separatelyDisk cache, best-effort, hours to days

OpenAI is the most hands-off. You don't have to touch your code: just make sure your prompt is over the threshold and keeps the same prefix. The field to watch in the response is cached_tokens.

In Claude Code, you don't need to worry about caching: the tool handles it for you. But if you call the Claude API directly, you have to add cache_control on tool, system, or message blocks to mark where the reusable content ends. The system searches backward from that point for the longest reusable prefix. You get more control, but it requires careful design. The fields to watch are cache_creation_input_tokens and cache_read_input_tokens.

Gemini offers two modes. Implicit cache works like OpenAI's: automatic. Explicit cache is different: you create a cache object with a large block of material and reference it in subsequent calls. Ideal for "one document, many questions," but storage has a cost.

DeepSeek stands out for using disk cache enabled by default. No interface change needed. The response includes prompt_cache_hit_tokens and prompt_cache_miss_tokens. It's best-effort: it doesn't guarantee a hit every time, and building the cache takes a moment the first time.

How much you can really save: step-by-step calculation

Imagine you're building a customer support bot. Each request includes 5,000 fixed tokens between system prompt and product manual. The average user question is 200 tokens. The average response is 300 tokens. You process 10,000 requests per day.

Without caching, the input tokens are:

(5,000 + 200) × 10,000 = 52 million tokens per day

At a normal input price of $2.50 per million tokens and output of $15 per million:

With caching enabled, suppose 4,500 of the 5,000 fixed tokens hit the cache. Cache price is approximately 10% of normal:

Cache hits: 4,500 × 10,000 = 45 million tokens
Full-price input: (500 + 200) × 10,000 = 7 million
Output: 300 × 10,000 = 3 million

Daily savings: roughly $100. Almost half the bill, without touching response quality.

The underlying rule doesn't change: the longer the input, the more it repeats, and the shorter the response, the more caching is worth. If your model writes long content from a small input, the relative savings are much smaller, because the output still costs the same.

Which use cases are most suitable

Semantic cache: the riskier cousin

Prompt caching looks at exact characters. A different punctuation mark breaks the match. Semantic cache plays a different game: it compares meanings.

Imagine user A asks:

What is your return policy?

And user B asks:

I bought something by mistake, how do I return it?

Semantic cache evaluates whether the two questions are about the same topic. If similarity is high enough, it returns the previous answer without calling the model again.

This is implemented at the application layer, not in the model. The usual approach: store historical questions and answers in a vector database; when a new question comes in, look up the most similar ones; if the similarity exceeds a threshold, return the cached answer.

The two caches combine well:

But semantic cache has a serious risk. Similar is not equal. In contexts like returns, contracts, healthcare, or finance, a false match destroys user trust. The threshold should be very conservative, and for critical questions it's always better to let the model answer from scratch.

The 5 most common production mistakes

  1. Putting timestamps, IDs, or randoms at the beginning of the prompt.

    Any variable that changes every request breaks the cache if placed at the start. It's a silent failure: the bill grows and nobody knows why.

  2. Serializing JSON inconsistently.

    The same configuration with a different field order is, to the model, a different input. Make sure to always serialize with fixed order and the same indentation.

  3. Putting user personalization at the top of the system prompt.

    Something like "You're serving John, he likes short answers" at the beginning means each user generates their own unique prefix. Better: keep global rules at the top and put user profile at the end.

  4. Looking only at total tokens and not at cache fields.

    Each provider exposes them with a different name: OpenAI uses cached_tokens, Claude uses cache_read_input_tokens, Gemini delivers it inside usage_metadata, DeepSeek uses prompt_cache_hit_tokens. Without watching these fields you don't know if your cache is working.

  5. Thinking the cache changes response quality.

    Prompt caching does not make the model smarter or responses more stable. It only reuses prefix computation. Output is still generated on the spot, and temperature, sampling, and context still influence it the same way.

How to verify your cache is working

This is the part most people skip. Without verification, you don't know if the savings you think you have are real:

  1. Always read the cache fields in the API response. They are the truth. If hits are zero after several similar requests, something in your prompt is silently changing.
  2. Diff between two consecutive prompts. Before complaining the cache doesn't work, do an exact diff between two supposedly identical requests. Most likely there's a hidden timestamp, ID, or whitespace that changes.
  3. Measure latency, not just cost. When the cache hits, the first token usually arrives much faster. If your TTFT (time to first token) p50 drops, that's a sign the cache is working.
  4. Watch hit/miss rate in production. If your hit rate is below 50% in apps with stable prefixes, you almost certainly have a bug in how the prompt is built.

TL;DR: one rule, big savings

What doesn't change goes at the beginning. What changes goes at the end.

If you do this one thing right, LLM caching has the chance to save you up to 90% of your input bill. It does not make you smarter, it does not change response quality, and it's not magic. It just stops the model from redoing work it already did.

In real applications — agents, RAG, customer support, assistants — that savings is often the difference between a profitable product and one that doesn't scale. Worth the time it takes to do well.

Official resources

Learn Claude Code to build agents with caching

If you're building an agent with Claude Code, you don't need to configure caching manually: the tool handles it for you. But knowing how it works helps you structure prompts and projects so the savings are real.

See the complete Claude Code guide