# LLM Caching: How to Cut Your AI Bill by 90% with Prompt Caching

> Clean Markdown version of https://claudecodeguia.com/en/llm-caching/, optimized for AI agent consumption. Last updated: May 2026.

You've spent the day talking to an AI. It probably read the same system prompt, the same product manual, and the same tool definitions dozens of times. Does that mean you burned tokens dozens of times? Here's the most important mechanism in modern language models: **repeated input does not need to be recomputed**. Configure it right and your bill can drop by up to 90%.

## The cache you've already used without knowing

Open a webpage. The first time it's slow. The second loads in a second. The reason: the browser stored images, scripts, and styles in a local cache.

LLM caching follows the same logic, with one difference. While browser cache stores *files*, LLM cache stores the **"compute state"** the model produced after reading the first part of your input. More precisely: many providers cache the intermediate `key` and `value` matrices from the attention layers. That's why it's called **KV cache**.

This mechanism lets the model, after having read a large block of fixed content, skip most of the work the next time it sees the same beginning.

## Prefill and decode: where your money actually goes

When the model receives your question, it goes through two phases:

**Prefill** = "reading everything you sent." The model scans system prompt, history, documents, tool definitions, and computes relationships among tokens. **The longer the input, the more expensive this step.**

**Decode** = "writing out." Building on the state from prefill, generates response token by token.

Key insight: in applications with **long input and short response**, almost all the cost happens before the model says a word. Typical cases: customer support bots, RAG, codebase analysis, agents with many tools. The fixed block can be thousands to hundreds of thousands of tokens, while the actual new question is two lines.

Prompt caching attacks that first half. The first request reads fixed content and stores it in cache. The second, if it shares the same beginning, reuses that state and only processes what changed.

**Important**: caching does not make the model "remember the answer" and does not save output tokens. It only saves the cost of re-reading repeated input.

## Why a single space can break the whole cache

The most counterintuitive part:

> The system checks whether **the prefix is exactly equal byte by byte**, not whether the meaning is similar.

```
You are a cooking assistant. Tell me what to have for dinner.
You are a cooking assistant. Tell me what to have for lunch.
```

First half identical → cache OK.

```
What should I have for dinner? You are a cooking assistant.
What should I have for lunch? You are a cooking assistant.
```

First word already different → cache almost never hits.

Same with JSON: changing field order, line breaks, or whitespace produces a different input from the model's perspective.

**Golden rule**: What doesn't change goes at the beginning. What changes goes at the end.

## How to order your prompt to maximize caching

1. **System instructions** — Role, boundaries, response format. Stable.
2. **Tool definitions** — Usually long, included in every request. Perfect to cache.
3. **Knowledge base and long documents** — Manuals, contracts, code snippets. Cache well only if truly identical between requests.
4. **Conversation history** — What already happened doesn't change.
5. **Whatever changes only this time** — User's question, current time, user ID, A/B parameters.

**Classic mistake**: putting `It's now 3:30 PM on May 4, 2026.` on the first line. A minute later it's 3:31 PM — whole prefix changes.

## How to enable caching with each provider

| Provider | How to enable | Min threshold | Savings on hit | Retention |
|---|---|---|---|---|
| **OpenAI** | Automatic | 1024 tokens, increases by 128 | ~10% of base price on GPT-5 | 5–10 min idle; 24 h optional |
| **Claude** | Manual: `cache_control` on reusable blocks | 1024 or 2048 tokens depending on model | Read ~10% of input price | 5 min default; 1 h optional |
| **Gemini** | Implicit automatic; explicit manual | Flash: 1024 · Pro: 4096 | Implicit: automatic. Explicit: varies | Explicit: 1 h default, configurable TTL |
| **DeepSeek** | Automatic | Internal prefix rules | Hit notably lower than miss | Disk cache, best-effort, hours to days |

**OpenAI** is the most hands-off. No code changes: prompt over threshold + same prefix → automatic. Watch `cached_tokens` in response.

In **Claude Code**, no need to worry: the tool handles caching. If calling Claude API directly, add `cache_control` on tool, system, or message blocks. Watch `cache_creation_input_tokens` and `cache_read_input_tokens`.

**Gemini** offers two modes. *Implicit* works like OpenAI. *Explicit*: create a cache object with large material and reference it. Ideal for "one document, many questions," but storage has cost.

**DeepSeek** uses disk cache enabled by default. Response includes `prompt_cache_hit_tokens` and `prompt_cache_miss_tokens`. Best-effort.

## How much you can really save

Customer support bot. Each request: 5,000 fixed tokens + 200-token question + 300-token response. 10,000 requests/day.

**Without caching**:

```
(5,000 + 200) × 10,000 = 52 million tokens/day
Input: ~$130   Output: ~$45   Total: $175/day
```

**With caching** (4,500 of 5,000 fixed tokens hit):

```
Cache hits: 4,500 × 10,000 = 45 million (~$11.25)
Full input: (500 + 200) × 10,000 = 7 million (~$17.50)
Output: 300 × 10,000 = 3 million (~$45)
Total: ~$73.75/day
```

**Savings: ~$100/day**. Almost half the bill without touching response quality.

Rule: longer input + more repetition + shorter response = **more cache value**.

## Ideal use cases

- **Customer support and assistants**: stable system prompt and manual, only question changes.
- **RAG with caveat**: if retrieved documents change completely each time, cache barely hits. Trick: stable base materials at the beginning, temporary retrieval at the end.
- **Agents with multiple tools**: tool definitions are thousands of tokens at start, repeated in every step.
- **Multi-turn conversations**: history accumulates without changing.
- **Low-frequency scripts**: skip it. If running every half hour, default cache (5–10 min) already expired.

## Semantic cache: the riskier cousin

Prompt caching looks at exact characters. **Semantic cache** compares *meanings*.

User A: "What is your return policy?"  
User B: "I bought something by mistake, how do I return it?"

If similarity is high, returns previous answer **without calling the model again**.

Implemented at the application layer. Usual approach: store historical Q&A in vector database; new question searches similar ones; if above threshold, return cached.

Combining with prompt caching:
- **Prompt caching** = "this call costs less"
- **Semantic cache** = "this call may not be needed at all"

**Serious risk**: similar is not equal. In returns, contracts, healthcare, or finance, a false match destroys trust. Threshold should be very conservative, and for critical questions always let the model answer.

## The 5 most common production mistakes

1. **Timestamps, IDs, or randoms at the beginning of the prompt.** Any variable that changes breaks the cache if at the start. Silent failure: bill grows, nobody knows why.

2. **Inconsistent JSON serialization.** Same config with different field order = different input to the model. Always serialize with fixed order and same indentation.

3. **User personalization at the top of the system prompt.** "You're serving John, he likes short answers" at the beginning = each user generates unique prefix. Better: global rules at top, user profile at end.

4. **Looking only at total tokens, not cache fields.** OpenAI: `cached_tokens`. Claude: `cache_read_input_tokens`. Gemini: inside `usage_metadata`. DeepSeek: `prompt_cache_hit_tokens`. Without watching, you don't know if it works.

5. **Thinking the cache changes quality.** Only reuses prefix computation. Output is still generated on the spot, temperature and sampling still influence.

## How to verify your cache is working

1. **Always read cache fields in the API response** — they are the truth.
2. **Diff between two consecutive prompts** — find the hidden character that changes.
3. **Measure latency, not just cost** — TTFT (time to first token) drops when cache hits.
4. **Watch hit/miss rate in production** — if <50% in apps with stable prefix, there's a bug.

## TL;DR

> **What doesn't change goes at the beginning. What changes goes at the end.**

If you do this one thing right, LLM caching can save you up to 90% of input bills. It does not make the model smarter, does not change quality, is not magic. Just stops the model from redoing work it already did.

In real applications (agents, RAG, customer support), that savings is often the difference between a profitable product and one that doesn't scale.

## Official resources

- [OpenAI Prompt Caching documentation](https://platform.openai.com/docs/guides/prompt-caching)
- [Claude (Anthropic) Prompt Caching documentation](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
- [Gemini API caching (Google)](https://ai.google.dev/gemini-api/docs/caching)
- [DeepSeek KV Cache Guide](https://api-docs.deepseek.com/guides/kv_cache)

---

**Web version (HTML)**: https://claudecodeguia.com/en/llm-caching/  
**Spanish version**: https://claudecodeguia.com/cache-llm/  
**Main site**: https://claudecodeguia.com/en/