I’m building an app that monitors my token consumption in Claude Code. A few days ago, looking at the raw numbers, I found this:
cacheReadInputTokens: 4,241,579,174
inputTokens: 1,293,019
Four billion two hundred million tokens read from cache. One million three hundred thousand “fresh” tokens. That’s a 99.97% cache hit rate.
My first reaction was to think something was broken. Nobody has 99% cache hit. Not Redis. Not Cloudflare. Not your mom when she tells you she already knows what you’re going to ask for lunch.
But it turns out it’s not broken. This is exactly how it works. And the reason is as elegant as it is counterintuitive.
What Gets Cached Isn’t Text
This is where most explanations fall short. When you read “prompt caching” you think of something like Redis: store the question, store the answer, if someone asks the same question you return the same answer.
No way.
What gets cached are KV tensors — the Key and Value matrices that the transformer calculates during the prefill phase. In plain language: when an LLM receives your prompt, the first thing it does is convert all that text into internal numerical representations (the embeddings) and multiply them by weight matrices to get the “keys” (K) and “values” (V) that the attention mechanism needs to generate the response.
That calculation is expensive as hell. In a 200,000-token prompt (normal in Claude Code, where conversation history accumulates), we’re talking about billions of matrix multiplication operations. It’s the part that consumes the most GPU, takes the longest, costs the most.
And here’s the kicker: between one message of yours and the next, 99% of that prompt doesn’t change. The system prompt is identical. The previous conversation history is identical. The files it read are the same. The only new thing is your latest message.
Why recalculate what you already calculated 30 seconds ago?
How Matching Works
Caching isn’t enough. You need to know when the cache is useful. And here Anthropic uses an elegant trick: cumulative prefix hashing.
Each block of the prompt (system, tools, messages) generates a hash. But not an individual hash: a cumulative hash. The hash of block 3 includes the content of blocks 1, 2, and 3. If anything changes in a previous block, the hash of all subsequent blocks changes too.
When a new request arrives, the system searches backward from the point marked with cache_control, comparing hashes block by block, until it finds the longest matching prefix. Everything that matches → read from cache. Only the new stuff → gets calculated.
It’s like a movie you’ve watched 40 times. You don’t need to watch the whole thing to know what happens. You only need to watch from the point where it differs from what you remember.
Pay attention to this: the system only checks up to 20 blocks backward. Beyond that, it stops searching. This is a practical decision to avoid spending more time searching the cache than calculating the tensors directly.
Why Claude Code Has 99% Cache Hit
Now that you know how matching works, 99% stops being mysterious. Look at what happens in a typical Claude Code session:
Message 1 (first of the session):
System prompt (8K tokens) + Tools (2K tokens) + Your message (500 tokens)
= 10,500 tokens → EVERYTHING gets calculated, EVERYTHING gets written to cache
Message 2:
System prompt (8K) + Tools (2K) + Message 1 (500) + Response 1 (3K) + Your message 2 (500)
= 14,000 tokens
→ First 10,500 → CACHE HIT (we already calculated these before)
→ 3,500 new ones → get calculated and added to cache
Cache hit: 75%
Message 10:
System prompt + Tools + 9 messages + 9 responses + Your message 10
= ~150,000 tokens
→ First ~149,500 → CACHE HIT
→ ~500 new ones → get calculated
Cache hit: 99.7%
See it? The conversation history only grows. Each new message is a tiny fraction of the accumulated total. The cache ratio converges to 99% with the certainty of a natural logarithm.
It’s not magic. It’s geometry: the numerator (new tokens) grows linearly; the denominator (accumulated tokens) also grows linearly, but it has a huge head start.
Where Those Tensors Live
This is where things get beautiful. Because caching KV tensors isn’t like caching strings in Redis. We’re talking about gigs of numerical data that have to be available with microsecond latency.
Anthropic uses a two-level system:
Level 1: VRAM (5-minute TTL)
The tensors live directly in the GPU memory that will serve the next request. Zero copy, zero network latency. The cache hit is almost instantaneous because the data is already where it’s needed.
TTL: 5 minutes. If nobody makes a request in 5 minutes, they get evicted. This is the cache you use with the standard API. Cache write price: 1.25x the normal input price.
Level 2: GPU Node SSD (1-hour TTL)
If you pay for extended cache write (2x the input price), the tensors don’t get evicted after 5 minutes. Instead, when they leave VRAM due to memory pressure, they get offloaded to the GPU node’s local SSD.
When a cache hit comes in, they reload from SSD to VRAM. Slower than level 1, but infinitely faster than recalculating the tensors from scratch.
The interesting thing about this: no network involved. It’s not a remote Redis. It’s not S3. It’s an SSD physically attached to the server that has the GPU. The architecture is designed to minimize data movement.
Request → In VRAM? → Yes → Instant cache hit
→ No → In local SSD? → Yes → Load to VRAM → Cache hit (~ms)
→ No → Calculate KV tensors → Cache miss
Since February 2026, isolation is per workspace (it used to be per organization). This means your dev team’s tensors don’t mix with the marketing team’s, even if they’re in the same Anthropic organization.
The Numbers
If you’re evaluating whether this matters for your use case, here are the hard facts:
| Concept | Value |
|---|---|
| Cache read | 0.1x input price (90% discount) |
| 5-min cache write | 1.25x input price |
| 1-hour cache write | 2x input price |
| Latency reduction | ~85% on long prompts |
| Minimum cacheable | 1,024 tokens per checkpoint |
With Sonnet, input costs $3.00/M tokens. A cache read costs $0.30/M. In a Claude Code session with 200K tokens of history, the difference between recalculating and reading from cache is the difference between $0.60 and $0.06 per message.
Multiply that by the hundreds of messages you can exchange in a long session and you understand why Anthropic invested in building this: without prompt caching, long conversations with huge context would be economically unviable.
My Real Data
Back to my numbers from the beginning. In my Claude Code usage over a month:
cacheReadInputTokens: 4,241,579,174 (4.2 billion — read from cache)
cacheCreationInputTokens: 196,596,243 (197 million — written to cache)
inputTokens: 1,293,019 (1.3 million — calculated without cache)
outputTokens: 2,517,666 (2.5 million — generated by the model)
Global cache hit rate: 95.5%. And within individual long sessions, it easily exceeds 99%.
Notice the asymmetry: I’ve read 4.2 billion tokens from cache, but the model has only generated 2.5 million tokens of output. The cache-read to actual-work ratio is 1,685:1. For every token the model produces, it reuses 1,685 tokens of previous context.
That also means cacheReadInputTokens isn’t a good productivity metric. It doesn’t measure how much you’ve “used” the model. It measures how much history the model has re-read. It’s like measuring your productivity by how many times you’ve opened the same file in your editor.
What Anthropic Doesn’t Tell You
There are things that aren’t public:
- User→GPU affinity: How do they guarantee your next request lands on the same node that has your cache? Probably sticky routing per session, but they don’t confirm.
- SSD type: NVMe? CXL-attached? The KV tensors for a 200K token prompt take up several GB. SSD speed matters a lot.
- PagedAttention: vLLM (the most popular open source serving engine) uses a technique called PagedAttention that manages KV tensors like virtual memory pages. Does Anthropic use something similar, or do they have something proprietary? Unknown.
- Cluster topology: How many GPUs, how they’re interconnected, whether they use InfiniBand or Ethernet. Nothing public.
The Analogy That Sums It All Up
Think of prompt caching like a surgeon’s working memory during an operation.
The surgeon (the model) has to process all the patient information (the prompt) to decide each move (the output). Without cache, they’d have to re-read the complete medical history before each cut. With cache, they remember everything they already read and only need to process new information — the latest lab results, the tissue’s response to the previous cut.
What gets saved aren’t the patient’s documents (the text). They’re the intermediate conclusions the surgeon already extracted from those documents (the KV tensors). They don’t need to re-read the lab work. They already know what it says. They just need to integrate the new stuff with what they already know.
The 99% cache hit simply reflects that in a conversation with an LLM, the amount of “what we already know” grows much faster than the amount of “new stuff to process.”
And that, in plain language, is what makes it possible for you to have 200K token context conversations without each message costing you an arm and a leg.
Related: If you’re interested in what happens when the app monitoring those tokens is based on data invented by the AI itself, read Silent failure: when your AI makes stuff up and the tests say everything’s fine. And if you want to see how I manage API secrets without 1Password asking for Touch ID every 30 seconds, authorization fatigue and a 40-line cache.