A few weeks ago, I published an article explaining why 99% of what you send to Claude is already cached. KV tensors, VRAM, local SSDs — the full internal machinery. But I left out the part that hurts the most: the bill.

Because prompt caching seems like a sweet deal until you look closely at the numbers. And then you realize that you’re paying to save.

The cost paradox

Let’s crunch the numbers. With Claude Sonnet:

ConceptPrice per million tokens
Normal input$3.00
Cache write$3.75 (1.25x)
Cache read$0.30 (0.10x)

Pay attention here: writing to cache costs 25% more than just processing the input without cache. You’re paying extra for the privilege of making it cheaper the next time.

It’s like paying for a Costco membership. The annual fee hurts. But if you shop enough, it pays off.

The problem is that “enough” depends on how many times you’re going to read from that cache before it expires.

When you lose money

Let’s say you send a 100K token prompt with cache_control. First request:

100,000 tokens × $3.75/M = $0.375  (cache write)

If you had sent that without caching:

100,000 tokens × $3.00/M = $0.300  (normal input)

You paid $0.075 extra. That’s 25% more. You lost money.

Now, second request with the same 100K token prefix:

100,000 tokens × $0.30/M = $0.030  (cache read)

Compared to:

100,000 tokens × $3.00/M = $0.300  (normal input without cache)

You saved $0.27. In just two requests, you’ve recouped the $0.075 extra write cost and come out ahead.

The breakeven point is at 1.4 reads. In plain language: If you’re going to reuse that prefix at least twice in the next 5 minutes, caching is worth it.

Why it’s a no-brainer with Claude Code

In a Claude Code session, every message you send includes the system prompt, tools, and the entire conversation history. Message after message. Without caching, you’d pay $3.00/M for the same 150K tokens of context every time you type “change this button’s color.”

Nobody can afford that.

With prompt caching, you pay the write cost once, then read back at $0.30/M for the rest of the conversation. In a session with 50 messages and 150K of accumulated context, the difference is massive:

Without cache:  50 × 150,000 × $3.00/M = $22.50
With cache:     1 × 150,000 × $3.75/M + 49 × 150,000 × $0.30/M = $2.77

From $22.50 to $2.77. That’s an 88% savings. This is why Anthropic enables caching by default in Claude Code. Not doing so would be financially unsustainable.

The counterintuitive part: cache reads make your token counter explode

Here’s the part that confuses everyone.

When you open your usage dashboard and see cacheReadInputTokens: 4,241,579,174, your brain says, “I’ve consumed four billion tokens.” And technically, that’s true: your account has processed those tokens. But not in the way you think.

A cache read doesn’t recalculate the KV tensors. It pulls them from memory. It’s vastly cheaper for Anthropic than processing a normal input, which is why they charge you a 90% discount.

But the number looks huge compared to everything else. My real data for one month:

cacheReadInputTokens:    4,241,579,174  (99.5%)
cacheCreationInputTokens:  196,596,243  (4.6%)
inputTokens:                 1,293,019  (0.03%)
outputTokens:                2,517,666  (0.06%)

99.5% of the tokens flowing through my account are cache reads. If Anthropic decided to charge cache reads at normal input prices, my bill would literally go up tenfold.

This has a practical consequence: when comparing your usage with others, cacheReadInputTokens is the most useless metric ever. Someone doing 200 short sessions and someone doing 10 long sessions can have radically different cache reads with the same actual cost.

The three levers you control

If you use the API directly (not Claude Code, where Anthropic handles caching for you), there are three things you can do to optimize:

1. Structure your prompt from most to least stable

The cache works by prefix. If you change something at the start, it invalidates everything that follows. Here’s the golden rule:

[System prompt — never changes]
[Tool definitions — rarely changes]
[Reference documents — occasionally changes]
[Conversation history — grows with each message]
[User's latest message — always new]

If you put the user’s message before the reference documents, you invalidate the reference cache every time. A clunky setup that costs you money every request.

2. Use cache breakpoints wisely

Anthropic gives you up to 4 breakpoints (cache_control) per request. The temptation is to set one for every block. But each breakpoint forces a cache write even if the content is identical to what’s already cached.

My recommendation: one breakpoint at the end of the system prompt and another at the end of the conversation history. Two anchor points, not four.

3. Respect the minimum cachable size

The cache only kicks in if a block has at least 1,024 tokens (2,048 for Opus). If your system prompt is 500 tokens, it won’t be cached. You can check this by looking at the API response: if cache_read_input_tokens is 0, the block didn’t meet the minimum.

What you can’t control (and shouldn’t worry about)

The cache TTL is 5 minutes in VRAM (standard tier). If your user takes 6 minutes to respond, there’s a cache miss, and everything is recalculated. You can’t change this unless you pay for extended cache (1-hour TTL, but at 2x the input price).

For interactive conversations — someone chatting with your bot — 5 minutes is usually enough. For batch processes with long pauses between requests, consider extended cache if your prompts are long. Do the math: does the 2x write cost pay off with enough reads?

The psychology of it

Here’s a side effect nobody talks about: the cache incentivizes you to make more requests, not fewer. Because you know that every additional request costs a fraction of the first one.

It’s the same behavior as unlimited mobile data plans. When you switched from paying per MB to unlimited, you started using more data. Not because you needed more, but because the marginal cost was zero.

With prompt caching, the marginal cost of an additional request (if the context is already cached) is almost just the output cost. And that changes your behavior. You start iterating faster. Asking smaller questions. Using the model like a rubber duck on steroids.

Is that a bad thing? It depends. If you have a spending cap or a quota like Claude Max, it can be a trap: the cost per request goes down, but the number of requests goes up. And quotas measure total usage, not marginal cost.

In summary

Prompt caching is one of those optimizations that works exactly the opposite of what your intuition expects:

  1. You pay more up front (cache writes cost 1.25x).
  2. You pay much less afterward (cache reads cost 0.10x).
  3. The breakeven point is ridiculously low (1.4 reads).
  4. The usage numbers are deceptively huge (99% of your tokens are cheap cache reads).
  5. The order of your prompt matters more than you’d think (stable prefix = more cache hits).

If you’re using Anthropic’s API and not using prompt caching, you’re wasting money. If you already use it but don’t understand your bill, now you know why: caching charges you double to save you 90%. And once you run the numbers, it’s the best deal you’ll get.


Related: If you want to understand the internal machinery (KV tensors, VRAM, prefix hashing), read Why 99% of What You Send to Claude is Already Cached. And if you’re interested in how I estimate my quota without API access, check out Tokamak: Estimating Claude’s Quota Without the API.