Technical

Prompt Caching Primer: The Easiest 30% Cost Reduction You're Leaving on the Table

Back to blog
OB
Oleg Balakirev
Mar 20, 20266 min read

Shared system prompts and repeated prefixes get rebilled on every call. Prompt caching eliminates that waste with zero code changes.

Prompt caching is the lowest-effort, highest-return optimization available to most agentic teams. It requires no changes to your agent logic, no new infrastructure, and typically delivers a 25–40% cost reduction from day one.

And yet most teams aren't using it.

What prompt caching is

When you make an LLM call, you're billed for every input token — including your system prompt, any shared context, and the instruction prefix that's identical across every call. If you're running 1,000 agent iterations per day with a 2,000-token system prompt, you're paying for those 2,000 tokens 1,000 times. Prompt caching stores the KV state for static prompt segments and reuses it across calls — so you pay for the cache write once, and cache reads at a fraction of the full input price.

Anthropic cache pricing

Cache writes: standard input token price. Cache reads: ~10% of standard input token price. For a 2,000-token system prompt running 1,000 times/day, caching reduces that segment from $6.00/day to ~$0.60/day — a 90% reduction on that segment alone.

What you can and can't cache

  • System prompts — Almost always static. The best candidate for caching.
  • Tool definitions — If you're using function calling with large tool schemas, those can be cached.
  • Shared instruction prefixes — Any text that appears at the start of every call in a given workflow.
  • RAG retrieved chunks — If the same chunks appear frequently, they can be cached. Works best for high-frequency knowledge bases.
  • User messages — Cannot be cached (they're dynamic by definition).
  • The current reasoning chain — Cannot be cached (changes every iteration).

How to implement it

With Anthropic's API, prompt caching is enabled by adding a cache_control parameter to the content blocks you want cached. TokenAxe's proxy handles this automatically — it identifies cacheable segments in your prompts and injects the cache_control parameter without any code changes on your end.

python
# Direct API usage with caching
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # cache this segment
            },
            {
                "type": "text",
                "text": user_message  # not cached — dynamic
            }
        ]
    }
]

# With TokenAxe proxy — automatic, no changes needed
import tokenaxe
client = tokenaxe.wrap(anthropic_client)  # caching handled automatically

Start with your system prompt. It takes 10 minutes to implement and typically pays back in reduced bills within the same billing cycle.

Ready to stop the loop?

TokenAxe gives you real-time visibility and automatic optimization. Free to start.

Get started free