TokenAxe — Stop Tokenmaxxing. Start Saving.

Oleg Balakirev

Apr 2, 20269 min read

Not every task needs your most powerful model. A routing layer that matches task complexity to model capability can cut costs 60–70% with zero performance loss.

The default behavior of most agentic systems is to route every call to the same flagship model. It's easy, it's consistent, and it's extremely expensive. Model routing fixes that.

The insight is simple: different subtasks within an agentic workflow have radically different complexity requirements. Formatting output, querying structured data, writing simple summaries, and performing basic classification don't need GPT-4o. They need a model that's fast, cheap, and good enough — which describes a large portion of the current model landscape.

The routing decision matrix

We've developed a routing matrix across thousands of agentic task samples. The key variables are: reasoning depth required, output creativity required, context length, and latency tolerance. Tasks that score low on the first two and have short contexts are almost always safe to route to a smaller model.

Classification tasks → Route to claude-haiku-4-5, GPT-3.5-turbo, or Mistral-7B. Cost reduction: 85–95%.
Summarization → claude-haiku-4-5 or GPT-4o-mini. Quality parity on most document types. Cost reduction: 75–85%.
Simple code generation → GPT-4o-mini handles the majority of CRUD, boilerplate, and templating tasks. Cost reduction: 70%.
Multi-step reasoning, novel problem solving → Keep on GPT-4o or Claude Sonnet/Opus. Don't route away.
Long-context synthesis (>50k tokens) → Claude Sonnet/Opus for cost efficiency on long context vs GPT-4o.

How to implement routing without breaking things

The biggest risk in routing is silent quality degradation. An agent that routes a task to a smaller model and produces a subtly wrong output may not surface as an error — it surfaces as a downstream failure that's hard to trace.

The safest approach is confidence-based routing with fallback. Route to a smaller model, evaluate the output against a simple rubric (does it meet format requirements? does it satisfy the task spec?), and fall back to the full model on failure. The fallback rate tells you how much headroom you have to expand routing.

python

# TokenAxe routing config
routing_policy = {
    "classification": "claude-haiku-4-5",
    "summarization":  "gpt-4o-mini",
    "code_simple":    "gpt-4o-mini",
    "code_complex":   "gpt-4o",
    "reasoning":      "claude-sonnet-4-6",
    "fallback_on_confidence_below": 0.75,
    "log_routing_decisions": True,
}

Real-world results

Across teams using TokenAxe's routing layer, the average cost reduction from routing alone is 62% — with an average fallback rate of 8%. That means 92% of routed calls succeed on the smaller model, and 8% fall back to the full model. The net quality impact, measured by task success rate, is typically indistinguishable from full-model routing.

“Routing is not about using worse models. It's about using the right model for the right job. Most jobs don't need a PhD.”
— Oleg Balakirev

Ready to stop the loop?

TokenAxe gives you real-time visibility and automatic optimization. Free to start.

Get started free

Intelligent Model Routing: Use GPT-4o Where It Matters, Save Everywhere Else

The routing decision matrix

How to implement routing without breaking things

Real-world results

Ready to stop the loop?

Related articles

Context Pruning Without Quality Loss: A Practical Guide

Prompt Caching Primer: The Easiest 30% Cost Reduction You're Leaving on the Table