The Problem With Paying Per Token on Commodity Tasks
Most LLM cost conversations focus on prompt compression or model selection. The conversation that actually moved the needle for us was simpler: why are we paying per token for tasks a local GPU can handle for free?
Drafting, classification, summarization, light extraction. These are not tasks that require GPT-4o or Claude 3.5 Sonnet. They require a capable model and low latency. Once we framed it that way, the architecture became obvious.
What We Built
We run a local GPU serving Qwen 2.5 14B through an OpenAI-compatible endpoint. Every call that hits that endpoint looks identical to a call hitting OpenAI or OpenRouter from the application's perspective. The client code does not know or care which backend is responding.
The routing logic is straightforward:
- If the requested model name maps to our local instance, send the call there first.
- If the local endpoint returns an error or does not respond within 60 seconds, fall back automatically to OpenRouter.
- If the call explicitly requests a premium model (GPT-4o, Claude, etc.), skip local entirely and route straight to OpenRouter.
That 60-second timeout is a real number we tuned. Long enough that a briefly loaded GPU finishes rather than bouncing to paid infrastructure. Short enough that a genuinely down instance does not stall the user.
The Routing Is Per-Call and Opt-In by Model Name
This is the part that made adoption inside our stack painless. We did not rewrite call sites. We changed an environment variable.
Before:
MODEL = "gpt-4o-mini"
After:
MODEL = "local/qwen2.5-14b"
The client library reads the model name, checks a routing table, and decides where the HTTP request goes. Reverting is one env change back. Migrating a new call is one env change forward. There is no deployment, no code review for the routing itself, no risk of breaking adjacent calls.
The routing table looks roughly like this:
ROUTING_TABLE = {
"local/qwen2.5-14b": {
"primary": "http://local-gpu:8000/v1",
"fallback": "https://openrouter.ai/api/v1",
"fallback_model": "qwen/qwen-2.5-14b-instruct",
"timeout": 60,
},
"gpt-4o": {
"primary": "https://openrouter.ai/api/v1",
"fallback": None,
"timeout": 120,
},
}
The wrapper that handles the actual dispatch is about 40 lines of Python. It catches httpx.TimeoutException and any non-2xx response from the primary, then retries against the fallback with the mapped model name.
async def routed_completion(model: str, messages: list, **kwargs):
config = ROUTING_TABLE.get(model)
if not config:
raise ValueError(f"Unknown model: {model}")
try:
response = await call_openai_compatible(
base_url=config["primary"],
model=model,
messages=messages,
timeout=config["timeout"],
**kwargs,
)
return response
except (TimeoutError, APIError):
if not config["fallback"]:
raise
return await call_openai_compatible(
base_url=config["fallback"],
model=config["fallback_model"],
messages=messages,
timeout=120,
**kwargs,
)
Nothing exotic. The value is in the pattern, not the code complexity.
Why Qwen 2.5 14B Specifically
Qwen 2.5 14B hits a practical sweet spot. At 14 billion parameters, it fits comfortably on a single consumer or prosumer GPU with quantization (we run Q4_K_M via llama.cpp, which lands around 9GB VRAM). Instruction-following quality on classification and drafting tasks is close enough to GPT-4o-mini that we have not had a task regress after migration.
The OpenAI-compatible endpoint comes from llama.cpp's server mode, started with:
./llama-server \
--model qwen2.5-14b-instruct-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8000 \
--ctx-size 8192 \
--n-gpu-layers 99
That --n-gpu-layers 99 offloads all layers to GPU. The --ctx-size 8192 is tuned to our longest classification prompts with room to spare.
What OpenRouter Is Actually For
OpenRouter serves two roles in this setup. First, it is the fallback when the local GPU is unavailable. Second, it is the explicit path for tasks that genuinely need a frontier model, things like complex multi-step reasoning, long-context synthesis over 8K tokens, or customer-facing outputs where we want the extra quality margin.
Because the routing is per-call, we can make that judgment at the call site without touching infrastructure. A classification job gets local/qwen2.5-14b. A contract analysis job gets anthropic/claude-3.5-sonnet. Both go through the same wrapper.
OpenRouter's unified API surface is what makes this clean. We are not maintaining separate SDK integrations for Anthropic, OpenAI, and our local instance. One base URL, one API key, one response schema for the fallback and premium paths.
The Cost Picture
Local handles the bulk of volume at zero marginal cost. The GPU is already running, already paid for. Every token generated locally is free at the margin.
OpenRouter only gets traffic when the GPU is down (rare, maybe a few hours a month during maintenance) or when a call explicitly requests a premium model. That second category is intentional spend, not waste. The first category is small enough that the fallback cost is negligible.
We have not published specific dollar figures because the savings depend entirely on your volume and what you were paying before. But the structure is sound: if you have a GPU and a meaningful volume of commodity LLM calls, the marginal cost of those calls drops to near-zero. The math is not complicated.
Operational Considerations
A few things worth knowing before you copy this pattern:
This Is the Stack We Run at Savage Digital Solutions
At Savage Digital Solutions (savagesolutions.io), this routing pattern is live across our AI-assisted workflows. It is not a prototype. The 60-second timeout, the model-name-based opt-in, the OpenRouter fallback, all of it is in production. We built it because paying per token for classification felt like leaving money on the table once we had the GPU anyway.
If you are already running local inference for any reason, adding this routing layer costs you an afternoon and pays back immediately.
