Local-First LLM Routing: How We Cut Generation…

The Problem With Paying Per Token on Commodity Tasks

Most LLM cost conversations focus on prompt compression or model selection. The conversation that actually moved the needle for us was simpler: why are we paying per token for tasks a local GPU can handle for free?

Drafting, classification, summarization, light extraction. These are not tasks that require GPT-4o or Claude 3.5 Sonnet. They require a capable model and low latency. Once we framed it that way, the architecture became obvious.

What We Built

We run a local GPU serving Qwen 2.5 14B through an OpenAI-compatible endpoint. Every call that hits that endpoint looks identical to a call hitting OpenAI or OpenRouter from the application's perspective. The client code does not know or care which backend is responding.

The routing logic is straightforward:

If the requested model name maps to our local instance, send the call there first.
If the local endpoint returns an error or does not respond within 60 seconds, fall back automatically to OpenRouter.
If the call explicitly requests a premium model (GPT-4o, Claude, etc.), skip local entirely and route straight to OpenRouter.

That 60-second timeout is a real number we tuned. Long enough that a briefly loaded GPU finishes rather than bouncing to paid infrastructure. Short enough that a genuinely down instance does not stall the user.

The Routing Is Per-Call and Opt-In by Model Name

This is the part that made adoption inside our stack painless. We did not rewrite call sites. We changed an environment variable.

Before:

MODEL = "gpt-4o-mini"

After:

MODEL = "local/qwen2.5-14b"

The client library reads the model name, checks a routing table, and decides where the HTTP request goes. Reverting is one env change back. Migrating a new call is one env change forward. There is no deployment, no code review for the routing itself, no risk of breaking adjacent calls.

The routing table looks roughly like this:

ROUTING_TABLE = {
    "local/qwen2.5-14b": {
        "primary": "http://local-gpu:8000/v1",
        "fallback": "https://openrouter.ai/api/v1",
        "fallback_model": "qwen/qwen-2.5-14b-instruct",
        "timeout": 60,
    },
    "gpt-4o": {
        "primary": "https://openrouter.ai/api/v1",
        "fallback": None,
        "timeout": 120,
    },
}

The wrapper that handles the actual dispatch is about 40 lines of Python. It catches httpx.TimeoutException and any non-2xx response from the primary, then retries against the fallback with the mapped model name.

async def routed_completion(model: str, messages: list, **kwargs):
    config = ROUTING_TABLE.get(model)
    if not config:
        raise ValueError(f"Unknown model: {model}")

    try:
        response = await call_openai_compatible(
            base_url=config["primary"],
            model=model,
            messages=messages,
            timeout=config["timeout"],
            **kwargs,
        )
        return response
    except (TimeoutError, APIError):
        if not config["fallback"]:
            raise
        return await call_openai_compatible(
            base_url=config["fallback"],
            model=config["fallback_model"],
            messages=messages,
            timeout=120,
            **kwargs,
        )

Nothing exotic. The value is in the pattern, not the code complexity.

Why Qwen 2.5 14B Specifically

Qwen 2.5 14B hits a practical sweet spot. At 14 billion parameters, it fits comfortably on a single consumer or prosumer GPU with quantization (we run Q4_K_M via llama.cpp, which lands around 9GB VRAM). Instruction-following quality on classification and drafting tasks is close enough to GPT-4o-mini that we have not had a task regress after migration.

The OpenAI-compatible endpoint comes from llama.cpp's server mode, started with:

./llama-server \
  --model qwen2.5-14b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 8192 \
  --n-gpu-layers 99

That --n-gpu-layers 99 offloads all layers to GPU. The --ctx-size 8192 is tuned to our longest classification prompts with room to spare.

What OpenRouter Is Actually For

OpenRouter serves two roles in this setup. First, it is the fallback when the local GPU is unavailable. Second, it is the explicit path for tasks that genuinely need a frontier model, things like complex multi-step reasoning, long-context synthesis over 8K tokens, or customer-facing outputs where we want the extra quality margin.

Because the routing is per-call, we can make that judgment at the call site without touching infrastructure. A classification job gets local/qwen2.5-14b. A contract analysis job gets anthropic/claude-3.5-sonnet. Both go through the same wrapper.

OpenRouter's unified API surface is what makes this clean. We are not maintaining separate SDK integrations for Anthropic, OpenAI, and our local instance. One base URL, one API key, one response schema for the fallback and premium paths.

The Cost Picture

Local handles the bulk of volume at zero marginal cost. The GPU is already running, already paid for. Every token generated locally is free at the margin.

OpenRouter only gets traffic when the GPU is down (rare, maybe a few hours a month during maintenance) or when a call explicitly requests a premium model. That second category is intentional spend, not waste. The first category is small enough that the fallback cost is negligible.

We have not published specific dollar figures because the savings depend entirely on your volume and what you were paying before. But the structure is sound: if you have a GPU and a meaningful volume of commodity LLM calls, the marginal cost of those calls drops to near-zero. The math is not complicated.

Operational Considerations

A few things worth knowing before you copy this pattern:

Cold start latency. llama.cpp loads the model into VRAM on startup, not per-request. Keep the server running continuously. A systemd service or Docker restart policy handles this.
Concurrency limits. A single llama.cpp instance processes one request at a time by default. For higher concurrency, run multiple instances behind a simple round-robin proxy, or use a server that supports parallel decoding.
Monitoring the fallback rate. Log every fallback. If OpenRouter starts getting 20% of your local-intended traffic, something is wrong with the GPU instance, not the routing logic.
Model drift. Local Qwen 2.5 14B and the hosted version on OpenRouter may differ slightly in behavior. Test your prompts against both before treating them as interchangeable fallbacks.

This Is the Stack We Run at Savage Digital Solutions

At Savage Digital Solutions (savagesolutions.io), this routing pattern is live across our AI-assisted workflows. It is not a prototype. The 60-second timeout, the model-name-based opt-in, the OpenRouter fallback, all of it is in production. We built it because paying per token for classification felt like leaving money on the table once we had the GPU anyway.

If you are already running local inference for any reason, adding this routing layer costs you an afternoon and pays back immediately.

Key Takeaways

Route commodity LLM calls (drafting, classification) to a local GPU running Qwen 2.5 14B via an OpenAI-compatible endpoint. Local inference at zero marginal cost is the goal.
Use a 60-second timeout on the local endpoint before falling back to OpenRouter. This number matters: too short and you waste paid fallbacks, too long and you stall users.
Make routing opt-in by model name. Migrating a call is one environment variable change. Reverting is the same.
OpenRouter serves two distinct roles: fallback for local failures, and explicit path for premium models. Keep those roles separate in your routing table.
Run Qwen 2.5 14B with llama.cpp in server mode, Q4_K_M quantization, all layers on GPU. This fits on a single consumer GPU and handles most commodity tasks without quality regression.
Log every fallback. A rising fallback rate signals infrastructure problems, not routing problems.
The pattern generalizes to any OpenAI-compatible local server. The specific models and timeout are tunable. The structure is not.

Local-First LLM Routing: How We Cut Generation Costs to Near-Zero with Qwen 2.5 14B and OpenRouter Fallback

The Problem With Paying Per Token on Commodity Tasks

What We Built

The Routing Is Per-Call and Opt-In by Model Name

Why Qwen 2.5 14B Specifically

What OpenRouter Is Actually For

The Cost Picture

Operational Considerations

This Is the Stack We Run at Savage Digital Solutions

Key Takeaways

Want to Learn More?

Explore Our Work

AI & LLM Development

Local SEO

OpenAI Integration