Routing & Tier-Schema
The stack exposes models through a six-tier slug-naming convention (ADR 0008 v2 nach Cut 2.14). Pick the tier that matches your intent.
Tier overview
| Tier | Form | Use when | Example |
|---|---|---|---|
| T0 | <model>-local / <model>-cloud | "I need force-on-prem" / "I need force-cloud" — no cross-domain fallback | llama-4-scout-local, llama-4-scout-cloud |
| T1 | <model> (unprefixed) | "give me the best you have, fall back if needed" | llama-4-scout |
| T2 | <model>-fp4 / -fp8 / -bf16 | "I need this exact quantization" (cost/quality trade-off) | llama-4-scout-fp8 |
| T3 | <model>-cloud-1 / -cloud-2 / -cloud-3 | "I want this specific cloud-slot only" | llama-4-scout-cloud-2 |
| T4 | <model>-spark1 / -station / -spark2 | operator-only, hardware-direct (not in /v1/info) | llama-4-scout-spark1 |
| T5 | <vendor>-<family>-latest | "current best from this vendor + family, my code stays unchanged across version bumps" | anthropic-claude-opus-latest |
T1 is the recommended default for casual use. T0 gives strong on-prem / cloud guarantees (e.g. data-residency policies). T2/T3 give explicit quant or provider-slot control. T4 exists for operator debug. T5 is for clients that want to track a vendor's latest version without code edits.
T0 — Force-Backend (Cut 2.14)
T0 aliases declare a hard isolation boundary:
<model>-local— chains primary local backend → secondary local backend. Never falls back to cloud. If both local backends fail, returns 5xx. Use case: GDPR / data-residency requirement that prompts must not leave the on-prem infrastructure.<model>-cloud— chains cloud-1 (Ollama Cloud, preferred) → cloud-2 (OpenRouter) → cloud-3 (NVIDIA NIM). Never falls back to local. Use case: operator policy mandating third-party infrastructure, or caller wants higher-quality cloud model and is willing to accept the external network dependency.
T0 differs from T1 in semantics: T1 is "best effort, will use whatever's up". T0 is "this domain only, fail rather than cross domains".
T5 — Vendor-pinned floating aliases (Cut 2.14)
T5 lets you say "anthropic-claude-opus-latest" instead of pinning your
client to claude-opus-4.7 and having to edit it on every release. The
operator periodically bumps the pin in litellm/config.yaml; your code
stays unchanged.
Each T5 alias resolves to a single concrete provider model at any given
moment — see /v1/info backends[0].notes for the current pin and date.
Initial pin-targets (2026-05-03):
| T5 Alias | Currently pins to |
|---|---|
anthropic-claude-opus-latest | claude-opus-4.7 |
google-gemini-pro-latest | gemini-3.1-pro-preview |
google-gemini-flash-latest | gemini-2.5-flash |
google-nano-banana-latest | gemini-3.1-flash-image-preview-20260226 |
openai-gpt-latest | gpt-5.5-pro |
For deterministic version-pinning use the unprefixed T1 alias directly.
T5 Cross-Vendor Fallback Cascade (Cut 2.23d)
If the primary provider for a T5 alias is out-of-quota or returns an error, the router automatically tries the other two flagship providers before emitting model_quota_exhausted:
openai-gpt-latest → [anthropic-claude-opus-latest, google-gemini-pro-latest]
anthropic-claude-opus-latest → [openai-gpt-latest, google-gemini-pro-latest]
google-gemini-pro-latest → [anthropic-claude-opus-latest, openai-gpt-latest]
Cross-vendor fallback is a best-effort 429-avoidance — the caller gets some answer when the primary is blocked. If you need strict vendor pinning (e.g. for compliance reasons), pass the underlying concrete-version alias instead of the *-latest form.
max_tokens-Default-Cap (Cut 2.23d): T5-cloud-aliases (*-latest, flagship) without an explicit max_tokens get silently capped at 4096. This stops upstream providers from reserving the model maximum (e.g. GPT 65536) against tight daily-credit budgets. Set max_tokens explicitly in your request to override.
What "primary" means in T1
For each T1 alias, /v1/info lists backends[] in the order the stack
will try them. Position 0 is the primary — that's what the
Spass-Resolved-Backend response-header reflects when the call succeeds
on first try. Position 1+ is the fallback chain.
For llama-4-scout (Stand 2026-05-03):
backends[0] local-fp8 (primary — Station, redundant 4× A100)
backends[1] local-fp4 (fallback — Spark1, single GB10)
backends[2] cloud-1 (fallback — currently mis-configured, see /v1/info notes)
backends[3] cloud-2 (fallback — open-router-like)
backends[4] cloud-3 (deprecated)
The order is stability-priority (Station-first per ADR 0008): the more redundant local backend gets primary, single-GPU is the local fallback, clouds are last.
Decision-tree: which tier should I send?
Do you have a strict on-prem / DSGVO requirement?
└── yes → T0 -local
Do you have a strict third-party / cloud-only requirement?
└── yes → T0 -cloud
Do you need cost-zero local serving (best effort)?
├── yes, no preference on quant → T1 unprefixed (Spass-Resolved-Reason: primary-up)
├── yes, want FP8 only → T2 -fp8
└── yes, want FP4 only → T2 -fp4
Do you need a specific cloud (cost-tracking, vendor-pin)?
└── pick T3 -cloud-N
Do you want "vendor's latest" without re-pinning your client code?
└── T5 <vendor>-<family>-latest
Are you an operator debugging a specific node?
└── T4 -spark1 / -station (only via direct litellm:4000, not exposed in /v1/info)
Default for casual integrators?
└── T1 unprefixed
What Spass-Resolved-Reason tells you
The header explains why the resolved-backend is what it is:
| Reason | Meaning | Action for caller |
|---|---|---|
primary-up | T1 alias, primary backend served (the happy path) | nothing — expected |
primary-down-fallback | T1 alias, primary failed, fallback served | log + maybe retry later — primary is being repaired |
quant-pin-explicit | T2 quant-tier pick respected | nothing — caller asked for this |
cloud-pin-explicit | T3 cloud-tier pick respected | nothing — caller asked for this |
hardware-pin-explicit | T4 hardware-direct (operator) | operator-context only |
force-backend-explicit | T0 force-tier pick respected (Cut 2.14) | nothing — caller asked for strict on-prem or strict cloud |
unknown | Streaming response (we can't read the body) OR model not in catalog | accept the uncertainty |
Deprecated suffixes (removed in Cut 2.14)
The pre-ADR-0008-v2 backward-compat suffixes have been removed. They
return 400 model_not_in_allowlist now.
| Removed slug | Migrate to |
|---|---|
<model>-ollama | <model>-cloud-1 |
<model>-openrouter | <model>-cloud-2 |
<model>-openrouter-free | <model>-cloud-2-free |
<model>-nim | <model>-cloud-3 |
<model>-local is kept, but reclassified from "backward-compat
alias" to "formal T0 force-tier" with new semantics (chains primary →
secondary local, no cloud-fallback).
See also
- Response Headers — full header reference
- Models — capabilities + constraints per alias
- Quickstart — first call walkthrough