Routing & Tier-Schema

The stack exposes models through a six-tier slug-naming convention (ADR 0008 v2 nach Cut 2.14). Pick the tier that matches your intent.

Tier overview

Tier	Form	Use when	Example
T0	`<model>-local` / `<model>-cloud`	"I need force-on-prem" / "I need force-cloud" — no cross-domain fallback	`llama-4-scout-local`, `llama-4-scout-cloud`
T1	`<model>` (unprefixed)	"give me the best you have, fall back if needed"	`llama-4-scout`
T2	`<model>-fp4` / `-fp8` / `-bf16`	"I need this exact quantization" (cost/quality trade-off)	`llama-4-scout-fp8`
T3	`<model>-cloud-1` / `-cloud-2` / `-cloud-3`	"I want this specific cloud-slot only"	`llama-4-scout-cloud-2`
T4	`<model>-spark1` / `-station` / `-spark2`	operator-only, hardware-direct (not in `/v1/info`)	`llama-4-scout-spark1`
T5	`<vendor>-<family>-latest`	"current best from this vendor + family, my code stays unchanged across version bumps"	`anthropic-claude-opus-latest`

T1 is the recommended default for casual use. T0 gives strong on-prem / cloud guarantees (e.g. data-residency policies). T2/T3 give explicit quant or provider-slot control. T4 exists for operator debug. T5 is for clients that want to track a vendor's latest version without code edits.

T0 — Force-Backend (Cut 2.14)

T0 aliases declare a hard isolation boundary:

<model>-local — chains primary local backend → secondary local backend. Never falls back to cloud. If both local backends fail, returns 5xx. Use case: GDPR / data-residency requirement that prompts must not leave the on-prem infrastructure.
<model>-cloud — chains cloud-1 (Ollama Cloud, preferred) → cloud-2 (OpenRouter) → cloud-3 (NVIDIA NIM). Never falls back to local. Use case: operator policy mandating third-party infrastructure, or caller wants higher-quality cloud model and is willing to accept the external network dependency.

T0 differs from T1 in semantics: T1 is "best effort, will use whatever's up". T0 is "this domain only, fail rather than cross domains".

T5 — Vendor-pinned floating aliases (Cut 2.14)

T5 lets you say "anthropic-claude-opus-latest" instead of pinning your client to claude-opus-4.7 and having to edit it on every release. The operator periodically bumps the pin in litellm/config.yaml; your code stays unchanged.

Each T5 alias resolves to a single concrete provider model at any given moment — see /v1/info backends[0].notes for the current pin and date.

Initial pin-targets (2026-05-03):

T5 Alias	Currently pins to
`anthropic-claude-opus-latest`	`claude-opus-4.7`
`google-gemini-pro-latest`	`gemini-3.1-pro-preview`
`google-gemini-flash-latest`	`gemini-2.5-flash`
`google-nano-banana-latest`	`gemini-3.1-flash-image-preview-20260226`
`openai-gpt-latest`	`gpt-5.5-pro`

For deterministic version-pinning use the unprefixed T1 alias directly.

T5 Cross-Vendor Fallback Cascade (Cut 2.23d)

If the primary provider for a T5 alias is out-of-quota or returns an error, the router automatically tries the other two flagship providers before emitting model_quota_exhausted:

openai-gpt-latest         → [anthropic-claude-opus-latest, google-gemini-pro-latest]
anthropic-claude-opus-latest → [openai-gpt-latest, google-gemini-pro-latest]
google-gemini-pro-latest  → [anthropic-claude-opus-latest, openai-gpt-latest]

Cross-vendor fallback is a best-effort 429-avoidance — the caller gets some answer when the primary is blocked. If you need strict vendor pinning (e.g. for compliance reasons), pass the underlying concrete-version alias instead of the *-latest form.

max_tokens-Default-Cap (Cut 2.23d): T5-cloud-aliases (*-latest, flagship) without an explicit max_tokens get silently capped at 4096. This stops upstream providers from reserving the model maximum (e.g. GPT 65536) against tight daily-credit budgets. Set max_tokens explicitly in your request to override.

What "primary" means in T1

For each T1 alias, /v1/info lists backends[] in the order the stack will try them. Position 0 is the primary — that's what the Spass-Resolved-Backend response-header reflects when the call succeeds on first try. Position 1+ is the fallback chain.

For llama-4-scout (Stand 2026-05-03):

backends[0]  local-fp8   (primary — Station, redundant 4× A100)
backends[1]  local-fp4   (fallback — Spark1, single GB10)
backends[2]  cloud-1     (fallback — currently mis-configured, see /v1/info notes)
backends[3]  cloud-2     (fallback — open-router-like)
backends[4]  cloud-3     (deprecated)

The order is stability-priority (Station-first per ADR 0008): the more redundant local backend gets primary, single-GPU is the local fallback, clouds are last.

Decision-tree: which tier should I send?

Do you have a strict on-prem / DSGVO requirement?
└── yes  → T0 -local

Do you have a strict third-party / cloud-only requirement?
└── yes  → T0 -cloud

Do you need cost-zero local serving (best effort)?
├── yes, no preference on quant   → T1 unprefixed (Spass-Resolved-Reason: primary-up)
├── yes, want FP8 only             → T2 -fp8
└── yes, want FP4 only             → T2 -fp4

Do you need a specific cloud (cost-tracking, vendor-pin)?
└── pick T3 -cloud-N

Do you want "vendor's latest" without re-pinning your client code?
└── T5 <vendor>-<family>-latest

Are you an operator debugging a specific node?
└── T4 -spark1 / -station (only via direct litellm:4000, not exposed in /v1/info)

Default for casual integrators?
└── T1 unprefixed

What `Spass-Resolved-Reason` tells you

The header explains why the resolved-backend is what it is:

Reason	Meaning	Action for caller
`primary-up`	T1 alias, primary backend served (the happy path)	nothing — expected
`primary-down-fallback`	T1 alias, primary failed, fallback served	log + maybe retry later — primary is being repaired
`quant-pin-explicit`	T2 quant-tier pick respected	nothing — caller asked for this
`cloud-pin-explicit`	T3 cloud-tier pick respected	nothing — caller asked for this
`hardware-pin-explicit`	T4 hardware-direct (operator)	operator-context only
`force-backend-explicit`	T0 force-tier pick respected (Cut 2.14)	nothing — caller asked for strict on-prem or strict cloud
`unknown`	Streaming response (we can't read the body) OR model not in catalog	accept the uncertainty

Deprecated suffixes (removed in Cut 2.14)

The pre-ADR-0008-v2 backward-compat suffixes have been removed. They return 400 model_not_in_allowlist now.

Removed slug	Migrate to
`<model>-ollama`	`<model>-cloud-1`
`<model>-openrouter`	`<model>-cloud-2`
`<model>-openrouter-free`	`<model>-cloud-2-free`
`<model>-nim`	`<model>-cloud-3`

<model>-local is kept, but reclassified from "backward-compat alias" to "formal T0 force-tier" with new semantics (chains primary → secondary local, no cloud-fallback).