DGX LLM Chat Gateway

Routing & Tier-Schema

The stack exposes models through a six-tier slug-naming convention (ADR 0008 v2 nach Cut 2.14). Pick the tier that matches your intent.

Tier overview

TierFormUse whenExample
T0<model>-local / <model>-cloud"I need force-on-prem" / "I need force-cloud" — no cross-domain fallbackllama-4-scout-local, llama-4-scout-cloud
T1<model> (unprefixed)"give me the best you have, fall back if needed"llama-4-scout
T2<model>-fp4 / -fp8 / -bf16"I need this exact quantization" (cost/quality trade-off)llama-4-scout-fp8
T3<model>-cloud-1 / -cloud-2 / -cloud-3"I want this specific cloud-slot only"llama-4-scout-cloud-2
T4<model>-spark1 / -station / -spark2operator-only, hardware-direct (not in /v1/info)llama-4-scout-spark1
T5<vendor>-<family>-latest"current best from this vendor + family, my code stays unchanged across version bumps"anthropic-claude-opus-latest

T1 is the recommended default for casual use. T0 gives strong on-prem / cloud guarantees (e.g. data-residency policies). T2/T3 give explicit quant or provider-slot control. T4 exists for operator debug. T5 is for clients that want to track a vendor's latest version without code edits.

T0 — Force-Backend (Cut 2.14)

T0 aliases declare a hard isolation boundary:

T0 differs from T1 in semantics: T1 is "best effort, will use whatever's up". T0 is "this domain only, fail rather than cross domains".

T5 — Vendor-pinned floating aliases (Cut 2.14)

T5 lets you say "anthropic-claude-opus-latest" instead of pinning your client to claude-opus-4.7 and having to edit it on every release. The operator periodically bumps the pin in litellm/config.yaml; your code stays unchanged.

Each T5 alias resolves to a single concrete provider model at any given moment — see /v1/info backends[0].notes for the current pin and date.

Initial pin-targets (2026-05-03):

T5 AliasCurrently pins to
anthropic-claude-opus-latestclaude-opus-4.7
google-gemini-pro-latestgemini-3.1-pro-preview
google-gemini-flash-latestgemini-2.5-flash
google-nano-banana-latestgemini-3.1-flash-image-preview-20260226
openai-gpt-latestgpt-5.5-pro

For deterministic version-pinning use the unprefixed T1 alias directly.

T5 Cross-Vendor Fallback Cascade (Cut 2.23d)

If the primary provider for a T5 alias is out-of-quota or returns an error, the router automatically tries the other two flagship providers before emitting model_quota_exhausted:

openai-gpt-latest         → [anthropic-claude-opus-latest, google-gemini-pro-latest]
anthropic-claude-opus-latest → [openai-gpt-latest, google-gemini-pro-latest]
google-gemini-pro-latest  → [anthropic-claude-opus-latest, openai-gpt-latest]

Cross-vendor fallback is a best-effort 429-avoidance — the caller gets some answer when the primary is blocked. If you need strict vendor pinning (e.g. for compliance reasons), pass the underlying concrete-version alias instead of the *-latest form.

max_tokens-Default-Cap (Cut 2.23d): T5-cloud-aliases (*-latest, flagship) without an explicit max_tokens get silently capped at 4096. This stops upstream providers from reserving the model maximum (e.g. GPT 65536) against tight daily-credit budgets. Set max_tokens explicitly in your request to override.

What "primary" means in T1

For each T1 alias, /v1/info lists backends[] in the order the stack will try them. Position 0 is the primary — that's what the Spass-Resolved-Backend response-header reflects when the call succeeds on first try. Position 1+ is the fallback chain.

For llama-4-scout (Stand 2026-05-03):

backends[0]  local-fp8   (primary — Station, redundant 4× A100)
backends[1]  local-fp4   (fallback — Spark1, single GB10)
backends[2]  cloud-1     (fallback — currently mis-configured, see /v1/info notes)
backends[3]  cloud-2     (fallback — open-router-like)
backends[4]  cloud-3     (deprecated)

The order is stability-priority (Station-first per ADR 0008): the more redundant local backend gets primary, single-GPU is the local fallback, clouds are last.

Decision-tree: which tier should I send?

Do you have a strict on-prem / DSGVO requirement?
└── yes  → T0 -local

Do you have a strict third-party / cloud-only requirement?
└── yes  → T0 -cloud

Do you need cost-zero local serving (best effort)?
├── yes, no preference on quant   → T1 unprefixed (Spass-Resolved-Reason: primary-up)
├── yes, want FP8 only             → T2 -fp8
└── yes, want FP4 only             → T2 -fp4

Do you need a specific cloud (cost-tracking, vendor-pin)?
└── pick T3 -cloud-N

Do you want "vendor's latest" without re-pinning your client code?
└── T5 <vendor>-<family>-latest

Are you an operator debugging a specific node?
└── T4 -spark1 / -station (only via direct litellm:4000, not exposed in /v1/info)

Default for casual integrators?
└── T1 unprefixed

What Spass-Resolved-Reason tells you

The header explains why the resolved-backend is what it is:

ReasonMeaningAction for caller
primary-upT1 alias, primary backend served (the happy path)nothing — expected
primary-down-fallbackT1 alias, primary failed, fallback servedlog + maybe retry later — primary is being repaired
quant-pin-explicitT2 quant-tier pick respectednothing — caller asked for this
cloud-pin-explicitT3 cloud-tier pick respectednothing — caller asked for this
hardware-pin-explicitT4 hardware-direct (operator)operator-context only
force-backend-explicitT0 force-tier pick respected (Cut 2.14)nothing — caller asked for strict on-prem or strict cloud
unknownStreaming response (we can't read the body) OR model not in catalogaccept the uncertainty

Deprecated suffixes (removed in Cut 2.14)

The pre-ADR-0008-v2 backward-compat suffixes have been removed. They return 400 model_not_in_allowlist now.

Removed slugMigrate to
<model>-ollama<model>-cloud-1
<model>-openrouter<model>-cloud-2
<model>-openrouter-free<model>-cloud-2-free
<model>-nim<model>-cloud-3

<model>-local is kept, but reclassified from "backward-compat alias" to "formal T0 force-tier" with new semantics (chains primary → secondary local, no cloud-fallback).

See also