DGX LLM Chat Gateway
Hybrid local + cloud LLM gateway running on a single NVIDIA DGX Spark (GB10) plus an x86 worker-station with 4× A100. Forwards chat completions to local vLLM with Llama-4-Scout (NVFP4) on Spark1 and Llama-4-Scout (FP8) on Station, falling back to OpenRouter for cloud-backed reasoning models. KV-cache reuse via LMCache, response cache via LiteLLM/Redis.
Phase 2 (Cuts 2.1 – 2.13) adds the /a1 agent system: rig-core powered agents with file-based YAML configs, multi-turn sessions, 23 server-side tools (compute, web, RAG, memory, image-gen, model discovery), tenant-scoped RAG indices with full CRUD via tools (B2 confirm + D4 pre-exec-diff), C4-hybrid session compaction with Z-chained lineage and auto-trigger, and a 3-level per-tenant configuration cascade (process → yaml → DB) covering compaction defaults, image-gen settings, model allow/blacklists, and cost markup.
This site is the client-facing reference for the gateway. Operators looking for installation, system maintenance, or troubleshooting should read docs/INSTALL.md in the repo instead.
Cut 2.23c (ADR 0016) —
SPASS-User-Id-Header is mandatory on every user-scoped endpoint (/c1/*,/v1/chat/completions,/v1/embeddings,/v1/memory/*,/v1/system-prompts/*,/v1/tools/execute,/a1/agents/*). Body and queryuser_idare rejected with HTTP 400.tenant_admin-tokens may omit the header for read-only endpoints (tenant-wide reads). Seeauthentication.mdfor details.
Three API surfaces
/v1/* (OpenAI passthrough) | /c1/* (cache-augmented) | /a1/* (agent system) | |
|---|---|---|---|
| Contract | OpenAI Chat-Completions verbatim | Custom domain schema | YAML-defined agents |
| State | Stateless (/v1/images/* adds image-store) | Per-conversation SQLite | Per-session SQLite + RAG indices + lineage chain |
| History | Client sends full messages | Server prepends from DB | Server replays per session, auto-compact above 80 % context |
| Tool-calling | Caller-supplied tools[] + 14 stack tools auto-injected (configurable allowlist) | Caller-supplied tools[] + stack tools | tools: [...] in agent.yaml — 23 stack tools available (incl. 9 RAG-CRUD with B2 confirm + D4 preview) |
| RAG | via /v1/embeddings + caller-side store | via rag_query tool | passive (rag_index:) or active (rag_query) + full CRUD via 9 dedicated tools |
| Compaction | — | — | manual POST /sessions/<sid>/compact + auto-trigger + Z-chained lineage (Cut 2.12) |
| Multi-tenant | Yes | Yes | Yes (sessions, RAG, images, configuration tenant-scoped) |
| Per-tenant config | shared 3-level cascade (/v1/tenant/config, ADR 0013, Cut 2.13) | shared cascade | shared cascade — agent.yaml + tenant defaults + DB overrides |
| SDKs | Every OpenAI SDK works | Custom client | Direct HTTP / OpenAI-compat for sub-pieces |
| Streaming | Yes (SSE) | Yes (SSE) | Cut 2.5+ uses non-streaming agent.chat |
| Tools / Vision / JSON-mode | Yes | Yes | inherited via /v1 self-call (ADR 0011) |
Pick /v1/* for a drop-in OpenAI replacement (drop-in for the existing OpenAI SDK + /v1/images for image generation). Pick /c1/* for an own app that wants history persistence + ergonomic provider selection. Pick /a1/* when you want pre-configured agents with tools, RAG, and long conversations that survive context-window limits via Z-chained compaction.
Where to start
- Quickstart — your first request in under a minute.
- Authentication — Bearer token, multi-tenant scoping, rate limits.
- Models & constraints — what each alias supports, behaviour quirks per model.
- Embeddings —
/v1/embeddings(nomic-embed-text-v1.5). - Routing & Tier-Schema — slug → backend mapping (T1/T2/T3/T4 per ADR 0008).
- Server-side tools — datetime, calculator, web/wikipedia, memory (incl.
memory_describe_scopediscovery),image_gen,list_models,rag_query, plus 9 RAG-CRUD tools (B2 confirm + D4 preview). - Agents (
/a1) — agent system architecture, YAML config, multi-turn sessions, C4-hybrid compaction with Z-chained lineage and auto-trigger, per-session config. - RAG (
/a1/rag) — index CRUD, document-level CRUD, query, agent-integration. - Per-tenant configuration — 3-level cascade (process → yaml → DB) via
/v1/tenant/config(ADR 0013, Cut 2.13). - Response headers — every
SPASS-*header explained. - Error catalog — every stable error code, with remediation.
- Examples — runnable
curlrecipes (tools, vision, conversations, image-gen, streaming). - Billing & Usage — per-tenant cost-aggregation via
/v1/billing/usage(Cut 2.40c, hour/day buckets). - Changelog — what's new per cut.
Discovery endpoints (no auth required)
GET /healthz— process is aliveGET /readyz— service availability (see below)GET /openapi.json— raw OpenAPI 3.x spec (now covers/v1,/c1,/a1, embeddings, RAG)GET /openapi— Swagger UIGET /redoc— ReDoc rendererGET /errors— full machine-readable error catalogGET /api/version— build SHA + crate versionGET /playground— interactive playground for live experimentsGET /playground/stt— Speech-to-Text live testbedGET /docs/*— this documentation site
Service availability (GET /readyz)
/readyz is a public, no-auth availability probe. It returns an overall
traffic-light plus which capabilities (if any) are currently limited — without
exposing any internal infrastructure detail.
status— overall health:ok(everything healthy),degraded(a non-essential capability is limited but requests are still served), orcritical(a core capability is impaired).ready—truewhenever the service can accept requests. This staystrueeven underdegraded/criticalas long as the request can be served (the platform transparently falls back across compute capacity).degraded_reasons— a generic, human-readable list of what is currently limited, e.g. compute capacity, the search service, or the speech service. No hostnames, no counts, no internal component names.
Use status for a quick green/yellow/red signal and degraded_reasons to tell
a user which capability to expect delays on. The response also carries a
checked_at (RFC3339) timestamp so you can detect a stale or partial probe.