DGX LLM Chat Gateway

DGX LLM Chat Gateway

Hybrid local + cloud LLM gateway running on a single NVIDIA DGX Spark (GB10) plus an x86 worker-station with 4× A100. Forwards chat completions to local vLLM with Llama-4-Scout (NVFP4) on Spark1 and Llama-4-Scout (FP8) on Station, falling back to OpenRouter for cloud-backed reasoning models. KV-cache reuse via LMCache, response cache via LiteLLM/Redis.

Phase 2 (Cuts 2.1 – 2.13) adds the /a1 agent system: rig-core powered agents with file-based YAML configs, multi-turn sessions, 23 server-side tools (compute, web, RAG, memory, image-gen, model discovery), tenant-scoped RAG indices with full CRUD via tools (B2 confirm + D4 pre-exec-diff), C4-hybrid session compaction with Z-chained lineage and auto-trigger, and a 3-level per-tenant configuration cascade (process → yaml → DB) covering compaction defaults, image-gen settings, model allow/blacklists, and cost markup.

This site is the client-facing reference for the gateway. Operators looking for installation, system maintenance, or troubleshooting should read docs/INSTALL.md in the repo instead.

Cut 2.23c (ADR 0016) — SPASS-User-Id-Header is mandatory on every user-scoped endpoint (/c1/*, /v1/chat/completions, /v1/embeddings, /v1/memory/*, /v1/system-prompts/*, /v1/tools/execute, /a1/agents/*). Body and query user_id are rejected with HTTP 400. tenant_admin-tokens may omit the header for read-only endpoints (tenant-wide reads). See authentication.md for details.

Three API surfaces

/v1/* (OpenAI passthrough)/c1/* (cache-augmented)/a1/* (agent system)
ContractOpenAI Chat-Completions verbatimCustom domain schemaYAML-defined agents
StateStateless (/v1/images/* adds image-store)Per-conversation SQLitePer-session SQLite + RAG indices + lineage chain
HistoryClient sends full messagesServer prepends from DBServer replays per session, auto-compact above 80 % context
Tool-callingCaller-supplied tools[] + 14 stack tools auto-injected (configurable allowlist)Caller-supplied tools[] + stack toolstools: [...] in agent.yaml — 23 stack tools available (incl. 9 RAG-CRUD with B2 confirm + D4 preview)
RAGvia /v1/embeddings + caller-side storevia rag_query toolpassive (rag_index:) or active (rag_query) + full CRUD via 9 dedicated tools
Compactionmanual POST /sessions/<sid>/compact + auto-trigger + Z-chained lineage (Cut 2.12)
Multi-tenantYesYesYes (sessions, RAG, images, configuration tenant-scoped)
Per-tenant configshared 3-level cascade (/v1/tenant/config, ADR 0013, Cut 2.13)shared cascadeshared cascade — agent.yaml + tenant defaults + DB overrides
SDKsEvery OpenAI SDK worksCustom clientDirect HTTP / OpenAI-compat for sub-pieces
StreamingYes (SSE)Yes (SSE)Cut 2.5+ uses non-streaming agent.chat
Tools / Vision / JSON-modeYesYesinherited via /v1 self-call (ADR 0011)

Pick /v1/* for a drop-in OpenAI replacement (drop-in for the existing OpenAI SDK + /v1/images for image generation). Pick /c1/* for an own app that wants history persistence + ergonomic provider selection. Pick /a1/* when you want pre-configured agents with tools, RAG, and long conversations that survive context-window limits via Z-chained compaction.

Where to start

Discovery endpoints (no auth required)

Service availability (GET /readyz)

/readyz is a public, no-auth availability probe. It returns an overall traffic-light plus which capabilities (if any) are currently limited — without exposing any internal infrastructure detail.

Use status for a quick green/yellow/red signal and degraded_reasons to tell a user which capability to expect delays on. The response also carries a checked_at (RFC3339) timestamp so you can detect a stale or partial probe.