DGX LLM Chat Gateway

Hybrid LLM gateway
on a single GB10.

OpenAI-compatible chat completions backed by a local Llama-4-Scout (NVFP4) on NVIDIA DGX Spark, with Ollama Cloud and OpenRouter fallbacks. KV-cache reuse via LMCache. One bearer token, three parallel APIs (/v1 stateless, /c1 with conversation persistence, /a1 agent system with multi-turn sessions, RAG, and C4-hybrid compaction), 23 server-side tools, per-tenant configuration cascade.

/docs/
Markdown documentation site. Quickstart, authentication, error catalog, models & constraints, runnable examples.
/playground
Interactive sandbox. Model cards with capabilities, drag-and-drop image upload, tool-call builder, live streaming.
/openapi
Themed Swagger UI — interactive API reference with "Try it out" buttons.
/redoc
Themed ReDoc — calmer, read-friendly spec rendering.
/openapi.json
Raw OpenAPI 3.x schema for code generators (openapi-generator, quicktype).
/errors
Machine-readable catalog of every stable error code with type, status, remediation.
/v1/info
Model catalog with capabilities (modalities, tools, context-window) — used by playground, /a1 agents, list_models tool.
/metrics
Prometheus exposition (request counts, latency histograms, cost EUR/USD per request, sub-call counters).
/healthz
Liveness probe (no auth). Returns {"status":"ok"}.
/readyz
Readiness probe with per-backend latency (litellm + vllm). No auth.

Three parallel APIs

Aspect/v1/* — OpenAI passthrough/c1/* — cache-augmented/a1/* — agent system
ContractVerbatim OpenAI Chat-CompletionsCustom domain schemaYAML-defined agents
StateStatelessSQLite-persisted conversationsSQLite multi-turn sessions + RAG indices + image-store
HistoryClient sends full messagesServer prepends from DBServer replays per session, auto-compact above 80 % context
SDKsEvery OpenAI SDK works as-isCustom clientDirect HTTP / OpenAI-compat for sub-pieces
Streaming, tools, vision, JSON-modeYesYesInherited via /v1 self-call (ADR 0011)
Backend selectorvia model slugvia model slug or provider enumpinned in agent.yaml
ToolsCaller-supplied tools[] + 14 stack tools auto-injected (configurable)Caller-supplied tools[] + stack toolstools: [...] in agent.yaml — 23 stack tools available (incl. 9 RAG-CRUD)
RAGvia /v1/embeddings + caller-side storevia rag_query toolpassive (rag_index:) or active (rag_query) + full CRUD via 9 dedicated tools (B2 confirm + D4 preview)
CompactionPOST /sessions/<sid>/compact + auto-trigger + Z-chained lineage (Cut 2.12)
Multi-tenantper Bearerper Bearerper Bearer (sessions, RAG, images, configuration tenant-scoped)
Per-tenant config3-level cascade (process → yaml → DB) via GET/PUT/DELETE /v1/tenant/config (ADR 0013, Cut 2.13)
CRUD endpoints/v1/images/*GET/DELETE /c1/conversations/{id}agents, sessions, RAG indices, RAG documents, lineage
Best forDrop-in OpenAI replacementOwn apps wanting history + provider switchPre-configured agents with tools, RAG, and long conversations