DGX LLM Chat Gateway

Hybrid LLM gateway
on a single GB10.

OpenAI-compatible chat completions backed by a local Llama-4-Scout (NVFP4) on NVIDIA DGX Spark, with Ollama Cloud and OpenRouter fallbacks. KV-cache reuse via LMCache. One bearer token, two parallel APIs (/v1 stateless, /c1 with conversation persistence), thirteen models exposed.

/docs/
Markdown documentation site. Quickstart, authentication, error catalog, models & constraints, runnable examples.
/playground
Interactive sandbox. Model cards with capabilities, drag-and-drop image upload, tool-call builder, live streaming.
/openapi
Themed Swagger UI — interactive API reference with "Try it out" buttons.
/redoc
Themed ReDoc — calmer, read-friendly spec rendering.
/openapi.json
Raw OpenAPI 3.x schema for code generators (openapi-generator, quicktype).
/errors
Machine-readable catalog of every stable error code with type, status, remediation.
/healthz
Liveness probe (no auth). Returns {"status":"ok"}.
/readyz
Readiness probe with per-backend latency (litellm + vllm). No auth.

Two parallel APIs

Aspect/v1/* — OpenAI passthrough/c1/* — cache-augmented
ContractVerbatim OpenAI Chat-CompletionsCustom domain schema
StateStatelessSQLite-persisted conversations
HistoryClient sends full messagesServer prepends from DB
SDKsEvery OpenAI SDK works as-isCustom client
Streaming, tools, vision, JSON-modeYesYes
Backend selectorvia model slugvia model slug or provider enum
Conversation CRUDGET / DELETE /c1/conversations/{id}
Best forDrop-in OpenAI replacementOwn apps wanting history + ergonomic provider switch