DGX LLM Chat Gateway

Hybrid local + cloud LLM gateway running on a single NVIDIA DGX Spark (GB10). Forwards chat completions to a local vLLM with Llama-4-Scout (NVFP4), falling back to Ollama Cloud or OpenRouter when local is unavailable. KV-cache reuse via LMCache, response cache via LiteLLM/Redis.

This site is the client-facing reference for the gateway. Operators looking for installation, system maintenance, or troubleshooting should read docs/INSTALL.md in the repo instead.

Two parallel APIs

	`/v1/*` (OpenAI passthrough)	`/c1/*` (cache-augmented)
Contract	OpenAI Chat-Completions verbatim	Custom domain schema
State	Stateless	Conversation persistence (SQLite)
History	Client sends full `messages`	Server prepends from DB
LMCache hit-rate	Depends on client	Maximised via stable prefixes
SDKs	Every OpenAI SDK works as-is	Custom client
Streaming	Yes (SSE)	Yes (SSE)
Tools / Vision / JSON-mode	Yes	Yes

Pick /v1/* if you want a drop-in OpenAI replacement. Pick /c1/* if you're building your own app and want history persistence + ergonomic provider selection in one call.

Where to start

Quickstart — your first request in under a minute.
Authentication — Bearer token, rate limits, conversation isolation.
Models & constraints — what each alias supports, behaviour quirks per model.
Error catalog — every stable error code, with remediation.
Examples — runnable curl recipes for tools, vision, conversations, image generation, and streaming.

Discovery endpoints (no auth required)

GET /healthz — process is alive (returns {"status":"ok"})
GET /readyz — readiness with per-backend status (litellm, vllm)
GET /openapi.json — raw OpenAPI 3.x spec
GET /openapi — Swagger UI (interactive API reference)
GET /redoc — ReDoc (alternative spec renderer)
GET /errors — full machine-readable error catalog
GET /playground — interactive playground for live experiments
GET /docs/* — this documentation site