DGX LLM Chat Gateway
Hybrid local + cloud LLM gateway running on a single NVIDIA DGX Spark (GB10). Forwards chat completions to a local vLLM with Llama-4-Scout (NVFP4), falling back to Ollama Cloud or OpenRouter when local is unavailable. KV-cache reuse via LMCache, response cache via LiteLLM/Redis.
This site is the client-facing reference for the gateway. Operators looking
for installation, system maintenance, or troubleshooting should read
docs/INSTALL.md in the repo instead.
Two parallel APIs
/v1/* (OpenAI passthrough) | /c1/* (cache-augmented) | |
|---|---|---|
| Contract | OpenAI Chat-Completions verbatim | Custom domain schema |
| State | Stateless | Conversation persistence (SQLite) |
| History | Client sends full messages | Server prepends from DB |
| LMCache hit-rate | Depends on client | Maximised via stable prefixes |
| SDKs | Every OpenAI SDK works as-is | Custom client |
| Streaming | Yes (SSE) | Yes (SSE) |
| Tools / Vision / JSON-mode | Yes | Yes |
Pick /v1/* if you want a drop-in OpenAI replacement. Pick /c1/* if you're
building your own app and want history persistence + ergonomic provider
selection in one call.
Where to start
- Quickstart — your first request in under a minute.
- Authentication — Bearer token, rate limits, conversation isolation.
- Models & constraints — what each alias supports, behaviour quirks per model.
- Error catalog — every stable error code, with remediation.
- Examples — runnable
curlrecipes for tools, vision, conversations, image generation, and streaming.
Discovery endpoints (no auth required)
GET /healthz— process is alive (returns{"status":"ok"})GET /readyz— readiness with per-backend status (litellm, vllm)GET /openapi.json— raw OpenAPI 3.x specGET /openapi— Swagger UI (interactive API reference)GET /redoc— ReDoc (alternative spec renderer)GET /errors— full machine-readable error catalogGET /playground— interactive playground for live experimentsGET /docs/*— this documentation site