DGX LLM Chat Gateway

DGX LLM Chat Gateway

Hybrid local + cloud LLM gateway running on a single NVIDIA DGX Spark (GB10). Forwards chat completions to a local vLLM with Llama-4-Scout (NVFP4), falling back to Ollama Cloud or OpenRouter when local is unavailable. KV-cache reuse via LMCache, response cache via LiteLLM/Redis.

This site is the client-facing reference for the gateway. Operators looking for installation, system maintenance, or troubleshooting should read docs/INSTALL.md in the repo instead.

Two parallel APIs

/v1/* (OpenAI passthrough)/c1/* (cache-augmented)
ContractOpenAI Chat-Completions verbatimCustom domain schema
StateStatelessConversation persistence (SQLite)
HistoryClient sends full messagesServer prepends from DB
LMCache hit-rateDepends on clientMaximised via stable prefixes
SDKsEvery OpenAI SDK works as-isCustom client
StreamingYes (SSE)Yes (SSE)
Tools / Vision / JSON-modeYesYes

Pick /v1/* if you want a drop-in OpenAI replacement. Pick /c1/* if you're building your own app and want history persistence + ergonomic provider selection in one call.

Where to start

Discovery endpoints (no auth required)