Hybrid LLM gateway
on a single GB10.

OpenAI-compatible chat completions backed by a local Llama-4-Scout (NVFP4) on NVIDIA DGX Spark, with Ollama Cloud and OpenRouter fallbacks. KV-cache reuse via LMCache. One bearer token, two parallel APIs (/v1 stateless, /c1 with conversation persistence), thirteen models exposed.

Open Playground → Quickstart API reference

/docs/

Markdown documentation site. Quickstart, authentication, error catalog, models & constraints, runnable examples.

/playground

Interactive sandbox. Model cards with capabilities, drag-and-drop image upload, tool-call builder, live streaming.

/openapi

Themed Swagger UI — interactive API reference with "Try it out" buttons.

/redoc

Themed ReDoc — calmer, read-friendly spec rendering.

/openapi.json

Raw OpenAPI 3.x schema for code generators (openapi-generator, quicktype).

/errors

Machine-readable catalog of every stable error code with type, status, remediation.

/healthz

Liveness probe (no auth). Returns {"status":"ok"}.

/readyz

Readiness probe with per-backend latency (litellm + vllm). No auth.

Two parallel APIs

Aspect	`/v1/*` — OpenAI passthrough	`/c1/*` — cache-augmented
Contract	Verbatim OpenAI Chat-Completions	Custom domain schema
State	Stateless	SQLite-persisted conversations
History	Client sends full `messages`	Server prepends from DB
SDKs	Every OpenAI SDK works as-is	Custom client
Streaming, tools, vision, JSON-mode	Yes	Yes
Backend selector	via `model` slug	via `model` slug or `provider` enum
Conversation CRUD	—	`GET / DELETE /c1/conversations/{id}`
Best for	Drop-in OpenAI replacement	Own apps wanting history + ergonomic provider switch

Hybrid LLM gatewayon a single GB10.

Two parallel APIs

Hybrid LLM gateway
on a single GB10.