Hybrid LLM gateway
on a single GB10.
OpenAI-compatible chat completions backed by a local Llama-4-Scout (NVFP4)
on NVIDIA DGX Spark, with Ollama Cloud and OpenRouter
fallbacks. KV-cache reuse via LMCache. One bearer token, two
parallel APIs (/v1 stateless, /c1 with conversation
persistence), thirteen models exposed.
/docs/
Markdown documentation site. Quickstart, authentication, error catalog, models & constraints, runnable examples.
/playground
Interactive sandbox. Model cards with capabilities, drag-and-drop image upload, tool-call builder, live streaming.
/openapi
Themed Swagger UI — interactive API reference with "Try it out" buttons.
/redoc
Themed ReDoc — calmer, read-friendly spec rendering.
/openapi.json
Raw OpenAPI 3.x schema for code generators (openapi-generator, quicktype).
/errors
Machine-readable catalog of every stable error code with type, status, remediation.
/healthz
Liveness probe (no auth). Returns
{"status":"ok"}.
/readyz
Readiness probe with per-backend latency (litellm + vllm). No auth.
Two parallel APIs
| Aspect | /v1/* — OpenAI passthrough | /c1/* — cache-augmented |
|---|---|---|
| Contract | Verbatim OpenAI Chat-Completions | Custom domain schema |
| State | Stateless | SQLite-persisted conversations |
| History | Client sends full messages | Server prepends from DB |
| SDKs | Every OpenAI SDK works as-is | Custom client |
| Streaming, tools, vision, JSON-mode | Yes | Yes |
| Backend selector | via model slug | via model slug or provider enum |
| Conversation CRUD | — | GET / DELETE /c1/conversations/{id} |
| Best for | Drop-in OpenAI replacement | Own apps wanting history + ergonomic provider switch |