Hybrid LLM gateway
on a single GB10.
OpenAI-compatible chat completions backed by a local Llama-4-Scout (NVFP4)
on NVIDIA DGX Spark, with Ollama Cloud and OpenRouter
fallbacks. KV-cache reuse via LMCache. One bearer token, three
parallel APIs (/v1 stateless, /c1 with conversation
persistence, /a1 agent system with multi-turn sessions, RAG, and
C4-hybrid compaction), 23 server-side tools, per-tenant configuration cascade.
/docs/
Markdown documentation site. Quickstart, authentication, error catalog, models & constraints, runnable examples.
/playground
Interactive sandbox. Model cards with capabilities, drag-and-drop image upload, tool-call builder, live streaming.
/openapi
Themed Swagger UI — interactive API reference with "Try it out" buttons.
/redoc
Themed ReDoc — calmer, read-friendly spec rendering.
/openapi.json
Raw OpenAPI 3.x schema for code generators (openapi-generator, quicktype).
/errors
Machine-readable catalog of every stable error code with type, status, remediation.
/v1/info
Model catalog with capabilities (modalities, tools, context-window) — used by playground, /a1 agents, list_models tool.
/metrics
Prometheus exposition (request counts, latency histograms, cost EUR/USD per request, sub-call counters).
/healthz
Liveness probe (no auth). Returns
{"status":"ok"}.
/readyz
Readiness probe with per-backend latency (litellm + vllm). No auth.
Three parallel APIs
| Aspect | /v1/* — OpenAI passthrough | /c1/* — cache-augmented | /a1/* — agent system |
|---|---|---|---|
| Contract | Verbatim OpenAI Chat-Completions | Custom domain schema | YAML-defined agents |
| State | Stateless | SQLite-persisted conversations | SQLite multi-turn sessions + RAG indices + image-store |
| History | Client sends full messages | Server prepends from DB | Server replays per session, auto-compact above 80 % context |
| SDKs | Every OpenAI SDK works as-is | Custom client | Direct HTTP / OpenAI-compat for sub-pieces |
| Streaming, tools, vision, JSON-mode | Yes | Yes | Inherited via /v1 self-call (ADR 0011) |
| Backend selector | via model slug | via model slug or provider enum | pinned in agent.yaml |
| Tools | Caller-supplied tools[] + 14 stack tools auto-injected (configurable) | Caller-supplied tools[] + stack tools | tools: [...] in agent.yaml — 23 stack tools available (incl. 9 RAG-CRUD) |
| RAG | via /v1/embeddings + caller-side store | via rag_query tool | passive (rag_index:) or active (rag_query) + full CRUD via 9 dedicated tools (B2 confirm + D4 preview) |
| Compaction | — | — | POST /sessions/<sid>/compact + auto-trigger + Z-chained lineage (Cut 2.12) |
| Multi-tenant | per Bearer | per Bearer | per Bearer (sessions, RAG, images, configuration tenant-scoped) |
| Per-tenant config | 3-level cascade (process → yaml → DB) via GET/PUT/DELETE /v1/tenant/config (ADR 0013, Cut 2.13) | ||
| CRUD endpoints | /v1/images/* | GET/DELETE /c1/conversations/{id} | agents, sessions, RAG indices, RAG documents, lineage |
| Best for | Drop-in OpenAI replacement | Own apps wanting history + provider switch | Pre-configured agents with tools, RAG, and long conversations |