{"openapi":"3.1.0","info":{"title":"DGX LLM Chat Gateway","description":"\n# DGX LLM Stack — Chat Gateway\n\nHybrid local + cloud LLM gateway running on a NVIDIA DGX Spark (GB10).\nForwards chat completions to a local **vLLM** with **Llama 4 Scout (NVFP4)**,\nfalling back to **Ollama Cloud** or **OpenRouter** when local is unavailable.\nKV-cache reuse via **LMCache**, response cache via **LiteLLM/Redis**.\n\nAuthentication for every endpoint (except the public ones: `/`, `/healthz`,\n`/readyz`, `/errors`, `/openapi`, `/openapi.json`, `/redoc`, `/docs/*`,\n`/playground`): `Authorization: Bearer <RUST_API_BEARER>`.\n\n## Error format — stable, machine-friendly\n\nAll error responses follow the OpenAI-style envelope, extended with stable\nmachine codes:\n\n```json\n{ \"error\": {\n    \"type\":    \"invalid_request_error\",\n    \"code\":    \"image_url_not_supported\",\n    \"message\": \"Cloud providers don't fetch URLs; encode as base64 data URI.\",\n    \"param\":   \"messages[0].content[1].image_url.url\"\n} }\n```\n\nThe full code catalog is exposed at `GET /errors` (public, no auth) and\nrendered as human prose at `/docs/errors`.\n\n## Two parallel APIs — comparison matrix\n\n| Aspect | `/v1/*` (OpenAI passthrough) | `/c1/*` (cache-augmented) |\n|---|---|---|\n| **Contract** | Verbatim OpenAI Chat-Completions | Custom domain schema |\n| **State** | Stateless | Conversation persistence (SQLite) |\n| **History prepend** | Client must send full `messages` | Service loads + prepends from DB |\n| **LMCache hit-rate** | depends on client behavior | maximized via stable prefixes |\n| **SDK compatibility** | every OpenAI SDK works as-is | requires custom client |\n| **Streaming** | yes (SSE, OpenAI-shaped) | yes (SSE, OpenAI-shaped) |\n| **Multimodal images** | client builds OpenAI content-array | `images: [...]` field, service builds array |\n| **Function calling** | client passes `tools` | service passes through `tools` |\n| **JSON mode** | `response_format` | `response_format` |\n| **Backend selector** | via `model` slug | via `model` slug **or** `provider` enum |\n| **Conversation CRUD** | n/a | `GET / DELETE /c1/conversations/{id}` |\n| **Best for** | drop-in replacement for OpenAI URL | own apps that want history & ergonomic provider switch |\n\n## Backend slugs (LiteLLM aliases)\n\n| Slug | Routes to |\n|---|---|\n| `llama-4-scout` | local vLLM → Ollama Cloud → OpenRouter (fallback chain) |\n| `llama-4-scout-local` | local vLLM only |\n| `llama-4-scout-ollama` | Ollama Cloud only |\n| `llama-4-scout-openrouter` | OpenRouter only |\n\nThe `/c1/chat` endpoint additionally accepts a `provider` enum (`auto`,\n`local`, `ollama`, `openrouter`) which expands to the matching slug. If both\n`model` and `provider` are set, `model` wins.\n","contact":{"name":"dietmar","email":"dietmar@scharf.am"},"license":{"name":""},"version":"0.1.0"},"paths":{"/c1/chat":{"post":{"tags":["c1 — cache-augmented"],"summary":"Cache-augmented chat with conversation persistence.","description":"History (including assistant replies) is loaded from SQLite and prepended.\nThe assistant's reply is automatically captured from the SSE stream (or\nunary JSON) and persisted at the end (audit #1) — unless `ephemeral=true`.","operationId":"chat","requestBody":{"content":{"application/json":{"schema":{"$ref":"#/components/schemas/ChatRequest"}}},"required":true},"responses":{"200":{"description":"OpenAI-compatible chat completion (SSE if stream=true, JSON otherwise). Headers include x-conversation-id."},"400":{"description":"Bad request."},"401":{"description":"Missing or invalid bearer token."},"502":{"description":"Upstream error."}},"security":[{"BearerAuth":[]}]}},"/c1/conversations":{"get":{"tags":["c1 — cache-augmented"],"summary":"Lists conversations with pagination (audit #11).","operationId":"list_conversations","parameters":[{"name":"user_id","in":"query","description":"Optional user-scope filter. Without it, all conversations are returned.","required":false,"schema":{"type":["string","null"]}},{"name":"limit","in":"query","description":"Pagination limit (default 50, max 500).","required":false,"schema":{"type":"integer","format":"int64"}},{"name":"offset","in":"query","description":"Pagination offset.","required":false,"schema":{"type":"integer","format":"int64"}}],"responses":{"200":{"description":"Paginated array of conversation summaries."},"401":{"description":"Missing or invalid bearer token."}},"security":[{"BearerAuth":[]}]}},"/c1/conversations/{id}":{"get":{"tags":["c1 — cache-augmented"],"summary":"Returns the full message history for a conversation.","operationId":"get_conversation","parameters":[{"name":"id","in":"path","description":"Conversation ID","required":true,"schema":{"type":"string"}},{"name":"user_id","in":"query","description":"Optional user-scope filter","required":false,"schema":{"type":"string"}}],"responses":{"200":{"description":"Conversation history."},"401":{"description":"Missing or invalid bearer token."}},"security":[{"BearerAuth":[]}]},"delete":{"tags":["c1 — cache-augmented"],"summary":"Deletes all messages of a conversation.","operationId":"delete_conversation","parameters":[{"name":"id","in":"path","description":"Conversation ID","required":true,"schema":{"type":"string"}},{"name":"user_id","in":"query","description":"Optional user-scope filter","required":false,"schema":{"type":"string"}}],"responses":{"200":{"description":"JSON with `deleted` count."},"401":{"description":"Missing or invalid bearer token."}},"security":[{"BearerAuth":[]}]}},"/errors":{"get":{"tags":["docs"],"summary":"Return the full error catalog. Public, no auth.","operationId":"list_errors","responses":{"200":{"description":"Full error catalog with stable codes, types, statuses, and remediation hints.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ErrorCatalog"}}}}}}},"/healthz":{"get":{"tags":["health"],"summary":"Liveness probe.","operationId":"healthz","responses":{"200":{"description":"Process is alive."}}}},"/readyz":{"get":{"tags":["health"],"summary":"Readiness probe with per-backend details.","operationId":"readyz","responses":{"200":{"description":"Readiness report with per-backend status."}}}},"/v1/chat/completions":{"post":{"tags":["v1 — OpenAI passthrough"],"summary":"OpenAI-compatible chat-completions passthrough.","operationId":"chat_completions","requestBody":{"description":"OpenAI chat-completions request body (verbatim passthrough)","content":{"application/json":{"schema":{"type":"object"},"example":{"messages":[{"content":"hi","role":"user"}],"model":"llama-4-scout","stream":false}}},"required":true},"responses":{"200":{"description":"OpenAI-style chat completion (SSE or JSON)."},"400":{"description":"Missing/invalid `messages` field, or `model` not in allowlist."},"401":{"description":"Missing or invalid bearer token."},"502":{"description":"Upstream error."}},"security":[{"BearerAuth":[]}]}},"/v1/info":{"get":{"tags":["v1 — OpenAI passthrough"],"summary":"Capability-Katalog aller exponierten Modelle.","description":"Enthält für jedes Public-Alias (`llama-4-scout`, `gemma-4-31b`, …) die\nstatisch bekannten Eigenschaften plus die aktive Backend-Routing-Liste.","operationId":"model_info","responses":{"200":{"description":"Catalog of all exposed model aliases with capabilities and backends.","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ModelCatalog"}}}},"401":{"description":"Missing or invalid bearer token."}},"security":[{"BearerAuth":[]}]}},"/v1/models":{"get":{"tags":["v1 — OpenAI passthrough"],"summary":"Lists models exposed by the LiteLLM router.","description":"Cached in `AppState::models_cache` with a 60 s TTL — the upstream\ncontent only changes when `litellm/config.yaml` is reloaded, which is\nrare. The cache keeps p50 latency under 1 ms instead of paying the\nLiteLLM round-trip (~ 2-3 ms) on every call.","operationId":"models","responses":{"200":{"description":"OpenAI-style model list."},"401":{"description":"Missing or invalid bearer token."}},"security":[{"BearerAuth":[]}]}}},"components":{"schemas":{"BackendInfo":{"type":"object","required":["provider","litellm_alias","status"],"properties":{"litellm_alias":{"type":"string"},"notes":{"type":["string","null"]},"pricing":{"type":["string","null"],"description":"Pricing hint (`\"free\"`, `\"$0.06/M in, $0.33/M out\"`, `null` for local)."},"provider":{"type":"string"},"status":{"$ref":"#/components/schemas/BackendStatus"}}},"BackendStatus":{"type":"string","enum":["available","deprecated","unconfigured","planned"]},"ChatRequest":{"type":"object","description":"Request body for `POST /c1/chat`.","required":["message","tools","tool_choice","response_format"],"properties":{"conversation_id":{"type":["string","null"],"description":"Conversation key. If omitted, a new UUID is generated and returned in\nthe `x-conversation-id` response header.","example":"demo1"},"ephemeral":{"type":"boolean","description":"When `true`, neither user message nor assistant reply is persisted."},"images":{"type":"array","items":{"$ref":"#/components/schemas/ImageRef"},"description":"Image attachments for the current turn. Used only when `message` is a\nplain string — service rebuilds the user message as an OpenAI\ncontent-parts array. Ignored if `message` is already a parts array."},"keep_last_turns":{"type":["integer","null"],"format":"int32","description":"Optional bound on conversation history sent to the upstream model.\nPersistence is unaffected — the full history is always written and\nreadable via `GET /c1/conversations/:id`. Only the *current request's*\npayload to LiteLLM/vLLM is trimmed.\n\nSemantics: keep the system message (if any, always) plus up to the\nlast `2 * keep_last_turns` user/assistant messages. `None` means\nkeep everything (current default behaviour). `0` is treated as\n\"system + current user only\" — useful for memory-less single-turn\ncompletions on a persisted thread.","example":5,"minimum":0},"max_tokens":{"type":["integer","null"],"format":"int32","description":"Output-token cap.","example":512,"minimum":0},"message":{"type":"object","description":"Current user message. Either a plain string (text-only) or an\nOpenAI-style content-parts array for inline multimodal input."},"model":{"type":["string","null"],"description":"Explicit LiteLLM model alias. If set, takes precedence over `provider`.","example":"llama-4-scout"},"provider":{"oneOf":[{"type":"null"},{"$ref":"#/components/schemas/Provider","description":"Convenience selector resolved to `<default_model>-<provider>`. Ignored\nwhen `model` is set."}]},"response_format":{"type":"object","description":"e.g. `{\"type\": \"json_object\"}` for JSON mode."},"stream":{"type":"boolean","description":"Defaults to `true`. When `false` → unary JSON response instead of SSE."},"system_prompt":{"type":["string","null"],"description":"Stored on the very first turn; ignored on subsequent turns of the same\nconversation.","example":"Du antwortest knapp auf Deutsch."},"temperature":{"type":["number","null"],"format":"float","description":"Sampling temperature 0.0–2.0.","example":0.7},"tool_choice":{"type":"object","description":"`\"auto\"` | `\"none\"` | `\"required\"` | function-object."},"tools":{"type":"object","description":"OpenAI-compatible `tools` array (function-calling)."},"user_id":{"type":["string","null"],"description":"Optional user-scope key (audit #12). Conversations are isolated per\nuser_id. If omitted, conversations are global to all users with the\nbearer token.","example":"alice"}}},"ConversationSummary":{"type":"object","description":"Summary entry returned by `GET /c1/conversations` (audit #11).","required":["conversation_id","message_count"],"properties":{"conversation_id":{"type":"string","example":"demo1"},"last_updated":{"type":["string","null"],"description":"RFC 3339 timestamp of the most recent message."},"message_count":{"type":"integer","format":"int64"},"user_id":{"type":["string","null"],"description":"Optional user-scope (audit #12).","example":"alice"}}},"ErrorBody":{"type":"object","description":"Wire-format of an error response. `param` and `code` are optional only\nfor backwards compatibility — new responses always set them.","required":["type","code","message"],"properties":{"code":{"type":"string"},"message":{"type":"string"},"param":{"type":["string","null"]},"type":{"$ref":"#/components/schemas/ErrorType"}}},"ErrorCatalog":{"type":"object","required":["version","count","entries"],"properties":{"count":{"type":"integer","description":"Total number of codes in this catalog.","minimum":0},"entries":{"type":"array","items":{"$ref":"#/components/schemas/ErrorCatalogEntry"}},"version":{"type":"integer","format":"int32","description":"Stable wire-format version of this catalog. Bump on breaking\nchanges (renamed code, removed field). Adding new codes is\nnon-breaking and does not bump.","minimum":0}}},"ErrorCatalogEntry":{"type":"object","description":"Catalog-Eintrag mit allen Detail-Feldern. Wird in `/errors`-JSON und\n`/docs/errors`-HTML gerendert.","required":["code","type","http_status","title","description","remediation"],"properties":{"code":{"type":"string"},"description":{"type":"string","description":"Length-detail markdown — wird im `/docs/errors` als Markdown gerendert."},"http_status":{"type":"integer","format":"int32","minimum":0},"remediation":{"type":"string","description":"Was der Client tun kann."},"title":{"type":"string"},"type":{"$ref":"#/components/schemas/ErrorType"},"typical_param":{"type":["string","null"],"description":"Falls relevant: welches Feld typischerweise gefingert wird.\nJSON-Pointer-Style. `None` für Codes ohne Feldbezug."}}},"ErrorCode":{"type":"string","description":"Stable, machine-readable error codes. Strings exposed in JSON.","enum":["missing_authorization","invalid_authorization","rate_limit_exceeded","body_too_large","invalid_json","missing_field","invalid_field","model_not_in_allowlist","max_tokens_below_minimum","image_url_not_supported","image_decode_error","conversation_not_found","route_not_found","upstream_error","upstream_timeout","upstream_unavailable","internal_error","storage_error"]},"ErrorEnvelope":{"type":"object","required":["error"],"properties":{"error":{"$ref":"#/components/schemas/ErrorBody"}}},"ErrorType":{"type":"string","description":"Top-level error categories. Mirror HTTP status semantics.\n\nAdd a `PermissionDenied` (403) variant when the API gains an authorization\nsurface beyond bearer auth (e.g. cross-user conversation access on `/c1`).","enum":["invalid_request_error","authentication_error","not_found","rate_limit_exceeded","upstream_error","internal_error"]},"ImageRef":{"type":"object","required":["url"],"properties":{"detail":{"type":["string","null"],"example":"high"},"url":{"type":"string","example":"https://example.com/cat.jpg"}}},"MessageBody":{"oneOf":[{"type":"string"},{"type":"array","items":{}}],"description":"Either a plain text string or an OpenAI-style content-parts array\n(`[{\"type\":\"text\",...},{\"type\":\"image_url\",...}]`). Audit #9."},"Modalities":{"type":"object","required":["text_in","text_out","image_in","audio_in","video_in","image_out"],"properties":{"audio_in":{"type":"boolean"},"image_in":{"type":"boolean"},"image_out":{"type":"boolean"},"text_in":{"type":"boolean"},"text_out":{"type":"boolean"},"video_in":{"type":"boolean"}}},"ModelCatalog":{"type":"object","required":["models","notes"],"properties":{"models":{"type":"array","items":{"$ref":"#/components/schemas/ModelInfo"}},"notes":{"type":"string"}}},"ModelConstraints":{"type":"object","description":"Behaviour constraints that the rust-api enforces before forwarding to the\nupstream provider. These reflect real-world quirks of LiteLLM and the\nunderlying providers — see `docs/INSTALL.md` Troubleshooting section for\nthe original observations and `error_catalog::ImageUrlNotSupported` /\n`MaxTokensBelowMinimum` for the corresponding error codes.","required":["accepts_image_url"],"properties":{"accepts_image_url":{"type":"boolean","description":"`false` ⇒ requests carrying `image_url.url` with an http(s)://-URL get\nrejected with HTTP 400 and `code=image_url_not_supported`. The client\nmust inline the image as a base64 `data:image/...;base64,...`-URI.\nVerified true for **all** cloud providers (Anthropic / Google /\nOpenAI / xAI / OpenRouter) and the local vLLM Llama-4-Scout backend."},"min_max_tokens":{"type":["integer","null"],"format":"int32","description":"If set, requests with `max_tokens < min` are silently floored to this\nvalue. The applied adjustment is reported back in response header\n`x-rust-api-applied: max_tokens_floored=<N>`. Reasoning models need\nat least 200 to leave room for hidden reasoning tokens before any\nvisible content; OpenAI-via-OpenRouter rejects values below 16.","minimum":0},"typical_response_seconds":{"type":["integer","null"],"format":"int32","description":"Order-of-magnitude hint for client-side timeouts. Cold (uncached)\ninference on the typical-length payload from `/docs/examples`.","minimum":0}}},"ModelInfo":{"type":"object","required":["alias","description","family","context_window","modalities","tools","json_mode","reasoning_mode","license","backends","constraints","notes"],"properties":{"alias":{"type":"string","description":"Public alias as routed via LiteLLM (`llama-4-scout`, `mistral-small-4`, …)."},"backends":{"type":"array","items":{"$ref":"#/components/schemas/BackendInfo"},"description":"Routing chain (in order LiteLLM tries them)."},"constraints":{"$ref":"#/components/schemas/ModelConstraints","description":"Behaviour constraints — used by the request-validation middleware to\nfail fast (`image_url_not_supported`) or auto-floor (`max_tokens`)\nbefore the upstream returns an opaque error. Clients can read this to\ngenerate predictable wrappers."},"context_window":{"type":"integer","format":"int32","description":"Effective context window in tokens (max_input + max_output combined).","minimum":0},"description":{"type":"string","description":"Short human-readable description."},"family":{"type":"string","description":"Vendor/family, e.g. \"Meta Llama 4\", \"Google Gemma 4\"."},"json_mode":{"type":"boolean","description":"JSON-mode / structured output supported."},"license":{"type":"string","description":"License (Apache-2.0, Llama-4-Community, Gemma-Terms, …)."},"modalities":{"$ref":"#/components/schemas/Modalities","description":"Modalities supported."},"notes":{"type":"string","description":"Free-form notes (caveats, performance hints, etc.)."},"reasoning_mode":{"type":"boolean","description":"Configurable reasoning/thinking mode (e.g. Mistral Small 4, Gemma 4)."},"tools":{"type":"boolean","description":"Native function/tool calling supported."}}},"Provider":{"type":"string","enum":["auto","local","ollama","openrouter"]},"ReadyReport":{"type":"object","description":"Detailed readiness probe (audit #17). Probes each backend with a 1s\ntimeout in parallel.","required":["ready","litellm","vllm_local"],"properties":{"litellm":{"$ref":"#/components/schemas/BackendStatus"},"ready":{"type":"boolean","description":"`true` if at least LiteLLM is reachable (gateway can serve any request)."},"vllm_local":{"$ref":"#/components/schemas/BackendStatus"}}},"StoredMessage":{"type":"object","description":"Persisted message of a conversation. `content` is either a plain string\n(text-only turn) or an OpenAI-style content-part array (multimodal turn\nwith text + `image_url` blocks).","required":["role","content"],"properties":{"content":{"type":"object","description":"Either a string or a content-parts array. OpenAI-compatible."},"role":{"type":"string","description":"One of `system`, `user`, `assistant`.","example":"user"}}}},"securitySchemes":{"BearerAuth":{"type":"http","scheme":"bearer","bearerFormat":"opaque"}}},"tags":[{"name":"v1 — OpenAI passthrough","description":"Stateless OpenAI-compatible endpoints. Use for drop-in SDK compatibility."},{"name":"c1 — cache-augmented","description":"Domain-specific endpoints with conversation persistence and provider selector. Use for own apps."},{"name":"health","description":"Liveness and readiness probes (no auth required)."},{"name":"docs","description":"Self-documentation endpoints: error catalog, capability discovery (no auth required)."}]}