DGX LLM Chat Gateway

Models & constraints

Each model alias is described in /v1/info with its modalities, capabilities, licensing, backend routing chain, and behaviour constraints — the latter being client-relevant quirks that the rust-api enforces before forwarding to the upstream.

curl -s -H "Authorization: Bearer $BEARER" \
  https://dgx-spark-4236.spass.fun/v1/info \
  | jq '.models[] | {alias, family, context_window, tools, constraints}'

Model summary

AliasFamilyContextToolsVisionReasoningmin_max_tokensTypical s
llama-4-scoutMeta Llama 410 M8
mistral-small-4Mistral Small 4256 Kconfigurable5
qwen3-vl-30b-thinkingQwen3-VL131 K12
qwen3-vl-30b-instructQwen3-VL131 K5
gemma-4-31bGoogle Gemma 4256 K5
flagshipcomposite1.05 M2008
gpt-5.5-proOpenAI GPT-5.51.05 M2008
claude-opus-4.7Anthropic Claude 4.71 M5
gemini-3.1-proGoogle Gemini 3.11.05 M+ audio + video2005
grok-4.20xAI Grok 42 M3
nano-bananaGoogle Gemini Image65 K✓ in / ✓ out15
gpt-imageOpenAI GPT-Image272 K✓ in / ✓ out16150
image-gencomposite (banana → gpt-image)65 K✓ in / ✓ out15

The min_max_tokens column is the floor that rust-api silently applies when your request specifies a smaller value. Reasoning models need ≥ 200 to leave room for hidden reasoning tokens before any visible content; OpenAI-via- OpenRouter refuses values below 16. Floored values are reported in the response header x-rust-api-applied: max_tokens_floored=N.

constraints schema

Every entry in /v1/info exposes a constraints object:

{
  "min_max_tokens": 200,
  "accepts_image_url": false,
  "typical_response_seconds": 8
}

Per-model notes

Local: llama-4-scout

The local backend runs vLLM 26.01-py3 with LMCache active. Cold first-token latency is dominated by the queue + the (often warm) prefix scan; a typical short Q&A is 8 s end-to-end through the tunnel, repeated requests with the same prefix collapse to 30-40 ms via LMCache.

Fallback chain: local vLLM → Ollama Cloud → OpenRouter. If the local container is down (/readyz shows vllm: unreachable), LiteLLM transparently routes the next call to Ollama Cloud or OpenRouter without changing the model field.

Reasoning: gpt-5.5-pro, gemini-3.1-pro, flagship

These hide a "thinking" pass before producing any visible content. With small max_tokens you get an empty content field and finish_reason: length. The rust-api silently floors to 200; if you need long answers, set max_tokens explicitly (e.g. 1500 for an essay-length reply).

gemini-3.1-pro is the only flagship in this set that accepts audio and video input via OpenRouter. See /v1/info for the modalities field.

Image generation: nano-banana, gpt-image, image-gen

nano-banana (Google Gemini 3.1 Flash Image) is fast (~15 s) and cheap. gpt-image (OpenAI GPT-5.4 + GPT-Image-2) is much slower (100-180 s) and ~5× more expensive but renders text-in-image and follows complex instructions better.

The composite alias image-gen runs nano-banana first and falls back to gpt-image only on hard failures, giving you the best of both.

Output is delivered as base64-encoded image data inside an OpenAI-style chat completion — the bytes live in choices[0].message.images[0].image_url.url as a data:image/...;base64,... string. Set client timeout ≥ 240 s when calling gpt-image directly.

Vision input — base64 only

All cloud providers refuse server-side URL-fetches; the local vLLM doesn't either. Inline images as base64 data URIs:

B64=$(base64 -w 0 photo.jpg)
curl ... -d "{
  \"model\": \"qwen3-vl-30b-instruct\",
  \"messages\": [{
    \"role\": \"user\",
    \"content\": [
      {\"type\": \"text\", \"text\": \"What is in this image?\"},
      {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,$B64\"}}
    ]
  }]
}"

Sending an https://... URL gets you HTTP 400 with code: image_url_not_supported and a param pointing at the offending field.

Backend status

The backends array in /v1/info lists each routing destination with status:

Only available backends are reachable at runtime; the rest are catalog metadata for transparency.