Models & constraints

Each model alias is described in /v1/info with its modalities, capabilities, licensing, backend routing chain, and behaviour constraints — the latter being client-relevant quirks that the rust-api enforces before forwarding to the upstream.

curl -s -H "Authorization: Bearer $BEARER" \
  https://dgx-spark-4236.spass.fun/v1/info \
  | jq '.models[] | {alias, family, context_window, tools, constraints}'

Model summary

Alias	Family	Context	Tools	Vision	Reasoning	`min_max_tokens`	Typical s
`llama-4-scout`	Meta Llama 4	10 M	✓	✓	—	—	8
`mistral-small-4`	Mistral Small 4	256 K	✓	✓	configurable	—	5
`qwen3-vl-30b-thinking`	Qwen3-VL	131 K	✓	✓	✓	—	12
`qwen3-vl-30b-instruct`	Qwen3-VL	131 K	✓	✓	—	—	5
`gemma-4-31b`	Google Gemma 4	256 K	✓	✓	—	—	5
`flagship`	composite	1.05 M	✓	✓	✓	200	8
`gpt-5.5-pro`	OpenAI GPT-5.5	1.05 M	✓	✓	✓	200	8
`claude-opus-4.7`	Anthropic Claude 4.7	1 M	✓	✓	✓	—	5
`gemini-3.1-pro`	Google Gemini 3.1	1.05 M	✓	+ audio + video	✓	200	5
`grok-4.20`	xAI Grok 4	2 M	✓	✓	✓	—	3
`nano-banana`	Google Gemini Image	65 K	—	✓ in / ✓ out	—	—	15
`gpt-image`	OpenAI GPT-Image	272 K	—	✓ in / ✓ out	✓	16	150
`image-gen`	composite (banana → gpt-image)	65 K	—	✓ in / ✓ out	—	—	15

The min_max_tokens column is the floor that rust-api silently applies when your request specifies a smaller value. Reasoning models need ≥ 200 to leave room for hidden reasoning tokens before any visible content; OpenAI-via- OpenRouter refuses values below 16. Floored values are reported in the response header x-rust-api-applied: max_tokens_floored=N.

`constraints` schema

Every entry in /v1/info exposes a constraints object:

{
  "min_max_tokens": 200,
  "accepts_image_url": false,
  "typical_response_seconds": 8
}

min_max_tokens — silent auto-floor for max_tokens. null means no floor.
accepts_image_url — when false (currently true for all models), image_url.url with an http(s):// URL is rejected with HTTP 400 and code: image_url_not_supported. Inline as base64 data URI.
typical_response_seconds — rough cold-inference latency hint for client-side timeout configuration. Cache hits are sub-second and not reflected here.

Per-model notes

Local: `llama-4-scout`

The local backend runs vLLM 26.01-py3 with LMCache active. Cold first-token latency is dominated by the queue + the (often warm) prefix scan; a typical short Q&A is 8 s end-to-end through the tunnel, repeated requests with the same prefix collapse to 30-40 ms via LMCache.

Fallback chain: local vLLM → Ollama Cloud → OpenRouter. If the local container is down (/readyz shows vllm: unreachable), LiteLLM transparently routes the next call to Ollama Cloud or OpenRouter without changing the model field.

Reasoning: `gpt-5.5-pro`, `gemini-3.1-pro`, `flagship`

These hide a "thinking" pass before producing any visible content. With small max_tokens you get an empty content field and finish_reason: length. The rust-api silently floors to 200; if you need long answers, set max_tokens explicitly (e.g. 1500 for an essay-length reply).

gemini-3.1-pro is the only flagship in this set that accepts audio and video input via OpenRouter. See /v1/info for the modalities field.

Image generation: `nano-banana`, `gpt-image`, `image-gen`

nano-banana (Google Gemini 3.1 Flash Image) is fast (~15 s) and cheap. gpt-image (OpenAI GPT-5.4 + GPT-Image-2) is much slower (100-180 s) and ~5× more expensive but renders text-in-image and follows complex instructions better.

The composite alias image-gen runs nano-banana first and falls back to gpt-image only on hard failures, giving you the best of both.

Output is delivered as base64-encoded image data inside an OpenAI-style chat completion — the bytes live in choices[0].message.images[0].image_url.url as a data:image/...;base64,... string. Set client timeout ≥ 240 s when calling gpt-image directly.

Vision input — base64 only

All cloud providers refuse server-side URL-fetches; the local vLLM doesn't either. Inline images as base64 data URIs:

B64=$(base64 -w 0 photo.jpg)
curl ... -d "{
  \"model\": \"qwen3-vl-30b-instruct\",
  \"messages\": [{
    \"role\": \"user\",
    \"content\": [
      {\"type\": \"text\", \"text\": \"What is in this image?\"},
      {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,$B64\"}}
    ]
  }]
}"

Sending an https://... URL gets you HTTP 400 with code: image_url_not_supported and a param pointing at the offending field.

Backend status

The backends array in /v1/info lists each routing destination with status:

available — actively used by LiteLLM.
deprecated — known to be retired (e.g. NVIDIA NIM endpoints).
unconfigured — supported in principle, missing API-key or config.
planned — in scope but not wired up yet.

Only available backends are reachable at runtime; the rest are catalog metadata for transparency.

Models & constraints

Model summary

constraints schema