Models & constraints
Each model alias is described in /v1/info with its modalities, capabilities,
licensing, backend routing chain, and behaviour constraints — the latter
being client-relevant quirks that the rust-api enforces before forwarding to
the upstream.
curl -s -H "Authorization: Bearer $BEARER" \
https://dgx-spark-4236.spass.fun/v1/info \
| jq '.models[] | {alias, family, context_window, tools, constraints}'
Model summary
| Alias | Family | Context | Tools | Vision | Reasoning | min_max_tokens | Typical s |
|---|---|---|---|---|---|---|---|
llama-4-scout | Meta Llama 4 | 10 M | ✓ | ✓ | — | — | 8 |
mistral-small-4 | Mistral Small 4 | 256 K | ✓ | ✓ | configurable | — | 5 |
qwen3-vl-30b-thinking | Qwen3-VL | 131 K | ✓ | ✓ | ✓ | — | 12 |
qwen3-vl-30b-instruct | Qwen3-VL | 131 K | ✓ | ✓ | — | — | 5 |
gemma-4-31b | Google Gemma 4 | 256 K | ✓ | ✓ | — | — | 5 |
flagship | composite | 1.05 M | ✓ | ✓ | ✓ | 200 | 8 |
gpt-5.5-pro | OpenAI GPT-5.5 | 1.05 M | ✓ | ✓ | ✓ | 200 | 8 |
claude-opus-4.7 | Anthropic Claude 4.7 | 1 M | ✓ | ✓ | ✓ | — | 5 |
gemini-3.1-pro | Google Gemini 3.1 | 1.05 M | ✓ | + audio + video | ✓ | 200 | 5 |
grok-4.20 | xAI Grok 4 | 2 M | ✓ | ✓ | ✓ | — | 3 |
nano-banana | Google Gemini Image | 65 K | — | ✓ in / ✓ out | — | — | 15 |
gpt-image | OpenAI GPT-Image | 272 K | — | ✓ in / ✓ out | ✓ | 16 | 150 |
image-gen | composite (banana → gpt-image) | 65 K | — | ✓ in / ✓ out | — | — | 15 |
The min_max_tokens column is the floor that rust-api silently applies when
your request specifies a smaller value. Reasoning models need ≥ 200 to leave
room for hidden reasoning tokens before any visible content; OpenAI-via-
OpenRouter refuses values below 16. Floored values are reported in the
response header x-rust-api-applied: max_tokens_floored=N.
constraints schema
Every entry in /v1/info exposes a constraints object:
{
"min_max_tokens": 200,
"accepts_image_url": false,
"typical_response_seconds": 8
}
min_max_tokens— silent auto-floor formax_tokens.nullmeans no floor.accepts_image_url— whenfalse(currently true for all models),image_url.urlwith anhttp(s)://URL is rejected with HTTP 400 andcode: image_url_not_supported. Inline as base64 data URI.typical_response_seconds— rough cold-inference latency hint for client-side timeout configuration. Cache hits are sub-second and not reflected here.
Per-model notes
Local: llama-4-scout
The local backend runs vLLM 26.01-py3 with LMCache active. Cold first-token
latency is dominated by the queue + the (often warm) prefix scan; a typical
short Q&A is 8 s end-to-end through the tunnel, repeated requests with the
same prefix collapse to 30-40 ms via LMCache.
Fallback chain: local vLLM → Ollama Cloud → OpenRouter. If the local
container is down (/readyz shows vllm: unreachable), LiteLLM transparently
routes the next call to Ollama Cloud or OpenRouter without changing the
model field.
Reasoning: gpt-5.5-pro, gemini-3.1-pro, flagship
These hide a "thinking" pass before producing any visible content. With small
max_tokens you get an empty content field and finish_reason: length.
The rust-api silently floors to 200; if you need long answers, set
max_tokens explicitly (e.g. 1500 for an essay-length reply).
gemini-3.1-pro is the only flagship in this set that accepts audio and
video input via OpenRouter. See /v1/info for the modalities field.
Image generation: nano-banana, gpt-image, image-gen
nano-banana (Google Gemini 3.1 Flash Image) is fast (~15 s) and cheap.
gpt-image (OpenAI GPT-5.4 + GPT-Image-2) is much slower (100-180 s) and
~5× more expensive but renders text-in-image and follows complex
instructions better.
The composite alias image-gen runs nano-banana first and falls back to
gpt-image only on hard failures, giving you the best of both.
Output is delivered as base64-encoded image data inside an OpenAI-style
chat completion — the bytes live in choices[0].message.images[0].image_url.url
as a data:image/...;base64,... string. Set client timeout ≥ 240 s when
calling gpt-image directly.
Vision input — base64 only
All cloud providers refuse server-side URL-fetches; the local vLLM doesn't either. Inline images as base64 data URIs:
B64=$(base64 -w 0 photo.jpg)
curl ... -d "{
\"model\": \"qwen3-vl-30b-instruct\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"text\", \"text\": \"What is in this image?\"},
{\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,$B64\"}}
]
}]
}"
Sending an https://... URL gets you HTTP 400 with code: image_url_not_supported
and a param pointing at the offending field.
Backend status
The backends array in /v1/info lists each routing destination with status:
available— actively used by LiteLLM.deprecated— known to be retired (e.g. NVIDIA NIM endpoints).unconfigured— supported in principle, missing API-key or config.planned— in scope but not wired up yet.
Only available backends are reachable at runtime; the rest are catalog
metadata for transparency.