/v1/embeddings — OpenAI-compatible embeddings
OpenAI-spec embeddings endpoint backed by nomic-embed-text-v1.5 (dim=768, 8192-token context, Apache-2.0). Routed via LiteLLM to a dedicated vllm pooling-runner on Station GPU 3 (Display-Karte) with a profile-gated Spark1 fallback.
Cut 2.23c (ADR 0016) —
SPASS-User-Id-Header is mandatory. Add-H "SPASS-User-Id: $USER_ID"to every embeddings request alongside the bearer. Body and queryuser_idare rejected with HTTP 400invalid_field.
Surface
POST /v1/embeddings
Authorization: Bearer <token>
Content-Type: application/json
{
"model": "nomic-embed",
"input": "Hello world" // string OR array of strings
}
{
"object": "list",
"data": [{
"object": "embedding",
"index": 0,
"embedding": [0.123, -0.456, ...] // 768 floats
}],
"model": "nomic-embed-text-v1.5",
"usage": {"prompt_tokens": 2, "total_tokens": 2}
}
Allowed model slugs
| Slug | Tier | Routing |
|---|---|---|
nomic-embed | T1 | LiteLLM fallback chain: Station → Spark1 |
nomic-embed-local | T2 | Station only (operator-direct) |
Layer-1 caps (rust-api defense)
The /v1/embeddings handler enforces:
| Cap | Value | Error code |
|---|---|---|
| array elements | 256 | embedding_input_too_large |
| per-string bytes | 8192 (~2k tokens) | embedding_input_too_large |
Match the underlying vllm pooling-runner's --max-model-len=2048. Caller should chunk longer documents before embedding (typical RAG-chunk size: 512–1024 tokens).
SPASS-* response headers
All standard cost/audit/correlation headers apply. For local backends:
spass-resolved-model: nomic-embed
spass-resolved-backend: local-embed
spass-cost-source: zero
spass-cost-eur: 0.00
spass-cost-usd: 0.00
spass-cost-available: true
spass-cost-exchange-rate: 0.8546
spass-cost-exchange-rate-source: ecb-2026-05-03
spass-request-id: <uuid>
Quirk workaround — encoding_format
vLLM 0.19 strict pydantic rejects encoding_format: null (the literal must be "float", "base64", "bytes", or "bytes_only"). The /v1/embeddings handler auto-injects encoding_format: "float" when the caller omits it — clients that send the field explicitly are passed through unchanged.
This is the reason internal RAG self-calls go through /v1/embeddings (not LiteLLM-direct): LiteLLM does not strip nullable params, the auto-inject is needed.
Quick test
TOKEN="$(grep '^RUST_API_BEARER=' /home/dietmar/dgx-llm/.env | cut -d= -f2)"
curl -s -X POST http://localhost:3000/v1/embeddings \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"model":"nomic-embed","input":["Hello world","Berlin ist die Hauptstadt"]}' \
| jq '{model, count: (.data|length), dim: (.data[0].embedding|length), usage}'
Use in /a1 RAG
The same nomic-embed model is the embedding-backbone for /a1/rag/indices/... (Cut 2.4 + 2.4.b + 2.9). See /docs/rag for the index management surface and /docs/agents for using indices in agent configs.