DGX LLM Chat Gateway

/v1/embeddings — OpenAI-compatible embeddings

OpenAI-spec embeddings endpoint backed by nomic-embed-text-v1.5 (dim=768, 8192-token context, Apache-2.0). Routed via LiteLLM to a dedicated vllm pooling-runner on Station GPU 3 (Display-Karte) with a profile-gated Spark1 fallback.

Cut 2.23c (ADR 0016) — SPASS-User-Id-Header is mandatory. Add -H "SPASS-User-Id: $USER_ID" to every embeddings request alongside the bearer. Body and query user_id are rejected with HTTP 400 invalid_field.

Surface

POST /v1/embeddings
Authorization: Bearer <token>
Content-Type: application/json

{
  "model": "nomic-embed",
  "input": "Hello world"            // string OR array of strings
}
{
  "object": "list",
  "data": [{
    "object": "embedding",
    "index": 0,
    "embedding": [0.123, -0.456, ...]   // 768 floats
  }],
  "model": "nomic-embed-text-v1.5",
  "usage": {"prompt_tokens": 2, "total_tokens": 2}
}

Allowed model slugs

SlugTierRouting
nomic-embedT1LiteLLM fallback chain: Station → Spark1
nomic-embed-localT2Station only (operator-direct)

Layer-1 caps (rust-api defense)

The /v1/embeddings handler enforces:

CapValueError code
array elements256embedding_input_too_large
per-string bytes8192 (~2k tokens)embedding_input_too_large

Match the underlying vllm pooling-runner's --max-model-len=2048. Caller should chunk longer documents before embedding (typical RAG-chunk size: 512–1024 tokens).

SPASS-* response headers

All standard cost/audit/correlation headers apply. For local backends:

spass-resolved-model: nomic-embed
spass-resolved-backend: local-embed
spass-cost-source: zero
spass-cost-eur: 0.00
spass-cost-usd: 0.00
spass-cost-available: true
spass-cost-exchange-rate: 0.8546
spass-cost-exchange-rate-source: ecb-2026-05-03
spass-request-id: <uuid>

Quirk workaround — encoding_format

vLLM 0.19 strict pydantic rejects encoding_format: null (the literal must be "float", "base64", "bytes", or "bytes_only"). The /v1/embeddings handler auto-injects encoding_format: "float" when the caller omits it — clients that send the field explicitly are passed through unchanged.

This is the reason internal RAG self-calls go through /v1/embeddings (not LiteLLM-direct): LiteLLM does not strip nullable params, the auto-inject is needed.

Quick test

TOKEN="$(grep '^RUST_API_BEARER=' /home/dietmar/dgx-llm/.env | cut -d= -f2)"
curl -s -X POST http://localhost:3000/v1/embeddings \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"model":"nomic-embed","input":["Hello world","Berlin ist die Hauptstadt"]}' \
  | jq '{model, count: (.data|length), dim: (.data[0].embedding|length), usage}'

Use in /a1 RAG

The same nomic-embed model is the embedding-backbone for /a1/rag/indices/... (Cut 2.4 + 2.4.b + 2.9). See /docs/rag for the index management surface and /docs/agents for using indices in agent configs.