DGX LLM Chat Gateway

Streaming (SSE)

Both /v1/chat/completions and /c1/chat stream tokens via Server-Sent Events when you set "stream": true.

Cut 2.23c (ADR 0016) — SPASS-User-Id-Header is mandatory on user-scoped endpoints. Add -H "SPASS-User-Id: $USER_ID" to every curl-call below in addition to the bearer. Body- and query-user_id would be rejected with HTTP 400 invalid_field.

Stream-mode tool-loop (Cut 2.21+) is now server-side — /c1/chat and /v1/chat/completions both run tools internally and emit a synth-SSE stream including a event: spass.cost trailer (Cut 2.21b) before [DONE].

The wire format is the OpenAI streaming shape:

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":" world"}}]}

data: [DONE]

Each event is a single line prefixed with data: and terminated by a blank line. The final event payload [DONE] is the sentinel for stream end.

curl streaming demo

curl -N -s https://dgx.spass.fun/v1/chat/completions \
  -H "Authorization: Bearer $BEARER" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-4-scout",
    "messages": [{"role": "user", "content": "Count slowly to ten."}],
    "stream": true,
    "max_tokens": 200
  }'

-N disables curl's response buffering — without it you'd see the full output in one chunk at the end.

Python SSE consumer

import json
import httpx

with httpx.stream(
    "POST",
    "https://dgx.spass.fun/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {BEARER}",
        "Content-Type": "application/json",
    },
    json={
        "model": "llama-4-scout",
        "messages": [{"role": "user", "content": "Tell me a one-paragraph story."}],
        "stream": True,
        "max_tokens": 400,
    },
    timeout=120.0,
) as r:
    r.raise_for_status()
    for line in r.iter_lines():
        if not line.startswith("data: "):
            continue
        data = line[len("data: "):]
        if data == "[DONE]":
            break
        chunk = json.loads(data)
        delta = chunk["choices"][0]["delta"]
        if (text := delta.get("content")):
            print(text, end="", flush=True)
print()

Tool calls in streaming

When the model decides to call a tool, the delta switches to tool_calls chunks. Concatenate tool_calls[*].function.arguments across chunks to reconstruct the final argument JSON:

buf = {"name": None, "args": ""}
for line in lines:  # as above
    chunk = json.loads(line[len("data: "):])
    delta = chunk["choices"][0]["delta"]
    if (tcs := delta.get("tool_calls")):
        for tc in tcs:
            if (fn := tc.get("function", {}).get("name")):
                buf["name"] = fn
            if (a := tc.get("function", {}).get("arguments")):
                buf["args"] += a

# Once the stream ends:
import json as _j
call_args = _j.loads(buf["args"])

Named SSE-Events (Cut 2.32, CR-0001)

Neben den OpenAI-spec data: …-Lines emittiert der Server seit Cut 2.32 zwei named SSE-Events vor dem finalen content + [DONE], wenn der Tool-Loop einen Synthesis-Pfad nehmen musste:

event: spass.cost
data: {"prompt_tokens":1234,"completion_tokens":567,"total_tokens":1801}

event: spass.tool-cap          ← MAX_TOOL_ITERATIONS=10 erreicht
data: {"code":"tool_loop_max_iterations","iterations":10,"synth_called":true}

event: spass.tool-stripped     ← Cut 2.36b — Llama-Halluzination im content gestripped (CR-0007 stream-fix)
data: {"code":"hallucinated_tool_stripped_all_unknown","tool_names":["translate_text"]}

event: spass.tool-feedback-recovery   ← Cut 2.39 — Multi-Turn-Feedback hat clean response wiederhergestellt
data: {"code":"tool_feedback_recovery_after_all_unknown","retries_used":1,"stripped_names":["translate_text"]}

data: {"choices":[{"delta":{"content":"... narrative answer ..."}}]}

data: [DONE]

Bei Anti-Loop-Detection wird event: spass.tool-anti-loop statt spass.tool-cap emittiert. Payload-Shape ist identisch. Caller die named-events nicht parsen, sehen weiter nur den data:-Stream — die named-events sind reine Diagnose-Layer und brechen kein bestehendes SSE-Consumer-Pattern. Siehe errors.md für die zugehörigen dgx_code-Werte und examples-tools.md für die Tool-Loop-Semantik.

Streaming on /c1/chat

The same SSE shape is emitted by /c1. The conversation id is not in the SSE body — it's in the response headers (x-conversation-id). Capture it before consuming the stream:

with httpx.stream("POST", ".../c1/chat", json=payload, headers=hdrs) as r:
    conv_id = r.headers.get("x-conversation-id")
    for line in r.iter_lines():
        ...

What does not stream

Backend signaling

x-rust-api-applied header is only set on the final HTTP response — since SSE is a single response with chunked body, you'll see the header once at the start of the stream.

If something fails mid-stream (e.g. upstream connection drop), the error arrives outside the OpenAI envelope as a normal HTTP 502 with the standard error body — clients should handle both an SSE event stream and a non-SSE error JSON.