DGX LLM Chat Gateway

Streaming (SSE)

Both /v1/chat/completions and /c1/chat stream tokens via Server-Sent Events when you set "stream": true.

The wire format is the OpenAI streaming shape:

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":" world"}}]}

data: [DONE]

Each event is a single line prefixed with data: and terminated by a blank line. The final event payload [DONE] is the sentinel for stream end.

curl streaming demo

curl -N -s https://dgx-spark-4236.spass.fun/v1/chat/completions \
  -H "Authorization: Bearer $BEARER" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-4-scout",
    "messages": [{"role": "user", "content": "Count slowly to ten."}],
    "stream": true,
    "max_tokens": 200
  }'

-N disables curl's response buffering — without it you'd see the full output in one chunk at the end.

Python SSE consumer

import json
import httpx

with httpx.stream(
    "POST",
    "https://dgx-spark-4236.spass.fun/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {BEARER}",
        "Content-Type": "application/json",
    },
    json={
        "model": "llama-4-scout",
        "messages": [{"role": "user", "content": "Tell me a one-paragraph story."}],
        "stream": True,
        "max_tokens": 400,
    },
    timeout=120.0,
) as r:
    r.raise_for_status()
    for line in r.iter_lines():
        if not line.startswith("data: "):
            continue
        data = line[len("data: "):]
        if data == "[DONE]":
            break
        chunk = json.loads(data)
        delta = chunk["choices"][0]["delta"]
        if (text := delta.get("content")):
            print(text, end="", flush=True)
print()

Tool calls in streaming

When the model decides to call a tool, the delta switches to tool_calls chunks. Concatenate tool_calls[*].function.arguments across chunks to reconstruct the final argument JSON:

buf = {"name": None, "args": ""}
for line in lines:  # as above
    chunk = json.loads(line[len("data: "):])
    delta = chunk["choices"][0]["delta"]
    if (tcs := delta.get("tool_calls")):
        for tc in tcs:
            if (fn := tc.get("function", {}).get("name")):
                buf["name"] = fn
            if (a := tc.get("function", {}).get("arguments")):
                buf["args"] += a

# Once the stream ends:
import json as _j
call_args = _j.loads(buf["args"])

Streaming on /c1/chat

The same SSE shape is emitted by /c1. The conversation id is not in the SSE body — it's in the response headers (x-conversation-id). Capture it before consuming the stream:

with httpx.stream("POST", ".../c1/chat", json=payload, headers=hdrs) as r:
    conv_id = r.headers.get("x-conversation-id")
    for line in r.iter_lines():
        ...

What does not stream

Backend signaling

x-rust-api-applied header is only set on the final HTTP response — since SSE is a single response with chunked body, you'll see the header once at the start of the stream.

If something fails mid-stream (e.g. upstream connection drop), the error arrives outside the OpenAI envelope as a normal HTTP 502 with the standard error body — clients should handle both an SSE event stream and a non-SSE error JSON.