DGX LLM Chat Gateway

Conversation persistence (/c1)

/c1/chat differs from /v1/chat/completions in three ways:

  1. Server keeps history. Pass conversation_id and the server prepends prior turns automatically.
  2. Different request schema. Single message field (string or content- parts array) instead of full messages array. tools, tool_choice, response_format are required (use [], "auto", {"type": "text"} for the no-op defaults).
  3. provider selector. Switch backend without changing model slug: "provider": "ollama" resolves to <model>-ollama.

Two-turn memory

# Turn 1 — let the server allocate a fresh conversation:
RESP1=$(curl -is -H "Authorization: Bearer $BEARER" \
                -H "Content-Type: application/json" \
                -d '{
                  "model": "llama-4-scout",
                  "message": "My favourite colour is teal. Remember that.",
                  "tools": [],
                  "tool_choice": "auto",
                  "response_format": {"type": "text"},
                  "stream": false,
                  "max_tokens": 60
                }' \
                https://dgx.spass.fun/c1/chat)

# Pull the conversation id from the response header
CONV=$(echo "$RESP1" | tr -d '\r' | grep -i "^x-conversation-id:" | awk '{print $2}')
echo "conversation: $CONV"

# Turn 2 — recall on the same conversation
curl -s -H "Authorization: Bearer $BEARER" -H "Content-Type: application/json" \
  -d "{
    \"model\": \"llama-4-scout\",
    \"conversation_id\": \"$CONV\",
    \"message\": \"What is my favourite colour?\",
    \"tools\": [],
    \"tool_choice\": \"auto\",
    \"response_format\": {\"type\": \"text\"},
    \"stream\": false,
    \"max_tokens\": 40
  }" \
  https://dgx.spass.fun/c1/chat \
  | jq -r '.choices[0].message.content'

The model answers teal — the server reloaded turn 1 from SQLite and prepended it to the prompt automatically. LMCache catches the prefix on the upstream side, so the second turn returns in tens of milliseconds with a cache hit.

Listing and deleting conversations

# Paginated list (newest first)
curl -s -H "Authorization: Bearer $BEARER" \
  "https://dgx.spass.fun/c1/conversations?limit=10&offset=0" \
  | jq '.items[] | {id, message_count, model, last_used_at}'

# Pull one conversation in full
curl -s -H "Authorization: Bearer $BEARER" \
  "https://dgx.spass.fun/c1/conversations/$CONV" \
  | jq

# Delete (soft — both user and assistant turns)
curl -X DELETE -H "Authorization: Bearer $BEARER" \
  "https://dgx.spass.fun/c1/conversations/$CONV"

Provider switch via provider enum

If you want to test the same model across backends:

# Force Ollama Cloud (skip local vLLM, skip OpenRouter)
curl -s -H "Authorization: Bearer $BEARER" -H "Content-Type: application/json" \
  -d '{
    "model": "llama-4-scout",
    "provider": "ollama",
    "message": "Was ist 2+2?",
    "tools": [], "tool_choice": "auto",
    "response_format": {"type": "text"},
    "stream": false, "max_tokens": 40
  }' \
  https://dgx.spass.fun/c1/chat

provider resolves to <model>-<provider> (e.g. llama-4-scout-ollama). If both model and provider are set, model wins — explicit slug beats the convenience shortcut.

System-Prompt Hybrid-Schema (Cut 2.33, CR-0003)

Empfehlung für GoCreate-style "globaler tenant-prompt"-Use-Case: NICHT mit system_prompt_ref-Feld pro Request arbeiten, sondern den Prompt einmal via POST /v1/system-prompts/tenant setzen. Dann läuft die Auto-Inject Pipeline für alle Folge-Calls automatisch ohne Frontend-Aktion. Die 3 body-Felder hier sind primär für per-conversation-overrides wenn der globale Prompt nicht passt.

Cut 2.33 erweitert POST /c1/chat um drei mutual-exclusive System-Prompt-Felder (genau eines erlaubt, sonst HTTP 400 invalid_field):

FeldVerhalten
system_promptInline-Text. Wird beim ersten Turn als role=system-row persistiert.
system_prompt_refReferenz auf einen named per-tenant Prompt (POST /v1/system-prompts/<name> auf scope-level). Server resolved + persistiert inline.
additional_system_promptAppend-style (OpenAI Assistants-API per-Run-Instructions-pattern). Persistiert mit Prefix "Zusätzliche Hinweise:\n…". Tenant-default-Prompts werden weiterhin per-request injected.
curl -s -H "Authorization: Bearer $BEARER" -H "SPASS-User-Id: $USER_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-4-scout",
    "system_prompt": "You answer concisely in German.",
    "message": "Explain GPUs in one sentence."
  }' \
  https://dgx.spass.fun/c1/chat

Multi-Turn-Verhalten: wenn auf einem späteren Turn wieder ein system-Feld gesetzt wird obwohl die Conversation schon eine role=system-row hat, wird das ignoriert und der Server emittiert den Response-Header spass-system-prompt-ignored: already_present als explicit signal (statt silent-ignore).

Storage: role=system-rows tauchen in GET /c1/conversations/{id}.messages[] auf. Frontend kann via m.role !== "system"-Filter eine reine User-Sicht zeigen oder die system-row bewusst als Settings-Icon rendern.

Nachträgliche Injection: POST /c1/conversations/{id}/system-prompt mit demselben body-shape (genau eines der drei Felder) appendet eine zusätzliche role=system-row (NICHT replace — audit-trail-fähig).

Native Chat-Summary (Cut 2.33, CR-0002; erweitert Cut 2.46, CR-0013)

Cut 2.33 generiert LLM-basierte Topic-Title + 1-2-Satz-Summary nach dem ersten Assistant-Turn (fire-and-forget, ~500ms-1s lag). Cut 2.46 (CR-0013) erweitert das um drei Dinge:

# Auto-Trigger nach erstem Turn — meta materialisiert sich ~700ms später.
# Accept-Language steuert die Sprache (hier: de).
curl -s "https://dgx.spass.fun/c1/conversations/$CONV" \
  -H "Authorization: Bearer $BEARER" -H "SPASS-User-Id: $USER_ID" \
  -H "Accept-Language: de" \
  | jq '.meta'
# → { "title": "Hauptstadt von Deutschland",
#     "summary": "Die Hauptstadt von Deutschland ist Berlin.",
#     "model": "llama-4-scout",
#     "updated_at": "2026-06-21T..Z" }

# Force-regenerate (z.B. nach längerem Chat) — englischer Titel/Summary via Header.
curl -s -X POST "https://dgx.spass.fun/c1/conversations/$CONV/summary?refresh=true" \
  -H "Authorization: Bearer $BEARER" -H "SPASS-User-Id: $USER_ID" \
  -H "Accept-Language: en"
# → { "title": "...", "summary": "...", "cached": false, ... }

ephemeral mode

{"ephemeral": true} runs a turn through /c1 for the convenience but doesn't persist anything — no user message, no assistant reply land in SQLite. Useful for one-off tools that benefit from provider selector but don't want to pollute history.

User isolation (Header-only, ADR 0016 / Cut 2.23c)

SPASS-User-Id header scopes conversations to that user. Different end-users sharing the same bearer token cannot read or mutate each other's conversations. Requesting somebody else's conversation_id returns 404 with code: conversation_not_found (rather than 403, to avoid leaking existence).

Body and query user_id are forbidden (HTTP 400 invalid_field with param: body.user_id or query.user_id). Use the header.

# Alice's conversation
curl -H "SPASS-User-Id: alice" -d '{"message": "...", ...}' .../c1/chat

# Bob trying to access Alice's conversation_id → 404 (existence-leak protected)
curl -H "SPASS-User-Id: bob" -d '{"conversation_id": "{conversation_id}", "message": "...", ...}' .../c1/chat

# Old shape — DO NOT use, returns 400 invalid_field, param=body.user_id
curl -H "SPASS-User-Id: alice" -d '{"user_id": "alice", "message": "..."}' .../c1/chat