DGX LLM Chat Gateway

Conversation persistence (/c1)

/c1/chat differs from /v1/chat/completions in three ways:

  1. Server keeps history. Pass conversation_id and the server prepends prior turns automatically.
  2. Different request schema. Single message field (string or content- parts array) instead of full messages array. tools, tool_choice, response_format are required (use [], "auto", {"type": "text"} for the no-op defaults).
  3. provider selector. Switch backend without changing model slug: "provider": "ollama" resolves to <model>-ollama.

Two-turn memory

# Turn 1 — let the server allocate a fresh conversation:
RESP1=$(curl -is -H "Authorization: Bearer $BEARER" \
                -H "Content-Type: application/json" \
                -d '{
                  "model": "llama-4-scout",
                  "message": "My favourite colour is teal. Remember that.",
                  "tools": [],
                  "tool_choice": "auto",
                  "response_format": {"type": "text"},
                  "stream": false,
                  "max_tokens": 60
                }' \
                https://dgx-spark-4236.spass.fun/c1/chat)

# Pull the conversation id from the response header
CONV=$(echo "$RESP1" | tr -d '\r' | grep -i "^x-conversation-id:" | awk '{print $2}')
echo "conversation: $CONV"

# Turn 2 — recall on the same conversation
curl -s -H "Authorization: Bearer $BEARER" -H "Content-Type: application/json" \
  -d "{
    \"model\": \"llama-4-scout\",
    \"conversation_id\": \"$CONV\",
    \"message\": \"What is my favourite colour?\",
    \"tools\": [],
    \"tool_choice\": \"auto\",
    \"response_format\": {\"type\": \"text\"},
    \"stream\": false,
    \"max_tokens\": 40
  }" \
  https://dgx-spark-4236.spass.fun/c1/chat \
  | jq -r '.choices[0].message.content'

The model answers teal — the server reloaded turn 1 from SQLite and prepended it to the prompt automatically. LMCache catches the prefix on the upstream side, so the second turn returns in tens of milliseconds with a cache hit.

Listing and deleting conversations

# Paginated list (newest first)
curl -s -H "Authorization: Bearer $BEARER" \
  "https://dgx-spark-4236.spass.fun/c1/conversations?limit=10&offset=0" \
  | jq '.items[] | {id, message_count, model, last_used_at}'

# Pull one conversation in full
curl -s -H "Authorization: Bearer $BEARER" \
  "https://dgx-spark-4236.spass.fun/c1/conversations/$CONV" \
  | jq

# Delete (soft — both user and assistant turns)
curl -X DELETE -H "Authorization: Bearer $BEARER" \
  "https://dgx-spark-4236.spass.fun/c1/conversations/$CONV"

Provider switch via provider enum

If you want to test the same model across backends:

# Force Ollama Cloud (skip local vLLM, skip OpenRouter)
curl -s -H "Authorization: Bearer $BEARER" -H "Content-Type: application/json" \
  -d '{
    "model": "llama-4-scout",
    "provider": "ollama",
    "message": "Was ist 2+2?",
    "tools": [], "tool_choice": "auto",
    "response_format": {"type": "text"},
    "stream": false, "max_tokens": 40
  }' \
  https://dgx-spark-4236.spass.fun/c1/chat

provider resolves to <model>-<provider> (e.g. llama-4-scout-ollama). If both model and provider are set, model wins — explicit slug beats the convenience shortcut.

system_prompt (only on first turn)

Set once when creating a fresh conversation; ignored on follow-ups:

curl -s -H "Authorization: Bearer $BEARER" -H "Content-Type: application/json" \
  -d '{
    "model": "llama-4-scout",
    "system_prompt": "You answer concisely in German.",
    "message": "Explain GPUs in one sentence.",
    "tools": [], "tool_choice": "auto",
    "response_format": {"type": "text"},
    "stream": false, "max_tokens": 80
  }' \
  https://dgx-spark-4236.spass.fun/c1/chat

Subsequent turns on the same conversation_id keep the original system prompt.

ephemeral mode

{"ephemeral": true} runs a turn through /c1 for the convenience but doesn't persist anything — no user message, no assistant reply land in SQLite. Useful for one-off tools that benefit from provider selector but don't want to pollute history.

User isolation

Optional user_id field scopes conversations to that user. Different user_ids sharing the same bearer token cannot read or mutate each other's conversations. Requesting somebody else's conversation_id returns 404 with code: conversation_not_found (rather than 403, to avoid leaking existence).

# Alice's conversation
curl ... -d '{"user_id": "alice", "message": "...", ...}' .../c1/chat
# Bob trying to access alice's conversation_id → 404
curl ... -d '{"user_id": "bob", "conversation_id": "<alice-id>", "message": "...", ...}' .../c1/chat