DGX LLM Chat Gateway

/a1 — Agent System (Phase 2)

The /a1/agents/... surface is a rig-core powered agent system on top of /v1/chat/completions. Agents are configured via YAML files, can use stack-side tools, retrieve from per-tenant RAG indices, and persist multi-turn sessions.

Architecture (ADR 0011)

Caller → POST /a1/agents/<name>/chat
       │   auth-middleware re-resolves Identity
       │   build rig::Agent from agents/<name>.yaml
       │   rig sub-call …
       └─▶ POST http://localhost:3000/v1/chat/completions  (self-call)
              ├─ SPASS-Augment: server-tools=off, system-prompt=off, memory=off
              ├─ SPASS-Caller-Depth: <incremented>
              ├─ SPASS-Parent-Request-Id: <outer /a1 request-id>
              └─ Bearer: <forwarded caller-bearer>
                  → full /v1 pipeline (Cost-V2, audit, Layer-1, leak-checks)

Path Y = "rig speaks to our own /v1" — every Stack-invariant (Cost-Pipeline V2, Audit, Layer-1 defense, ADR-0009 leak-sanitization, SPASS-Augment) applies transparently to every rig sub-call.

Endpoints

MethodPathDescription
GET/a1/agentsList all configured agents (operator-authored YAML)
GET/a1/agents/<name>Full agent config (system_prompt, model, tools, rag_index)
POST/a1/agents/<name>/chatSingle-shot prompt → completion
POST/a1/agents/<name>/sessionsCreate persistent multi-turn session
GET/a1/agents/<name>/sessionsList the caller's sessions for that agent
GET/a1/agents/<name>/sessions/<sid>Session metadata + msg-count
DELETE/a1/agents/<name>/sessions/<sid>Drop session + all messages
GET/a1/agents/<name>/sessions/<sid>/messagesFull conversation history
POST/a1/agents/<name>/sessions/<sid>/messagesAppend turn (uses persisted history); 308 → successor if archived
POST/a1/agents/<name>/sessions/<sid>/compactCompact older turns into a summary, archive source, create successor (Cut 2.12)
GET/a1/agents/<name>/sessions/<sid>/lineageForward + backward lineage chain + summary records (Cut 2.12)

Agent YAML config

Operator-authored, lives in data/agents/*.yaml on the host. Loaded once at process boot — restart needed for config changes (hot-reload is on the backlog).

name: berlin-rag                 # URL-safe [a-zA-Z0-9_-]+
model: claude-opus-4.7           # must be in ALLOWED_MODELS + per-token allowlist
description: Berlin RAG persona  # shown in /a1/agents listing
system_prompt: |
  Du beantwortest Fragen zu Berlin auf Deutsch...
max_tokens: 512                  # optional
temperature: 0.3                 # optional
tools:                           # optional — stack tools available to agent
  - calculator
  - current_datetime
  - rag_query                    # active-RAG tool, see /docs/rag
rag_index: berlin-info           # optional — passive RAG context-injection
rag_top_k: 3                     # optional, default 5

Bundled agents (out-of-the-box examples)

AgentModelToolsRAGUse-case
concise-dellama-4-scout (local)Bare LLM, terse German
berlin-tourllama-4-scout (local)Berlin-tour persona
berlin-ragllama-4-scout (local)berlin-infoPassive RAG (rig dynamic_context)
researcherclaude-opus-4.7calculator, current_datetime, wikipedia_search, wikipedia_summaryTool-using research assistant
researcher-ragclaude-opus-4.7calculator, current_datetime, wikipedia_searchmatrix-rag-idxTools + RAG combined
librarianclaude-opus-4.7rag_queryActive-RAG via tool-call

Tool-calling note: local Llama-4-Scout-FP4 (--tool-call-parser=pythonic) currently emits tool-calls as plain-text JSON, which our /v1 tool-loop cannot parse. Agents that need tools should pin model: claude-opus-4.7 for now. See backlog for llama4_json parser switch.

Multi-turn sessions

Sessions persist in a tenant-scoped SQLite at data/sqlite/agent_sessions/<tenant>.sqlite. Each session is agent-pinned at creation — switching the agent mid-session returns 400 session_agent_mismatch.

Sessions are user-scoped (Cut 2.23c, ADR 0016). All session-endpoints require SPASS-User-Id header — different end-users sharing the same bearer cannot see or mutate each other's sessions. Pre-Cut-2.23c sessions (DEFAULT user_id '') are effectively invisible to all end-users; operator can read them via SQL.

Per-session compaction config (Cut 2.12 + 2.13) — optional body fields on create. Missing fields fall through to the per-tenant cascade default (process → yaml → DB), so an operator can change the tenant-wide default without touching every session-create call.

FieldTypeDefault sourceNotes
compact_strategyauto|manual|offtenant cascade (auto)auto enables the auto-trigger gate (see below); manual keeps only the explicit POST /compact; off disables compaction entirely
compact_keep_last_n0..200tenant cascade (10)live messages preserved verbatim when compaction runs; everything older becomes the summary
compact_observation_maskbooltenant cascade (true)true instructs the summary-model to drop tool-noise / acknowledgements; false passes raw turns
TOKEN="$(grep '^RUST_API_BEARER=' /home/dietmar/dgx-llm/.env | cut -d= -f2)"

SID=$(curl -s -X POST http://localhost:3000/a1/agents/concise-de/sessions \
  -H "Authorization: Bearer $TOKEN" | jq -r .id)

curl -s -X POST http://localhost:3000/a1/agents/concise-de/sessions/$SID/messages \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"message": "Mein Lieblingssport ist Tennis."}' | jq

curl -s -X POST http://localhost:3000/a1/agents/concise-de/sessions/$SID/messages \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"message": "Welcher Sport ist mein Liebling?"}' | jq
# → "Dein Lieblingssport ist Tennis."

Compaction (Cut 2.12) — Z-chained lineage + C4-hybrid

When a session grows past the model's effective context-budget, or the operator triggers POST /a1/agents/<n>/sessions/<sid>/compact manually:

  1. The most-recent compact_keep_last_n (default 10) messages are preserved verbatim.
  2. Everything older is summarised by the per-tenant effective compact_summary_model — pulled from the 3-level cascade process → yaml → DB (default llama-4-scout for Datenschutz; tenant can override via tokens.yaml::tenants[].defaults.compact_summary_model or runtime PUT /v1/tenant/config). See Per-tenant config and ADR 0013.
  3. The summarised rows are soft-deleted (compacted_at = NOW); a session_summaries row is persisted.
  4. A new successor session is created with parent_session_id → source.
  5. The summary is bootstrapped into the successor as a single synthetic assistant-role message ("[compaction summary from session X] …"); the kept-last-N messages are copied verbatim.
  6. The source session is archived_at = NOW, successor_session_id = new.
  7. Subsequent chat-attempts on the archived session id return HTTP 308 with Location: /a1/agents/<n>/sessions/<successor>/messages — standard HTTP-clients follow this transparently and the chat lands on the successor with the body intact.
# Compact a session manually
curl -s -X POST "http://localhost:3000/a1/agents/$NAME/sessions/$SID/compact" \
  -H "Authorization: Bearer $TOKEN" | jq
# → { source_session_id, successor_session_id, summary_id, summary_text, ... }

# Inspect the lineage chain
curl -s "http://localhost:3000/a1/agents/$NAME/sessions/$SID/lineage" \
  -H "Authorization: Bearer $TOKEN" | jq
# → { backward: [...], forward: [...], summaries: [...] }

observation_mask controls the summary-model's instruction template. With true (default), the model is told to drop greetings, acknowledgements, raw tool-output, and error-traces while preserving named entities, numbers, decisions, and unresolved questions. With false, raw turns are passed verbatim — useful when tool-output is itself the load-bearing context (e.g. RAG-retrieval pipelines).

Failure modes:

Auto-trigger

When compact_strategy: auto (the default), POST .../sessions/<sid>/messages checks BEFORE the chat-loop runs:

ConditionSource
compact_strategy == "auto"session row
live-message count > compact_keep_last_nmessages table where compacted_at IS NULL
estimated tokens > 80 % of model context_windowchar-based heuristic (~4 chars/token) over live-messages × catalog-lookup of model

If all three hold, the handler compacts the session FIRST, then runs the chat against the brand-new successor. The response carries:

Auto-trigger failures are logged but do NOT bubble up — the chat continues against the original (now-uncompacted) session and the upstream may surface its own context-overflow error if it really busts. This is intentional: a transient summary-model issue should not break user-facing chat. Switch to compact_strategy: manual to disable the auto-trigger entirely while keeping the explicit POST /compact endpoint available.

Token-estimate caveat: the gate uses a conservative chars / 4 heuristic, not real tokens. It tends to over-count for English-heavy prose (real ratio ≈ 4.3) and under-count for very token-dense content (long URLs, code, structured-output). The 80 % threshold leaves enough headroom that mis-estimates of ±20 % do not bust the upstream.

Recursion-protection

Every /a1 call propagates SPASS-Caller-Depth (default 0). The handler increments before issuing the rig self-call. Depth ≥ 3 hard-fails with recursion_depth_exceeded — defensive guard against any future flow that could form an /a1 → /v1 → /a1 chain.

Audit-trail correlation

Every /v1 sub-call carries SPASS-Parent-Request-Id = outer /a1 request_id in audit.jsonl. Reconstruct a call-tree:

jq -c "select(.fields.parent_request_id == \"<outer-rid>\")" \
  /home/dietmar/dgx-llm/data/audit/audit.jsonl.YYYY-MM-DD

Cost aggregation

The /a1-outer response carries summed SPASS-Cost-{Eur,Usd,Source,...} headers across all sub-calls (Cut 2.7). A debug header SPASS-Cost-Sub-Calls reports the count.

HTTP/1.1 200 OK
spass-cost-eur: 0.02
spass-cost-usd: 0.02
spass-cost-source: upstream
spass-cost-sub-calls: 1

See /docs/response-headers for the full Cost-V2 spec.