DGX LLM Chat Gateway

/a1/rag — Retrieval-Augmented Generation

Tenant-scoped RAG indices on top of the local nomic-embed-text-v1.5 model (dim=768, ADR 0012). Indices persist per-tenant in SQLite; restart-survival verified. Two consumption patterns are supported:

Cut 2.23c (ADR 0016) — SPASS-User-Id-Header is mandatory on every authenticated /a1/* endpoint (incl. /a1/rag/* operations + agent-sessions). Body and query user_id are rejected with HTTP 400. RAG-indices are tenant-scoped (not user-scoped) — but the audit-trail still requires the header.

  1. Passive RAG — agent.yaml rag_index: foo injects top-k chunks before every prompt (rig dynamic_context).
  2. Active RAG — the rag_query stack-tool lets the LLM decide WHEN and WHAT to retrieve, can chain multiple indices per turn (industry-standard agentic-RAG pattern).

Both patterns reuse the same tenant-scoped IndexRegistry.

Architecture

caller → POST /a1/rag/indices/<id>/documents
       │   embed via /v1/embeddings (nomic-embed)
       │   persist (tenant_id, id, doc_id, text, metadata, embeddings_json)
       ▼   to data/sqlite/rag/<tenant_id>.sqlite
       in-memory InMemoryVectorStore + InMemoryVectorIndex (rig)

On POST .../query the query is embedded with the same model and run as cosine-similarity top_n against the in-memory index. On rust-api restart, load_all_from_disk rebuilds every tenant's index from stored embeddings — no re-embedding at boot.

Index-level operations

MethodPathDescription
GET/a1/rag/indicesList the caller's indices
GET/a1/rag/indices/<id>Index summary (doc_count, embedding_model, created_at)
DELETE/a1/rag/indices/<id>Drop the entire index + all docs
POST/a1/rag/indices/<id>/documentsFull-replace ingest (drops old docs first)
POST/a1/rag/indices/<id>/queryPure-retrieval, top-k chunks

Document-level operations (Cut 2.9)

For incremental knowledge-base maintenance — caller doesn't need to re-post existing docs:

MethodPathDescription
GET/a1/rag/indices/<id>/documentsList all docs in the index (id + text + metadata, no similarity)
GET/a1/rag/indices/<id>/documents/<doc_id>Read one doc
POST/a1/rag/indices/<id>/documents/appendIncremental ingest (existing docs preserved)
PATCH/a1/rag/indices/<id>/documents/<doc_id>Update text and/or metadata (re-embeds)
DELETE/a1/rag/indices/<id>/documents/<doc_id>Drop one doc (cache rebuilt)

PATCH accepts {text?, metadata?} — at least one of the two must be provided. DELETE is idempotent (returns deleted: false if the doc was already gone).

Tenant-isolation

All ops are scoped by (tenant_id, public_id). Tenant A's index foo is invisible to Tenant B; both tenants can have an index named foo independently. Each tenant has its own SQLite file. Verified end-to-end in e2e/tests/a1-multi-tenant.spec.ts.

Quick walkthrough

TOKEN="$(grep '^RUST_API_BEARER=' /home/dietmar/dgx-llm/.env | cut -d= -f2)"

# 1) Ingest a knowledge-base
curl -s -X POST http://localhost:3000/a1/rag/indices/berlin-info/documents \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"documents":[
    {"id":"d1","text":"Das Brandenburger Tor steht im Bezirk Mitte am Pariser Platz, gebaut 1791."},
    {"id":"d2","text":"Der Tiergarten ist der größte innerstädtische Park Berlins, 210 Hektar."}
  ]}' | jq

# 2) Pure-retrieval query
curl -s -X POST http://localhost:3000/a1/rag/indices/berlin-info/query \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"query":"Wo ist der größte Park?","top_k":2}' | jq

# 3) Incrementally add a new doc
curl -s -X POST http://localhost:3000/a1/rag/indices/berlin-info/documents/append \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"documents":[{"id":"d3","text":"Der Kurfürstendamm im Westen ist die bekannteste Einkaufsstraße."}]}' | jq

# 4) Patch an existing doc
curl -s -X PATCH http://localhost:3000/a1/rag/indices/berlin-info/documents/d1 \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"metadata":{"category":"sights","source":"manual"}}' | jq

# 5) Use it in an agent (passive: agent.yaml `rag_index: berlin-info`)
curl -s -X POST http://localhost:3000/a1/agents/berlin-rag/chat \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"message":"Wo finde ich den größten Park?"}' | jq

# 6) Use it as active-RAG via tool (agent.yaml `tools: [rag_query]`)
curl -s -X POST http://localhost:3000/a1/agents/librarian/chat \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"message":"Wann wurde das Brandenburger Tor gebaut?"}' | jq
# → LLM calls rag_query(index_id="berlin-info", query="...") then synthesizes

The rag_query stack-tool (Cut 2.8)

Same IndexRegistry exposed as a tool every LLM-call (/v1, /c1, /a1) can invoke. Schema:

{
  "name": "rag_query",
  "parameters": {
    "type": "object",
    "properties": {
      "index_id": {"type": "string"},
      "query":    {"type": "string"},
      "top_k":    {"type": "integer", "default": 5, "minimum": 1, "maximum": 50}
    },
    "required": ["index_id", "query"]
  }
}

The tool inherits tenant-scoping from ToolCtx.identity.tenant_id — Tenant B cannot query Tenant A's indices regardless of how the tool is invoked.

The 9 RAG-CRUD stack-tools (Cut 2.11)

Same IndexRegistry exposed for full agent-driven knowledge-base maintenance — the LLM can curate its own indices end-to-end. All 9 inherit per-tenant isolation via ToolCtx.identity.tenant_id and are reachable via /v1/tools/execute, agent.yaml tools: [...] lists, and (when added to TOOLS_ENABLED_DEFAULT) auto-injected into chat-completions.

ToolModeConfirmPreview
rag_index_listreadnone
rag_index_describereadnone
rag_doc_listreadnone
rag_doc_getreadnone
rag_index_createwrite (replace)light B2shape_summary (doc_count, total_bytes, replace_semantics)
rag_doc_appendwrite (upsert)light B2shape_summary + conflicting_existing_ids[]
rag_doc_patchwritelight B2which fields change + new size
rag_doc_deletedestructiveheavy B2 + D4text_excerpt (≤200 chars) + size + metadata
rag_index_deletedestructiveheavy B2 + D4doc_count + total_text_bytes + sample_doc_ids (≤10) + created_at

B2 — Two-step-confirm

Every write/destructive RAG-tool returns a confirm_required: true shape on the first call — the model must re-invoke the same tool with the same arguments PLUS the confirm_token to actually execute. The token is bound to a SHA-256 fingerprint over (tenant_id, op_kind, index_id, doc_id?, body_hash) so it cannot be redeemed for a different operation. TTL is 60 s. Defends against prompt-injection that flips a benign read into a destructive write.

D4 — Pre-exec-diff (heavy preview)

For the two destructive ops, the first-call preview field is structured (not just an echo of the input) so the LLM can audit what exactly will go away:

// rag_doc_delete first-call response:
{
  "confirm_required": true,
  "confirm_token": "{uuid}",
  "expires_in_seconds": 60,
  "op": "rag_doc_delete",
  "preview": {
    "index_id": "berlin-info",
    "doc_id": "d1",
    "text_size_bytes": 218,
    "text_excerpt": "Das Brandenburger Tor steht im Bezirk Mitte am Pariser Platz, gebaut 1791.",
    "text_excerpt_truncated": false,
    "metadata": {"category": "sights"}
  },
  "message": "rag_doc_delete is a write operation and requires explicit second-call confirmation. ..."
}
// rag_index_delete first-call response:
{
  "confirm_required": true,
  "preview": {
    "index_id": "berlin-info",
    "doc_count": 42,
    "total_text_bytes": 18372,
    "embedding_model": "nomic-embed",
    "created_at": "2026-04-30T08:14:22Z",
    "sample_doc_ids": ["d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10"],
    "sample_truncated_at": 10
  }
}

Direct invocation example

TOKEN="$(grep '^RUST_API_BEARER=' /home/dietmar/dgx-llm/.env | cut -d= -f2)"

# 1) First call — get confirm_token + preview
RESP=$(curl -s -X POST http://localhost:3000/v1/tools/execute \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"name":"rag_doc_delete","arguments":{"index_id":"berlin-info","doc_id":"d1"}}')
TOK=$(echo "$RESP" | jq -r .result.confirm_token)
echo "$RESP" | jq .result.preview

# 2) Second call — same args + confirm_token
curl -s -X POST http://localhost:3000/v1/tools/execute \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d "{\"name\":\"rag_doc_delete\",\"arguments\":{\"index_id\":\"berlin-info\",\"doc_id\":\"d1\",\"confirm_token\":\"$TOK\"}}" | jq

Limits

LimitValueReason
documents per POST (full or append)256matches /v1/embeddings per-request cap
per-document text length8192 bytes (~2k tokens)matches embedding-model --max-model-len
top_k on query1..50safety + UI sanity
embedding modelnomic-embed-text-v1.5 (dim=768)local, low-latency, tenant-isolated