`/a1/rag` — Retrieval-Augmented Generation

Tenant-scoped RAG indices on top of the local nomic-embed-text-v1.5 model (dim=768, ADR 0012). Indices persist per-tenant in SQLite; restart-survival verified. Two consumption patterns are supported:

Cut 2.23c (ADR 0016) — SPASS-User-Id-Header is mandatory on every authenticated /a1/* endpoint (incl. /a1/rag/* operations + agent-sessions). Body and query user_id are rejected with HTTP 400. RAG-indices are tenant-scoped (not user-scoped) — but the audit-trail still requires the header.

Passive RAG — agent.yaml rag_index: foo injects top-k chunks before every prompt (rig dynamic_context).
Active RAG — the rag_query stack-tool lets the LLM decide WHEN and WHAT to retrieve, can chain multiple indices per turn (industry-standard agentic-RAG pattern).

Both patterns reuse the same tenant-scoped IndexRegistry.

Architecture

caller → POST /a1/rag/indices/<id>/documents
       │   embed via /v1/embeddings (nomic-embed)
       │   persist (tenant_id, id, doc_id, text, metadata, embeddings_json)
       ▼   to data/sqlite/rag/<tenant_id>.sqlite
       in-memory InMemoryVectorStore + InMemoryVectorIndex (rig)

On POST .../query the query is embedded with the same model and run as cosine-similarity top_n against the in-memory index. On rust-api restart, load_all_from_disk rebuilds every tenant's index from stored embeddings — no re-embedding at boot.

Index-level operations

Method	Path	Description
GET	`/a1/rag/indices`	List the caller's indices
GET	`/a1/rag/indices/<id>`	Index summary (doc_count, embedding_model, created_at)
DELETE	`/a1/rag/indices/<id>`	Drop the entire index + all docs
POST	`/a1/rag/indices/<id>/documents`	Full-replace ingest (drops old docs first)
POST	`/a1/rag/indices/<id>/query`	Pure-retrieval, top-k chunks

Document-level operations (Cut 2.9)

For incremental knowledge-base maintenance — caller doesn't need to re-post existing docs:

Method	Path	Description
GET	`/a1/rag/indices/<id>/documents`	List all docs in the index (id + text + metadata, no similarity)
GET	`/a1/rag/indices/<id>/documents/<doc_id>`	Read one doc
POST	`/a1/rag/indices/<id>/documents/append`	Incremental ingest (existing docs preserved)
PATCH	`/a1/rag/indices/<id>/documents/<doc_id>`	Update text and/or metadata (re-embeds)
DELETE	`/a1/rag/indices/<id>/documents/<doc_id>`	Drop one doc (cache rebuilt)

PATCH accepts {text?, metadata?} — at least one of the two must be provided. DELETE is idempotent (returns deleted: false if the doc was already gone).

Tenant-isolation

All ops are scoped by (tenant_id, public_id). Tenant A's index foo is invisible to Tenant B; both tenants can have an index named foo independently. Each tenant has its own SQLite file. Verified end-to-end in e2e/tests/a1-multi-tenant.spec.ts.

Quick walkthrough

TOKEN="$(grep '^RUST_API_BEARER=' /home/dietmar/dgx-llm/.env | cut -d= -f2)"

# 1) Ingest a knowledge-base
curl -s -X POST http://localhost:3000/a1/rag/indices/berlin-info/documents \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"documents":[
    {"id":"d1","text":"Das Brandenburger Tor steht im Bezirk Mitte am Pariser Platz, gebaut 1791."},
    {"id":"d2","text":"Der Tiergarten ist der größte innerstädtische Park Berlins, 210 Hektar."}
  ]}' | jq

# 2) Pure-retrieval query
curl -s -X POST http://localhost:3000/a1/rag/indices/berlin-info/query \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"query":"Wo ist der größte Park?","top_k":2}' | jq

# 3) Incrementally add a new doc
curl -s -X POST http://localhost:3000/a1/rag/indices/berlin-info/documents/append \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"documents":[{"id":"d3","text":"Der Kurfürstendamm im Westen ist die bekannteste Einkaufsstraße."}]}' | jq

# 4) Patch an existing doc
curl -s -X PATCH http://localhost:3000/a1/rag/indices/berlin-info/documents/d1 \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"metadata":{"category":"sights","source":"manual"}}' | jq

# 5) Use it in an agent (passive: agent.yaml `rag_index: berlin-info`)
curl -s -X POST http://localhost:3000/a1/agents/berlin-rag/chat \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"message":"Wo finde ich den größten Park?"}' | jq

# 6) Use it as active-RAG via tool (agent.yaml `tools: [rag_query]`)
curl -s -X POST http://localhost:3000/a1/agents/librarian/chat \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"message":"Wann wurde das Brandenburger Tor gebaut?"}' | jq
# → LLM calls rag_query(index_id="berlin-info", query="...") then synthesizes

The `rag_query` stack-tool (Cut 2.8)

Same IndexRegistry exposed as a tool every LLM-call (/v1, /c1, /a1) can invoke. Schema:

{
  "name": "rag_query",
  "parameters": {
    "type": "object",
    "properties": {
      "index_id": {"type": "string"},
      "query":    {"type": "string"},
      "top_k":    {"type": "integer", "default": 5, "minimum": 1, "maximum": 50}
    },
    "required": ["index_id", "query"]
  }
}

The tool inherits tenant-scoping from ToolCtx.identity.tenant_id — Tenant B cannot query Tenant A's indices regardless of how the tool is invoked.

The 9 RAG-CRUD stack-tools (Cut 2.11)

Same IndexRegistry exposed for full agent-driven knowledge-base maintenance — the LLM can curate its own indices end-to-end. All 9 inherit per-tenant isolation via ToolCtx.identity.tenant_id and are reachable via /v1/tools/execute, agent.yaml tools: [...] lists, and (when added to TOOLS_ENABLED_DEFAULT) auto-injected into chat-completions.

Tool	Mode	Confirm	Preview
`rag_index_list`	read	none	—
`rag_index_describe`	read	none	—
`rag_doc_list`	read	none	—
`rag_doc_get`	read	none	—
`rag_index_create`	write (replace)	light B2	shape_summary (doc_count, total_bytes, replace_semantics)
`rag_doc_append`	write (upsert)	light B2	shape_summary + `conflicting_existing_ids[]`
`rag_doc_patch`	write	light B2	which fields change + new size
`rag_doc_delete`	destructive	heavy B2 + D4	text_excerpt (≤200 chars) + size + metadata
`rag_index_delete`	destructive	heavy B2 + D4	doc_count + total_text_bytes + sample_doc_ids (≤10) + created_at

B2 — Two-step-confirm

Every write/destructive RAG-tool returns a confirm_required: true shape on the first call — the model must re-invoke the same tool with the same arguments PLUS the confirm_token to actually execute. The token is bound to a SHA-256 fingerprint over (tenant_id, op_kind, index_id, doc_id?, body_hash) so it cannot be redeemed for a different operation. TTL is 60 s. Defends against prompt-injection that flips a benign read into a destructive write.

D4 — Pre-exec-diff (heavy preview)

For the two destructive ops, the first-call preview field is structured (not just an echo of the input) so the LLM can audit what exactly will go away:

// rag_doc_delete first-call response:
{
  "confirm_required": true,
  "confirm_token": "{uuid}",
  "expires_in_seconds": 60,
  "op": "rag_doc_delete",
  "preview": {
    "index_id": "berlin-info",
    "doc_id": "d1",
    "text_size_bytes": 218,
    "text_excerpt": "Das Brandenburger Tor steht im Bezirk Mitte am Pariser Platz, gebaut 1791.",
    "text_excerpt_truncated": false,
    "metadata": {"category": "sights"}
  },
  "message": "rag_doc_delete is a write operation and requires explicit second-call confirmation. ..."
}

// rag_index_delete first-call response:
{
  "confirm_required": true,
  "preview": {
    "index_id": "berlin-info",
    "doc_count": 42,
    "total_text_bytes": 18372,
    "embedding_model": "nomic-embed",
    "created_at": "2026-04-30T08:14:22Z",
    "sample_doc_ids": ["d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10"],
    "sample_truncated_at": 10
  }
}

Direct invocation example

TOKEN="$(grep '^RUST_API_BEARER=' /home/dietmar/dgx-llm/.env | cut -d= -f2)"

# 1) First call — get confirm_token + preview
RESP=$(curl -s -X POST http://localhost:3000/v1/tools/execute \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"name":"rag_doc_delete","arguments":{"index_id":"berlin-info","doc_id":"d1"}}')
TOK=$(echo "$RESP" | jq -r .result.confirm_token)
echo "$RESP" | jq .result.preview

# 2) Second call — same args + confirm_token
curl -s -X POST http://localhost:3000/v1/tools/execute \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d "{\"name\":\"rag_doc_delete\",\"arguments\":{\"index_id\":\"berlin-info\",\"doc_id\":\"d1\",\"confirm_token\":\"$TOK\"}}" | jq

Limits

Limit	Value	Reason
documents per `POST` (full or append)	256	matches `/v1/embeddings` per-request cap
per-document text length	8192 bytes (~2k tokens)	matches embedding-model `--max-model-len`
`top_k` on query	1..50	safety + UI sanity
embedding model	`nomic-embed-text-v1.5` (dim=768)	local, low-latency, tenant-isolated

/a1/rag — Retrieval-Augmented Generation