/a1/rag — Retrieval-Augmented Generation
Tenant-scoped RAG indices on top of the local nomic-embed-text-v1.5 model (dim=768, ADR 0012). Indices persist per-tenant in SQLite; restart-survival verified. Two consumption patterns are supported:
Cut 2.23c (ADR 0016) —
SPASS-User-Id-Header is mandatory on every authenticated/a1/*endpoint (incl./a1/rag/*operations + agent-sessions). Body and queryuser_idare rejected with HTTP 400. RAG-indices are tenant-scoped (not user-scoped) — but the audit-trail still requires the header.
- Passive RAG — agent.yaml
rag_index: fooinjects top-k chunks before every prompt (rigdynamic_context). - Active RAG — the
rag_querystack-tool lets the LLM decide WHEN and WHAT to retrieve, can chain multiple indices per turn (industry-standard agentic-RAG pattern).
Both patterns reuse the same tenant-scoped IndexRegistry.
Architecture
caller → POST /a1/rag/indices/<id>/documents
│ embed via /v1/embeddings (nomic-embed)
│ persist (tenant_id, id, doc_id, text, metadata, embeddings_json)
▼ to data/sqlite/rag/<tenant_id>.sqlite
in-memory InMemoryVectorStore + InMemoryVectorIndex (rig)
On POST .../query the query is embedded with the same model and run as cosine-similarity top_n against the in-memory index. On rust-api restart, load_all_from_disk rebuilds every tenant's index from stored embeddings — no re-embedding at boot.
Index-level operations
| Method | Path | Description |
|---|---|---|
| GET | /a1/rag/indices | List the caller's indices |
| GET | /a1/rag/indices/<id> | Index summary (doc_count, embedding_model, created_at) |
| DELETE | /a1/rag/indices/<id> | Drop the entire index + all docs |
| POST | /a1/rag/indices/<id>/documents | Full-replace ingest (drops old docs first) |
| POST | /a1/rag/indices/<id>/query | Pure-retrieval, top-k chunks |
Document-level operations (Cut 2.9)
For incremental knowledge-base maintenance — caller doesn't need to re-post existing docs:
| Method | Path | Description |
|---|---|---|
| GET | /a1/rag/indices/<id>/documents | List all docs in the index (id + text + metadata, no similarity) |
| GET | /a1/rag/indices/<id>/documents/<doc_id> | Read one doc |
| POST | /a1/rag/indices/<id>/documents/append | Incremental ingest (existing docs preserved) |
| PATCH | /a1/rag/indices/<id>/documents/<doc_id> | Update text and/or metadata (re-embeds) |
| DELETE | /a1/rag/indices/<id>/documents/<doc_id> | Drop one doc (cache rebuilt) |
PATCH accepts {text?, metadata?} — at least one of the two must be provided. DELETE is idempotent (returns deleted: false if the doc was already gone).
Tenant-isolation
All ops are scoped by (tenant_id, public_id). Tenant A's index foo is invisible to Tenant B; both tenants can have an index named foo independently. Each tenant has its own SQLite file. Verified end-to-end in e2e/tests/a1-multi-tenant.spec.ts.
Quick walkthrough
TOKEN="$(grep '^RUST_API_BEARER=' /home/dietmar/dgx-llm/.env | cut -d= -f2)"
# 1) Ingest a knowledge-base
curl -s -X POST http://localhost:3000/a1/rag/indices/berlin-info/documents \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"documents":[
{"id":"d1","text":"Das Brandenburger Tor steht im Bezirk Mitte am Pariser Platz, gebaut 1791."},
{"id":"d2","text":"Der Tiergarten ist der größte innerstädtische Park Berlins, 210 Hektar."}
]}' | jq
# 2) Pure-retrieval query
curl -s -X POST http://localhost:3000/a1/rag/indices/berlin-info/query \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"query":"Wo ist der größte Park?","top_k":2}' | jq
# 3) Incrementally add a new doc
curl -s -X POST http://localhost:3000/a1/rag/indices/berlin-info/documents/append \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"documents":[{"id":"d3","text":"Der Kurfürstendamm im Westen ist die bekannteste Einkaufsstraße."}]}' | jq
# 4) Patch an existing doc
curl -s -X PATCH http://localhost:3000/a1/rag/indices/berlin-info/documents/d1 \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"metadata":{"category":"sights","source":"manual"}}' | jq
# 5) Use it in an agent (passive: agent.yaml `rag_index: berlin-info`)
curl -s -X POST http://localhost:3000/a1/agents/berlin-rag/chat \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"message":"Wo finde ich den größten Park?"}' | jq
# 6) Use it as active-RAG via tool (agent.yaml `tools: [rag_query]`)
curl -s -X POST http://localhost:3000/a1/agents/librarian/chat \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"message":"Wann wurde das Brandenburger Tor gebaut?"}' | jq
# → LLM calls rag_query(index_id="berlin-info", query="...") then synthesizes
The rag_query stack-tool (Cut 2.8)
Same IndexRegistry exposed as a tool every LLM-call (/v1, /c1, /a1) can invoke. Schema:
{
"name": "rag_query",
"parameters": {
"type": "object",
"properties": {
"index_id": {"type": "string"},
"query": {"type": "string"},
"top_k": {"type": "integer", "default": 5, "minimum": 1, "maximum": 50}
},
"required": ["index_id", "query"]
}
}
The tool inherits tenant-scoping from ToolCtx.identity.tenant_id — Tenant B cannot query Tenant A's indices regardless of how the tool is invoked.
The 9 RAG-CRUD stack-tools (Cut 2.11)
Same IndexRegistry exposed for full agent-driven knowledge-base maintenance — the LLM can curate its own indices end-to-end. All 9 inherit per-tenant isolation via ToolCtx.identity.tenant_id and are reachable via /v1/tools/execute, agent.yaml tools: [...] lists, and (when added to TOOLS_ENABLED_DEFAULT) auto-injected into chat-completions.
| Tool | Mode | Confirm | Preview |
|---|---|---|---|
rag_index_list | read | none | — |
rag_index_describe | read | none | — |
rag_doc_list | read | none | — |
rag_doc_get | read | none | — |
rag_index_create | write (replace) | light B2 | shape_summary (doc_count, total_bytes, replace_semantics) |
rag_doc_append | write (upsert) | light B2 | shape_summary + conflicting_existing_ids[] |
rag_doc_patch | write | light B2 | which fields change + new size |
rag_doc_delete | destructive | heavy B2 + D4 | text_excerpt (≤200 chars) + size + metadata |
rag_index_delete | destructive | heavy B2 + D4 | doc_count + total_text_bytes + sample_doc_ids (≤10) + created_at |
B2 — Two-step-confirm
Every write/destructive RAG-tool returns a confirm_required: true shape on the first call — the model must re-invoke the same tool with the same arguments PLUS the confirm_token to actually execute. The token is bound to a SHA-256 fingerprint over (tenant_id, op_kind, index_id, doc_id?, body_hash) so it cannot be redeemed for a different operation. TTL is 60 s. Defends against prompt-injection that flips a benign read into a destructive write.
D4 — Pre-exec-diff (heavy preview)
For the two destructive ops, the first-call preview field is structured (not just an echo of the input) so the LLM can audit what exactly will go away:
// rag_doc_delete first-call response:
{
"confirm_required": true,
"confirm_token": "{uuid}",
"expires_in_seconds": 60,
"op": "rag_doc_delete",
"preview": {
"index_id": "berlin-info",
"doc_id": "d1",
"text_size_bytes": 218,
"text_excerpt": "Das Brandenburger Tor steht im Bezirk Mitte am Pariser Platz, gebaut 1791.",
"text_excerpt_truncated": false,
"metadata": {"category": "sights"}
},
"message": "rag_doc_delete is a write operation and requires explicit second-call confirmation. ..."
}
// rag_index_delete first-call response:
{
"confirm_required": true,
"preview": {
"index_id": "berlin-info",
"doc_count": 42,
"total_text_bytes": 18372,
"embedding_model": "nomic-embed",
"created_at": "2026-04-30T08:14:22Z",
"sample_doc_ids": ["d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10"],
"sample_truncated_at": 10
}
}
Direct invocation example
TOKEN="$(grep '^RUST_API_BEARER=' /home/dietmar/dgx-llm/.env | cut -d= -f2)"
# 1) First call — get confirm_token + preview
RESP=$(curl -s -X POST http://localhost:3000/v1/tools/execute \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"name":"rag_doc_delete","arguments":{"index_id":"berlin-info","doc_id":"d1"}}')
TOK=$(echo "$RESP" | jq -r .result.confirm_token)
echo "$RESP" | jq .result.preview
# 2) Second call — same args + confirm_token
curl -s -X POST http://localhost:3000/v1/tools/execute \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d "{\"name\":\"rag_doc_delete\",\"arguments\":{\"index_id\":\"berlin-info\",\"doc_id\":\"d1\",\"confirm_token\":\"$TOK\"}}" | jq
Limits
| Limit | Value | Reason |
|---|---|---|
documents per POST (full or append) | 256 | matches /v1/embeddings per-request cap |
| per-document text length | 8192 bytes (~2k tokens) | matches embedding-model --max-model-len |
top_k on query | 1..50 | safety + UI sanity |
| embedding model | nomic-embed-text-v1.5 (dim=768) | local, low-latency, tenant-isolated |