cpatpa lands design docs for five roadmap phases - web search through knowledge collections

Five docs-only commits covering Phases 10-14: web search with SSRF defence, groups and granular RBAC, multi-model side-by-side comparison, pgvector RAG with hybrid retrieval, and knowledge collections. No code yet, but the data model and migration plans are detailed enough to preview where this fork is heading.

knowledge-managementsearch

The web search design (fb928a96, revised in f96b450a) is the most immediately useful for other forks. The initial design picked SearXNG as the default provider; the revision in the same batch flipped it to Brave Search (cleaner commercial ToS, one env var) with SearXNG as an alternative. That decision reversal within a single design pass is a good sign the author is working from the actual constraints rather than preferences. The SSRF defence section is worth reading regardless of which search story you pick: scheme/port allowlist (http/https, 80/443 only), DNS resolution that blocks RFC1918, loopback, link-local, ULA, and cloud metadata addresses re-checked on every redirect, content-type allowlist, and hard caps on bytes, chars, time, and redirect count. The blocklist is seeded with paste sites and *.onion; workspace blocklists stack on top of the org list and cannot be removed at the workspace level.

Phase 11 (groups, f96b450a) plans four new tables: groups, group_members, permissions (capability catalogue with default_for_role mapping and admin_locked flags), group_permissions. Two auto-managed system groups per workspace ("All members", "Admins") are maintained by triggers. project_members and review_members gain a nullable group_id alongside user_id. Effective permissions are resolved per request and exposed as req.can('capability.key'). The migration uses nullable columns and concurrent index builds to avoid locking on PG16.

Phase 12 (multi-model compare, 876049f9) adds mode and compare_models to chats, and turn_index, branch_index, model_id, and three cost columns to chat_messages. Cost capture is on for every chat from this phase, compare and non-compare alike. A model_prices reference table holds per-model rates, admin-editable. Policy: allow_compare_mode defaults off, compare_mode_admin_only defaults on.

Phase 13 (pgvector RAG, 56917d4b) is the biggest structural change. document_chunks carries page_number, heading_path text[], char_start/char_end, and a generated tsvector column. document_embeddings is one-to-one with chunks. Hybrid retrieval uses HNSW (m=16, ef_construction=64) for vector similarity and ts_rank_cd for full-text, merged with RRF k=60. The ingest queue is Postgres-backed (SELECT FOR UPDATE SKIP LOCKED, retry cap 3). Default embedding model is bge-m3 via Ollama (1024-dim). Switching the embedding model triggers a full reindex. The search_documents tool's external shape is unchanged so the model side needs no update.

Phase 14 (knowledge collections, 5f2e7203) adds a collections table with a visibility enum (private | workspace | shared), collection_documents, and collection_members reusing the Phase 11 dual-principal shape. The ACL model chosen - hybrid intersection - means a user can use any visible collection but only retrieves documents they already have access to. Adding a document to a collection never widens access. System collections per project are trigger-maintained and read-only.

So what These are design documents, not merged code. Reading them is useful for understanding where the fork's data model is converging, especially the Phase 11 permissions catalogue and the Phase 13 pgvector schema - both will affect anything built on top of this fork after they land. The Phase 10 SSRF checklist is generic and should be copied into any web-search implementation you build regardless of cpatpa's specific choices. The Phase 14 "listing ACL never widens document ACL" assertion is the right model and worth anchoring any sharing-feature design against. One naming caveat: an earlier cpatpa topic called "Phase 10 retention" is the old Phase 10 numbering and unrelated to the web search design here.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

5 commits from cpatpa/PIP, oldest first. Source extracted verbatim from the harvested git log.

SHA	Subject	Author	Date
`fb928a96`	docs: add Phase 10 (Web search) design	Claude	2026-05-16	↗ GitHub
commit body Covers two LLM tools (web_search, fetch_url) with SearXNG as the recommended self-hosted default and Brave Search as an external alternative. Provider abstraction lives in backend/src/lib/websearch/. Compliance posture: - allow_web_search master switch defaults off, with per-workspace override (narrow-only on blocklist). - Domain allowlist and blocklist at both org and workspace levels. - Audit event for every search and every fetch, including the query text; URL only for fetches (body lives in the fetch cache for its TTL). - Banner in chat when web search is active, alongside the existing external-AI banner. SSRF defence: - Scheme/port allowlist (http/https on 80/443 only). - DNS resolution check that blocks RFC1918, loopback, link-local, ULA, cloud metadata addresses, re-checked on each redirect. - Content-type allowlist (text/html, text/plain, application/pdf, application/json). - Hard caps on bytes, chars, time, and redirect count. Two new Postgres cache tables, ten new org_settings columns, three new workspaces columns, one chats column for the per-chat toggle. Single migration 0023_web_search.sql. Includes rollout plan in five steps, risk matrix, four open questions parked for review, and full acceptance criteria.
`f96b450a`	docs: update Phase 10 defaults, add Phase 11 (Groups) design	Claude	2026-05-16	↗ GitHub
commit body Phase 10 changes: - Default provider switched from SearXNG to Brave Search per decision. SearXNG remains as an alternative for operators who prefer self-hosted. - Blocklist is now the primary filtering mechanism with a seeded starter list (paste sites, *.onion, unmoderated forums). Allowlist demoted to "advanced" use for tightly scoped configurations. - Migration default for web_search_provider flipped to 'brave'. - Compose changes are now SearXNG-only and opt-in via profile. - Open question reframed to Brave plan choice (commercial use). Phase 11 (Groups and granular permissions) added: - Four new tables: groups, group_members, permissions (catalogue), group_permissions. - project_members and review_members extended with nullable group_id alongside user_id, enforced by check constraint. - Two auto-managed system groups per workspace (All members, Admins) maintained by triggers on workspace_members. - Capability set seeded by migration with default_for_role mapping and admin_locked flags. - Effective permissions resolved per request via lib/access.ts and exposed as req.can('capability.key'). - Sharing modal accepts users or groups via a unified principal search endpoint. - Audit events for group lifecycle and share/unshare with principal kind metadata. - Single migration 0024_groups.sql; columns added are nullable (metadata-only in PG16), indexes built concurrently. - Rollout in four steps with feature-flag option for behavioural rollback. Open questions parked for review: owner-locked capabilities, cross- workspace group sharing, residual shared_with JSONB cleanup, perf of req.can() in chat tool dispatch.
`876049f9`	docs: add Phase 12 (Multi-model side-by-side) design	Claude	2026-05-16	↗ GitHub
commit body Adds a compare mode that lets an authorised user run the same prompt against 2 or 3 models in parallel. Data model: - chats gets mode ('standard'\|'compare') and compare_models text[]. - chat_messages gets turn_index, branch_index, model_id, and three cost-related columns (input_tokens, output_tokens, cost_cents). - Per-message cost capture is on for every chat from this phase, including standard chats. Cheap side-benefit feeds a new admin Cost summary tab. - New model_prices reference table, admin-editable. Backend: - Compare-aware POST /api/chats/:id/messages fans out N provider calls with Promise.all and multiplexes SSE chunks tagged with branch_index. - Per-branch regenerate endpoint replaces one assistant row in place. - Fork-to-standard endpoint clones a compare chat's history with one branch's responses into a new standard chat. Frontend: - Mode toggle on new-chat composer. Multi-select model picker showing availability per model. Cost amplification note. - Chat view renders user turns full-width and assistant turns as N columns side by side on desktop, tabs on mobile. - Per-column streaming with regen and keep-this actions, error state with retry. Policy and gating: - org_settings.allow_compare_mode (default off), compare_mode_max_models (cap 3, must be 2 or 3), compare_mode_admin_only (default on). - New chats.compare capability seeded into the Phase 11 permissions catalogue with owner+admin default. - allow_external_models and EXTERNAL_AI_DISABLED both gate which models appear in the picker. Disallowed models are not selectable. Tools in compare mode are disabled in Phase 12. Memory injection still happens (read-only path); add_memory cannot fire because tools are off. Single migration 0025_compare_mode.sql with one-shot backfill of turn_index for existing chats and seeded model_prices entries. Five open questions parked for review: per-branch regen during streaming, RAG cost attribution (deferred to Phase 13), locked model set, per-user concurrency cap, mobile UX.
`56917d4b`	docs: add Phase 13 (Vector RAG with pgvector) design	Claude	2026-05-16	↗ GitHub
commit body The foundational retrieval upgrade. Replaces the current LLM-driven in-context document scan with chunked semantic retrieval over pgvector with hybrid search. Data model: - pgvector extension on Postgres 16 (image swap to pgvector/pgvector:pg16). - document_chunks with structure-aware metadata: page_number for PDF, heading_path text[] for DOCX, char_start/char_end for highlight overlays, and a stored tsvector generated column for the full-text arm. - document_embeddings keyed one-to-one with chunks, single active embedding model at a time, vector(1024) column type pinned at migration time. Dimension swap via templated paired migration plus full reindex. - rag_ingest_jobs as a simple Postgres-backed queue, consumed by an in-process worker (single backend replica assumption). - Eight new org_settings columns covering provider, model, chunk shape, top-K per arm, and final top-N. Embedding model: - bge-m3 via Ollama as the default (1024 dim, multilingual, CPU-capable). nomic-embed-text as a lighter alternative. - HF TEI as a dedicated-service option. OpenAI text-embedding-3 family as opt-in, gated by EXTERNAL_AI_DISABLED. Chunking: - Token-based (~512 tokens, 64 overlap) with paragraph/sentence/ word boundary preference and heading-forced boundaries for DOCX. - cl100k_base tokenizer for reproducibility across models. - Per-chunk metadata: document, version, index, page, heading path, character offsets. Retrieval: - Hybrid: vector cosine via HNSW (m=16, ef_construction=64) plus ts_rank_cd full-text, merged with Reciprocal Rank Fusion (k=60). - ACL filter computed up front from Phase 11 effective permissions; RLS as defence-in-depth. - search_documents tool refactored to call the new retriever; same external shape so the model side is unchanged. Tool output includes chunk_id, document_name, page, heading_path, excerpt, score. Worker: - One in-process worker, SELECT FOR UPDATE SKIP LOCKED on rag_ingest_jobs, retry cap of 3, batched embedding calls (default batch 64). - Hooks: document upload, version creation, admin "reindex" endpoint. Frontend: - Composer scope chip ("Searching: project X (47 documents)") with a scope-edit modal. - search_documents tool-call card renders the hit list with links that jump to document viewer with the chunk highlighted. - cite-button hover preview of the chunk excerpt. - Admin AI Policy gets a Retrieval section with provider/model selection, chunk/top-K knobs, queue and index stats, and guarded "Reindex" / "Clear and reindex" actions. Rollout in six steps gated by a new rag_enabled org switch; rollback at any step flips the switch back to keep the legacy in-context tool. Compose image swap to pgvector/pgvector:pg16 documented in the operator deployment guide. Risks captured for: dimension mismatch, HNSW build time, worker stalls on malformed docs, chunk explosion on multi-thousand-page PDFs, permission bypass through retrieval, stale chunks after version change, OpenAI leak under EXTERNAL_AI_DISABLED, postgres image change for operators. Open questions parked: tokenizer choice, heading-aware tuning, RRF weighting tuning UI, per-document index opt-out, cross-encoder reranker, workspace-wide retrieval (deferred to Phase 14).
`5f2e7203`	docs: add Phase 14 (Knowledge collections) design	Claude	2026-05-16	↗ GitHub
commit body The final phase. UX layer over Phase 13 retrieval. Lets users group documents into named collections that can be scoped at chat time via a #-mention autocomplete, and bound as the source set for tabular reviews and workflows. Data model: - collections (workspace-scoped, with visibility = private \| workspace \| shared), collection_documents, collection_members (dual-principal pattern from Phase 11, used as a LISTING ACL only). - chats.default_scope_kind + default_scope_id for per-chat default. - tabular_reviews.collection_id (nullable, on delete set null). - workflows.default_scope_kind + default_scope_id. ACL model (Model C - hybrid intersection): - Visibility controls who can SEE the collection. - effectiveDocumentSet intersects collection contents with the caller's accessible-document set. - Adding a document to a collection NEVER widens access. A collection containing docs the caller cannot read surfaces a count ("X of Y visible to you") without naming inaccessible documents. - collection_members is listing-only; cannot widen document ACL. System collections: - One per project, system_kind = 'project_all', auto-maintained by triggers on documents and projects. - Visible workspace-wide; effective contents per user remain intersected with their accessible set. - Migration backfills system collections for every existing project; duplicate project names emit a warning. Composer UX: - #-autocomplete dropdown grouped by Collections, Projects, Documents; capped at 20 results from /scope-search. - Selected scope renders as a chip with kind icon; multiple chips union; submitting records the scope for the assistant turn. - Per-chat default scope chip above the composer, editable. Tabular reviews and workflows: - Tabular review create modal gains a Source toggle (project or collection). - Workflow run modal gains a scope picker honouring the workflow's default scope. - Workflow runner already takes document_ids; Phase 14 adds a thin scope-resolution wrapper that calls effectiveDocumentSet. Risks captured for visibility leakage via counts, name collisions, performance, autocomplete latency, orphan default scopes, expected behaviour of tabular reviews when collection contents change after review creation, and explicit assertion that shared-collection membership does NOT widen document ACL. Open questions parked: cross-workspace collections, bulk add via CSV vs picker, system-collection rename on project rename (recommended yes), synthetic "All documents in this workspace" entry in scope picker. Single migration 0027_collections.sql with trigger-managed system collections and backfill.

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-464.md from inside the repo you want the changes in.

⬇ Download capture-thread-464.md