RAG over tabular-review documents via pgvector and OpenAI embeddings

nwhitehouse added end-to-end RAG on top of the tabular review: every uploaded document gets chunked and embedded via a new job type on the existing worker pool, chunks land in a `document_chunks` table with a pgvector HNSW index, and the TR chat retrieves top-K passages and injects them into the system prompt. Four commits over roughly an hour show what actually goes wrong when you run this on real PDFs.

contract-reviewsearch

The schema is in migration 007_document_chunks.sql: document_chunks(id, document_id, chunk_index, page_start, page_end, content, embedding vector(1536)) with a (document_id, chunk_index) unique constraint and an HNSW index using cosine ops. The kNN search is wrapped in a rag_search_chunks SQL function callable via supabase.rpc() because the JS client can't bind vector(1536) parameters inline. Migration 008_tabular_jobs_job_type.sql adds a job_type column to multiplex generate and embed jobs on the same worker pool - no new infrastructure, just a new dispatch case.

The chunker uses 800-token windows with 150-token overlap and tracks page boundaries from the ## Page N markers that the existing PDF extractor already emits. Embeddings use OpenAI text-embedding-3-small (1536 dims, ~$0.02 per 1M tokens). When OPENAI_API_KEY is unset, the upload path skips the embed job and TR chat falls back to cell-only context.

The follow-up commits document real production friction. U+0000 NUL bytes in pdfjs output caused Postgres to reject the entire document with "unsupported Unicode escape sequence" - the fix is a one-liner text.replace(//g, "") in the chunker, applied before both the chunker and page-marker scanner run. The backfill script (scripts/backfill-embeddings.ts) initially crashed on re-runs with unique-constraint violations from the (document_id, chunk_index) pair; the fix is wipe-then-insert, matching the worker's semantics, plus bumping the supabase-js row cap from the default 1000 to 100000. The fourth commit addresses two TR chat issues: cells completed in the gap between the last delta and the polling loop's terminal-status flip were silently dropped from the UI until manual refresh, and the system prompt's column-centric framing ("call read_table_cells before answering") caused the model to refuse questions whose answers were only in retrieved passages.

So what Worth importing if your fork wants RAG on user documents. The job-type multiplexing on an existing worker pool is the structurally elegant bit - you get a second job class without any new infrastructure. The backfill script pattern (semantics carefully matched to the live writer, idempotent on re-run) is a useful default for any chunked-data feature. Two real constraints: pgvector's HNSW index requires the `vector` Postgres extension (Supabase Cloud has it; raw RDS may not), and the OpenAI dependency is a hard requirement for embeddings - there's no fallback model path.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

4 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA	Subject	Author	Date
`a187e3a0`	[feat-024] RAG chat over tabular-review docs	Nick Whitehouse	2026-05-07	↗ GitHub
commit body Embeds every doc on upload via a new embed_document job type on the bug-007 worker pool, stores chunks + embeddings in document_chunks (pgvector, HNSW), and injects top-K passages into the TR chat system prompt before the LLM call. - migrations 007 (vector ext, table, RPC, RLS) + 008 (job_type column) - text-embedding-3-small (1536 dim) via direct fetch with batched retry - 800-tok chunks / 150-tok overlap, page-aware via "## Page N" markers - POST /single-documents/embed-backfill, gated by ENABLE_EMBED_BACKFILL - TR chat falls back to cell-only context when OPENAI_API_KEY unset Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dbf18a87`	[feat-024] Strip NUL bytes from chunk input	Nick Whitehouse	2026-05-07	↗ GitHub
commit body Postgres text/jsonb encoding rejects U+0000 with "unsupported Unicode escape sequence" - pdfjs occasionally emits NUL in extracted text and the embed insert fails for the whole doc. Drop them upfront in the chunker so both worker + script paths are covered. Also adds a one-shot backfill script that bypasses the HTTP endpoint + worker pool: chunks and embeds every doc that has no rows in document_chunks. Useful for backfilling docs that pre-date this feature without restarting the backend or wiring auth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`9d98a18e`	[feat-024] Backfill script: wipe-and-reinsert for idempotency	Nick Whitehouse	2026-05-07	↗ GitHub
commit body Without the wipe, re-running the script crashes on (document_id, chunk_index) unique-constraint violations whenever a prior run partially landed chunks for a doc. Match the worker's processEmbedDocumentItem semantics so the script is safe to re-run on any state. Also bumps the chunkedSet query past supabase-js's 1000-row default since document_chunks easily exceeds that on a real review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`4f7e58ca`	[feat-024] Refetch review on job terminate + use passages over column-refusal	Nick Whitehouse	2026-05-07	↗ GitHub
commit body Two fixes for symptoms reported on prod: 1. After a tabular_generate job finishes, the poll loop's prev.map delta only updates cells that already exist in state - anything completed in the gap between the last delta poll and the terminal-status flip stays blank until the user hits refresh. Refetch the canonical review state once polling exits so the UI matches the DB without manual reload. 2. The TR chat system prompt framed the world entirely around columns ("call read_table_cells before answering"), so the LLM refused questions whose subject wasn't a column even when RETRIEVED PASSAGES contained the answer. Tighten the framing to tell the model both sources are valid grounding and to consult passages before declining. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-155.md from inside the repo you want the changes in.

⬇ Download capture-thread-155.md