nwhitehouse teaches the table reviewer to read the underlying documents

The fork's tabular-review chat can now pull answers from the source PDFs, not just the cells in the table.

contract-reviewsearch

Until now, when a reviewer asked the chat a question, the AI could only see the columns and cells laid out in the review. nwhitehouse has wired in retrieval-augmented generation - every uploaded document gets sliced into passages, indexed by meaning, and the most relevant chunks are handed to the model alongside the question. Ask about something that lives in the body of a contract rather than in a table column, and the chat can now actually find it.

The rollout is pragmatic. A back-fill tool re-indexes documents already in the system, the indexing piggybacks on existing background workers rather than standing up new infrastructure, and the whole thing degrades gracefully if the embeddings provider isn't configured. Three quick follow-up commits within half an hour - fixing a PDF encoding gotcha, a re-run crash, and a too-narrow prompt - suggest it was tested against real documents, not a demo set.

So what Anyone running document-heavy contract or discovery review on a Mike fork should watch this: it's the difference between a chat that summarises a spreadsheet and one that reads the underlying file.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

4 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA Subject Author Date
a187e3a0 [feat-024] RAG chat over tabular-review docs Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
Embeds every doc on upload via a new embed_document job type on the
bug-007 worker pool, stores chunks + embeddings in document_chunks
(pgvector, HNSW), and injects top-K passages into the TR chat system
prompt before the LLM call.

- migrations 007 (vector ext, table, RPC, RLS) + 008 (job_type column)
- text-embedding-3-small (1536 dim) via direct fetch with batched retry
- 800-tok chunks / 150-tok overlap, page-aware via "## Page N" markers
- POST /single-documents/embed-backfill, gated by ENABLE_EMBED_BACKFILL
- TR chat falls back to cell-only context when OPENAI_API_KEY unset

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dbf18a87 [feat-024] Strip NUL bytes from chunk input Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
Postgres text/jsonb encoding rejects U+0000 with "unsupported Unicode
escape sequence" - pdfjs occasionally emits NUL in extracted text and
the embed insert fails for the whole doc. Drop them upfront in the
chunker so both worker + script paths are covered.

Also adds a one-shot backfill script that bypasses the HTTP endpoint
+ worker pool: chunks and embeds every doc that has no rows in
document_chunks. Useful for backfilling docs that pre-date this
feature without restarting the backend or wiring auth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9d98a18e [feat-024] Backfill script: wipe-and-reinsert for idempotency Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
Without the wipe, re-running the script crashes on
(document_id, chunk_index) unique-constraint violations whenever a
prior run partially landed chunks for a doc. Match the worker's
processEmbedDocumentItem semantics so the script is safe to re-run
on any state.

Also bumps the chunkedSet query past supabase-js's 1000-row default
since document_chunks easily exceeds that on a real review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4f7e58ca [feat-024] Refetch review on job terminate + use passages over column-refusal Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
Two fixes for symptoms reported on prod:

1. After a tabular_generate job finishes, the poll loop's prev.map
   delta only updates cells that already exist in state - anything
   completed in the gap between the last delta poll and the
   terminal-status flip stays blank until the user hits refresh.
   Refetch the canonical review state once polling exits so the UI
   matches the DB without manual reload.

2. The TR chat system prompt framed the world entirely around
   columns ("call read_table_cells before answering"), so the LLM
   refused questions whose subject wasn't a column even when
   RETRIEVED PASSAGES contained the answer. Tighten the framing to
   tell the model both sources are valid grounding and to consult
   passages before declining.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-155.md from inside the repo you want the changes in.

⬇ Download capture-thread-155.md