nwhitehouse wires up clickable citations in extracted tables

Reviewers can now jump straight from a table cell to the exact page and phrase in the source document.

contract-reviewdiscovery

When an AI pulls structured data out of a PDF - parties, dates, clauses into neat columns - the hard part isn't extraction, it's trust. nwhitehouse's update makes every extracted cell defend itself: click a citation chip and the document viewer jumps to the cited page; click a keyword chip and the doc search runs the phrase and scrolls to the first hit.

The team tightened the model prompts so every name, date, and clause has to come with a page-and-quote citation rather than vague prose like "(see Page 5)." When the model still slips and writes the page reference in prose anyway, a frontend safety net catches it and at least preserves the page-jump. Chip labels also got smarter - "Section 2.06" or the first words of the quote, rather than a useless "Page 1" repeated four times.

So what Anyone building review tools where a lawyer needs to verify AI-extracted facts against the source should study this pattern.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

4 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA Subject Author Date
04c7218a [feat-022] Cell-detail citation + related-keyword chips Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
User request: clicking a citation in the cell side panel should jump
the doc viewer to the cited page; clicking a related keyword should
drive the doc-search input + jump to the first match. Both speed up
the verify/spot-check workflow that's the point of tabular review.

Backend:
- Cell-extraction JSON schema gains `keywords: string[]`. Both
  queryGemini (single-cell regenerate) and queryGeminiAllColumns
  (full review run) now require 3-5 short search terms a reviewer
  could use to verify the cell against the source. Prompts say
  "skip generic words, skip column-name restatements, prefer
  near-verbatim phrases".
- sanitiseKeywords() junk-filters the LLM output: trims, dedupes
  case-insensitively, drops <2-char strings, caps at 5, max 60 chars
  each. Bad keywords can't leak into the UI.
- CellResult type extended with `keywords?: string[]`.
- Storage piggybacks on existing tabular_cells.content jsonb
  (no migration).

Frontend:
- TabularCell.content type extended with keywords?: string[].
  Legacy cells (pre-feat-022) have no keywords; the chip row simply
  doesn't render.
- CellCitationChips component - pill row of clickable citations,
  one per parsed [[page:N||quote:...]] marker. Chip label is "Section
  X.Y" if the quote starts with one, else "Page N". Tooltip carries
  the full quote. Click → jumps the doc viewer to the cited page
  (via the existing setActiveCitationIdx flow).
- CellKeywordChips component - pill row of LLM-suggested doc-search
  terms. Click → setSearchTerm(kw), which drives the existing
  doc-viewer search input + scroll-to-first-match.
- Both rows render under the Explanation block in TRDocDetailView,
  hidden in edit mode and absent when there are no citations /
  keywords (graceful for legacy cells).
- Citation chip click clears any active doc-search so the new jump
  starts on a fresh viewport.

Both components are deliberately generic ({citations, onJump} and
{keywords, onSearch}) so feat-024's RAG retrieved-passage chips can
reuse them without copy-paste.

Verified: tsc clean both sides; 16 backend tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a71f3cc1 [feat-022] Stop parseCellContent from stripping keywords Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
The worker stores cell content as {summary, flag, reasoning, keywords}
but the GET /tabular-review/:reviewId endpoint passes content through
parseCellContent, which had a hard-coded 3-field shape that silently
dropped keywords on the way out. Result: keyword chips never had data,
no matter what the worker actually emitted.

Updated parseCellContent (and the inner string-fallback branch) to
read + return keywords. Added sanitiseKeywordsForResponse - kept in
sync with sanitiseKeywords in lib/tabularJobs.ts (≤5 entries, trim,
≥2 chars, max 60 chars per entry) so legacy cells / corrupt rows
don't poison responses.

This was a missed plumbing step in feat-022. The DB data was correct;
it was the response shape that was wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c8ad2161 [feat-022] Citation chip labels: quote excerpt, not "Page N" Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
Was rendering "Page 1 · p1" for every citation when the quote didn't
start with a Section reference - useless when 4 of 4 citations are on
page 1 (can't tell them apart). Now:

- Section reference at start of quote → "Section 2.06"
- Otherwise → first ~40 chars of the quote in curly quotes, e.g.
  "terminate...effective within one h..."
- "Page N" is now only the final fallback when the quote is empty

Page number moved to a small "p1" trailing suffix (was the redundant
" · p1" stacked next to "Page 1"). Full quote stays on hover tooltip.
Max chip width capped at 260px so a long quote doesn't blow the row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ada3fc90 [feat-022] Force LLM citations on lists + prose-page-ref fallback chips Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
Two fixes for the "Parties" column case the user found, where the
LLM described both parties but emitted no [[page:N||quote:...]]
markers, leaving the Citations chip row empty even though "Page 1"
was clearly stated in prose.

Backend prompts (queryGemini + queryGeminiAllColumns):
- Lists/bulleted summaries are explicitly NOT exempt from citation
  rules. The prompt now spells out "if the summary lists three
  parties, emit three citation markers - one per party" with a
  worked example: "ACME CORP [[page:1||quote:ACME CORP]] and
  BETA INC [[page:1||quote:BETA INC]]".
- Tightened the rule from "factual claim" to "every distinct fact
  (every name, date, number, party, clause, defined term, or
  substantive statement)" so the model can't lawyer its way out of
  citing a list.
- Explicit ban on prose page references like "(Page 5)" or
  "see Section 3" inside the summary - must use the marker format
  so the UI can render clickable chips.
- Reasoning may also include [[page:N||quote:...]] citations, and
  the prompt now encourages it.
- "keywords" guidance picks up "party names" as a preferred kind.

Frontend safety net for legacy / occasionally-still-lazy outputs:
- citation-utils gains extractProsePageRefs() - scans free text for
  "Page N" / "page N" / "p. N" / "pp. N" patterns and emits
  quoteless ParsedCitation entries, deduped against any structured
  citations on the same page.
- TRDocDetailView's allCitations now appends those prose refs after
  the structured ones, so the Citations chip row gets at least the
  page-jump affordance even when the model skipped its markers. The
  existing [N]-numbered superscripts in the cell text stay stable
  (only structured citations get numbered).

Net effect: a re-run of the Parties column should now produce per-party
[[page:N||quote:...]] markers and four real Citations chips. For cells
already in the DB (without re-running) the prose fallback at least
gives a clickable "Page 1" jump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-154.md from inside the repo you want the changes in.

⬇ Download capture-thread-154.md