Citation chips and keyword chips in the tabular cell side-panel

nwhitehouse extended the tabular cell detail view with two chip types: citation chips that jump the doc viewer to the cited page, and keyword chips that drive the doc-search input. Four commits landed over a day, with three of them fixing issues found from running the feature on real data.

contract-reviewdiscovery

The extraction prompt gains a keywords field: an array of 3-5 short phrases a reviewer would type into a doc-search box, explicitly prompted to prefer verbatim or near-verbatim terms from the document and skip generic words like "provision" or "agreement". The worker sanitizes the output through sanitiseKeywords (trim, dedupe case-insensitively, drop strings under 2 characters, cap at 5, max 60 chars each) before storing. No migration needed - the field piggybacks on the existing tabular_cells.content jsonb column.

The missed-plumbing bug came quickly: parseCellContent in routes/tabular.ts had a hard-coded 3-field shape (summary, flag, reasoning) that silently dropped keywords from the GET response. The worker stored them correctly; they never reached the frontend. The fix updates parseCellContent and adds a sanitiseKeywordsForResponse function - deliberately duplicated from the worker-side sanitiser rather than imported, with a comment noting to keep them in sync by hand.

Citation chip labels were initially rendered as "Page 1 · p1" for every citation, useless when all citations are on the same page. The updated chipLabel() tries a section reference match first (Section 2.06, § 4.1), then falls back to the first ~40 characters of the quote in curly quotes, and only uses "Page N" when the quote is empty.

The fourth commit addresses an LLM compliance failure on the "Parties" column: the model described three parties in prose without emitting any [[page:N||quote:...]] markers. The prompt was tightened with a concrete worked example - "ACME CORP [[page:1||quote:ACME CORP]] and BETA INC [[page:1||quote:BETA INC]]" - and an explicit ban on prose page references like "(Page 5)". As a frontend safety net, extractProsePageRefs was added to synthesize quoteless ParsedCitation objects from "Page N" / "p. N" / "pp. N" patterns in cell text, preserving the page-jump affordance for legacy cells and under-citing model responses.

So what Worth importing for any surface that extracts structured data from PDFs and lets reviewers verify results. The citation marker as canonical source of truth, with keyword chips as a separate verification path, is a clean split. Two things to watch: the `sanitiseKeywords` / `sanitiseKeywordsForResponse` duplication is a maintenance hazard if either copy drifts; and the prompt-tightening pattern (explicit worked examples to enforce citation format) is effective but model-dependent - validate against whatever LLM your fork uses before relying on it.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

4 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA	Subject	Author	Date
`04c7218a`	[feat-022] Cell-detail citation + related-keyword chips	Nick Whitehouse	2026-05-07	↗ GitHub
commit body User request: clicking a citation in the cell side panel should jump the doc viewer to the cited page; clicking a related keyword should drive the doc-search input + jump to the first match. Both speed up the verify/spot-check workflow that's the point of tabular review. Backend: - Cell-extraction JSON schema gains `keywords: string[]`. Both queryGemini (single-cell regenerate) and queryGeminiAllColumns (full review run) now require 3-5 short search terms a reviewer could use to verify the cell against the source. Prompts say "skip generic words, skip column-name restatements, prefer near-verbatim phrases". - sanitiseKeywords() junk-filters the LLM output: trims, dedupes case-insensitively, drops <2-char strings, caps at 5, max 60 chars each. Bad keywords can't leak into the UI. - CellResult type extended with `keywords?: string[]`. - Storage piggybacks on existing tabular_cells.content jsonb (no migration). Frontend: - TabularCell.content type extended with keywords?: string[]. Legacy cells (pre-feat-022) have no keywords; the chip row simply doesn't render. - CellCitationChips component - pill row of clickable citations, one per parsed [[page:N\|\|quote:...]] marker. Chip label is "Section X.Y" if the quote starts with one, else "Page N". Tooltip carries the full quote. Click → jumps the doc viewer to the cited page (via the existing setActiveCitationIdx flow). - CellKeywordChips component - pill row of LLM-suggested doc-search terms. Click → setSearchTerm(kw), which drives the existing doc-viewer search input + scroll-to-first-match. - Both rows render under the Explanation block in TRDocDetailView, hidden in edit mode and absent when there are no citations / keywords (graceful for legacy cells). - Citation chip click clears any active doc-search so the new jump starts on a fresh viewport. Both components are deliberately generic ({citations, onJump} and {keywords, onSearch}) so feat-024's RAG retrieved-passage chips can reuse them without copy-paste. Verified: tsc clean both sides; 16 backend tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`a71f3cc1`	[feat-022] Stop parseCellContent from stripping keywords	Nick Whitehouse	2026-05-07	↗ GitHub
commit body The worker stores cell content as {summary, flag, reasoning, keywords} but the GET /tabular-review/:reviewId endpoint passes content through parseCellContent, which had a hard-coded 3-field shape that silently dropped keywords on the way out. Result: keyword chips never had data, no matter what the worker actually emitted. Updated parseCellContent (and the inner string-fallback branch) to read + return keywords. Added sanitiseKeywordsForResponse - kept in sync with sanitiseKeywords in lib/tabularJobs.ts (≤5 entries, trim, ≥2 chars, max 60 chars per entry) so legacy cells / corrupt rows don't poison responses. This was a missed plumbing step in feat-022. The DB data was correct; it was the response shape that was wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`c8ad2161`	[feat-022] Citation chip labels: quote excerpt, not "Page N"	Nick Whitehouse	2026-05-07	↗ GitHub
commit body Was rendering "Page 1 · p1" for every citation when the quote didn't start with a Section reference - useless when 4 of 4 citations are on page 1 (can't tell them apart). Now: - Section reference at start of quote → "Section 2.06" - Otherwise → first ~40 chars of the quote in curly quotes, e.g. "terminate...effective within one h..." - "Page N" is now only the final fallback when the quote is empty Page number moved to a small "p1" trailing suffix (was the redundant " · p1" stacked next to "Page 1"). Full quote stays on hover tooltip. Max chip width capped at 260px so a long quote doesn't blow the row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ada3fc90`	[feat-022] Force LLM citations on lists + prose-page-ref fallback chips	Nick Whitehouse	2026-05-07	↗ GitHub
commit body Two fixes for the "Parties" column case the user found, where the LLM described both parties but emitted no [[page:N\|\|quote:...]] markers, leaving the Citations chip row empty even though "Page 1" was clearly stated in prose. Backend prompts (queryGemini + queryGeminiAllColumns): - Lists/bulleted summaries are explicitly NOT exempt from citation rules. The prompt now spells out "if the summary lists three parties, emit three citation markers - one per party" with a worked example: "ACME CORP [[page:1\|\|quote:ACME CORP]] and BETA INC [[page:1\|\|quote:BETA INC]]". - Tightened the rule from "factual claim" to "every distinct fact (every name, date, number, party, clause, defined term, or substantive statement)" so the model can't lawyer its way out of citing a list. - Explicit ban on prose page references like "(Page 5)" or "see Section 3" inside the summary - must use the marker format so the UI can render clickable chips. - Reasoning may also include [[page:N\|\|quote:...]] citations, and the prompt now encourages it. - "keywords" guidance picks up "party names" as a preferred kind. Frontend safety net for legacy / occasionally-still-lazy outputs: - citation-utils gains extractProsePageRefs() - scans free text for "Page N" / "page N" / "p. N" / "pp. N" patterns and emits quoteless ParsedCitation entries, deduped against any structured citations on the same page. - TRDocDetailView's allCitations now appends those prose refs after the structured ones, so the Citations chip row gets at least the page-jump affordance even when the model skipped its markers. The existing [N]-numbered superscripts in the cell text stay stable (only structured citations get numbered). Net effect: a re-run of the Parties column should now produce per-party [[page:N\|\|quote:...]] markers and four real Citations chips. For cells already in the DB (without re-running) the prose fallback at least gives a clickable "Page 1" jump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-154.md from inside the repo you want the changes in.

⬇ Download capture-thread-154.md