jamietso/mike-redline: feed tracked changes and comment bubbles to the LLM

Upstream mike strips redlines before the LLM sees a document - `w:del` text and comment bubbles are silently dropped. This fork keeps them, inlining `{++ins++}`, `{--del--}`, `{<<moved>>}`, and `{>>by AUTHOR: text<<}` markers so the model can answer "what did counterparty change?" on both DOCX and PDF.

contract-reviewdiscovery

The core change is in backend/src/lib/docxTrackedChanges.ts. A new extractDocxBodyText function walks word/document.xml directly (replacing the mammoth dependency) and emits inline markers at w:ins, w:del, and w:commentRangeStart boundaries. Comments are loaded once from word/comments.xml into a map keyed by w:id and spliced in at each anchor. Both the chat assistant's read_document tool and the tabular per-cell extractor call this new function; system prompts in chatTools.ts and routes/tabular.ts are updated to define the marker syntax.

PDF redlines are handled by a separate Python script, backend/scripts/redline_extract.py, which reads a PDF on stdin and emits marked-up text on stdout. It uses PyMuPDF to walk text spans and classifies fill color against three targets: red (tolerance 30 Euclidean RGB units) becomes {--deleted--}, blue becomes {++inserted++}, green becomes {<<moved>>}. The Node side in pdfRedlineExtract.ts spawns it as a subprocess and falls back to pdfjs-dist text-only extraction if Python or PyMuPDF isn't available. PYTHON_BIN sets the interpreter path.

A few things to check before pulling this in. First, the new extractDocxBodyText produces marker-annotated text - it must not be passed to the edit-planning anchor matcher (flattenParagraph), which still expects accepted-view text. The fork is explicit about this split but you need to carry it across any downstream call sites. Second, the PDF detector is purely color-based: documents whose redline colors fall outside the 30-unit tolerance window (custom color schemes, scanned PDFs, some older Litera exports) will silently return plain text with no error. Third, the commit bundles an unrelated storage.ts fix (forcePathStyle: true) and cosmetic sidebar changes - worth splitting on cherry-pick.

So what Worth a close look if your users need to interrogate redlined contracts or review comment threads. The implementation is self-contained enough to extract as a pair of files (`docxTrackedChanges.ts` and `pdfRedlineExtract.ts` plus the Python script) with targeted edits to the two call sites. Skip if your documents are already in accepted-view or if a Python runtime on the backend host is a deployment constraint you can't meet.

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

1 commit from jamietso/mike-redline, oldest first. Source extracted verbatim from the harvested git log.

SHA	Subject	Author	Date
`394f2ba2`	Add redline-aware DOCX/PDF extraction and comment-bubble support	Jamie Tso	2026-05-05	↗ GitHub
commit body Feeds tracked changes and review comments to the LLM as inline markers instead of stripping them ("accepted view"). Closes the redline-reading gap that closed-source legal AI products like Harvey and Legora ship as a paid feature. DOCX - extractDocxBodyText (lib/docxTrackedChanges.ts): walks document.xml and emits {++ins++} / {--del--} for w:ins/w:del, and {>>by AUTHOR: text<<} for comment bubbles loaded once from word/comments.xml. - tabular's extractDocxMarkdown switches from mammoth to the same redline-aware extractor so column extraction sees redlines too. PDF - New scripts/redline_extract.py uses PyMuPDF to detect color-based redlines per text span: red/strikethrough -> {--del--}, blue/underline -> {++ins++}, green -> {<<moved>>}. Algorithm ported from Diff Master's browserPyMuPdfProcessor (Pyodide), now spawned as a Node subprocess via lib/pdfRedlineExtract.ts. Falls back to pdfjs-dist text-only extraction if Python or pymupdf are unavailable. - extractPdfMarkdown (tabular) and extractPdfText (chatTools) both call the new extractor first. Prompts - chatTools SYSTEM_PROMPT and tabular EXTRACTION_SYSTEM / SYSTEM all document the {++/--/<<>>}, {>>...<<} markers so the LLM knows how to read them and what "current" vs "original" means. Misc - storage.ts: forcePathStyle: true on the S3 client so MinIO and other path-style S3 endpoints work locally without subdomain DNS. - Sidebar / layout / site-logo: brand reads "Mike (v2)" so side-by-side comparisons against upstream are unambiguous. - backend/.env.example: PYTHON_BIN documented; pymupdf install line in README. Adds Python 3.10+ + pymupdf as an optional runtime dep - extractor gracefully no-ops to text-only if either is missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SHA

Subject

Author

Date

394f2ba2

Add redline-aware DOCX/PDF extraction and comment-bubble support

Jamie Tso

2026-05-05

↗ GitHub

commit body

Feeds tracked changes and review comments to the LLM as inline markers
instead of stripping them ("accepted view"). Closes the redline-reading
gap that closed-source legal AI products like Harvey and Legora ship as
a paid feature.

DOCX
- extractDocxBodyText (lib/docxTrackedChanges.ts): walks document.xml and
  emits {++ins++} / {--del--} for w:ins/w:del, and {>>by AUTHOR: text<<}
  for comment bubbles loaded once from word/comments.xml.
- tabular's extractDocxMarkdown switches from mammoth to the same
  redline-aware extractor so column extraction sees redlines too.

PDF
- New scripts/redline_extract.py uses PyMuPDF to detect color-based
  redlines per text span: red/strikethrough -> {--del--},
  blue/underline -> {++ins++}, green -> {<<moved>>}. Algorithm ported
  from Diff Master's browserPyMuPdfProcessor (Pyodide), now spawned as a
  Node subprocess via lib/pdfRedlineExtract.ts. Falls back to pdfjs-dist
  text-only extraction if Python or pymupdf are unavailable.
- extractPdfMarkdown (tabular) and extractPdfText (chatTools) both call
  the new extractor first.

Prompts
- chatTools SYSTEM_PROMPT and tabular EXTRACTION_SYSTEM / SYSTEM all
  document the {++/--/<<>>}, {>>...<<} markers so the LLM knows how to
  read them and what "current" vs "original" means.

Misc
- storage.ts: forcePathStyle: true on the S3 client so MinIO and other
  path-style S3 endpoints work locally without subdomain DNS.
- Sidebar / layout / site-logo: brand reads "Mike (v2)" so side-by-side
  comparisons against upstream are unambiguous.
- backend/.env.example: PYTHON_BIN documented; pymupdf install line in
  README.

Adds Python 3.10+ + pymupdf as an optional runtime dep - extractor
gracefully no-ops to text-only if either is missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-102.md from inside the repo you want the changes in.