jamietso teaches Mike to read redlines

Tracked changes and reviewer comments now survive the trip from document to model - instead of being silently flattened away.

contract-reviewdiscovery

Most AI tools, when handed a Word doc or PDF full of tracked changes, quietly accept everything and feed the model a clean version. The lawyer's actual work - what was struck, what was inserted, who said what in the margin - disappears before the AI ever sees it. jamietso's fork fixes that. Both Word and PDF extractors now preserve insertions, deletions, moves, and reviewer comments as inline markers, and the model is taught what those markers mean so it can reason about the markup itself.

The PDF side is ported from a side project of jamietso's, using color cues (red strikethrough, blue underline, green moves) to detect changes visually. The author frames this bluntly as closing a gap that paid tools like Harvey and Legora charge real money for.

So what If your lawyers live in redlines, this is the upstream feature worth watching - an AI that can actually read the markup, not just the clean draft underneath.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

1 commit from jamietso/mike-redline, oldest first. Source extracted verbatim from the harvested git log.

SHA Subject Author Date
394f2ba2 Add redline-aware DOCX/PDF extraction and comment-bubble support Jamie Tso 2026-05-05 ↗ GitHub
commit body
Feeds tracked changes and review comments to the LLM as inline markers
instead of stripping them ("accepted view"). Closes the redline-reading
gap that closed-source legal AI products like Harvey and Legora ship as
a paid feature.

DOCX
- extractDocxBodyText (lib/docxTrackedChanges.ts): walks document.xml and
  emits {++ins++} / {--del--} for w:ins/w:del, and {>>by AUTHOR: text<<}
  for comment bubbles loaded once from word/comments.xml.
- tabular's extractDocxMarkdown switches from mammoth to the same
  redline-aware extractor so column extraction sees redlines too.

PDF
- New scripts/redline_extract.py uses PyMuPDF to detect color-based
  redlines per text span: red/strikethrough -> {--del--},
  blue/underline -> {++ins++}, green -> {<<moved>>}. Algorithm ported
  from Diff Master's browserPyMuPdfProcessor (Pyodide), now spawned as a
  Node subprocess via lib/pdfRedlineExtract.ts. Falls back to pdfjs-dist
  text-only extraction if Python or pymupdf are unavailable.
- extractPdfMarkdown (tabular) and extractPdfText (chatTools) both call
  the new extractor first.

Prompts
- chatTools SYSTEM_PROMPT and tabular EXTRACTION_SYSTEM / SYSTEM all
  document the {++/--/<<>>}, {>>...<<} markers so the LLM knows how to
  read them and what "current" vs "original" means.

Misc
- storage.ts: forcePathStyle: true on the S3 client so MinIO and other
  path-style S3 endpoints work locally without subdomain DNS.
- Sidebar / layout / site-logo: brand reads "Mike (v2)" so side-by-side
  comparisons against upstream are unambiguous.
- backend/.env.example: PYTHON_BIN documented; pymupdf install line in
  README.

Adds Python 3.10+ + pymupdf as an optional runtime dep - extractor
gracefully no-ops to text-only if either is missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-102.md from inside the repo you want the changes in.

⬇ Download capture-thread-102.md