Add redline-aware DOCX/PDF extraction and comment-bubble support

↗ view on GitHub · Jamie Tso · 2026-05-05 · 394f2ba2

Feeds tracked changes and review comments to the LLM as inline markers
instead of stripping them ("accepted view"). Closes the redline-reading
gap that closed-source legal AI products like Harvey and Legora ship as
a paid feature.

DOCX
- extractDocxBodyText (lib/docxTrackedChanges.ts): walks document.xml and
  emits {++ins++} / {--del--} for w:ins/w:del, and {>>by AUTHOR: text<<}
  for comment bubbles loaded once from word/comments.xml.
- tabular's extractDocxMarkdown switches from mammoth to the same
  redline-aware extractor so column extraction sees redlines too.

PDF
- New scripts/redline_extract.py uses PyMuPDF to detect color-based
  redlines per text span: red/strikethrough -> {--del--},
  blue/underline -> {++ins++}, green -> {<<moved>>}. Algorithm ported
  from Diff Master's browserPyMuPdfProcessor (Pyodide), now spawned as a
  Node subprocess via lib/pdfRedlineExtract.ts. Falls back to pdfjs-dist
  text-only extraction if Python or pymupdf are unavailable.
- extractPdfMarkdown (tabular) and extractPdfText (chatTools) both call
  the new extractor first.

Prompts
- chatTools SYSTEM_PROMPT and tabular EXTRACTION_SYSTEM / SYSTEM all
  document the {++/--/<<>>}, {>>...<<} markers so the LLM knows how to
  read them and what "current" vs "original" means.

Misc
- storage.ts: forcePathStyle: true on the S3 client so MinIO and other
  path-style S3 endpoints work locally without subdomain DNS.
- Sidebar / layout / site-logo: brand reads "Mike (v2)" so side-by-side
  comparisons against upstream are unambiguous.
- backend/.env.example: PYTHON_BIN documented; pymupdf install line in
  README.

Adds Python 3.10+ + pymupdf as an optional runtime dep - extractor
gracefully no-ops to text-only if either is missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repository jamietso/mike-redline
Author Jamie Tso <jamietso@gmail.com>
Authored
Parents d9690965
Stats 11 files changed , +361 , -43
Part of Redline-aware DOCX/PDF extraction

Capture this commit into my fork

Download a Markdown prompt that tells Claude how to port this exact commit into your working tree. Run it via claude -p < capture-commit-394f2ba2.md from inside the repo you want the change in.

⬇ Download capture-commit-394f2ba2.md