Add redline-aware DOCX/PDF extraction and comment-bubble support
Feeds tracked changes and review comments to the LLM as inline markers
instead of stripping them ("accepted view"). Closes the redline-reading
gap that closed-source legal AI products like Harvey and Legora ship as
a paid feature.
DOCX
- extractDocxBodyText (lib/docxTrackedChanges.ts): walks document.xml and
emits {++ins++} / {--del--} for w:ins/w:del, and {>>by AUTHOR: text<<}
for comment bubbles loaded once from word/comments.xml.
- tabular's extractDocxMarkdown switches from mammoth to the same
redline-aware extractor so column extraction sees redlines too.
PDF
- New scripts/redline_extract.py uses PyMuPDF to detect color-based
redlines per text span: red/strikethrough -> {--del--},
blue/underline -> {++ins++}, green -> {<<moved>>}. Algorithm ported
from Diff Master's browserPyMuPdfProcessor (Pyodide), now spawned as a
Node subprocess via lib/pdfRedlineExtract.ts. Falls back to pdfjs-dist
text-only extraction if Python or pymupdf are unavailable.
- extractPdfMarkdown (tabular) and extractPdfText (chatTools) both call
the new extractor first.
Prompts
- chatTools SYSTEM_PROMPT and tabular EXTRACTION_SYSTEM / SYSTEM all
document the {++/--/<<>>}, {>>...<<} markers so the LLM knows how to
read them and what "current" vs "original" means.
Misc
- storage.ts: forcePathStyle: true on the S3 client so MinIO and other
path-style S3 endpoints work locally without subdomain DNS.
- Sidebar / layout / site-logo: brand reads "Mike (v2)" so side-by-side
comparisons against upstream are unambiguous.
- backend/.env.example: PYTHON_BIN documented; pymupdf install line in
README.
Adds Python 3.10+ + pymupdf as an optional runtime dep - extractor
gracefully no-ops to text-only if either is missing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| Repository | jamietso/mike-redline |
|---|---|
| Author | Jamie Tso <jamietso@gmail.com> |
| Authored | |
| Parents | d9690965 |
| Stats | 11 files changed , +361 , -43 |
| Part of | Redline-aware DOCX/PDF extraction |
Capture this commit into my fork
Download a Markdown prompt that tells Claude how to port this
exact commit into your working tree. Run it via
claude -p < capture-commit-394f2ba2.md
from inside the repo you want the change in.