Redline-aware extraction: DOCX tracked changes and PDF color annotations surfaced as inline markers

JonasBoury added end-to-end support for preserving review markup through the document ingest pipeline. DOCX tracked changes and reviewer comments now arrive at the LLM as inline markers rather than getting flattened; PDFs with color-based redlines (Litera/Workshare style) get a PyMuPDF extractor, with the existing pdfjs path kept as fallback.

The DOCX side adds w:ins/w:del/w:commentRangeStart walking in backend/src/lib/docxTrackedChanges.ts. Insertions become {++text++}, deletions {--text--}, comments {>>by AUTHOR: comment<<} with author names and content pulled from word/comments.xml. Documents with no tracked changes go through the existing Mammoth extraction path, unchanged.

The PDF side is a new Python script at backend/scripts/redline_extract.py, spawned via backend/src/lib/pdfRedlineExtract.ts with a 30-second timeout and fallback to pdfjs on any error. The script uses PyMuPDF to read each span's color field, decodes it to RGB, and classifies spans against three targets with a 30-unit Euclidean tolerance: blue → insertion, red → deletion, green → moved. The commit header notes this is "ported from Diff Master's browserPyMuPdfProcessor.ts (Pyodide+PyMuPDF)" - the same algorithm moved from WASM to a Node subprocess. Set PYTHON_BIN if you need a specific interpreter; it defaults to python3.

Both chatTools.SYSTEM_PROMPT and the tabular extraction prompts (Gemini and Anthropic variants) are extended with a section explaining the marker format: accept insertions, drop deletions, treat comments as marginalia, strip markers from extracted values unless the user asks for them.

One unrelated change in the same commit: forcePathStyle: true on the R2 S3 client in storage.ts. Looks like a self-hosted MinIO compatibility tweak; harmless on real R2 but worth noting if you're reviewing the commit narrowly.

Redline-aware extraction: DOCX tracked changes and PDF color annotations surfaced as inline markers

Commits in this thread

Capture this thread into my fork