feat(extraction): Phase 3.5a - MAR/Vitals lenses + DocView memory-leak fix

🟢 open · #4 · rmerk/mike ← rmerk/mike · opened 15d ago by rmerk · self · +1,740-178 across 19 files · ↗ on GitHub

From the PR description

Summary

Two related changes on the med-mal extraction surface:

  1. Phase 3.5a - MAR + Vitals lenses (eb038cc). Generalizes Phase 3's chronology Timeline into per-lens views over the event log. New routes /projects/[id]/{mar,vitals}/[docId] mirror the timeline shape; medications and vitals JSON shapes are now locked in the extractor prompt with coerceMedications / coerceVitals validators dropping malformed entries. Default extraction model swapped to NVIDIA Catalog Kimi K2.6 (vision) via a new provider-dispatching completeMedMalExtractionPage with retry/backoff. Tuning knobs (concurrency, retry budget, reaper timeout, async mode) documented in CLAUDE.md for the multi-hour Epic ebook case.
  2. DocView memory-leak fix (a088b31). On the extraction page with a 3000-page PDF, repeated zoom/resize/bbox interactions and document switches retained worker-side PDF.js caches that JS GC can't reclaim, climbing RSS indefinitely. pdfDoc.destroy() and per-page cleanup() now run on the right transitions, and bbox-highlight changes no longer trigger a full re-render of every page canvas.

Phase split

This PR ships Phase 3.5a only (MAR + Vitals - both columns already exist on document_events as jsonb, no SQL migration). Phases 3.5b (Labs - needs new labs jsonb column) and 3.5c (Bills - likely needs a new document_charges table) deferred to follow-up PRs.

What's not fixed in the DocView change

The eager all-pages render (~10 GB of canvas pixel data on a 3K-page Epic at 1× scale) is untouched. That needs page virtualization and is a separate change. The leak fixes here mean memory at least plateaus during a session instead of climbing every interaction.

Test plan

  • Backend npm run build clean.
  • Frontend npx tsc --noEmit clean.
  • Frontend npm run lint - no new errors in changed files (one pre-existing scrollToHighlightOnPage declaration-order warning in DocView.tsx, unrelated).
  • Backend extraction tests green (43/43 per Phase 3.5a verification).
  • Manual e2e - MAR/Vitals lenses on an extracted med-mal doc:
    • + MAR and + Vitals Trend buttons appear on the project page when ≥1 PDF is fully extracted.
    • Single-PDF case routes directly; multi-PDF case opens DocPickerModal with the right target.
    • Row click on the right panel scrolls the PDF preview to source_page and overlays the bbox highlight.
  • Manual e2e - DocView memory leak on the extraction page (/projects/[id]/extraction):
    • Switch documents 5× in a row → DevTools Memory profile shows RSS plateauing instead of climbing per switch.
    • Click 10 events in the right panel → no full re-render flash; bbox overlay updates in place.
    • Pinch-zoom or trigger window resize → page-level renders happen but prior page proxies are released (verify: detached PDFPageProxy count in heap snapshot stays bounded).
  • Manual e2e - extraction throughput on a real Epic ebook:
    • Default MED_MAL_EXTRACTION_MODEL=moonshotai/kimi-k2.6 runs end-to-end without auth/4xx errors.
    • MED_MAL_MAIN_LOOP_CONCURRENCY=8 produces visible parallelism in extraction logs without rate-limiting.

Our analysis

MAR and Vitals lenses plus a DocView memory-leak fix — read the full analysis →

Think the analysis missed something the PR description covers?

Capture this PR into my fork

Download a Markdown prompt that tells Claude how to port every commit in this PR into your working tree. Run it via claude -p < capture-pull-4.md from inside the repo you want the changes in.

⬇ Download capture-pull-4.md