feat(extraction): Phase 2 med-mal extraction pipeline

✅ merged · #2 · rmerk/mike ← rmerk/mike · opened 15d ago by rmerk · merged 15d ago by rmerk · self · +6,900-21 across 44 files · ↗ on GitHub

From the PR description

Summary

Phase 2 med-mal extraction: Postgres (0002-0005), patch_document_extraction_run with GRANTs, per-page Claude JSON extraction, raster + vision for empty text layers (node-canvas, R2 keys per run, end-of-run sweep), §145.64 vision-page peer-review prescan halting before any event call when scanned markers are detected, optional queue mode (EXTRACTION_ASYNC_MODE=queue), REST + UI + chat tools, Vitest and Supertest (403/404/409), backend CI.

Ops

  • Apply backend/migrations/0002-0005 on each Supabase environment.
  • Serverless: use queue mode + a worker process; set EXTRACTION_JOB_POLL_MS if needed.
  • canvas native dependency required for scanned pages.
  • §145.64 vision prescan cost: on a 3K-page Epic with ~60% scanned pages, expect ~$5-7 in Claude marker-detection calls before event extraction begins. The prescan is unconditional by design (no kill-switch env var) so the compliance gate cannot be bypassed.

Compliance gate (closed in this PR)

The original prescan read only the text layer; scanned pages with peer review / M&M conference / RCA report / etc. visible only in the raster bypassed the halt and could result in event rows being written for protected content. New peerReviewVisionPrescan.ts renders empty-text pages, asks Claude whether any canonical PEER_REVIEW_MARKERS phrase is visible, and halts via the existing red-flag insert path before any event-extraction call. Rasters are cached and reused by the main loop.

Follow-ups

  • Apply migration 0005_extraction_async_jobs_document_index.sql on each environment; verify get_advisors --type performance no longer flags the unindexed FK.
  • Gemini multimodal behind flag (deferred until Gemini path has its own JSON-schema tests).
  • Periodic R2 sweeper for orphaned rasters from hard crashes (current cleanup is best-effort end-of-run).

Our analysis

Close the §145.64 peer-review compliance gate with vision prescan — read the full analysis →

Think the analysis missed something the PR description covers?

Capture this PR into my fork

Download a Markdown prompt that tells Claude how to port every commit in this PR into your working tree. Run it via claude -p < capture-pull-2.md from inside the repo you want the changes in.

⬇ Download capture-pull-2.md