nwhitehouse teaches Mike to read PDFs with its eyes

Instead of extracting text from uploaded PDFs, this fork shows the document to the AI as images - and pre-loads them the moment you upload.

chat-uiinfrastructure

nwhitehouse rewired the chat flow so that any PDF attached to a conversation gets rendered into page images and handed directly to the model, which can then reason over layout, tables, signatures and stamps the way a human reader would, rather than puzzling over flattened text. To keep cost and speed sane, pages are stitched four-to-an-image - a trick that survived stress tests on a 75-page financing document at roughly a third of the token cost with no loss of accuracy.

The fork also pre-renders PDFs the moment they're uploaded, with a shimmer on the attachment chip and the send button disabled until the document is ready, caching results both in memory and in cloud storage so a re-opened chat doesn't pay the cost twice. An attempt at a separate citation-verification pass was tried and quietly switched off - it added twelve seconds per answer for little gain.

So what Worth a look for any legal team whose documents lean on visual structure - contracts with signature blocks, financials with tables, anything where the layout carries meaning.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

8 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA Subject Author Date
f870b0a2 [feat-007a] Vision-mode auto-on: send PDF page images to Olava Nick Whitehouse 2026-05-04 ↗ GitHub
commit body
When a chat has any attached PDF, render every page (capped at 30, the
vLLM --limit-mm-per-prompt setting) to PNG via @napi-rs/canvas + pdfjs
and splice them into the last user message as OpenAI-style image_url
content blocks. Olava-001 (served from a Qwen3-VL base) reasons over
the document visually rather than waiting for read_document text
extraction.

Implementation:
  - lib/pdfRender.ts: PDF buffer → base64 PNG per page
  - lib/visionContext.ts: download PDFs from docStore, render, splice
  - llm/types.ts: LlmMessage.content now string | LlmContentBlock[]
  - llm/olava.ts: pass-through (vLLM serves multimodal natively)
  - llm/claude.ts, llm/gemini.ts: flatten text blocks (vision is
    Olava-only for now; defensive in case content reaches them)
  - lib/chatTools.ts: detect vision content and add a system-prompt
    hint so the model reads from images instead of waiting for tools

Test path: any chat with an attached PDF auto-enters vision mode. No
user-visible toggle yet - that comes after we know the quality is good.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
436d028d [feat-007a] Live citation rendering for vision mode Nick Whitehouse 2026-05-04 ↗ GitHub
commit body
Three pieces working together so pills appear progressively as the
model streams, instead of all-at-once at end-of-turn:

1) Stream-parse the hidden <CITATIONS> JSON block. Once <CITATIONS>
   opens we accumulate into a buffer and brace-depth scan for newly
   completed {...} entries every delta, emitting a citation_added SSE
   event per entry the moment it's parseable. Each ref is deduped via
   a per-turn Set so the end-of-turn batched citations event doesn't
   re-emit. Resets the per-iter buffer in flushText.

2) Per-marker citation verifier (Olava non-streaming, parallel). When
   exactly one PDF is in scope we pre-extract its text once, then in
   onContentDelta scan iterText for newly-complete [N] or superscript
   runs and fire a verifyCitation Promise per marker without awaiting.
   Each resolution emits citation_added live + pushes to events. v1
   currently misses 0/N - debug logging added but the stream-parser
   path covers the live-pill UX independently.

3) Frontend rAF coalescer. A burst of 30+ citation_added events would
   otherwise yield 30 setMessages calls → 30 ChatView re-renders → 30
   updateScrollButton invocations, compounding into max-update-depth.
   Buffer pending citations in a ref and flush once per animation
   frame; force-flush at end-of-stream.

Plus: PDF render swapped from @napi-rs/canvas to node-canvas (napi
rejects pdfjs's internal Path2D objects in ctx.fill, breaking the
very first page). Frontend preprocessCitations also matches Unicode
superscript marker runs (¹²³⁴), which Olava sometimes prefers in
legal-style prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
787258cb [feat-008] pdf-4up rendering via pdftoppm - fixes blank-image bug + 3× compression Nick Whitehouse 2026-05-06 ↗ GitHub
commit body
Two outcomes in one commit:

1) Fixes a silent vision-mode bug. The prior pdfjs+node-canvas glyph
   rendering produced blank PNGs - canvas v3 dropped Path2D, pdfjs 4.x
   needs it for glyph paths, and the path2d polyfill didn't bridge the
   gap. Vision-mode answers were entirely from read_document text
   fallback; vision input was noise. Rendering now shells out to
   pdftoppm (poppler-utils), which is battle-tested and renders glyphs
   correctly.

2) Switches to 4-up grid composition (default pagesPerImage=4). Two
   independent spike rounds (25-page services agreement and 75-page
   SEK financing doc, see backend/spike-out/text-compression*) showed
   ≈3× token compression vs 1-up at no fidelity loss on legal-grade
   factual queries (dates, currency amounts, party names, repayment
   terms). 8-up was rejected - round 1 hallucinated, round 3 returned
   mostly empty.

Combined effect: 4× page capacity per request (100 images × 4 pages =
≈400 PDF pages), blank-image bug gone, ~3× cheaper. The 75-page SEK
test doc that previously truncated to 30 pages in 1-up now fits whole.

Files:
- nixpacks.toml: aptPkgs += poppler-utils
- src/lib/pdfRender.ts: rewritten - pdftoppm + grid composer
- src/lib/visionContext.ts: VISION_MAX_IMAGES_PER_REQUEST=100, pass
  pagesPerImage:4 to renderer

read_document text-fallback stays in place as a safety net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
26ef15f4 [bug] verifier: use olava-extract model name (was hardcoded olava-001) Nick Whitehouse 2026-05-06 ↗ GitHub
commit body
Spike-leftover hardcode caused every per-marker verifier call to 404
against vLLM (which registers olava-extract, not olava-001). The
end-of-stream <CITATIONS> block parser still produced pills, so user-
visible behaviour was just "no progressive pill rendering during
streaming." Behaviour now matches the design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6bf6d52d [feat-009] Vision perf: parallel render + tiered cache + progress UI Nick Whitehouse 2026-05-06 ↗ GitHub
commit body
Four wins for the vision-mode wait time, ordered by user impact:

1) PROGRESS UI. Backend emits vision_render_start/done SSE events
   around the pdftoppm call. Frontend renders a "Reading <filename>..."
   block (matching the existing DocReadBlock pattern) instead of a
   dead spinner. SSE stream now opens BEFORE the render so the
   placeholder reaches the browser immediately. ~10s+ wait now feels
   intentional rather than broken.

2) PARALLEL RENDER. pdftoppm is CPU-bound; one process handles only
   one page at a time. Split into 4 workers via -f/-l page ranges,
   each writing to its own subdir to avoid filename collisions. 75
   pages went from 28s → 11.7s on bench. Page count discovered via
   pdfinfo before splitting (also from poppler-utils).

3) IN-MEMORY LRU CACHE (visionCache.ts). 5-entry cap (composites are
   1-2MB each, ~30MB per 75-page doc - keeps worst-case ≤150MB
   resident on the 512MB Railway box). Subsequent turns against the
   same doc skip render entirely; sub-millisecond hit. No SSE
   placeholder events on a memory hit so the UI doesn't flicker.

4) R2 PERSISTENT CACHE (visionR2Cache.ts). Sits behind memory cache.
   Single JSON manifest at vision-cache/<base64url(storagePath|p|d)>.json
   contains the array of base64 composites. Survives backend restarts
   and Railway redeploys. Render → memory write → fire-and-forget R2
   write; subsequent processes hit R2 once, then promote to memory.
   Errors swallowed - cache is best-effort.

Combined effect on a 75-page PDF:
- First chat ever:      ~12s render, ~5MB R2 write
- Same chat session:    sub-ms (memory)
- After backend restart: ~1-2s (R2 read + parse)
- New process or doc:    back to first-chat numbers

Files:
- backend/src/lib/pdfRender.ts: parallelise pdftoppm; pdfinfo page count
- backend/src/lib/visionCache.ts: new - in-memory LRU
- backend/src/lib/visionR2Cache.ts: new - R2-backed manifest
- backend/src/lib/visionContext.ts: tiered lookup + SSE events + write hookup
- backend/src/routes/chat.ts: open SSE before render so placeholder ships
- frontend/src/app/hooks/useAssistantChat.ts: handle vision_render_start/done
- frontend/src/app/components/assistant/AssistantMessage.tsx: VisionRenderBlock
- frontend/src/app/components/shared/types.ts: vision_render variant on
  AssistantEvent

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b2840147 [feat-010] Pre-render PDFs at upload + shimmer chip + send-button gate Nick Whitehouse 2026-05-06 ↗ GitHub
commit body
After a PDF upload completes, fire a fire-and-forget vision pre-render
in the background. By the time the user opens a chat against the doc,
the R2 cache is warm and the chat skips the ~10s pdftoppm cost
entirely. Combined with feat-009's caches, first chat against a
freshly-uploaded doc is now near-instant from the user's perspective.

UX:
  - Attachment chip in ChatInput shows immediately on attach
  - Shimmer overlay (chip-shimmer keyframe in globals.css) plays while
    pre-render is in flight - clear visual signal that the chat isn't
    quite ready yet
  - Send button disabled while ANY attached PDF is pending; tooltip
    explains why. Belt-and-braces in handleSubmit so Enter doesn't
    sneak past the disabled button

Backend:
  - lib/visionPrerender.ts: in-process pending-renders map + R2 lookup
    fallback so status survives restarts. kickOffVisionPrerender is
    idempotent (no-op if already pending or ready).
  - routes/documents.ts (handleDocumentUpload + version upload):
    fire-and-forget kick-off after the documents.update completes.
    Only PDFs - DOCX vision mode isn't wired.
  - routes/projects.ts (project upload): same.
  - GET /single-documents/:id/vision-status: returns
    {status: pending|ready|failed|missing}. Cheap - combines memory
    map with R2 manifest existence check.

Frontend:
  - hooks/useVisionStatus.ts: polls vision-status every 1s per attached
    PDF until status resolves; caps at 60 attempts so the UI never
    locks if the backend goes weird.
  - ChatInput uses the hook to drive shimmer + button-disable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9c244fd2 [maint] Doc tidy: Sprint 2 backlog entries + spike artifact Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
- backlog.md: append Sprint 2 (vision mode) covering feat-007a through
  feat-010 with status, commits, and open issues. Add an "Open items"
  section for bug-005 (verifier blocking [DONE]), feat-011 (vLLM
  prefix caching), feat-012 (text-as-image compression), and the
  feat-006 outcome note.
- backend/.gitignore: exclude spike-out/ - local benchmark output,
  ~70 paired markdown reports per run, regenerable on demand.
- backend/scripts/spike_compression.ts: keep the spike runner as a
  reusable harness; current shape targets the round-3 SEK financing
  doc with the 5 winner-candidate variants.

SECURITY.md (untracked, from the parallel security review) deliberately
left untouched - not part of this session's scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b814d76a [bug-005] Disable per-marker verifier by default; gate behind OLAVA_VERIFIER env Nick Whitehouse 2026-05-07 ↗ GitHub
commit body
Verifier (feat-007a) was awaiting all in-flight per-marker Olava calls
at end-of-turn before sending [DONE]. Per-call latency observed 12-17s
on olava-extract, and 3-of-4 calls came back empty in practice. This
added ~12s to time-to-[DONE] on every chat with citations.

Empirically the model emits a clean <CITATIONS> JSON block on its own
the vast majority of the time - citations land via the existing block-
parser path regardless. The verifier is mostly redundant work today.

Single env-gated flag: when OLAVA_VERIFIER is unset (or anything other
than "on"), the pre-extract is skipped, verifierDocId stays null, and
the existing fireVerifier early-return turns marker detection into a
no-op. verifierPromises stays empty so the end-of-turn await collapses.

All supporting code preserved (verifyCitation, marker detection,
streaming SSE emit) so re-enabling is just OLAVA_VERIFIER=on if/when
we observe the model regressing on the JSON tail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-126.md from inside the repo you want the changes in.

⬇ Download capture-thread-126.md