Vision mode: PDF pages sent as images to Olava, with tiered caching and upload-time prerender

nwhitehouse builds an eight-commit arc that renders PDF pages to PNG and splices them into the last user message as image content blocks for Qwen3-VL. The biggest lesson buried in the diff: pdfjs + node-canvas produces blank pages on this stack; the fix is `pdftoppm` from poppler-utils.

chat-uiinfrastructure

The initial impl auto-enables vision when any PDF is attached. lib/pdfRender.ts shells out to pdftoppm, lib/visionContext.ts downloads from docStore, renders, and splices base64 PNG blocks into LlmMessage.content. Claude and Gemini adapters flatten image blocks defensively. Olava gets them natively. Hard page cap of 30 (vLLM's --limit-mm-per-prompt).

The blank-image bug discovery (feat-008) is the most important commit. The initial pdfjs + node-canvas approach silently rendered blank PNGs - canvas v3 dropped Path2D, which breaks pdfjs glyph rendering. Every "vision" answer was actually coming from read_document text fallback. Rewriting pdfRender.ts to call pdftoppm fixed it. The 4-up grid compositor (four pages per output image, 2x2 layout) was validated against a 25-page services agreement and a 75-page SEK financing doc, showing ~3x token compression versus 1-up at no fidelity loss on legal-grade factual queries. 8-up was tried and rejected after hallucinations. With pagesPerImage=4 and a 100-image cap, the effective page capacity is 400 pages per request.

feat-009 cuts a 75-page render from 28s to 11.7s by splitting into 4 parallel pdftoppm workers via -f/-l page ranges, each writing to its own subdir to avoid filename collisions. A 5-entry LRU memory cache (~30MB per 75-page doc) and an R2 persistent cache survive restarts.

feat-010 pre-renders at upload time and gates the send button until rendering completes. The attachment chip shows a shimmer overlay while pending. A GET /single-documents/:id/vision-status endpoint returns {status: pending|ready|failed|missing}.

The per-marker citation verifier was disabled by default in bug-005 after it added ~12s to every chat response. It awaited parallel Olava calls (12-17s each, 3-of-4 returning empty) before sending [DONE]. Re-enable with OLAVA_VERIFIER=on.

So what Worth adopting the architecture if your fork serves a multimodal model: the pdftoppm 4-up compositor, parallel render workers, and tiered memory+R2 cache all transfer cleanly. Hard requirement: `poppler-utils` must be on the deploy host. Do not try pdfjs+node-canvas for PDF page rendering - this fork confirmed it produces blank images. Skip the verifier work; it's a documented dead end.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

8 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA	Subject	Author	Date
`f870b0a2`	[feat-007a] Vision-mode auto-on: send PDF page images to Olava	Nick Whitehouse	2026-05-04	↗ GitHub
commit body When a chat has any attached PDF, render every page (capped at 30, the vLLM --limit-mm-per-prompt setting) to PNG via @napi-rs/canvas + pdfjs and splice them into the last user message as OpenAI-style image_url content blocks. Olava-001 (served from a Qwen3-VL base) reasons over the document visually rather than waiting for read_document text extraction. Implementation: - lib/pdfRender.ts: PDF buffer → base64 PNG per page - lib/visionContext.ts: download PDFs from docStore, render, splice - llm/types.ts: LlmMessage.content now string \| LlmContentBlock[] - llm/olava.ts: pass-through (vLLM serves multimodal natively) - llm/claude.ts, llm/gemini.ts: flatten text blocks (vision is Olava-only for now; defensive in case content reaches them) - lib/chatTools.ts: detect vision content and add a system-prompt hint so the model reads from images instead of waiting for tools Test path: any chat with an attached PDF auto-enters vision mode. No user-visible toggle yet - that comes after we know the quality is good. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`436d028d`	[feat-007a] Live citation rendering for vision mode	Nick Whitehouse	2026-05-04	↗ GitHub
commit body Three pieces working together so pills appear progressively as the model streams, instead of all-at-once at end-of-turn: 1) Stream-parse the hidden <CITATIONS> JSON block. Once <CITATIONS> opens we accumulate into a buffer and brace-depth scan for newly completed {...} entries every delta, emitting a citation_added SSE event per entry the moment it's parseable. Each ref is deduped via a per-turn Set so the end-of-turn batched citations event doesn't re-emit. Resets the per-iter buffer in flushText. 2) Per-marker citation verifier (Olava non-streaming, parallel). When exactly one PDF is in scope we pre-extract its text once, then in onContentDelta scan iterText for newly-complete [N] or superscript runs and fire a verifyCitation Promise per marker without awaiting. Each resolution emits citation_added live + pushes to events. v1 currently misses 0/N - debug logging added but the stream-parser path covers the live-pill UX independently. 3) Frontend rAF coalescer. A burst of 30+ citation_added events would otherwise yield 30 setMessages calls → 30 ChatView re-renders → 30 updateScrollButton invocations, compounding into max-update-depth. Buffer pending citations in a ref and flush once per animation frame; force-flush at end-of-stream. Plus: PDF render swapped from @napi-rs/canvas to node-canvas (napi rejects pdfjs's internal Path2D objects in ctx.fill, breaking the very first page). Frontend preprocessCitations also matches Unicode superscript marker runs (¹²³⁴), which Olava sometimes prefers in legal-style prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`787258cb`	[feat-008] pdf-4up rendering via pdftoppm - fixes blank-image bug + 3× compression	Nick Whitehouse	2026-05-06	↗ GitHub
commit body Two outcomes in one commit: 1) Fixes a silent vision-mode bug. The prior pdfjs+node-canvas glyph rendering produced blank PNGs - canvas v3 dropped Path2D, pdfjs 4.x needs it for glyph paths, and the path2d polyfill didn't bridge the gap. Vision-mode answers were entirely from read_document text fallback; vision input was noise. Rendering now shells out to pdftoppm (poppler-utils), which is battle-tested and renders glyphs correctly. 2) Switches to 4-up grid composition (default pagesPerImage=4). Two independent spike rounds (25-page services agreement and 75-page SEK financing doc, see backend/spike-out/text-compression*) showed ≈3× token compression vs 1-up at no fidelity loss on legal-grade factual queries (dates, currency amounts, party names, repayment terms). 8-up was rejected - round 1 hallucinated, round 3 returned mostly empty. Combined effect: 4× page capacity per request (100 images × 4 pages = ≈400 PDF pages), blank-image bug gone, ~3× cheaper. The 75-page SEK test doc that previously truncated to 30 pages in 1-up now fits whole. Files: - nixpacks.toml: aptPkgs += poppler-utils - src/lib/pdfRender.ts: rewritten - pdftoppm + grid composer - src/lib/visionContext.ts: VISION_MAX_IMAGES_PER_REQUEST=100, pass pagesPerImage:4 to renderer read_document text-fallback stays in place as a safety net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`26ef15f4`	[bug] verifier: use olava-extract model name (was hardcoded olava-001)	Nick Whitehouse	2026-05-06	↗ GitHub
commit body Spike-leftover hardcode caused every per-marker verifier call to 404 against vLLM (which registers olava-extract, not olava-001). The end-of-stream <CITATIONS> block parser still produced pills, so user- visible behaviour was just "no progressive pill rendering during streaming." Behaviour now matches the design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`6bf6d52d`	[feat-009] Vision perf: parallel render + tiered cache + progress UI	Nick Whitehouse	2026-05-06	↗ GitHub
commit body Four wins for the vision-mode wait time, ordered by user impact: 1) PROGRESS UI. Backend emits vision_render_start/done SSE events around the pdftoppm call. Frontend renders a "Reading <filename>..." block (matching the existing DocReadBlock pattern) instead of a dead spinner. SSE stream now opens BEFORE the render so the placeholder reaches the browser immediately. ~10s+ wait now feels intentional rather than broken. 2) PARALLEL RENDER. pdftoppm is CPU-bound; one process handles only one page at a time. Split into 4 workers via -f/-l page ranges, each writing to its own subdir to avoid filename collisions. 75 pages went from 28s → 11.7s on bench. Page count discovered via pdfinfo before splitting (also from poppler-utils). 3) IN-MEMORY LRU CACHE (visionCache.ts). 5-entry cap (composites are 1-2MB each, ~30MB per 75-page doc - keeps worst-case ≤150MB resident on the 512MB Railway box). Subsequent turns against the same doc skip render entirely; sub-millisecond hit. No SSE placeholder events on a memory hit so the UI doesn't flicker. 4) R2 PERSISTENT CACHE (visionR2Cache.ts). Sits behind memory cache. Single JSON manifest at vision-cache/<base64url(storagePath\|p\|d)>.json contains the array of base64 composites. Survives backend restarts and Railway redeploys. Render → memory write → fire-and-forget R2 write; subsequent processes hit R2 once, then promote to memory. Errors swallowed - cache is best-effort. Combined effect on a 75-page PDF: - First chat ever: ~12s render, ~5MB R2 write - Same chat session: sub-ms (memory) - After backend restart: ~1-2s (R2 read + parse) - New process or doc: back to first-chat numbers Files: - backend/src/lib/pdfRender.ts: parallelise pdftoppm; pdfinfo page count - backend/src/lib/visionCache.ts: new - in-memory LRU - backend/src/lib/visionR2Cache.ts: new - R2-backed manifest - backend/src/lib/visionContext.ts: tiered lookup + SSE events + write hookup - backend/src/routes/chat.ts: open SSE before render so placeholder ships - frontend/src/app/hooks/useAssistantChat.ts: handle vision_render_start/done - frontend/src/app/components/assistant/AssistantMessage.tsx: VisionRenderBlock - frontend/src/app/components/shared/types.ts: vision_render variant on AssistantEvent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`b2840147`	[feat-010] Pre-render PDFs at upload + shimmer chip + send-button gate	Nick Whitehouse	2026-05-06	↗ GitHub
commit body After a PDF upload completes, fire a fire-and-forget vision pre-render in the background. By the time the user opens a chat against the doc, the R2 cache is warm and the chat skips the ~10s pdftoppm cost entirely. Combined with feat-009's caches, first chat against a freshly-uploaded doc is now near-instant from the user's perspective. UX: - Attachment chip in ChatInput shows immediately on attach - Shimmer overlay (chip-shimmer keyframe in globals.css) plays while pre-render is in flight - clear visual signal that the chat isn't quite ready yet - Send button disabled while ANY attached PDF is pending; tooltip explains why. Belt-and-braces in handleSubmit so Enter doesn't sneak past the disabled button Backend: - lib/visionPrerender.ts: in-process pending-renders map + R2 lookup fallback so status survives restarts. kickOffVisionPrerender is idempotent (no-op if already pending or ready). - routes/documents.ts (handleDocumentUpload + version upload): fire-and-forget kick-off after the documents.update completes. Only PDFs - DOCX vision mode isn't wired. - routes/projects.ts (project upload): same. - GET /single-documents/:id/vision-status: returns {status: pending\|ready\|failed\|missing}. Cheap - combines memory map with R2 manifest existence check. Frontend: - hooks/useVisionStatus.ts: polls vision-status every 1s per attached PDF until status resolves; caps at 60 attempts so the UI never locks if the backend goes weird. - ChatInput uses the hook to drive shimmer + button-disable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`9c244fd2`	[maint] Doc tidy: Sprint 2 backlog entries + spike artifact	Nick Whitehouse	2026-05-07	↗ GitHub
commit body - backlog.md: append Sprint 2 (vision mode) covering feat-007a through feat-010 with status, commits, and open issues. Add an "Open items" section for bug-005 (verifier blocking [DONE]), feat-011 (vLLM prefix caching), feat-012 (text-as-image compression), and the feat-006 outcome note. - backend/.gitignore: exclude spike-out/ - local benchmark output, ~70 paired markdown reports per run, regenerable on demand. - backend/scripts/spike_compression.ts: keep the spike runner as a reusable harness; current shape targets the round-3 SEK financing doc with the 5 winner-candidate variants. SECURITY.md (untracked, from the parallel security review) deliberately left untouched - not part of this session's scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`b814d76a`	[bug-005] Disable per-marker verifier by default; gate behind OLAVA_VERIFIER env	Nick Whitehouse	2026-05-07	↗ GitHub
commit body Verifier (feat-007a) was awaiting all in-flight per-marker Olava calls at end-of-turn before sending [DONE]. Per-call latency observed 12-17s on olava-extract, and 3-of-4 calls came back empty in practice. This added ~12s to time-to-[DONE] on every chat with citations. Empirically the model emits a clean <CITATIONS> JSON block on its own the vast majority of the time - citations land via the existing block- parser path regardless. The verifier is mostly redundant work today. Single env-gated flag: when OLAVA_VERIFIER is unset (or anything other than "on"), the pre-extract is skipped, verifierDocId stays null, and the existing fireVerifier early-return turns marker detection into a no-op. verifierPromises stays empty so the end-of-turn await collapses. All supporting code preserved (verifyCitation, marker detection, streaming SSE emit) so re-enabling is just OLAVA_VERIFIER=on if/when we observe the model regressing on the JSON tail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-126.md from inside the repo you want the changes in.

⬇ Download capture-thread-126.md