[feat-024] Strip NUL bytes from chunk input

↗ view on GitHub · Nick Whitehouse · 2026-05-07 · dbf18a87

Postgres text/jsonb encoding rejects U+0000 with "unsupported Unicode
escape sequence" - pdfjs occasionally emits NUL in extracted text and
the embed insert fails for the whole doc. Drop them upfront in the
chunker so both worker + script paths are covered.

Also adds a one-shot backfill script that bypasses the HTTP endpoint
+ worker pool: chunks and embeds every doc that has no rows in
document_chunks. Useful for backfilling docs that pre-date this
feature without restarting the backend or wiring auth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repository nwhitehouse/mike
Author Nick Whitehouse <nick.whitehouse@mccarthyfinch.com>
Authored
Committed
Parents a187e3a0
Stats 2 files changed , +132
Part of RAG chat over tabular-review docs (pgvector embeddings)

Capture this commit into my fork

Download a Markdown prompt that tells Claude how to port this exact commit into your working tree. Run it via claude -p < capture-commit-dbf18a87.md from inside the repo you want the change in.

⬇ Download capture-commit-dbf18a87.md