[feat-024] Strip NUL bytes from chunk input
Postgres text/jsonb encoding rejects U+0000 with "unsupported Unicode escape sequence" - pdfjs occasionally emits NUL in extracted text and the embed insert fails for the whole doc. Drop them upfront in the chunker so both worker + script paths are covered. Also adds a one-shot backfill script that bypasses the HTTP endpoint + worker pool: chunks and embeds every doc that has no rows in document_chunks. Useful for backfilling docs that pre-date this feature without restarting the backend or wiring auth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| Repository | nwhitehouse/mike |
|---|---|
| Author | Nick Whitehouse <nick.whitehouse@mccarthyfinch.com> |
| Authored | |
| Committed | |
| Parents | a187e3a0 |
| Stats | 2 files changed , +132 |
| Part of | RAG chat over tabular-review docs (pgvector embeddings) |
Capture this commit into my fork
Download a Markdown prompt that tells Claude how to port this
exact commit into your working tree. Run it via
claude -p < capture-commit-dbf18a87.md
from inside the repo you want the change in.