Accept .txt, .eml, .xlsx uploads with LLM-readable extraction + previews
From the PR description
Summary
- Extends the document-upload allowlist beyond pdf/docx to also accept
.txt,.eml, and.xlsx, so users can hand the LLM plain text, emails, and spreadsheets. - Server-side parsing only -
mailparserandexceljsnever enter the browser bundle.GET /single-documents/:id/displayreturns pre-parsed JSON under vendor content-types (application/vnd.mike.eml+json,application/vnd.mike.xlsx+json) for.eml/.xlsx, andtext/plainfor.txt. - New
TxtView,EmlView,XlsxViewcomponents render the parsed payload inline; dispatch happens insideDocViewviauseFetchSingleDoc's widenedDocResultunion, so existing pdf/docx paths and callers are untouched.
Why
The frontend accept= attribute looked like the only gate, but the backend also enforced ALLOWED_TYPES = {pdf, docx, doc} in two routes and the read_document LLM tool only dispatched pdf vs docx (with a mammoth fallback that would fail on .xls/.eml). To make new types useful rather than just accepted, the LLM read path and the in-app preview both need format-specific handling.
Design notes
- Caps (so a 10MB log dump or million-row sheet doesn't blow the LLM context or the browser): txt/eml capped at 200k chars; xlsx capped at 1000 rows × 50 cols per sheet, with explicit truncation notes in both the LLM rendering and the UI banner.
.xlsdeliberately excluded - the SheetJS npm package (the only library that handles the legacy binary format) has prototype-pollution CVEs.exceljscovers.xlsxcleanly; users with old.xlscan Save As.xlsx.- No DB migration - the
documentstable already storesfile_typeas a free string. Existing pdf/docx rows are unaffected. - No new env vars or infra. Same S3 bucket, same auth, no IAM changes.
- Encryption work in flight (Phase 2) is unaffected - storage layer is content-agnostic, KMS SSE-KMS applies regardless of file type.
Files touched
- Backend: 3 new extract libs + tests (21 cases),
chatTools.read_document, bothroutes/documents.tsandroutes/projects.ts(allowlist +extractStructureTree+/displayendpoint), and a newSUPPORTED_DOC_TYPES/contentTypeForSuffix()helper inlib/upload.tsso the allowlist, error message, and storage content-type can't drift apart. - Frontend: widened
accept=in two upload modals, new shared types module, extendeduseFetchSingleDoc, three new viewer components, dispatch branch inDocView.
Our analysis
Broaden document uploads to txt, eml, and xlsx — read the full analysis →
Think the analysis missed something the PR description covers?
Capture this PR into my fork
Download a Markdown prompt that tells Claude how to port every
commit in this PR into your working tree. Run it via
claude -p < capture-pull-8.md from
inside the repo you want the changes in.