Accept .txt, .eml, .xlsx uploads with LLM-readable extraction + previews

✅ merged · #8 · easterbrooka/mike ← easterbrooka/mike · opened 16d ago by easterbrooka · merged 16d ago by easterbrooka · self · +2,000-28 across 20 files · ↗ on GitHub

From the PR description

Summary

  • Extends the document-upload allowlist beyond pdf/docx to also accept .txt, .eml, and .xlsx, so users can hand the LLM plain text, emails, and spreadsheets.
  • Server-side parsing only - mailparser and exceljs never enter the browser bundle. GET /single-documents/:id/display returns pre-parsed JSON under vendor content-types (application/vnd.mike.eml+json, application/vnd.mike.xlsx+json) for .eml/.xlsx, and text/plain for .txt.
  • New TxtView, EmlView, XlsxView components render the parsed payload inline; dispatch happens inside DocView via useFetchSingleDoc's widened DocResult union, so existing pdf/docx paths and callers are untouched.

Why

The frontend accept= attribute looked like the only gate, but the backend also enforced ALLOWED_TYPES = {pdf, docx, doc} in two routes and the read_document LLM tool only dispatched pdf vs docx (with a mammoth fallback that would fail on .xls/.eml). To make new types useful rather than just accepted, the LLM read path and the in-app preview both need format-specific handling.

Design notes

  • Caps (so a 10MB log dump or million-row sheet doesn't blow the LLM context or the browser): txt/eml capped at 200k chars; xlsx capped at 1000 rows × 50 cols per sheet, with explicit truncation notes in both the LLM rendering and the UI banner.
  • .xls deliberately excluded - the SheetJS npm package (the only library that handles the legacy binary format) has prototype-pollution CVEs. exceljs covers .xlsx cleanly; users with old .xls can Save As .xlsx.
  • No DB migration - the documents table already stores file_type as a free string. Existing pdf/docx rows are unaffected.
  • No new env vars or infra. Same S3 bucket, same auth, no IAM changes.
  • Encryption work in flight (Phase 2) is unaffected - storage layer is content-agnostic, KMS SSE-KMS applies regardless of file type.

Files touched

  • Backend: 3 new extract libs + tests (21 cases), chatTools.read_document, both routes/documents.ts and routes/projects.ts (allowlist + extractStructureTree + /display endpoint), and a new SUPPORTED_DOC_TYPES/contentTypeForSuffix() helper in lib/upload.ts so the allowlist, error message, and storage content-type can't drift apart.
  • Frontend: widened accept= in two upload modals, new shared types module, extended useFetchSingleDoc, three new viewer components, dispatch branch in DocView.

Our analysis

Broaden document uploads to txt, eml, and xlsx — read the full analysis →

Think the analysis missed something the PR description covers?

Capture this PR into my fork

Download a Markdown prompt that tells Claude how to port every commit in this PR into your working tree. Run it via claude -p < capture-pull-8.md from inside the repo you want the changes in.

⬇ Download capture-pull-8.md