Fix .msg body extraction + inner-msg attachment detection

✅ merged · #16 · easterbrooka/mike ← easterbrooka/mike · opened 15d ago by easterbrooka · merged 15d ago by easterbrooka · self · +159-45 across 3 files · ↗ on GitHub

From the PR description

Two bugs surfaced once users uploaded real .msg files from Outlook:

  1. Body missing. Outlook composes most emails in HTML mode, which means data.body (PidTagBody) is undefined and only data.bodyHtml (PidTagBodyHtml) is populated. Our extractor was reading body only and quietly returning "". Now falls back to HTML-stripped bodyHtml, matching the .eml path's behaviour. Plain-text body still wins when both are present. RTF-compressed-only bodies remain unhandled - defer until we hit one in practice.

  2. Inner-msg attachments dropped silently. msgreader marks embedded messages with innerMsgContent: true and stores their human-readable name in .name rather than .fileName - its getAttachment() then constructs name + ".msg" as the filename. We were filtering attachments on !!a.fileName only, which threw inner-msg entries away in both the UI's attachment chip list AND the LLM's expansion loop. Now we detect inner-msg via innerMsgContent === true, synthesise the same filename, and surface them downstream.

Tests: 6 new vitest cases (HTML-body fallback, plain-text-body priority, no-body case, inner-msg listed with synthesised filename, inner-msg without a name, inner-msg expanded recursively in extractMsgForLLM). Backend suite 139/139 green; tsc clean.

Our analysis

Recover HTML bodies and inner-msg attachments from Outlook .msg uploads — read the full analysis →

Think the analysis missed something the PR description covers?

Capture this PR into my fork

Download a Markdown prompt that tells Claude how to port every commit in this PR into your working tree. Run it via claude -p < capture-pull-16.md from inside the repo you want the changes in.

⬇ Download capture-pull-16.md