nwhitehouse wires in a self-hosted reasoning model

The fork adds a third AI option alongside the usual two - a self-hosted model the team can run on its own hardware.

infrastructuredrafting

Upstream Mike talks to two big commercial AI providers (Anthropic's Claude and Google's Gemini). nwhitehouse bolts on a third lane called Olava: a fine-tuned version of Qwen, an open-weights model from Alibaba, served from the team's own infrastructure rather than a vendor's cloud. The frontend now shows Olava as an available option without each user having to paste in their own API key.

Most of the work is plumbing for the quirks of running a reasoning model yourself - cleaning up the model's internal "thinking" chatter before it reaches the user, giving it enough room to actually think, and teaching the app to understand a non-standard way this particular model asks to use tools. There's also a small polish fix so party names in generated Word documents render bold instead of showing literal asterisks.

So what Relevant if you're weighing self-hosted open-weights models against API-only providers for sensitive legal work where data residency or per-seat costs matter.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

3 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA Subject Author Date
b04c4213 Add OLAVA provider and local dev stack support Nick Whitehouse 2026-04-30 ↗ GitHub
commit body
- New OLAVA provider (vLLM/OpenAI-compatible) wired through types,
  models registry, llm/index router, and a streaming client. Handles
  reasoning-model output: drops `delta.reasoning` /
  `delta.reasoning_content` from emitted text, strips inline
  <think>...</think> blocks, defaults max_tokens to 16384 with
  OLAVA_MAX_TOKENS env override, and logs per-iteration content vs
  reasoning byte counts. Surfaced as "Olava Extract" in the tabular
  review model dropdown only (deliberately omitted from main chat).
- GET /user/server-keys reports which provider keys are present in
  .env (booleans only, placeholder values filtered) so the frontend
  can mark env-configured providers as available without requiring
  per-user keys. Threaded through UserProfileContext, ModelToggle,
  TabularModelDropdown, and the four call sites that build the
  apiKeys check.
- forcePathStyle on the S3Client in both backend and frontend so
  MinIO works as a local R2 substitute (R2 accepts path-style too).
- Frontend dev script binds to port 9000; backend reads PORT (9001).
- supabase/ config.toml + .gitignore from `supabase init` for the
  local Postgres+Auth+Storage stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3a2f397c Olava follow-ups: regen budget, tool gating, main-chat option Nick Whitehouse 2026-04-30 ↗ GitHub
commit body
- completeOlavaText takes max(caller, OLAVA_MAX_TOKENS) so callers
  tuned for non-reasoning models (e.g. tabular regen passing 2048)
  don't undershoot the reasoning budget.
- Strip tools from the request body by default - vLLM rejects with
  HTTP 400 unless launched with --enable-auto-tool-choice. Set
  OLAVA_ENABLE_TOOLS=true to pass tools through when the server is
  configured for them.
- Olava is now offered in the main chat model dropdown alongside
  Anthropic and Google. ApiKeyMissingModal shows a server-config
  message for Olava (env vars) instead of pointing at account
  settings.
- Per-iteration log dumps the truncated response text to make
  diagnosing short / refusal responses straightforward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5a1b0cc3 Olava tool calling, docx markdown bold, tabular UX Nick Whitehouse 2026-05-01 ↗ GitHub
commit body
- Olava tool-call plumbing: switch to non-streaming when tools are
  forwarded (vLLM streaming with --enable-auto-tool-choice silently
  drops the parsed tool_calls payload, even though it sets
  finish_reason="tool_calls"). Single round-trip per iteration is
  more reliable.
- Custom client-side parser for the LoRA's <tool_call><function=...>
  <parameter=...> token format, generic across every tool. Iteratively
  strips trailing </parameter> / </function> / </tool_call> tags so
  values like read_document's docLabel aren't poisoned with markup.
  JSON-decodes parameter values, with scalar coercion fallback.
- generate_docx now parses **bold** markdown in section content via a
  small TextRun splitter so party names / defined terms render bold
  instead of leaking literal asterisks into the .docx.
- System prompt: demoted the "MUST call read_document after
  generate_docx" rule to "MAY", and explicitly forbids re-issuing
  generate_docx in the same turn to "fix" perceived imperfections -
  use edit_document or just describe the issue. Stops the model from
  emitting two duplicate downloads when it self-critiques.
- Tabular review: new "Wrap text" toolbar toggle (cells switch from
  line-clamp-1 to wrap-and-grow). Header columns are drag-to-reorder
  via HTML5 DnD, persisted through the existing columns_config saver;
  display order follows the array, cell lookup keeps using the stable
  .index. Per-column resize via a hidden right-edge drag handle with
  a 120px floor; widths are local state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-119.md from inside the repo you want the changes in.

⬇ Download capture-thread-119.md