Third LLM provider: Olava (vLLM/Qwen LoRA) wired into the router

nwhitehouse added a vLLM-backed "Olava" provider - an OpenAI-compatible endpoint serving a Qwen3-based reasoning LoRA - alongside the existing Anthropic and Gemini paths. The interesting part is not the provider addition itself but the scaffolding built to make a small reasoning model behave: think-block stripping, a 16384-token floor, a custom tool-call parser for non-standard LoRA markup, and a tool-stripping default that prevents 400s on vLLM servers without `--enable-auto-tool-choice`.

infrastructuredrafting

backend/src/lib/llm/olava.ts handles the awkward details. Reasoning output arrives two different ways depending on the vLLM build: delta.reasoning (DeepSeek-style) or delta.reasoning_content (Qwen3-style), and some builds inline it as <think>...</think> in delta.content instead. The adapter strips all three to keep reasoning tokens out of the visible response. completeOlavaText takes max(caller_max_tokens, OLAVA_MAX_TOKENS) so callers that pass a small cap like 2048 - tuned for non-reasoning models - don't truncate the chain-of-thought before the answer arrives.

The LoRA emits tool calls in a format no standard OpenAI client handles: <tool_call><function=name><parameter=foo>.... The fork ships a client-side parser that iteratively strips trailing XML tags and JSON-decodes parameter values with scalar coercion fallback. Since vLLM streaming silently drops the parsed tool_calls even when finish_reason="tool_calls", the adapter falls back to non-streaming when tools are forwarded. That decision was later revised - see post-456 - but the parser itself is the reusable piece.

Tools are stripped from the request body by default. Set OLAVA_ENABLE_TOOLS=true only if the vLLM server is running with --enable-auto-tool-choice. Without it, the server returns 400. A new GET /user/server-keys endpoint reports which provider env vars are present (booleans, no values), so the frontend can show Olava as available without per-user API keys. The ApiKeyMissingModal message branches on provider === "olava" to explain the server-side configuration path instead of the usual key-entry flow.

Two unrelated fixes rode in the same commit cluster: a **bold** markdown-to-TextRun splitter in generate_docx so party names render correctly instead of leaking literal asterisks, and a system prompt change demoting "MUST call read_document after generate_docx" to MAY - the model was self-critiquing into duplicate downloads on the same turn.

So what Worth importing if you're wiring a non-Anthropic/Gemini OpenAI-compatible backend, especially a reasoning model. The reasoning-field stripping, max_tokens floor, and custom tool-call parser are useful scaffolding even if you're not running this specific LoRA. The main caveat: the tool-stripping default and `OLAVA_ENABLE_TOOLS` flag are defensively coupled to vLLM's behavior - if your inference server handles tools differently, re-examine both assumptions before using this code.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

3 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA	Subject	Author	Date
`b04c4213`	Add OLAVA provider and local dev stack support	Nick Whitehouse	2026-04-30	↗ GitHub
commit body - New OLAVA provider (vLLM/OpenAI-compatible) wired through types, models registry, llm/index router, and a streaming client. Handles reasoning-model output: drops `delta.reasoning` / `delta.reasoning_content` from emitted text, strips inline <think>...</think> blocks, defaults max_tokens to 16384 with OLAVA_MAX_TOKENS env override, and logs per-iteration content vs reasoning byte counts. Surfaced as "Olava Extract" in the tabular review model dropdown only (deliberately omitted from main chat). - GET /user/server-keys reports which provider keys are present in .env (booleans only, placeholder values filtered) so the frontend can mark env-configured providers as available without requiring per-user keys. Threaded through UserProfileContext, ModelToggle, TabularModelDropdown, and the four call sites that build the apiKeys check. - forcePathStyle on the S3Client in both backend and frontend so MinIO works as a local R2 substitute (R2 accepts path-style too). - Frontend dev script binds to port 9000; backend reads PORT (9001). - supabase/ config.toml + .gitignore from `supabase init` for the local Postgres+Auth+Storage stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`3a2f397c`	Olava follow-ups: regen budget, tool gating, main-chat option	Nick Whitehouse	2026-04-30	↗ GitHub
commit body - completeOlavaText takes max(caller, OLAVA_MAX_TOKENS) so callers tuned for non-reasoning models (e.g. tabular regen passing 2048) don't undershoot the reasoning budget. - Strip tools from the request body by default - vLLM rejects with HTTP 400 unless launched with --enable-auto-tool-choice. Set OLAVA_ENABLE_TOOLS=true to pass tools through when the server is configured for them. - Olava is now offered in the main chat model dropdown alongside Anthropic and Google. ApiKeyMissingModal shows a server-config message for Olava (env vars) instead of pointing at account settings. - Per-iteration log dumps the truncated response text to make diagnosing short / refusal responses straightforward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`5a1b0cc3`	Olava tool calling, docx markdown bold, tabular UX	Nick Whitehouse	2026-05-01	↗ GitHub
commit body - Olava tool-call plumbing: switch to non-streaming when tools are forwarded (vLLM streaming with --enable-auto-tool-choice silently drops the parsed tool_calls payload, even though it sets finish_reason="tool_calls"). Single round-trip per iteration is more reliable. - Custom client-side parser for the LoRA's <tool_call><function=...> <parameter=...> token format, generic across every tool. Iteratively strips trailing </parameter> / </function> / </tool_call> tags so values like read_document's docLabel aren't poisoned with markup. JSON-decodes parameter values, with scalar coercion fallback. - generate_docx now parses bold markdown in section content via a small TextRun splitter so party names / defined terms render bold instead of leaking literal asterisks into the .docx. - System prompt: demoted the "MUST call read_document after generate_docx" rule to "MAY", and explicitly forbids re-issuing generate_docx in the same turn to "fix" perceived imperfections - use edit_document or just describe the issue. Stops the model from emitting two duplicate downloads when it self-critiques. - Tabular review: new "Wrap text" toolbar toggle (cells switch from line-clamp-1 to wrap-and-grow). Header columns are drag-to-reorder via HTML5 DnD, persisted through the existing columns_config saver; display order follows the array, cell lookup keeps using the stable .index. Per-column resize via a hidden right-edge drag handle with a 120px floor; widths are local state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-119.md from inside the repo you want the changes in.

⬇ Download capture-thread-119.md