Qwen3 thinking mode gets env-var controls and a collapsed UI

nwhitehouse adds three env vars to gate Qwen3's reasoning budget and reworks the frontend thinking card so it collapses by default. The per-helper opt-out pattern - passing `enableThinking: false` on small single-purpose calls - is the most portable piece.

chat-uiinfrastructure

Three new backend env vars control Olava's reasoning behavior: OLAVA_THINKING_MODE (off|low|standard, defaults to standard), OLAVA_MAX_TOKENS (halved from 16384 to 8192), and OLAVA_COMPLETION_MAX_TOKENS (2048 for helper calls). The Olava adapter sets Qwen3's thinking switch via vLLM's chat_template_kwargs.enable_thinking, and in low or off mode appends a /no_think hint to the system prompt since Qwen3 doesn't expose a per-token reasoning budget.

The caller-level override is the clean part. Passing enableThinking: false on calls in research/queryExpander.ts and research/triage.ts forces low mode regardless of the env var - a 5-word query rewrite doesn't need 4000 reasoning tokens. That pattern transfers to any reasoning model.

Frontend AssistantMessage.tsx gets a reworked thinking card: collapsed by default, markdown-aware rendering, bounded scroll area. One earlier feature was dropped - a hard OLAVA_REASONING_DISPLAY_CHAR_LIMIT backend truncation cap - because collapsed-by-default handles the "too much reasoning output" problem without an ugly cutoff marker.

Setting OLAVA_THINKING_MODE=low in Railway env disables Qwen reasoning entirely for faster responses.

So what Worth a look if your fork runs a reasoning model and you want cheap-tier helper calls (query rewrite, triage, classification) to skip the thinking pass. The per-call `enableThinking: false` opt-out is a clean pattern. The `chat_template_kwargs.enable_thinking` mechanism is vLLM+Qwen3 specific; the collapsed-card UI is generally good.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

1 commit from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA Subject Author Date

eaef8912 [feat-019] Thinking controls + collapsed reasoning UI Nick Whitehouse 2026-05-07 ↗ GitHub

SHA	Subject	Author	Date
`eaef8912`	[feat-019] Thinking controls + collapsed reasoning UI	Nick Whitehouse	2026-05-07	↗ GitHub
commit body What's in this commit: - backend/.env.example - OLAVA_THINKING_MODE (off\|low\|standard, default standard), OLAVA_MAX_TOKENS (default 8192, was 16384), OLAVA_COMPLETION_MAX_TOKENS (2048). - backend/src/lib/llm/olava.ts - Qwen3 thinking control via vLLM `chat_template_kwargs.enable_thinking`. In low/off mode also appends a /no_think hint to the system prompt. Caller-passed `enableThinking: false` forces low mode regardless of env (used by helper calls). - backend/src/lib/chatTools.ts - adds a "REASONING BUDGET: keep internal analysis brief and targeted" line to the chat system prompt as soft guidance. - backend/src/lib/research/{queryExpander,triage}.ts - non-interactive helper calls opt out of thinking (enableThinking: false) so a 5-word search-query rewrite doesn't burn 4000 tokens reasoning first. - frontend/.../AssistantMessage.tsx - thinking card collapsed by default, readable spacing, markdown-aware reasoning rendering, bounded scroll area so long reasoning doesn't dominate the message. Defaults take effect immediately on deploy. To disable Qwen reasoning entirely (snappier, no <think> block), set OLAVA_THINKING_MODE=low in the Railway env. No code change needed. Removed from earlier draft: the OLAVA_REASONING_DISPLAY_CHAR_LIMIT cap + "[Thought process truncated by display limit.]" marker. The collapsed- by-default UI handles "hide so much of the read out" without a hard backend truncation; the marker was ugly when it appeared. Backlog entries for bug-008 (assistant thinking output noisy) and feat-019 added. Rebased onto main post-Sprint-3 so feat-017's tool_call_id / tool_calls preservation in olava.ts is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

commit body

What's in this commit:
- backend/.env.example       - OLAVA_THINKING_MODE (off|low|standard,
                               default standard), OLAVA_MAX_TOKENS
                               (default 8192, was 16384),
                               OLAVA_COMPLETION_MAX_TOKENS (2048).
- backend/src/lib/llm/olava.ts - Qwen3 thinking control via vLLM
                                 `chat_template_kwargs.enable_thinking`.
                                 In low/off mode also appends a /no_think
                                 hint to the system prompt. Caller-passed
                                 `enableThinking: false` forces low mode
                                 regardless of env (used by helper calls).
- backend/src/lib/chatTools.ts - adds a "REASONING BUDGET: keep internal
                                 analysis brief and targeted" line to the
                                 chat system prompt as soft guidance.
- backend/src/lib/research/{queryExpander,triage}.ts - non-interactive
                                 helper calls opt out of thinking
                                 (enableThinking: false) so a 5-word
                                 search-query rewrite doesn't burn 4000
                                 tokens reasoning first.
- frontend/.../AssistantMessage.tsx - thinking card collapsed by default,
                                      readable spacing, markdown-aware
                                      reasoning rendering, bounded scroll
                                      area so long reasoning doesn't
                                      dominate the message.

Defaults take effect immediately on deploy. To disable Qwen reasoning
entirely (snappier, no <think> block), set OLAVA_THINKING_MODE=low in
the Railway env. No code change needed.

Removed from earlier draft: the OLAVA_REASONING_DISPLAY_CHAR_LIMIT cap +
"[Thought process truncated by display limit.]" marker. The collapsed-
by-default UI handles "hide so much of the read out" without a hard
backend truncation; the marker was ugly when it appeared.

Backlog entries for bug-008 (assistant thinking output noisy) and
feat-019 added. Rebased onto main post-Sprint-3 so feat-017's
tool_call_id / tool_calls preservation in olava.ts is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-125.md from inside the repo you want the changes in.

⬇ Download capture-thread-125.md