nwhitehouse keeps the AI typing live during tool calls

When the assistant pauses to look something up, it no longer goes silent the whole time.

chat-uiinfrastructure

nwhitehouse runs Mike on a self-hosted setup where the AI model can pause mid-answer to fetch case law from CourtListener (a free public database of US court opinions). The problem: on turns where the model reached for a tool, the chat window froze - no live typing, no signal that anything was happening, just dead air until the full response landed.

The first fix assumed the streaming text was arriving but mangled; the team shipped it, then discovered by testing against the live endpoint that the underlying server actually drops the tool-call signal entirely when streaming. The second fix is a workaround: if the model goes to use a tool, briefly drop out of live mode to capture the request, then go back to live typing for the actual written answer. Result is roughly two seconds of pause when a lookup fires, and smooth streaming for everything that follows.

So what Anyone self-hosting a chat assistant on top of an open-source model will recognise this class of bug - and the pragmatic recover-and-retry pattern is worth borrowing.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

2 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA Subject Author Date
6d2aac9e [feat-001] Stream tokens during tool-using turns Nick Whitehouse 2026-05-03 ↗ GitHub
commit body
Olava previously fell back to a non-streaming request whenever tools were
forwarded, because vLLM's tool-call streaming is broken for the LoRA's
custom <tool_call><function=...><parameter=...> markup - `delta.tool_calls`
arrives empty even when finish_reason is "tool_calls".

Fix it client-side: keep streaming on, accumulate raw delta.content, but
filter the user-visible stream through a small state machine that hides
<think>...</think> blocks and everything after a <tool_call> open tag.
Held-back tail handles markup that spans chunk boundaries. After the stream
ends, run the existing parseCustomToolCall() on the raw buffer to extract
the call and dispatch via runTools - same path the non-streaming branch
already used.

Also fixes a related bug: the no-tools "streaming" path was buffering the
entire response and emitting one giant onContentDelta at the end. Now
genuinely per-token in both paths.

Emergency rollback available via OLAVA_FORCE_NONSTREAM_TOOLS=true.

Adds backlog.md to track the sprint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e772ac55 [bug-002] Recover tool calls when vLLM streaming drops the payload Nick Whitehouse 2026-05-03 ↗ GitHub
commit body
feat-001's premise was that the Olava LoRA's custom tool-call markup
(<tool_call><function=...><parameter=...>) would arrive in delta.content
during streaming, where parseCustomToolCall could extract it. Verified
against the live RunPod endpoint with a "What's the latest court opinion
involving AI" query: vLLM finishes with finish_reason="tool_calls" but
neither populates delta.tool_calls (accCalls=0) nor includes the markup
in delta.content (raw text comes through as just "\n\n"). The tool-call
info just disappears in streaming mode for this LoRA.

Fix: when streaming finishes with finish_reason="tool_calls" but no
tool call extracted from either channel, re-issue the iter as a single
non-streaming request and parse the markup from message.content. One
extra request per tool-using iter, only on iter 0. Iter 1+ (the prose
answer iters that come after the tool runs) stream normally - that's
where the streaming win actually lives.

Net behaviour:
- Tool-free turns: stream tokens (feat-001 win preserved).
- Tool-using turns iter 0: ~2s of dead air to detect + recover the
  call. Same as the original always-non-stream behaviour.
- Tool-using turns iter 1+: stream prose tokens to the user.

OLAVA_FORCE_NONSTREAM_TOOLS=true escape hatch from feat-001 still
works if the recovery itself misbehaves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-120.md from inside the repo you want the changes in.

⬇ Download capture-thread-120.md