Olava streaming restored, then corrected after live endpoint testing

Two commits, same day, in opposite directions. feat-001 restores streaming during tool-using turns by adding a state machine that hides `<think>` blocks and `<tool_call>` markup from the user-visible stream while keeping the raw buffer intact for the custom parser. bug-002 then discovers that on the actual RunPod endpoint the markup never lands in `delta.content` at all - vLLM just drops it - and pivots to re-issuing the first iter as a non-streaming request to recover the tool call.

chat-uiinfrastructure

The prior code fell back to non-streaming on any turn with tools in scope. feat-001's fix was a StreamingMarkupFilter class: it maintains a state machine across chunk boundaries, suppressing <think>...</think> blocks and everything after a <tool_call> open tag, while passing through visible tokens in real time. It holds back any trailing slice that could be the start of a tag (e.g. <, <t, <tool_ca) so markup split across two chunks isn't accidentally emitted. After the stream ends, parseCustomToolCall() runs on the raw accumulated buffer. The emergency rollback path is OLAVA_FORCE_NONSTREAM_TOOLS=true.

Then bug-002 tested against the live endpoint. The actual behavior: vLLM sets finish_reason="tool_calls" but delta.content comes through as just "\n\n" - no markup, no delta.tool_calls, nothing. feat-001's premise was wrong.

The fix is recoverToolCallNonStreaming(): when a streaming iter ends with finish_reason="tool_calls" but zero extracted calls, re-issue the same request with stream: false. The non-streaming response puts the markup in message.content where the existing parser can reach it. There's about 2 seconds of dead air on iter 0 of any tool-using turn. Iter 1 and later - the prose-generating iterations after the tool runs - stream normally, which is where the streaming benefit actually lives.

The StreamingMarkupFilter is still in the codebase and handles the case where vLLM does stream the markup. The non-streaming recovery path handles the case where it doesn't. Both run.

So what The recovery pattern is the portable piece: any vLLM-served model where streaming silently swallows tool calls needs this same defensive non-streaming retry on iter 0. Worth importing if your fork serves a similar model. The caveat nwhitehouse flags is real - if vLLM fixes its streaming, the recovery becomes a gratuitous extra round-trip. Keep it behind an env flag and revisit when you upgrade vLLM.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

2 commits from nwhitehouse/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA	Subject	Author	Date
`6d2aac9e`	[feat-001] Stream tokens during tool-using turns	Nick Whitehouse	2026-05-03	↗ GitHub
commit body Olava previously fell back to a non-streaming request whenever tools were forwarded, because vLLM's tool-call streaming is broken for the LoRA's custom <tool_call><function=...><parameter=...> markup - `delta.tool_calls` arrives empty even when finish_reason is "tool_calls". Fix it client-side: keep streaming on, accumulate raw delta.content, but filter the user-visible stream through a small state machine that hides <think>...</think> blocks and everything after a <tool_call> open tag. Held-back tail handles markup that spans chunk boundaries. After the stream ends, run the existing parseCustomToolCall() on the raw buffer to extract the call and dispatch via runTools - same path the non-streaming branch already used. Also fixes a related bug: the no-tools "streaming" path was buffering the entire response and emitting one giant onContentDelta at the end. Now genuinely per-token in both paths. Emergency rollback available via OLAVA_FORCE_NONSTREAM_TOOLS=true. Adds backlog.md to track the sprint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`e772ac55`	[bug-002] Recover tool calls when vLLM streaming drops the payload	Nick Whitehouse	2026-05-03	↗ GitHub
commit body feat-001's premise was that the Olava LoRA's custom tool-call markup (<tool_call><function=...><parameter=...>) would arrive in delta.content during streaming, where parseCustomToolCall could extract it. Verified against the live RunPod endpoint with a "What's the latest court opinion involving AI" query: vLLM finishes with finish_reason="tool_calls" but neither populates delta.tool_calls (accCalls=0) nor includes the markup in delta.content (raw text comes through as just "\n\n"). The tool-call info just disappears in streaming mode for this LoRA. Fix: when streaming finishes with finish_reason="tool_calls" but no tool call extracted from either channel, re-issue the iter as a single non-streaming request and parse the markup from message.content. One extra request per tool-using iter, only on iter 0. Iter 1+ (the prose answer iters that come after the tool runs) stream normally - that's where the streaming win actually lives. Net behaviour: - Tool-free turns: stream tokens (feat-001 win preserved). - Tool-using turns iter 0: ~2s of dead air to detect + recover the call. Same as the original always-non-stream behaviour. - Tool-using turns iter 1+: stream prose tokens to the user. OLAVA_FORCE_NONSTREAM_TOOLS=true escape hatch from feat-001 still works if the recovery itself misbehaves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-120.md from inside the repo you want the changes in.

⬇ Download capture-thread-120.md