Use SGLang with Claude Code (but be prepared for patching)
TLDR: SGlang is likely to be the fastest way of serving your local models for Claude Code, if you can live with a bit of instability. It much faster compared to Ollama. However, as of 20 June 2026, some patching is required.
As some of you may know, my AI assistant (or bot), Alice uses Claude Code - instead of a Claw-type harness - for reasons I shall not fully go into here. I then cancelled my Claude subscription with the intention to just stop using Claude Code entirely. But then - to cut a long story short - I decided to go back to Claude Code, running off local models.
I use Claude Code for two purposes: as a harness to run Alice on, and to do coding work (mainly, to support and maintain the current harness). I mention this to acknowledge that although it is possible to run local coding models at a low speed, in order for Alice to be responsive as an agent needs to be, the local model has to run at real-time (or at least, similar speeds to a Claude model on subscription), which is not just a matter of decode (generation) speed, but also prefill/KV cache speed. The former can be bumped up cheaply by MTP, but not the latter. And SGLang is an absolute beast when it comes to the latter. Why is it important? Because Claude Code has A LOT of context that is being loaded ALL the time, repeatedly. A lot of time is wasted just reading things (yes, another thing to optimize, I know) and if your inference server has slow prefill it's going to take longer to respond.
My main objective during the past week of was to achieve that type of speed, using the best possible local model that I could run on my hardware. Having speed as my main criterion ruled out a lot of options. I eventually settled on qwen3-27b running at FP8, with multi-token prediction (MTP) for the speed bump. This required testing the best possible inference server I could find, and "best" is a combination of raw speed and reliability (especially accuracy in tool calling and long context work).
The part below, tells you the journey that I went through in about a week in order to get SGLang working with Claude Code. It was drafted by Alice (on qwen3-27b-fp8-mtp, served using SGLang), in accordance with my instructions.
From LiteLLM to Volume-Mounted Patches: My SGLang Odyssey
June 20, 2026
I've been running local LLMs for Claude Code through a launcher that routes completions to on-metal models instead of Anthropic's API. Initially, I started out with Ollama. But it was painfully slow. Then, I switched to llama.cpp was my workhorse: rock-solid at 52 tok/s output, 600–1000 tok/s prefill, 83% MTP acceptance. Zero drama.
Then I got ambitious. I wanted the throughput of modern serving engines — SGLang's radix cache, vLLM's PagedAttention — without sacrificing the Anthropic Messages API compatibility that Claude Code demands.
What followed was a week of LiteLLM proxies, hand-rolled Bun middleware, SSE transform streams, pull-based line buffers, and eventually the realization that the fix was two volume-mounted Python files all along.
Act I: LiteLLM — The Easy Button That Isn't
The reason why LiteLLM even came into the picture was that SGLang did not play very nice with Claude Code's messages endpoint. A couple of LLMs I surveyed, including frontier ones, suggested that I run SGLang behind LiteLLM. LiteLLM translates between provider APIs, so theoretically I could point Claude Code at LiteLLM's /v1/messages and have it proxy to SGLang.
It worked... sort of. LiteLLM dropped speculative token metadata silently, had problems with subagents, tool calls, and had a weird limit that hit once you reach 100,000 tokens. It was suggested that neither vLLM nor SGLang gets MTP benefits through LiteLLM, although I'm still not sure today whether this was true. My llama.cpp worked because there was no proxy in the chain.
Lesson #1: Proxies are transparent until they're not. When you're optimizing for token-level performance, every layer in the stack matters.
Act II: The Native Endpoint Dream
So I thought, KISS (Keep it Simplem Stupid) right? Maybe I just needed to upgrade to a newer version of SGLang. SGLang v0.5.13 ships with a native /v1/messages Anthropic-compatible endpoint. In principle, no proxy needed. Set ANTHROPIC_BASE_URL to http://localhost:8072/v1 and you're done.
Or so I thought.
Bug #1: System Messages in the Middle
Claude Code 2.1.154+ sends role: "system" entries inside the messages[] array via the mid-conversation-system-2026-04-07 beta. SGLang's Pydantic validation said no:
400: Input should be 'user' or 'assistant'
The server's schema literally doesn't know what a system message in the message stream is. GitHub issue #26773 had the fix — a model validator that hoists mid-conversation system messages to the top-level system field — but it hadn't been released.
Bug #2: Tool-Calling Streams That Lie
Even after fixing system messages, Claude Code started crashing with:
API Error: Content block is not a text block.
This was SGLang issue #24293. The Anthropic adapter in SGLang tracked whether a content block was open, but not what type it was. When Qwen3 emitted whitespace between consecutive tool calls, SGLang emitted it as text_delta on a still-open tool_use block. The Anthropic SDK asserts that tool_use blocks only accept input_json_delta — and crashed.
Fix: PR #24294 — add content_block_type tracking to the serving handler. Also not in v0.5.13.
Bug #3: Thinking Tokens You Can't Turn Off
Qwen3.6-27B generates chain-of-thought tokens wrapped in <thinking> tags. In a Claude Code session, these waste context space and add latency. The Anthropic API lets you disable thinking per-request with thinking: { type: "disabled" }.
SGLang's Pydantic schema has no thinking field on AnthropicMessagesRequest (issue #26620). The parameter was silently dropped. The model kept thinking. I couldn't stop it.
Act III: The Proxy Spiral
Faced with three unfixed bugs, and LiteLLM giving me weird errors, at the egging of yet more frontier LLMs, I decided to hand-roll a custom proxy. An 88-line Bun service on port 8073 that sat between Claude Code and SGLang.
It started innocent enough — hoist the system messages, pass everything else through. That fixed Bug #1.
Then I added thinking: { type: "disabled" } to every outgoing request. Bug #3 didn't care — SGLang's Pydantic dropped it. No-op.
So I tried filtering the SSE response. I used a TransformStream to track content block types by index and drop rogue text_delta events on tool_use blocks. Bug #2 went away. But now I got JSON Parse error: Unexpected EOF — the TransformStream + pipeThrough combination in Bun was losing buffered data on stream close.
I replaced it with an explicit ReadableStream using a start() reader loop. Still EOF errors under load.
Then a pull()-based line buffer with backpressure awareness. The same strategy LiteLLM uses internally. Partial lines stayed in the buffer between pulls. Only complete lines emitted.
By this point I had six commits, six iterations, and a proxy that was arguably more complex than the bugs it was working around. Each fix exposed a new problem. Piping SSE through Bun with line buffering is fundamentally lossy.
Lesson #2: When you're working around upstream bugs with a proxy, you're not fixing problems — you're moving them to a different layer where they're harder to debug.
Act IV: The Obvious Answer
Then sometime this morning, it hit me. Why use a proxy when I could directly patch SGLang? There were some PRs on these very issues that Alice had uncovered during the research phase of the project - just not fully merged to main yet. So why not just patch them using the PRs? I did faintly think of it earlier but had unconsciously blocked out that option in my mind because I was serving SGLang via Docker, which I thought meant I had to rebuild the entire image every time I wanted to test it in the patching process.
So I asked Alice - what is a good way to make sure I get the patched files into the docker container every time it starts up?
Her answer was startlingly simple (and from a local model too!): just mount the fixes into the container.
The fix was embarrassingly simple:
volumes:
# PR #26773: system message hoist + thinking_delta types
- ./sglang-patched-protocol.py:/sgl-workspace/sglang/python/sglang/srt/entrypoints/anthropic/protocol.py:ro
# PR #24294: content_block_type tracking + reasoning_content
- ./sglang-patched-serving.py:/sgl-workspace/sglang/python/sglang/srt/entrypoints/anthropic/serving.py:ro
# Chat template with thinking disabled
- ./chat_template_no_thinking.jinja:/models/chat_template_no_thinking.jinja:ro
Two Python files. One Jinja template. Read-only mounts. No proxy, no SSE transformation, no stream manipulation.
The thinking disable? A chat template override setting {%- set enable_thinking = false %} at the top. SGLang's per-request thinking parameter was broken, but the chat template isn't — it's how the model decides what to generate in the first place.
Then I archived the proxy, removed the systemd service, masked it so it can't come back, and pointed Claude Code directly at :8072.
The Serving Engine Comparison
Here's how the three serving engines compared for my use case — Claude Code with Qwen3.6-27B on a single RTX 4090 48GB:
| llama.cpp | vLLM | SGLang (patched) | |
|---|---|---|---|
| Output speed | 52 tok/s (steady) | 40–70 tok/s | 41 tok/s mean, 46 p50, 70 p99 |
| Prefill speed | 600–1000 tok/s | 1000+ tok/s | 1500–3700 tok/s (radix-cached: instant) |
| Speculative acceptance | 83% (MTP, n=4) | 56.8% (MTP, n=4, pos 3-4 wasteful) | 72% mean, 3.16 avg accept len |
| Anthropic API | Via proxy | First-class /v1/messages |
First-class /v1/messages |
| Tool calling | Solid | Solid | Solid (after patch) |
| Multi-request | Stable | Collapses to 5–7 tok/s | Stable |
| Proxy needed? | Yes (for API) | No | No (after patch) |
| Setup friction | Low | Low | Medium (patches) |
llama.cpp
The safe choice. 83% MTP acceptance means most speculative tokens stick. No proxy issues because there's no Anthropic API layer to break. But prefill is slower than PagedAttention engines, and you don't get SGLang's radix cache for shared prefixes across conversations.
vLLM
Beautiful on paper. PagedAttention gives 1000+ tok/s prefill. But MTP is finicky — position 1 accepts 76%, position 4 accepts 41%. Under concurrent requests (Claude Code subagents), throughput collapsed from ~60 tok/s to ~6 tok/s. KV cache pressure amplified the MTP overhead. Also, the vLLM nightly builds needed for MTP are — nightlies.
SGLang (patched)
Once the patches are in, it just works. Decode throughput averages 41 tok/s (median 46, peaks at 70) — close to llama.cpp's 52 tok/s steady state but with much faster prefill (1500–3700 tok/s on uncached context, instant on radix-cached prefixes). The speculative decoder accepts 72% of draft tokens at ~3.16 tokens per step — better than vLLM's 57% but without the position decay. Radix cache is great for Claude Code sessions where system prompts and early context share prefixes across requests. Multi-request stability is better than vLLM.
What I'd Do Differently
Check for upstream PRs AND consider implementing them, before building workarounds. PRs #26773 and #24294 existed the whole time. I should have patched the source instead of building a proxy.
Volume mounts > proxies for Docker. If you're running a containerized service and the fix is a code change, just mount the fixed files. This means that instead of the file in the container, it reads the patched file in the mounted volume.
Chat templates are more powerful than API params. The
thinkingparameter was broken, but the chat template'senable_thinkingvariable worked perfectly. Sometimes the lowest level of the stack is the most reliable.LiteLLM is fine for multi-model routing, terrible for performance-critical paths. It silently drops parameters, transforms payloads, and adds latency. For a single-model setup, go direct.
The TL;DR
I spent a week building and iterating on a Bun proxy to work around three SGLang bugs. The solution was mounting two patched Python files and a chat template into the Docker container. The proxy is now archived. SGLang serves Claude Code directly on port 8072. It works.
The moral: before you build infrastructure around a bug, check if you can just fix the bug.
Coda: The Patch That Ate a Nightly Build (June 21, 2026)
I said the blog post was done. I was wrong. The day after I published, the input_json_delta error was still showing up in logs. Non-fatal — Claude Code's retry logic absorbed it every time — but it nagged at me. I knew the root cause: SGLang's Anthropic adapter was misreading Qwen3's inter-tool whitespace token as a signal to close an open tool_use block, then the Anthropic SDK would reject the next JSON chunk because the block had been prematurely terminated. PR #25876, merged June 12, claimed to fix exactly this. But was it in my v0.5.13 image?
It was not.
So I did what any reasonable person would do at this point. I consulted an AI to help me debug the AI that was running the AI. The analysis came back: two separate PRs, two separate bugs, two separate timelines. PR #25876 (the streaming fix) was merged June 12 and was in the June 21 nightly. PR #26773 (the mid-array system message fix) was merged June 21 at 04:34 UTC — after that morning's nightly build at 01:47 UTC. So no single image had both fixes. The June 22 nightly would. But that was tomorrow, and I am constitutionally incapable of waiting for tomorrow.
The plan: pull the June 21 nightly, extract its vanilla serving.py, hot-patch only the system message fix, leave serving.py unpatched to let #25876 do its work natively, and test. This is where it got interesting.
The nightly booted. Then it immediately threw a new error — Anthropic thinking is not supported for models without a reasoning parser — because the nightly's vanilla serving.py now enforces capability declarations strictly and my requests were arriving with a thinking beta header. My old patched serving.py had been silently absorbing this. Another layer of the onion.
Fixed the reasoning parser flag. Booted again. New error: System message must be at the beginning — this time not from Pydantic validation, but from deep inside apply_chat_template(), where the Qwen3 Jinja template itself enforces the constraint. My protocol.py patch had successfully let the mid-array system messages through validation, but then handed them unsanitised to the tokenizer, which promptly threw them on the floor.
So the fix wasn't just accepting "system" at the schema layer. It also had to hoist those messages before they hit the template. That's what #26773 actually does — the full upstream fix operates at two layers. My partial port only covered one.
The agent — Qwen3-27B, 27 billion parameters, running hot on a 4090, no cloud, no subscription — pulled PR #26773's diff, read what the upstream fix actually did at both layers, extracted the vanilla serving.py from the nightly image, applied the minimal correct fix, and had the container back up inside ten minutes. Error gone. Both errors gone.
Here's the thing that still makes me shake my head: the entire debugging session — the PR archaeology, the traceback analysis, the diff reading, the patch writing, the container restart — was largely orchestrated by the same model whose inference server was broken (I used llama.cpp to serve it). A 27B local model, diagnosing and patching its own serving infrastructure, on 100% local hardware, albeit with a bit of ChatGPT style help.
The lesson from the original post was volume mounts beat proxies. The lesson from the coda is something weirder: the model you're trying to fix is good enough to fix itself. You just have to give it the logs, the diff, and some high level guidance from free frontier chat bots.
That's the stack now. Two patched Python files, a Jinja template, and a 27B model that knows where its own bodies are buried. Until the June 22 nightly drops and we get to throw the patches away and start over. Or not.
Docker compose:
sglang-qwen36:
# Nightly 20260620 includes PR #25876 (input_json block fix) — testing natively.
# Only protocol.py is hotpatched (mid-conversation system role fix — PR #26773).
# serving.py hotpatch commented out to test #25876 from the nightly itself.
image: lmsysorg/sglang:nightly-dev-cu13-20260620-871ed0dc
container_name: sglang-qwen36
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HF_HOME=/models/hf_cache
- SGLANG_ENABLE_SPEC_V2=1
volumes:
- /(local path)/models:/models
# Hotpatch: PR #26773 (mid-conversation system role fix)
- /(local path)/sglang-docker/sglang-patched-protocol.py:/sgl-workspace/sglang/python/sglang/srt/entrypoints/anthropic/protocol.py:ro
# serving.py hotpatch disabled — test PR #25876 natively from nightly
# - /(local path)/sglang-patched-serving.py:/sgl-workspace/sglang/python/sglang/srt/entrypoints/anthropic/serving.py:ro
- /(local path)/sglang-docker/chat_template_no_thinking.jinja:/models/chat_template_no_thinking.jinja:ro
ports:
- "8072:8072"
ipc: host
ulimits:
memlock: -1
stack: 67108864
command: >
python3 -m sglang.launch_server --model-path /models/Qwen3.6-27B-FP8 --tp-size 1 --mem-fraction-static 0.78 --context-length 256000
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:8072/health" ]
interval: 30s
timeout: 10s
retries: 5
start_period: 120s
restart: no
*This post documents lessons from running Qwen3.6-27B on a consumer RTX 4090 through SGLang nightly build lmsysorg/sglang:nightly-dev-cu13-20260620-871ed0dc with Claude Code 2.1.181.