Notes on Qwen3.5 vs Gemma4 for Local Agentic Coding
Gemma4 was relased by Google on April 2nd earlier this week and I wanted to see how it performs against Qwen3.5 for local agentic coding. This post is my notes on benchmarking the two model families. I ran two types of tests:
- Standard llama-bench benchmarks for raw prefill and generation speed
- Single-shot agentic coding tasks using Open Code to see how these models actually perform on real multi-step coding workflows
Quick Summary:
| Model | Gen tok/s | Turn(correct) | Code Quality | VRAM | Max Context |
|---|---|---|---|---|---|
| Gemma4-26B-A4B | ~135 | 3rd | Weakest | ~21 GB | 256K |
| Qwen3.5-35B-A3B | ~136 | 2nd | Best structure, wrong API | ~23 GB | 200K |
| Qwen3.5-27B | ~45 | 1st | Cleanest and best overall | ~21 GB | 130K |
| Gemma4-31B | ~38 | 1st | Clean but shallow | ~24 GB | 65K |
Max Context is the largest context size that fits in VRAM with acceptable generation speed.
- MoE models are 3x faster but both dense models got the complex task right on the first try.
- My pick is Qwen3.5-27B which is still the best model for local agentic coding on an 24GB card (RTX 3090/4090). It is reliable, efficient, produces the cleanest code and fits comfortably on a 4090.
Models
Below are the models and their quantization that I used for benchmarking:
| Model | Architecture | Quant | Model size | Total Params | Active Params |
|---|---|---|---|---|---|
| Qwen3.5-27B | Dense | Q4_K_XL | 16.40 GiB | 26.90 B | 26.90 B |
| Qwen3.5-35B-A3B | MoE | Q4_K_XL | 20.70 GiB | 34.66 B | ~3 B |
| Gemma4-26B-A4B | MoE | Q4_K_XL | 15.95 GiB | 25.23 B | ~4 B |
| Gemma4-31B | Dense | Q4_K_XL | 17.46 GiB | 30.70 B | 30.70 B |
Standard Benchmarks with llama-bench
llama.cpp has a llama-bench utility that runs standard prefill and generation (decode) benchmarks. It is a quick way to get raw throughput numbers..
This is the command I used to run the benchmarks:
./llama.cpp/llama-bench \
-m $MODEL_PATH \
-ctk q8_0 -ctv q8_0 -fa 1 -b 2048 -ub 512 \
-p 512,2048,4096,8192,16384,32768,65336 \
-n 128 -r 3 -o md| Flag | Value | Meaning |
|---|---|---|
-ctk |
q8_0 |
KV cache keys quantized to 8-bit |
-ctv |
q8_0 |
KV cache values quantized to 8-bit |
-fa |
1 |
Flash Attention enabled |
-b |
2048 |
Batch size (max tokens processed per batch) |
-ub |
512 |
Micro-batch size (tokens processed per CUDA kernel call) |
-p |
512,2048,...,65336 |
Prefill token counts to sweep |
-n |
128 |
Decode (generation) tokens per run |
-r |
3 |
Repeat each test 3 times and report mean |
-o |
md |
Output as markdown table |
Prefill Speed
| Context | Qwen3.5-27B | Qwen3.5-35B-A3B | Gemma4-26B-A4B | Gemma4-31B |
|---|---|---|---|---|
| 512 | 3,037 | 6,666 | 8,597 | 3,100 |
| 2K | 3,069 | 6,674 | 8,710 | 2,992 |
| 4K | 3,025 | 6,633 | 8,733 | 2,925 |
| 8K | 2,957 | 6,524 | 8,443 | 2,811 |
| 16K | 2,841 | 6,308 | 7,961 | 2,614 |
| 32K | 2,632 | 5,920 | 7,097 | 2,304 |
| 65K | 2,290 | 5,273 | 5,917 | 1,869 |
Generation Speed (tg128)
| Model | Architecture | Generation (tokens/s) |
|---|---|---|
| Qwen3.5-35B-A3B | MoE | 165.84 |
| Gemma4-26B-A4B | MoE | 164.38 |
| Qwen3.5-27B | Dense | 45.88 |
| Gemma4-31B | Dense | 44.42 |
Notes on llama-bench Results
- As expected, the MoE models dominate both prefill and generation speed.
- Generation speed for the two MoE models is nearly identical (~165 tok/s). The same story for the two dense models (~45 tok/s) (as memory-bandwidth bound).
Agentic Coding: One-Prompt Test
The llama-bench numbers tell you how fast tokens move (sort of the max limit we can expect) but they say nothing about how a model actually performs in reasoning, tool calls, writing code and actual speed with coding assistants.
To test it, I ran a simple practical test: give the model one prompt and see if it can figure out the rest on its own. There will be no hand-holding and multi-turn guidance. The idea is to see how the model performs in such a scenario.
This is not a formal test. It is two prompts at different complexity levels to see how well the model handles mult-step workflows. This is usually the case with most of us, we describe what we want in a single prompt and let the model do its thing.
Setup
I used Open Code as the agentic coding frontend because I find it easier to set up with a local llama-server backend. I also configured Context7 as an skills + MCP server to let models fetch up-to-date library documentation and API docs during its run.
llama-server was configured with q8_0 KV cache (turboquant) and context size varied per model based on VRAM constraints to maximize generation speed (full config in Appendix A).
- Speed metrics came from llama-server’s
/metricsendpoint. - Token usage breakdowns were estimated using the opencode-tokenscope plugin.
I also made sure to restart llama-server between model runs so the counters would not carry over.
Prompt 1: Simple (httpx + pytest)
use context7 to look up the httpx library docs. then write me a python script
that fetches the post from https://jsonplaceholder.typicode.com/posts/1 and
prints the title. also write a pytest test for it, no mocks, hit the real API.
use uv run to run everything so we don't install anything in the current
environment. run the test and make sure it passes.
This tests the basics such as can the model call Context7 to look up docs, write a simple script and real integration test (no mocks), use uv run for running and dep. management and actually execute everything to verify it works.
Prompt 2: Comprehensive (Image Gen API calls + TDD)
use context7 to search for the latest google gemini image generation API docs.
I want you to write a python script that uses the google-genai SDK to generate
images using the gemini-3.1-flash-preview model (nano banana). use TDD
red-green methodology, write failing tests first then make them pass. do not
use any mock tests. use uv run to run everything so we don't install anything
in the current environment. test the script and if it works fine and generates
an image, then use this script to run image generation on the five prompts
given in prompts.json. save the images to an images folder, make sure the
folder exists, if it doesn't then create it.
This is a slightly heavier multi-step workflow. The model has to:
- Look up the Gemini image generation API docs via Context7
- Write a Python script using the google-genai SDK
- Follow TDD red-green methodology (write failing tests first, then make them pass)
- Use real API calls, no mocks
- Use
uv runfor dependencies - Read
prompts.json, generate images for all five prompts - Handle file I/O (create output directory, save images)
The idea is to see how well the model handles and executes the mult-step workflows correctly.
Results: Gemma4-26B-A4B
VRAM: ~21 GB | Context: 256K tokens
| Metric | Prompt 1 | Prompt 2 |
|---|---|---|
| Prefill tok/s | 4,338 | 4,560 |
| Generation tok/s | 135.5 | 134.4 |
| Total prompt tokens processed | 17,847 | 23,204 |
| Total tokens generated | 1,623 | 3,435 |
| Prompt processing time | 4.11s | 5.09s |
| Generation time | 11.98s | 125.56s |
| API calls | 10 | 13 |
| Tool calls | 7 | 11 |
| Correct on turn | 1st | 3rd |
API calls is the number of api calls opencode makes to the llm model.
- It is fast with 135 tok/s generation and 4.3K+ prefill is the fastest of all.
- It is also the most concise model by far with based on generated tokens.
- Needed 3 attempts on Prompt 2. Despite being the fastest and most concise, it struggled with the multi-step instructions.
Results: Gemma4-31B
VRAM: ~24 GB | Context: 65K tokens
Note: I had to drop the context size to 65K from 128K to maintain reasonable generation speed. At 128K, generation speed degrades to around ~10tok/s with best speed only achievable around 65K tokens.
| Metric | Prompt 1 | Prompt 2 |
|---|---|---|
| Prefill tok/s | 1,466 | 1,357 |
| Generation tok/s | 37.7 | 35.2 |
| Total prompt tokens processed | 16,618 | 25,070 |
| Total tokens generated | 2,903 | 5,968 |
| Prompt processing time | 11.34s | 18.48s |
| Generation time | 77.07s | 169.53s |
| API calls | 10 | 16 |
| Tool calls | 8 | 14 |
| Correct on turn | 1st | 1st |
- Got Prompt 2 correct on the first turn. The dense model reliability on the complex task was noticeably better.
- The model generated nearly twice the tokens as the MoE variant (2,903 vs 1,623 on Prompt 1) including 1,548 reasoning tokens.
- Context limited to 65K is a real practical limitation, not sure whether this degradation in speed at higher context will be solved in future.
Results: Qwen3.5-35B-A3B
VRAM: ~23 GB | Context: 200K tokens
| Metric | Prompt 1 | Prompt 2 |
|---|---|---|
| Prefill tok/s | 3,179 | 3,056 |
| Generation tok/s | 136.7 | 132.3 |
| Total prompt tokens processed | 16,145 | 92,375 |
| Total tokens generated | 7,564 | 32,904 |
| Prompt processing time | 5.08s | 30.23s |
| Generation time | 55.32s | 248.75s |
| API calls | 13 | 30 |
| Tool calls | 11 | 28 |
| Correct on turn | 1st | 2nd |
- The generation speed is identical to Gemma4-26B-A4B.
- Though the model is extremely verbose. ~7.5K and ~32K tokens on Prompt 1/2.
- Prompt 2 was the most intensive run of the entire benchmark: 30 API calls with multiple tool calls.
- Got Prompt 2 correct on the 2nd turn better than Gemma4-26B-A4B.
- The 248.7s generation time on Prompt 2 is a direct result of such large API and tool calls.
Results: Qwen3.5-27B
VRAM: ~21 GB | Context: 130K tokens
| Metric | Prompt 1 | Prompt 2 |
|---|---|---|
| Prefill tok/s | 2,474 | 2,188 |
| Generation tok/s | 44.9 | 44.6 |
| Total prompt tokens processed | 15,043 | 24,385 |
| Total tokens generated | 2,867 | 11,824 |
| Prompt processing time | 6.08s | 11.14s |
| Generation time | 63.91s | 265.00s |
| API calls | 9 | 18 |
| Tool calls | 7 | 14 |
| Correct on turn | 1st | 1st |
- Got Prompt 2 correct on the first turn. Same as Gemma4-31B.
- Most efficient session on Prompt 1 with fewest API calls (9) and tool calls.
- Generation at 44.9 tok/s is slower than MoE but faster than Gemma4-31B (37.7).
- 130K context fits comfortably in VRAM. This is a practical sweet spot with decent enough context size.
Comparing the Performance
Summary Tables
Speed
| Model | Prefill tok/s (P1) | Prefill tok/s (P2) | Gen tok/s (P1) | Gen tok/s (P2) |
|---|---|---|---|---|
| Gemma4-26B-A4B | 4,338 | 4,560 | 135.5 | 134.4 |
| Qwen3.5-35B-A3B | 3,179 | 3,056 | 136.7 | 132.3 |
| Gemma4-31B | 1,466 | 1,357 | 37.7 | 35.2 |
| Qwen3.5-27B | 2,474 | 2,188 | 44.9 | 44.6 |
Efficiency and Completion
| Model | Tokens Gen (P1) | Tokens Gen (P2) | API Calls (P1) | API Calls (P2) | Tool Calls (P2) | Correct Turn (P2) |
|---|---|---|---|---|---|---|
| Gemma4-26B-A4B | 1,623 | 3,435 | 10 | 13 | 11 | 3rd |
| Qwen3.5-35B-A3B | 7,564 | 32,904 | 13 | 30 | 28 | 2nd |
| Gemma4-31B | 2,903 | 5,968 | 10 | 16 | 14 | 1st |
| Qwen3.5-27B | 2,867 | 11,824 | 9 | 18 | 14 | 1st |
Hardware Fit (RTX 4090 24 GB)
| Model | VRAM Usage | Max Context |
|---|---|---|
| Gemma4-26B-A4B | ~21 GB | 256,000 |
| Qwen3.5-35B-A3B | ~23 GB | 200,000 |
| Qwen3.5-27B | ~21 GB | 130,672 |
| Gemma4-31B | ~24 GB | 65,336 |
Code Quality
I looked at the working code each model produced for Prompt 2 (the Nano Banana image generation task) and used Opus to compare them on structure, error handling, TDD compliance, API correctness and overall cleanliness.
| Aspect | Gemma4-26B-A4B | Gemma4-31B | Qwen3.5-35B-A3B | Qwen3.5-27B |
|---|---|---|---|---|
| Structure | 2 files, basic separation | 3 files, clean separation | Class-based with helpers, cleanest design | 3 files + dead main.py stub |
| Error handling | Minimal, no API error handling | Poor, no try/except around API | Adequate but no batch error recovery | Weak, silent failures |
| TDD | Placeholder test, no real TDD | One integration test, superficial | Integration tests only, claimed but not real | Integration tests only, claimed but not real |
| Cleanliness | Acceptable, concise | Good, readable, concise | Good structure but unused base64 import |
Good docstrings, type hints, pathlib usage |
| Critical issues | Broken summary, no uv run setup |
New client per API call | Hardcoded API key in tests, wrong model | Dead main.py, new client per call |
- None of the models truly followed TDD. All of them claimed red-green methodology in their summaries but wrote integration tests that hit the real API. No model used mocks or wrote genuinely failing tests first.
- Qwen3.5-27B produced the most correct code. It got the model name right, used type hints and docstrings, used pathlib properly and had the cleanest overall implementation. Its issues (dead
main.pystub, client created per call) are minor compared to the others. - Qwen3.5-35B-A3B had the best code structure with a proper class-based design, but committed a security sin by hardcoding an API key in the test file and used the wrong model name entirely. For a task that specifically asked for
gemini-3.1-flash-previewusinggemini-2.5-flash-imageis a correctness failure. - Gemma4-31B was clean and concise but shallow. Minimal code, readable but no error handling and superficial testing.
- Gemma4-26B-A4B was the weakest. Missing a critical API parameter and broken summary file and no
uv runintegration despite being asked for it. This lines up with it needing 3 attempts to get working code.
Takeaways
Speed and Efficiency
- Dense models were more reliable on the complex task. Both Qwen3.5-27B and Gemma4-31B got Prompt 2 right on the first turn. Both MoE models needed retries. Two data points is not a conclusion, but it is a pattern worth noting.
- MoE speed advantage is real but verbosity can eat it up. Both MoE models hit ~135 tok/s generation vs ~40-45 tok/s for dense. But Qwen3.5-35B-A3B generated 32,904 tokens on Prompt 2 which means 248 seconds of generation even at MoE speeds. Gemma4-26B-A4B was the only model that was both fast and concise.
- Gemma4-26B-A4B is the speed king. If you are doing high-volume simpler tasks where first-try reliability matters less, it is hard to beat.
Code Quality
- Qwen3.5-27B produced the most correct and cleanest code overall. Right model name, type hints, docstrings, pathlib usage. Its issues are minor compared to every other model.
- None of the models truly followed TDD. All claimed red-green methodology but wrote integration tests hitting the real API. No mocks, no genuinely failing tests first.
- Better structure does not mean better code. Qwen3.5-35B-A3B had the cleanest design (class-based) but hardcoded an API key and used the wrong model name. Structure alone is not enough.
Bottom Line
- Qwen3.5-27B feels like the best overall pick for agentic coding on a 4090.
- Reliable: got the complex task right on the first try
- 130K context is a practical sweet spot for long agentic sessions without maxing out the card
- 44.9 tok/s is slower than MoE but fast enough for interactive use
- Most efficient on the simple task (fewest API calls)
- Only uses ~21 GB VRAM, leaving headroom
- Produced the most correct and cleanest code of all four models
These are notes from a single benchmarking session with two prompts and my experience over the last 2 days. I am not claiming any of this is statistically rigorous.
Appendix
A: Hardware Fit and Server Config
llama-server Launch Config
Base config used for all models:
llama-server \
--model $MODEL_PATH \
--jinja \
--host 100.80.101.103 \
--port 8001 \
--parallel 1 \
--batch-size 2048 \
--ubatch-size 512 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on \
--context-shift \
--metricsPer-Model Overrides
Gemma4 models:
--ctx-size 256000 # MoE (26B-A4B)
--ctx-size 65336 # Dense (31B) - reduced due to VRAM constraints
--temp 1.0
--top-p 0.95
--top-k 64
--min-p 0.00Qwen 3.5 models:
--ctx-size 200000 # MoE (35B-A3B)
--ctx-size 130672 # Dense (27B)
--temp 0.6
--top-k 20
--chat-template-file $TEMPLATES_DIR/qwen35-chat-template-corrected.jinja
--chat-template-kwargs '{"enable_thinking":true}'B: Installation
- llama.cpp Installation
sudo apt-get update
sudo apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev libssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && git pull origin master && cd ..
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=ON \
-DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j24 --clean-first \
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split llama-bench
cp llama.cpp/build/bin/llama-* llama.cppTokenscope: I used the opencode-tokenscope plugin to get per-session token breakdowns. You need to add
"plugin": ["@ramtinj95/opencode-tokenscope"]to youropencode.jsonthen create a/tokenscopeslash command in~/.config/opencode/command/tokenscope.md.llama-server /metrics: llama-server exposes a
/metricsendpoint (enabled with the--metricsflag) that returns Prometheus-format counters.
Troubleshooting
- Qwen3.5-35B-A3B todowrite Parse Error: Qwen3.5-35B-A3B sometimes returned tool call arguments as a raw JSON string instead of a parsed object. This caused the
todowritetool to fail because Open Code expectedtodosto be an array, not a string containing an array. You can fix this using a small plugin at~/.opencode/plugins/todo-fix-plugins.ts:
export const TodoFixPlugin = async (ctx) => {
return {
"tool.execute.before": async (input, output) => {
if (input.tool === "todowrite" && typeof output.args.todos === "string") {
output.args.todos = JSON.parse(output.args.todos)
}
}
}
}Gemma4-31B Context Size: I had to reduce to 65,336 tokens to maintain ~40 tok/s generation. You can push it higher but generation speed degrades as context grows.
**Qwen3.5 models needed a corrected Jinja chat template qwen35-chat-template-corrected.jinjahttps://gist.github.com/garg-aayush/c0211a5fdca3e237d248d52806ff8d96 to work properly with llama-server. The default template had issues with thinking mode.