Notes on Qwen3.5 vs Gemma4 for Local Agentic Coding

Comparing Qwen 3.5 and Gemma4 (dense and MoE) for local agentic coding on an RTX 4090 using llama-bench and one-prompt coding tasks with Open Code.
Published

April 5, 2026

Gemma4 was relased by Google on April 2nd earlier this week and I wanted to see how it performs against Qwen3.5 for local agentic coding. This post is my notes on benchmarking the two model families. I ran two types of tests:

Quick Summary:

Model Gen tok/s Turn(correct) Code Quality VRAM Max Context
Gemma4-26B-A4B ~135 3rd Weakest ~21 GB 256K
Qwen3.5-35B-A3B ~136 2nd Best structure, wrong API ~23 GB 200K
Qwen3.5-27B ~45 1st Cleanest and best overall ~21 GB 130K
Gemma4-31B ~38 1st Clean but shallow ~24 GB 65K

Max Context is the largest context size that fits in VRAM with acceptable generation speed.

Models

Below are the models and their quantization that I used for benchmarking:

Model Architecture Quant Model size Total Params Active Params
Qwen3.5-27B Dense Q4_K_XL 16.40 GiB 26.90 B 26.90 B
Qwen3.5-35B-A3B MoE Q4_K_XL 20.70 GiB 34.66 B ~3 B
Gemma4-26B-A4B MoE Q4_K_XL 15.95 GiB 25.23 B ~4 B
Gemma4-31B Dense Q4_K_XL 17.46 GiB 30.70 B 30.70 B
  • All four models were run with thinking mode enabled and on April 3rd, 2026
  • I used Unsloth GGUFs model versions on llama.cpp.

Standard Benchmarks with llama-bench

llama.cpp has a llama-bench utility that runs standard prefill and generation (decode) benchmarks. It is a quick way to get raw throughput numbers..

This is the command I used to run the benchmarks:

./llama.cpp/llama-bench \
  -m $MODEL_PATH \
  -ctk q8_0 -ctv q8_0 -fa 1 -b 2048 -ub 512 \
  -p 512,2048,4096,8192,16384,32768,65336 \
  -n 128 -r 3 -o md
Flag Value Meaning
-ctk q8_0 KV cache keys quantized to 8-bit
-ctv q8_0 KV cache values quantized to 8-bit
-fa 1 Flash Attention enabled
-b 2048 Batch size (max tokens processed per batch)
-ub 512 Micro-batch size (tokens processed per CUDA kernel call)
-p 512,2048,...,65336 Prefill token counts to sweep
-n 128 Decode (generation) tokens per run
-r 3 Repeat each test 3 times and report mean
-o md Output as markdown table

Prefill Speed

Context Qwen3.5-27B Qwen3.5-35B-A3B Gemma4-26B-A4B Gemma4-31B
512 3,037 6,666 8,597 3,100
2K 3,069 6,674 8,710 2,992
4K 3,025 6,633 8,733 2,925
8K 2,957 6,524 8,443 2,811
16K 2,841 6,308 7,961 2,614
32K 2,632 5,920 7,097 2,304
65K 2,290 5,273 5,917 1,869

Generation Speed (tg128)

Model Architecture Generation (tokens/s)
Qwen3.5-35B-A3B MoE 165.84
Gemma4-26B-A4B MoE 164.38
Qwen3.5-27B Dense 45.88
Gemma4-31B Dense 44.42

Notes on llama-bench Results

  • As expected, the MoE models dominate both prefill and generation speed.
  • Generation speed for the two MoE models is nearly identical (~165 tok/s). The same story for the two dense models (~45 tok/s) (as memory-bandwidth bound).

Agentic Coding: One-Prompt Test

The llama-bench numbers tell you how fast tokens move (sort of the max limit we can expect) but they say nothing about how a model actually performs in reasoning, tool calls, writing code and actual speed with coding assistants.

To test it, I ran a simple practical test: give the model one prompt and see if it can figure out the rest on its own. There will be no hand-holding and multi-turn guidance. The idea is to see how the model performs in such a scenario.

This is not a formal test. It is two prompts at different complexity levels to see how well the model handles mult-step workflows. This is usually the case with most of us, we describe what we want in a single prompt and let the model do its thing.

Setup

I used Open Code as the agentic coding frontend because I find it easier to set up with a local llama-server backend. I also configured Context7 as an skills + MCP server to let models fetch up-to-date library documentation and API docs during its run.

llama-server was configured with q8_0 KV cache (turboquant) and context size varied per model based on VRAM constraints to maximize generation speed (full config in Appendix A).

  • Speed metrics came from llama-server’s /metrics endpoint.
  • Token usage breakdowns were estimated using the opencode-tokenscope plugin.

I also made sure to restart llama-server between model runs so the counters would not carry over.

Prompt 1: Simple (httpx + pytest)

use context7 to look up the httpx library docs. then write me a python script
that fetches the post from https://jsonplaceholder.typicode.com/posts/1 and
prints the title. also write a pytest test for it, no mocks, hit the real API.
use uv run to run everything so we don't install anything in the current
environment. run the test and make sure it passes.

This tests the basics such as can the model call Context7 to look up docs, write a simple script and real integration test (no mocks), use uv run for running and dep. management and actually execute everything to verify it works.

Prompt 2: Comprehensive (Image Gen API calls + TDD)

use context7 to search for the latest google gemini image generation API docs.
I want you to write a python script that uses the google-genai SDK to generate
images using the gemini-3.1-flash-preview model (nano banana). use TDD
red-green methodology, write failing tests first then make them pass. do not
use any mock tests. use uv run to run everything so we don't install anything
in the current environment. test the script and if it works fine and generates
an image, then use this script to run image generation on the five prompts
given in prompts.json. save the images to an images folder, make sure the
folder exists, if it doesn't then create it.

This is a slightly heavier multi-step workflow. The model has to:

  • Look up the Gemini image generation API docs via Context7
  • Write a Python script using the google-genai SDK
  • Follow TDD red-green methodology (write failing tests first, then make them pass)
  • Use real API calls, no mocks
  • Use uv run for dependencies
  • Read prompts.json, generate images for all five prompts
  • Handle file I/O (create output directory, save images)

The idea is to see how well the model handles and executes the mult-step workflows correctly.

Results: Gemma4-26B-A4B

VRAM: ~21 GB   |   Context: 256K tokens

Metric Prompt 1 Prompt 2
Prefill tok/s 4,338 4,560
Generation tok/s 135.5 134.4
Total prompt tokens processed 17,847 23,204
Total tokens generated 1,623 3,435
Prompt processing time 4.11s 5.09s
Generation time 11.98s 125.56s
API calls 10 13
Tool calls 7 11
Correct on turn 1st 3rd

API calls is the number of api calls opencode makes to the llm model.

  • It is fast with 135 tok/s generation and 4.3K+ prefill is the fastest of all.
  • It is also the most concise model by far with based on generated tokens.
  • Needed 3 attempts on Prompt 2. Despite being the fastest and most concise, it struggled with the multi-step instructions.

Results: Gemma4-31B

VRAM: ~24 GB   |   Context: 65K tokens

Note: I had to drop the context size to 65K from 128K to maintain reasonable generation speed. At 128K, generation speed degrades to around ~10tok/s with best speed only achievable around 65K tokens.

Metric Prompt 1 Prompt 2
Prefill tok/s 1,466 1,357
Generation tok/s 37.7 35.2
Total prompt tokens processed 16,618 25,070
Total tokens generated 2,903 5,968
Prompt processing time 11.34s 18.48s
Generation time 77.07s 169.53s
API calls 10 16
Tool calls 8 14
Correct on turn 1st 1st
  • Got Prompt 2 correct on the first turn. The dense model reliability on the complex task was noticeably better.
  • The model generated nearly twice the tokens as the MoE variant (2,903 vs 1,623 on Prompt 1) including 1,548 reasoning tokens.
  • Context limited to 65K is a real practical limitation, not sure whether this degradation in speed at higher context will be solved in future.

Results: Qwen3.5-35B-A3B

VRAM: ~23 GB   |   Context: 200K tokens

Metric Prompt 1 Prompt 2
Prefill tok/s 3,179 3,056
Generation tok/s 136.7 132.3
Total prompt tokens processed 16,145 92,375
Total tokens generated 7,564 32,904
Prompt processing time 5.08s 30.23s
Generation time 55.32s 248.75s
API calls 13 30
Tool calls 11 28
Correct on turn 1st 2nd
  • The generation speed is identical to Gemma4-26B-A4B.
  • Though the model is extremely verbose. ~7.5K and ~32K tokens on Prompt 1/2.
  • Prompt 2 was the most intensive run of the entire benchmark: 30 API calls with multiple tool calls.
  • Got Prompt 2 correct on the 2nd turn better than Gemma4-26B-A4B.
  • The 248.7s generation time on Prompt 2 is a direct result of such large API and tool calls.

Results: Qwen3.5-27B

VRAM: ~21 GB   |   Context: 130K tokens

Metric Prompt 1 Prompt 2
Prefill tok/s 2,474 2,188
Generation tok/s 44.9 44.6
Total prompt tokens processed 15,043 24,385
Total tokens generated 2,867 11,824
Prompt processing time 6.08s 11.14s
Generation time 63.91s 265.00s
API calls 9 18
Tool calls 7 14
Correct on turn 1st 1st
  • Got Prompt 2 correct on the first turn. Same as Gemma4-31B.
  • Most efficient session on Prompt 1 with fewest API calls (9) and tool calls.
  • Generation at 44.9 tok/s is slower than MoE but faster than Gemma4-31B (37.7).
  • 130K context fits comfortably in VRAM. This is a practical sweet spot with decent enough context size.

Comparing the Performance

Summary Tables

Speed

Model Prefill tok/s (P1) Prefill tok/s (P2) Gen tok/s (P1) Gen tok/s (P2)
Gemma4-26B-A4B 4,338 4,560 135.5 134.4
Qwen3.5-35B-A3B 3,179 3,056 136.7 132.3
Gemma4-31B 1,466 1,357 37.7 35.2
Qwen3.5-27B 2,474 2,188 44.9 44.6

Efficiency and Completion

Model Tokens Gen (P1) Tokens Gen (P2) API Calls (P1) API Calls (P2) Tool Calls (P2) Correct Turn (P2)
Gemma4-26B-A4B 1,623 3,435 10 13 11 3rd
Qwen3.5-35B-A3B 7,564 32,904 13 30 28 2nd
Gemma4-31B 2,903 5,968 10 16 14 1st
Qwen3.5-27B 2,867 11,824 9 18 14 1st

Hardware Fit (RTX 4090 24 GB)

Model VRAM Usage Max Context
Gemma4-26B-A4B ~21 GB 256,000
Qwen3.5-35B-A3B ~23 GB 200,000
Qwen3.5-27B ~21 GB 130,672
Gemma4-31B ~24 GB 65,336

Code Quality

I looked at the working code each model produced for Prompt 2 (the Nano Banana image generation task) and used Opus to compare them on structure, error handling, TDD compliance, API correctness and overall cleanliness.

Aspect Gemma4-26B-A4B Gemma4-31B Qwen3.5-35B-A3B Qwen3.5-27B
Structure 2 files, basic separation 3 files, clean separation Class-based with helpers, cleanest design 3 files + dead main.py stub
Error handling Minimal, no API error handling Poor, no try/except around API Adequate but no batch error recovery Weak, silent failures
TDD Placeholder test, no real TDD One integration test, superficial Integration tests only, claimed but not real Integration tests only, claimed but not real
Cleanliness Acceptable, concise Good, readable, concise Good structure but unused base64 import Good docstrings, type hints, pathlib usage
Critical issues Broken summary, no uv run setup New client per API call Hardcoded API key in tests, wrong model Dead main.py, new client per call
  • None of the models truly followed TDD. All of them claimed red-green methodology in their summaries but wrote integration tests that hit the real API. No model used mocks or wrote genuinely failing tests first.
  • Qwen3.5-27B produced the most correct code. It got the model name right, used type hints and docstrings, used pathlib properly and had the cleanest overall implementation. Its issues (dead main.py stub, client created per call) are minor compared to the others.
  • Qwen3.5-35B-A3B had the best code structure with a proper class-based design, but committed a security sin by hardcoding an API key in the test file and used the wrong model name entirely. For a task that specifically asked for gemini-3.1-flash-preview using gemini-2.5-flash-image is a correctness failure.
  • Gemma4-31B was clean and concise but shallow. Minimal code, readable but no error handling and superficial testing.
  • Gemma4-26B-A4B was the weakest. Missing a critical API parameter and broken summary file and no uv run integration despite being asked for it. This lines up with it needing 3 attempts to get working code.

Takeaways

Speed and Efficiency

  • Dense models were more reliable on the complex task. Both Qwen3.5-27B and Gemma4-31B got Prompt 2 right on the first turn. Both MoE models needed retries. Two data points is not a conclusion, but it is a pattern worth noting.
  • MoE speed advantage is real but verbosity can eat it up. Both MoE models hit ~135 tok/s generation vs ~40-45 tok/s for dense. But Qwen3.5-35B-A3B generated 32,904 tokens on Prompt 2 which means 248 seconds of generation even at MoE speeds. Gemma4-26B-A4B was the only model that was both fast and concise.
  • Gemma4-26B-A4B is the speed king. If you are doing high-volume simpler tasks where first-try reliability matters less, it is hard to beat.

Code Quality

  • Qwen3.5-27B produced the most correct and cleanest code overall. Right model name, type hints, docstrings, pathlib usage. Its issues are minor compared to every other model.
  • None of the models truly followed TDD. All claimed red-green methodology but wrote integration tests hitting the real API. No mocks, no genuinely failing tests first.
  • Better structure does not mean better code. Qwen3.5-35B-A3B had the cleanest design (class-based) but hardcoded an API key and used the wrong model name. Structure alone is not enough.

Bottom Line

  • Qwen3.5-27B feels like the best overall pick for agentic coding on a 4090.
    • Reliable: got the complex task right on the first try
    • 130K context is a practical sweet spot for long agentic sessions without maxing out the card
    • 44.9 tok/s is slower than MoE but fast enough for interactive use
    • Most efficient on the simple task (fewest API calls)
    • Only uses ~21 GB VRAM, leaving headroom
    • Produced the most correct and cleanest code of all four models

These are notes from a single benchmarking session with two prompts and my experience over the last 2 days. I am not claiming any of this is statistically rigorous.

Appendix

A: Hardware Fit and Server Config

llama-server Launch Config

Base config used for all models:

llama-server \
    --model $MODEL_PATH \
    --jinja \
    --host 100.80.101.103 \
    --port 8001 \
    --parallel 1 \
    --batch-size 2048 \
    --ubatch-size 512 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --flash-attn on \
    --context-shift \
    --metrics

Per-Model Overrides

Gemma4 models:

--ctx-size 256000     # MoE (26B-A4B)
--ctx-size 65336      # Dense (31B) - reduced due to VRAM constraints
--temp 1.0
--top-p 0.95
--top-k 64
--min-p 0.00

Qwen 3.5 models:

--ctx-size 200000     # MoE (35B-A3B)
--ctx-size 130672     # Dense (27B)
--temp 0.6
--top-k 20
--chat-template-file $TEMPLATES_DIR/qwen35-chat-template-corrected.jinja
--chat-template-kwargs '{"enable_thinking":true}'

B: Installation

  1. llama.cpp Installation
sudo apt-get update
sudo apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev libssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && git pull origin master && cd ..
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_CUDA=ON \
    -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j24 --clean-first \
    --target llama-cli llama-mtmd-cli llama-server llama-gguf-split llama-bench
cp llama.cpp/build/bin/llama-* llama.cpp
  1. Tokenscope: I used the opencode-tokenscope plugin to get per-session token breakdowns. You need to add "plugin": ["@ramtinj95/opencode-tokenscope"] to your opencode.json then create a /tokenscope slash command in ~/.config/opencode/command/tokenscope.md.

  2. llama-server /metrics: llama-server exposes a /metrics endpoint (enabled with the --metrics flag) that returns Prometheus-format counters.

Troubleshooting

  1. Qwen3.5-35B-A3B todowrite Parse Error: Qwen3.5-35B-A3B sometimes returned tool call arguments as a raw JSON string instead of a parsed object. This caused the todowrite tool to fail because Open Code expected todos to be an array, not a string containing an array. You can fix this using a small plugin at ~/.opencode/plugins/todo-fix-plugins.ts:
export const TodoFixPlugin = async (ctx) => {
  return {
    "tool.execute.before": async (input, output) => {
      if (input.tool === "todowrite" && typeof output.args.todos === "string") {
        output.args.todos = JSON.parse(output.args.todos)
      }
    }
  }
}
  1. Gemma4-31B Context Size: I had to reduce to 65,336 tokens to maintain ~40 tok/s generation. You can push it higher but generation speed degrades as context grows.

  2. **Qwen3.5 models needed a corrected Jinja chat template qwen35-chat-template-corrected.jinjahttps://gist.github.com/garg-aayush/c0211a5fdca3e237d248d52806ff8d96 to work properly with llama-server. The default template had issues with thinking mode.