Notes on Qwen3.5 vs Gemma4 for Local Agentic Coding

Comparing Qwen 3.5 and Gemma4 (dense and MoE) for local agentic coding on an RTX 4090 using llama-bench and one-prompt coding tasks with Open Code.

Published

April 5, 2026

Gemma4 was relased by Google on April 2nd earlier this week and I wanted to see how it performs against Qwen3.5 for local agentic coding. This post is my notes on benchmarking the two model families. I ran two types of tests:

Standard llama-bench benchmarks for raw prefill and generation speed
Single-shot agentic coding tasks using Open Code to see how these models actually perform on real multi-step coding workflows

Quick Summary:

Model	Gen tok/s	Turn(correct)	Code Quality	VRAM	Max Context
Gemma4-26B-A4B	~135	3rd	Weakest	~21 GB	256K
Qwen3.5-35B-A3B	~136	2nd	Best structure, wrong API	~23 GB	200K
Qwen3.5-27B	~45	1st	Cleanest and best overall	~21 GB	130K
Gemma4-31B	~38	1st	Clean but shallow	~24 GB	65K

Max Context is the largest context size that fits in VRAM with acceptable generation speed.

MoE models are 3x faster but both dense models got the complex task right on the first try.
My pick is Qwen3.5-27B which is still the best model for local agentic coding on an 24GB card (RTX 3090/4090). It is reliable, efficient, produces the cleanest code and fits comfortably on a 4090.

Models

Below are the models and their quantization that I used for benchmarking:

Model	Architecture	Quant	Model size	Total Params	Active Params
Qwen3.5-27B	Dense	Q4_K_XL	16.40 GiB	26.90 B	26.90 B
Qwen3.5-35B-A3B	MoE	Q4_K_XL	20.70 GiB	34.66 B	~3 B
Gemma4-26B-A4B	MoE	Q4_K_XL	15.95 GiB	25.23 B	~4 B
Gemma4-31B	Dense	Q4_K_XL	17.46 GiB	30.70 B	30.70 B

All four models were run with thinking mode enabled and on April 3rd, 2026
I used Unsloth GGUFs model versions on llama.cpp.

Standard Benchmarks with llama-bench

llama.cpp has a llama-bench utility that runs standard prefill and generation (decode) benchmarks. It is a quick way to get raw throughput numbers..

This is the command I used to run the benchmarks:

./llama.cpp/llama-bench \
  -m $MODEL_PATH \
  -ctk q8_0 -ctv q8_0 -fa 1 -b 2048 -ub 512 \
  -p 512,2048,4096,8192,16384,32768,65336 \
  -n 128 -r 3 -o md

llama-bench flags used

Flag	Value	Meaning
`-ctk`	`q8_0`	KV cache keys quantized to 8-bit
`-ctv`	`q8_0`	KV cache values quantized to 8-bit
`-fa`	`1`	Flash Attention enabled
`-b`	`2048`	Batch size (max tokens processed per batch)
`-ub`	`512`	Micro-batch size (tokens processed per CUDA kernel call)
`-p`	`512,2048,...,65336`	Prefill token counts to sweep
`-n`	`128`	Decode (generation) tokens per run
`-r`	`3`	Repeat each test 3 times and report mean
`-o`	`md`	Output as markdown table

Prefill Speed

Context	Qwen3.5-27B	Qwen3.5-35B-A3B	Gemma4-26B-A4B	Gemma4-31B
512	3,037	6,666	8,597	3,100
2K	3,069	6,674	8,710	2,992
4K	3,025	6,633	8,733	2,925
8K	2,957	6,524	8,443	2,811
16K	2,841	6,308	7,961	2,614
32K	2,632	5,920	7,097	2,304
65K	2,290	5,273	5,917	1,869

Generation Speed (tg128)

Model	Architecture	Generation (tokens/s)
Qwen3.5-35B-A3B	MoE	165.84
Gemma4-26B-A4B	MoE	164.38
Qwen3.5-27B	Dense	45.88
Gemma4-31B	Dense	44.42

Notes on llama-bench Results

As expected, the MoE models dominate both prefill and generation speed.
Generation speed for the two MoE models is nearly identical (~165 tok/s). The same story for the two dense models (~45 tok/s) (as memory-bandwidth bound).

Agentic Coding: One-Prompt Test

The llama-bench numbers tell you how fast tokens move (sort of the max limit we can expect) but they say nothing about how a model actually performs in reasoning, tool calls, writing code and actual speed with coding assistants.

To test it, I ran a simple practical test: give the model one prompt and see if it can figure out the rest on its own. There will be no hand-holding and multi-turn guidance. The idea is to see how the model performs in such a scenario.

This is not a formal test. It is two prompts at different complexity levels to see how well the model handles mult-step workflows. This is usually the case with most of us, we describe what we want in a single prompt and let the model do its thing.

Setup

I used Open Code as the agentic coding frontend because I find it easier to set up with a local llama-server backend. I also configured Context7 as an skills + MCP server to let models fetch up-to-date library documentation and API docs during its run.

llama-server was configured with q8_0 KV cache (turboquant) and context size varied per model based on VRAM constraints to maximize generation speed (full config in Appendix A).

Speed metrics came from llama-server’s /metrics endpoint.
Token usage breakdowns were estimated using the opencode-tokenscope plugin.

I also made sure to restart llama-server between model runs so the counters would not carry over.

Prompt 1: Simple (httpx + pytest)

use context7 to look up the httpx library docs. then write me a python script
that fetches the post from https://jsonplaceholder.typicode.com/posts/1 and
prints the title. also write a pytest test for it, no mocks, hit the real API.
use uv run to run everything so we don't install anything in the current
environment. run the test and make sure it passes.

This tests the basics such as can the model call Context7 to look up docs, write a simple script and real integration test (no mocks), use uv run for running and dep. management and actually execute everything to verify it works.

Prompt 2: Comprehensive (Image Gen API calls + TDD)

use context7 to search for the latest google gemini image generation API docs.
I want you to write a python script that uses the google-genai SDK to generate
images using the gemini-3.1-flash-preview model (nano banana). use TDD
red-green methodology, write failing tests first then make them pass. do not
use any mock tests. use uv run to run everything so we don't install anything
in the current environment. test the script and if it works fine and generates
an image, then use this script to run image generation on the five prompts
given in prompts.json. save the images to an images folder, make sure the
folder exists, if it doesn't then create it.

This is a slightly heavier multi-step workflow. The model has to:

Look up the Gemini image generation API docs via Context7
Write a Python script using the google-genai SDK
Follow TDD red-green methodology (write failing tests first, then make them pass)
Use real API calls, no mocks
Use uv run for dependencies
Read prompts.json, generate images for all five prompts
Handle file I/O (create output directory, save images)

The idea is to see how well the model handles and executes the mult-step workflows correctly.

Results: Gemma4-26B-A4B

VRAM: ~21 GB | Context: 256K tokens

Metric	Prompt 1	Prompt 2
Prefill tok/s	4,338	4,560
Generation tok/s	135.5	134.4
Total prompt tokens processed	17,847	23,204
Total tokens generated	1,623	3,435
Prompt processing time	4.11s	5.09s
Generation time	11.98s	125.56s
API calls	10	13
Tool calls	7	11
Correct on turn	1st	3rd

API calls is the number of api calls opencode makes to the llm model.

It is fast with 135 tok/s generation and 4.3K+ prefill is the fastest of all.
It is also the most concise model by far with based on generated tokens.
Needed 3 attempts on Prompt 2. Despite being the fastest and most concise, it struggled with the multi-step instructions.

Results: Gemma4-31B

VRAM: ~24 GB | Context: 65K tokens

Note: I had to drop the context size to 65K from 128K to maintain reasonable generation speed. At 128K, generation speed degrades to around ~10tok/s with best speed only achievable around 65K tokens.

Metric	Prompt 1	Prompt 2
Prefill tok/s	1,466	1,357
Generation tok/s	37.7	35.2
Total prompt tokens processed	16,618	25,070
Total tokens generated	2,903	5,968
Prompt processing time	11.34s	18.48s
Generation time	77.07s	169.53s
API calls	10	16
Tool calls	8	14
Correct on turn	1st	1st

Got Prompt 2 correct on the first turn. The dense model reliability on the complex task was noticeably better.
The model generated nearly twice the tokens as the MoE variant (2,903 vs 1,623 on Prompt 1) including 1,548 reasoning tokens.
Context limited to 65K is a real practical limitation, not sure whether this degradation in speed at higher context will be solved in future.

Results: Qwen3.5-35B-A3B

VRAM: ~23 GB | Context: 200K tokens

Metric	Prompt 1	Prompt 2
Prefill tok/s	3,179	3,056
Generation tok/s	136.7	132.3
Total prompt tokens processed	16,145	92,375
Total tokens generated	7,564	32,904
Prompt processing time	5.08s	30.23s
Generation time	55.32s	248.75s
API calls	13	30
Tool calls	11	28
Correct on turn	1st	2nd

The generation speed is identical to Gemma4-26B-A4B.
Though the model is extremely verbose. ~7.5K and ~32K tokens on Prompt 1/2.
Prompt 2 was the most intensive run of the entire benchmark: 30 API calls with multiple tool calls.
Got Prompt 2 correct on the 2nd turn better than Gemma4-26B-A4B.
The 248.7s generation time on Prompt 2 is a direct result of such large API and tool calls.

Results: Qwen3.5-27B

VRAM: ~21 GB | Context: 130K tokens

Metric	Prompt 1	Prompt 2
Prefill tok/s	2,474	2,188
Generation tok/s	44.9	44.6
Total prompt tokens processed	15,043	24,385
Total tokens generated	2,867	11,824
Prompt processing time	6.08s	11.14s
Generation time	63.91s	265.00s
API calls	9	18
Tool calls	7	14
Correct on turn	1st	1st

Got Prompt 2 correct on the first turn. Same as Gemma4-31B.
Most efficient session on Prompt 1 with fewest API calls (9) and tool calls.
Generation at 44.9 tok/s is slower than MoE but faster than Gemma4-31B (37.7).
130K context fits comfortably in VRAM. This is a practical sweet spot with decent enough context size.

Comparing the Performance

Summary Tables

Speed

Model	Prefill tok/s (P1)	Prefill tok/s (P2)	Gen tok/s (P1)	Gen tok/s (P2)
Gemma4-26B-A4B	4,338	4,560	135.5	134.4
Qwen3.5-35B-A3B	3,179	3,056	136.7	132.3
Gemma4-31B	1,466	1,357	37.7	35.2
Qwen3.5-27B	2,474	2,188	44.9	44.6

Efficiency and Completion

Model	Tokens Gen (P1)	Tokens Gen (P2)	API Calls (P1)	API Calls (P2)	Tool Calls (P2)	Correct Turn (P2)
Gemma4-26B-A4B	1,623	3,435	10	13	11	3rd
Qwen3.5-35B-A3B	7,564	32,904	13	30	28	2nd
Gemma4-31B	2,903	5,968	10	16	14	1st
Qwen3.5-27B	2,867	11,824	9	18	14	1st

Hardware Fit (RTX 4090 24 GB)

Model	VRAM Usage	Max Context
Gemma4-26B-A4B	~21 GB	256,000
Qwen3.5-35B-A3B	~23 GB	200,000
Qwen3.5-27B	~21 GB	130,672
Gemma4-31B	~24 GB	65,336

Code Quality

I looked at the working code each model produced for Prompt 2 (the Nano Banana image generation task) and used Opus to compare them on structure, error handling, TDD compliance, API correctness and overall cleanliness.

Aspect	Gemma4-26B-A4B	Gemma4-31B	Qwen3.5-35B-A3B	Qwen3.5-27B
Structure	2 files, basic separation	3 files, clean separation	Class-based with helpers, cleanest design	3 files + dead `main.py` stub
Error handling	Minimal, no API error handling	Poor, no try/except around API	Adequate but no batch error recovery	Weak, silent failures
TDD	Placeholder test, no real TDD	One integration test, superficial	Integration tests only, claimed but not real	Integration tests only, claimed but not real
Cleanliness	Acceptable, concise	Good, readable, concise	Good structure but unused `base64` import	Good docstrings, type hints, pathlib usage
Critical issues	Broken summary, no `uv run` setup	New client per API call	Hardcoded API key in tests, wrong model	Dead `main.py`, new client per call

None of the models truly followed TDD. All of them claimed red-green methodology in their summaries but wrote integration tests that hit the real API. No model used mocks or wrote genuinely failing tests first.
Qwen3.5-27B produced the most correct code. It got the model name right, used type hints and docstrings, used pathlib properly and had the cleanest overall implementation. Its issues (dead main.py stub, client created per call) are minor compared to the others.
Qwen3.5-35B-A3B had the best code structure with a proper class-based design, but committed a security sin by hardcoding an API key in the test file and used the wrong model name entirely. For a task that specifically asked for gemini-3.1-flash-preview using gemini-2.5-flash-image is a correctness failure.
Gemma4-31B was clean and concise but shallow. Minimal code, readable but no error handling and superficial testing.
Gemma4-26B-A4B was the weakest. Missing a critical API parameter and broken summary file and no uv run integration despite being asked for it. This lines up with it needing 3 attempts to get working code.

Takeaways

Speed and Efficiency

Dense models were more reliable on the complex task. Both Qwen3.5-27B and Gemma4-31B got Prompt 2 right on the first turn. Both MoE models needed retries. Two data points is not a conclusion, but it is a pattern worth noting.
MoE speed advantage is real but verbosity can eat it up. Both MoE models hit ~135 tok/s generation vs ~40-45 tok/s for dense. But Qwen3.5-35B-A3B generated 32,904 tokens on Prompt 2 which means 248 seconds of generation even at MoE speeds. Gemma4-26B-A4B was the only model that was both fast and concise.
Gemma4-26B-A4B is the speed king. If you are doing high-volume simpler tasks where first-try reliability matters less, it is hard to beat.

Code Quality

Qwen3.5-27B produced the most correct and cleanest code overall. Right model name, type hints, docstrings, pathlib usage. Its issues are minor compared to every other model.
None of the models truly followed TDD. All claimed red-green methodology but wrote integration tests hitting the real API. No mocks, no genuinely failing tests first.
Better structure does not mean better code. Qwen3.5-35B-A3B had the cleanest design (class-based) but hardcoded an API key and used the wrong model name. Structure alone is not enough.

Bottom Line

Qwen3.5-27B feels like the best overall pick for agentic coding on a 4090.
- Reliable: got the complex task right on the first try
- 130K context is a practical sweet spot for long agentic sessions without maxing out the card
- 44.9 tok/s is slower than MoE but fast enough for interactive use
- Most efficient on the simple task (fewest API calls)
- Only uses ~21 GB VRAM, leaving headroom
- Produced the most correct and cleanest code of all four models

These are notes from a single benchmarking session with two prompts and my experience over the last 2 days. I am not claiming any of this is statistically rigorous.

Appendix

A: Hardware Fit and Server Config

llama-server Launch Config

Base config used for all models:

llama-server \
    --model $MODEL_PATH \
    --jinja \
    --host 100.80.101.103 \
    --port 8001 \
    --parallel 1 \
    --batch-size 2048 \
    --ubatch-size 512 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --flash-attn on \
    --context-shift \
    --metrics

Per-Model Overrides

Gemma4 models:

--ctx-size 256000     # MoE (26B-A4B)
--ctx-size 65336      # Dense (31B) - reduced due to VRAM constraints
--temp 1.0
--top-p 0.95
--top-k 64
--min-p 0.00

Qwen 3.5 models:

--ctx-size 200000     # MoE (35B-A3B)
--ctx-size 130672     # Dense (27B)
--temp 0.6
--top-k 20
--chat-template-file $TEMPLATES_DIR/qwen35-chat-template-corrected.jinja
--chat-template-kwargs '{"enable_thinking":true}'

B: Installation

llama.cpp Installation

sudo apt-get update
sudo apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev libssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && git pull origin master && cd ..
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_CUDA=ON \
    -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j24 --clean-first \
    --target llama-cli llama-mtmd-cli llama-server llama-gguf-split llama-bench
cp llama.cpp/build/bin/llama-* llama.cpp

Tokenscope: I used the opencode-tokenscope plugin to get per-session token breakdowns. You need to add "plugin": ["@ramtinj95/opencode-tokenscope"] to your opencode.json then create a /tokenscope slash command in ~/.config/opencode/command/tokenscope.md.
llama-server /metrics: llama-server exposes a /metrics endpoint (enabled with the --metrics flag) that returns Prometheus-format counters.

Troubleshooting

Qwen3.5-35B-A3B todowrite Parse Error: Qwen3.5-35B-A3B sometimes returned tool call arguments as a raw JSON string instead of a parsed object. This caused the todowrite tool to fail because Open Code expected todos to be an array, not a string containing an array. You can fix this using a small plugin at ~/.opencode/plugins/todo-fix-plugins.ts:

export const TodoFixPlugin = async (ctx) => {
  return {
    "tool.execute.before": async (input, output) => {
      if (input.tool === "todowrite" && typeof output.args.todos === "string") {
        output.args.todos = JSON.parse(output.args.todos)
      }
    }
  }
}

Gemma4-31B Context Size: I had to reduce to 65,336 tokens to maintain ~40 tok/s generation. You can push it higher but generation speed degrades as context grows.
**Qwen3.5 models needed a corrected Jinja chat template qwen35-chat-template-corrected.jinjahttps://gist.github.com/garg-aayush/c0211a5fdca3e237d248d52806ff8d96 to work properly with llama-server. The default template had issues with thinking mode.