<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Aayush Garg</title>
<link>https://garg-aayush.github.io/blog/</link>
<atom:link href="https://garg-aayush.github.io/blog/index.xml" rel="self" type="application/rss+xml"/>
<description>From-scratch LLM builds, deep-dives and practical AI engineering.</description>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Wed, 29 Apr 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Building a Live India Weather Dashboard with Claude Code</title>
  <link>https://garg-aayush.github.io/posts/2026-04-29-india-weather-with-claude/</link>
  <description><![CDATA[ 




<p>It is summer in India with temperature soaring above 40°C in many parts and like a lot of folks here, I too have a habit of checking the temperature and humidity more often than I would like to admit. Moreover, if you are from northern or western India, you are probably also checking the AQI given how big and genuine the pollution problem is these parts.</p>
<p>As a fun project, I built a small dashboard that shows the weather and AQI for the major Indian cities along with the option of viewing the history over 24 hours, 7 days and 30 days. Moreover, this dashboard fetches the data every 15 minutes so more or less you have a live view of the weather and AQI for the major Indian cities.</p>
<p>The dashboard is live at <a href="../../india-weather.html">aayushgarg.dev/india-weather</a>.</p>
<table align="center">
<tbody><tr>
<td>
<img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/weather_dashboard_snap1.jpg" alt="Live India weather map" width="100%">
</td>
<td>
<img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/weather_dashboard_snap2.jpg" alt="Per-city history charts" width="100%">
</td>
</tr>
</tbody></table>
<p>As usual (and like most of the devs out there) I used <a href="https://claude.ai/code">Claude Code</a> to build this dashboard. Also, I am an ML Researcher/Engineer, so not the brigtest when it comes to frontend and devops. This post is my reflection on how I went about building (prompting!) this dashboard in stages.</p>
<section id="what-it-does" class="level2">
<h2 class="anchored" data-anchor-id="what-it-does">What it does</h2>
<p>Just a quick overview of what the dashboard does:</p>
<ul>
<li>I have picked 20 Indian metropolitan cities and displayed them on an interactive map, with live temperature, humidity and AQI refreshed every 15 minutes</li>
<li>When you click a marker or a leaderboard entry, you get 24h / 7d / 30d charts for the temperature, humidity and AQI</li>
<li>The map is based on <a href="https://www.mapbox.com/">Mapbox</a>, locked to India’s bounds</li>
<li>The site is static, so fetching runs server-side on a cron and the page just reads JSON</li>
</ul>
<p>All the numbers are fetched from <a href="https://open-meteo.com/">Open-Meteo</a> and <a href="https://aqicn.org/">WAQI</a>:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 17%">
<col style="width: 16%">
<col style="width: 12%">
<col style="width: 18%">
<col style="width: 34%">
</colgroup>
<thead>
<tr class="header">
<th>Metric</th>
<th style="text-align: center;">Live (15 min)</th>
<th style="text-align: center;">24h chart</th>
<th style="text-align: center;">7d / 30d charts</th>
<th>Source</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Temperature</td>
<td style="text-align: center;">✓</td>
<td style="text-align: center;">✓</td>
<td style="text-align: center;">✓</td>
<td>Open-Meteo Forecast</td>
</tr>
<tr class="even">
<td>Humidity</td>
<td style="text-align: center;">✓</td>
<td style="text-align: center;">✓</td>
<td style="text-align: center;">✓</td>
<td>Open-Meteo Forecast</td>
</tr>
<tr class="odd">
<td>Live AQI</td>
<td style="text-align: center;">✓</td>
<td style="text-align: center;"></td>
<td style="text-align: center;"></td>
<td>WAQI / CPCB stations</td>
</tr>
<tr class="even">
<td>Historical AQI</td>
<td style="text-align: center;"></td>
<td style="text-align: center;">✓</td>
<td style="text-align: center;">✓</td>
<td>Open-Meteo Air Quality (CAMS)</td>
</tr>
</tbody>
</table>
<p>Why these two options:</p>
<ul>
<li>I wanted free APIs with no client-exposed key which rules the paid tiers out like AccuWeather and OpenWeatherMap (even Google Maps I guess).</li>
<li>Open-Meteo is free for non-commercial use and runs on the same ECMWF + GFS models behind most weather apps.</li>
<li><strong>WAQI is the more interesting pick.</strong> Its Indian feed comes directly from the CPCB ground stations (the same network behind <code>airquality.cpcb.gov.in</code>) and republishes on the natural CPCB cadence of about 15 minutes, so it is effectively the official live AQI reading with a free developer API in front.</li>
<li>Open-Meteo’s air-quality stream is different in kind. It is a global chemistry model on a coarse grid which is good for the 24h/7d/30d historical trends but tends to under-read Indian pollution episodes compared to CPCB stations.</li>
</ul>
<p>I built this in three loose stages, with stages 2 and 3 overlapping in practice. Below is roughly how each went.</p>
</section>
<section id="stage-1-brainstorming-and-building-a-working-v1" class="level2">
<h2 class="anchored" data-anchor-id="stage-1-brainstorming-and-building-a-working-v1">Stage 1: brainstorming and building a working v1</h2>
<p>When I am building something new my first instinct is to brainstorm with the coding assistant (Claude Code in my case, codex or whichever one you use) before even asking it to write a single line of code. The shape of the prompt is roughly:</p>
<blockquote class="blockquote">
<p>here is what I want to build, here are the options I see, here are the constraints, help me decide.</p>
</blockquote>
<p>For this dashboard the biggest constraint was easy to state: since the site is my personal portfolio, anything I pulled in had to fit inside a free tier with enough headroom for the kind of traffic a personal site sees.</p>
<p>The one line that <strong>ALWAYS</strong> helps me is appending this at the end of the prompt:</p>
<blockquote class="blockquote">
<p>Interview me until you have 95% confidence about what I actually want, not what I think I should want.</p>
</blockquote>
<p>It tells the model to keep questioning your assumptions instead of taking them at face value which is exactly what you want when you are scoping a new or large feature. Whenever I am about to integrate something I have not built before, this single line saves me a lot of time in fast tracking the development and arriving at a plan that actually fits.</p>
<p>For this dashboard the back-and-forth helped me take the important decisions early.</p>
<ul>
<li>I had said Google Maps, Claude pointed out that on a static site any key would leak in client JS and steered me to Mapbox’s free tier.</li>
<li>I had said AccuWeather, it asked whether brand mattered more than accuracy for India, and pushed me to Open-Meteo + WAQI, both free and key-optional.</li>
<li>It also laid out the cadence logic for me: a GitHub Actions cron every 15 minutes that writes a single JSON to a <code>data</code> branch on the same repo which the page then reads on load.</li>
<li>By the end of the session I had a page that was live with 8 metros, a marker plus card UI and a working live tile and the data was being fetched every 15 minutes.</li>
</ul>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/stage1_snap.jpg" class="img-fluid" style="width:100.0%"></p>
<p><em>Full Claude Code conversation for this stage: <a href="https://gist.github.com/garg-aayush/ad4dce3170fddcefdfb75eabc5b15620">stage-1-brainstorm-and-v1.txt</a>.</em></p>
</section>
<section id="stage-2-improving-the-dashboard-with-history-charts-and-ui-fixes" class="level2">
<h2 class="anchored" data-anchor-id="stage-2-improving-the-dashboard-with-history-charts-and-ui-fixes">Stage 2: improving the dashboard with history charts and UI fixes</h2>
<p>The next thing that I wanted to do was to add history charts under the map. I wanted to be able to view the history of the temperature, humidity and AQI for each city over certain time periods. This is the first version that I made conversationally with Claude:</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/stage2_hischarts.jpg" class="img-fluid" style="width:60.0%"></p>
<p>Obviously, this was not the best way to view the values. The 7d and 30d views looked terrible with the day-night cycle turning every chart into a sawtooth. Next, I sent Claude the screenshot and asked one question:</p>
<blockquote class="blockquote">
<p>Currently, this is how the values look over the 7d and 30d period, I dont think that is the best way to view the values. What is the standard way to show these values?</p>
</blockquote>
<p>Claude pointed me at the weather.com pattern where weather apps aggregate the values by day, show a min/max bands with a daily mean line for temp and humidity, and color the AQI bars by EPA category. A second daily cron rebuilds the 30-day window from Open-Meteo each run which looked much nicer:</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/stage2_hischarts_better.jpg" class="img-fluid" style="width:60.0%"></p>
<p>Finally, I improved the UI by fixing a few small issues. Some of them you can see below:</p>
<table align="center">
<tbody><tr>
<td>
<img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/stage2_ui_issues1.jpg" alt="UI issue: label clutter" width="100%">
</td>
<td>
<img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/stage2_ui_issues2.jpg" alt="UI issue: bounds and styling" width="100%">
</td>
<td>
<img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/stage2_ui_issues3.jpg" alt="UI issue: caption and controls" width="100%">
</td>
</tr>
</tbody></table>
<ul>
<li>Better labeling for the cities that sit close together, better placement of the labels and leader lines</li>
<li>Locking the map to India’s bounds</li>
<li>Adding a reset-view control to the map</li>
<li>A few alignment and styling tweaks</li>
</ul>
<p><em>Full Claude Code conversations for this stage: <a href="https://gist.github.com/garg-aayush/90d071d907958bc086562e5c17652d33">history charts</a>, <a href="https://gist.github.com/garg-aayush/4e7cefa56c5bba94c1ef3f830b3f0f48">daily aggregates</a>, <a href="https://gist.github.com/garg-aayush/259733c719cf10f77cf035e1c28b9380">UI polish</a>.</em></p>
</section>
<section id="stage-3-moving-the-cron-from-github-actions-to-cloudflare" class="level2">
<h2 class="anchored" data-anchor-id="stage-3-moving-the-cron-from-github-actions-to-cloudflare">Stage 3: moving the cron from GitHub Actions to Cloudflare</h2>
<p>After the dashboard was deployed I noticed the live tile timestamps were drifting. The label said “every 15 minutes” but the GitHub Actions free-tier cron was firing erratically, sometimes 40 mins or even &gt;1 hour late. This is usually the case with the free-tier cron jobs on GitHub Actions.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/stage3_github_actions.jpg" class="img-fluid" style="width:80.0%"></p>
<p>I decided to move the cron to Cloudflare Workers which has a free tier and fires within seconds of schedule.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-04-29/stage3_cloudflare.jpg" class="img-fluid" style="width:100.0%"></p>
<p>The interesting thing for me here is that I had never used Cloudflare workers for cron jobs before. Given how good these assistants are at writing code, it wrote the code, walked me through setup steps and the live cadence has been working well since. <strong>Even unfamiliar backend infra is something you can prompt your way through</strong> (obviously you should have the fundamental knowledge to nudge it in the right direction).</p>
<p><em>Full Claude Code conversation for this stage: <a href="https://gist.github.com/garg-aayush/aae13309e3c8c8ab42499b2eb9987b7c">Cloudflare Worker migration</a>.</em></p>
</section>
<section id="what-i-would-carry-forward" class="level2">
<h2 class="anchored" data-anchor-id="what-i-would-carry-forward">What I would carry forward</h2>
<p>A few things from this build that I will keep doing:</p>
<ul>
<li><strong>Brainstorm before you build</strong>: Asking the model to interview you back, with a 95% confidence target, saves a lot of rework downstream. This has always worked for me!</li>
<li><strong>Let it disagree with you</strong>: Mapbox over Google Maps, daily bands over raw sawtooth points, Cloudflare over GitHub Actions. None of those calls were mine. However, make sure your fundamental knowledge is strong enough to nudge it in the right direction.</li>
<li><strong>Iterate from screenshots</strong>: A picture of what looks wrong plus one question is usually enough to move forward. I also use playwright mcp to let Claude test the UI (though they at times consume a lot of tokens).</li>
</ul>


</section>

 ]]></description>
  <category>Tools &amp; Infra</category>
  <guid>https://garg-aayush.github.io/posts/2026-04-29-india-weather-with-claude/</guid>
  <pubDate>Wed, 29 Apr 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen3.5 vs Gemma4 for Local Agentic Coding</title>
  <link>https://garg-aayush.github.io/posts/2026-04-05-qwen35-vs-gemma4/</link>
  <description><![CDATA[ 




<p><a href="https://deepmind.google/models/gemma/gemma-4/">Gemma4</a> was relased by Google on April 2nd earlier this week and I wanted to see how it performs against Qwen3.5 for local agentic coding. This post is my notes on benchmarking the two model families. I ran two types of tests:</p>
<ul>
<li><strong>Standard llama-bench benchmarks</strong> for raw prefill and generation speed</li>
<li><strong>Single-shot agentic coding tasks</strong> using <a href="https://opencode.ai">Open Code</a> to see how these models actually perform on real multi-step coding workflows</li>
</ul>
<p><strong>Quick Summary:</strong></p>
<table class="caption-top table">
<colgroup>
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>Gen tok/s</th>
<th>Turn(correct)</th>
<th>Code Quality</th>
<th>VRAM</th>
<th>Max Context</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Gemma4-26B-A4B</td>
<td>~135</td>
<td>3rd</td>
<td>Weakest</td>
<td>~21 GB</td>
<td>256K</td>
</tr>
<tr class="even">
<td>Qwen3.5-35B-A3B</td>
<td>~136</td>
<td>2nd</td>
<td>Best structure, wrong API</td>
<td>~23 GB</td>
<td>200K</td>
</tr>
<tr class="odd">
<td>Qwen3.5-27B</td>
<td>~45</td>
<td>1st</td>
<td>Cleanest and best overall</td>
<td>~21 GB</td>
<td>130K</td>
</tr>
<tr class="even">
<td>Gemma4-31B</td>
<td>~38</td>
<td>1st</td>
<td>Clean but shallow</td>
<td>~24 GB</td>
<td>65K</td>
</tr>
</tbody>
</table>
<blockquote class="blockquote">
<p><strong>Max Context</strong> is the largest context size that fits in VRAM with acceptable generation speed.</p>
</blockquote>
<ul>
<li>MoE models are 3x faster but both dense models got the complex task right on the <strong>first try</strong>.</li>
<li><strong>My pick is Qwen3.5-27B which is still the best model for local agentic coding</strong> on an 24GB card (RTX 3090/4090). It is reliable, efficient, produces the cleanest code and fits comfortably on a 4090.</li>
</ul>
<section id="models" class="level2">
<h2 class="anchored" data-anchor-id="models">Models</h2>
<p>Below are the models and their quantization that I used for benchmarking:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>Architecture</th>
<th>Quant</th>
<th>Model size</th>
<th>Total Params</th>
<th>Active Params</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Qwen3.5-27B</td>
<td>Dense</td>
<td>Q4_K_XL</td>
<td>16.40 GiB</td>
<td>26.90 B</td>
<td>26.90 B</td>
</tr>
<tr class="even">
<td>Qwen3.5-35B-A3B</td>
<td>MoE</td>
<td>Q4_K_XL</td>
<td>20.70 GiB</td>
<td>34.66 B</td>
<td>~3 B</td>
</tr>
<tr class="odd">
<td>Gemma4-26B-A4B</td>
<td>MoE</td>
<td>Q4_K_XL</td>
<td>15.95 GiB</td>
<td>25.23 B</td>
<td>~4 B</td>
</tr>
<tr class="even">
<td>Gemma4-31B</td>
<td>Dense</td>
<td>Q4_K_XL</td>
<td>17.46 GiB</td>
<td>30.70 B</td>
<td>30.70 B</td>
</tr>
</tbody>
</table>
<ul>
<li>All four models were run with thinking mode enabled and on April 3rd, 2026</li>
<li>I used <a href="https://unsloth.ai/">Unsloth</a> GGUFs model versions on <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a>.</li>
</ul>
</section>
<section id="standard-benchmarks-with-llama-bench" class="level2">
<h2 class="anchored" data-anchor-id="standard-benchmarks-with-llama-bench">Standard Benchmarks with llama-bench</h2>
<p>llama.cpp has a <code>llama-bench</code> utility that runs standard prefill and generation (decode) benchmarks. It is a quick way to get raw throughput numbers..</p>
<p>This is the command I used to run the benchmarks:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">./llama.cpp/llama-bench</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-m</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$MODEL_PATH</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-ctk</span> q8_0 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-ctv</span> q8_0 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-fa</span> 1 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-b</span> 2048 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-ub</span> 512 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-p</span> 512,2048,4096,8192,16384,32768,65336 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-n</span> 128 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-r</span> 3 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-o</span> md</span></code></pre></div></div>
<div class="callout callout-style-default callout-note callout-titled" title="llama-bench flags used">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>llama-bench flags used
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<table class="caption-top table">
<colgroup>
<col style="width: 33%">
<col style="width: 33%">
<col style="width: 33%">
</colgroup>
<thead>
<tr class="header">
<th>Flag</th>
<th>Value</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>-ctk</code></td>
<td><code>q8_0</code></td>
<td>KV cache keys quantized to 8-bit</td>
</tr>
<tr class="even">
<td><code>-ctv</code></td>
<td><code>q8_0</code></td>
<td>KV cache values quantized to 8-bit</td>
</tr>
<tr class="odd">
<td><code>-fa</code></td>
<td><code>1</code></td>
<td>Flash Attention enabled</td>
</tr>
<tr class="even">
<td><code>-b</code></td>
<td><code>2048</code></td>
<td>Batch size (max tokens processed per batch)</td>
</tr>
<tr class="odd">
<td><code>-ub</code></td>
<td><code>512</code></td>
<td>Micro-batch size (tokens processed per CUDA kernel call)</td>
</tr>
<tr class="even">
<td><code>-p</code></td>
<td><code>512,2048,...,65336</code></td>
<td>Prefill token counts to sweep</td>
</tr>
<tr class="odd">
<td><code>-n</code></td>
<td><code>128</code></td>
<td>Decode (generation) tokens per run</td>
</tr>
<tr class="even">
<td><code>-r</code></td>
<td><code>3</code></td>
<td>Repeat each test 3 times and report mean</td>
</tr>
<tr class="odd">
<td><code>-o</code></td>
<td><code>md</code></td>
<td>Output as markdown table</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<section id="prefill-speed" class="level3">
<h3 class="anchored" data-anchor-id="prefill-speed">Prefill Speed</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Context</th>
<th>Qwen3.5-27B</th>
<th>Qwen3.5-35B-A3B</th>
<th>Gemma4-26B-A4B</th>
<th>Gemma4-31B</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>512</td>
<td>3,037</td>
<td>6,666</td>
<td><strong>8,597</strong></td>
<td>3,100</td>
</tr>
<tr class="even">
<td>2K</td>
<td>3,069</td>
<td>6,674</td>
<td><strong>8,710</strong></td>
<td>2,992</td>
</tr>
<tr class="odd">
<td>4K</td>
<td>3,025</td>
<td>6,633</td>
<td><strong>8,733</strong></td>
<td>2,925</td>
</tr>
<tr class="even">
<td>8K</td>
<td>2,957</td>
<td>6,524</td>
<td><strong>8,443</strong></td>
<td>2,811</td>
</tr>
<tr class="odd">
<td>16K</td>
<td>2,841</td>
<td>6,308</td>
<td><strong>7,961</strong></td>
<td>2,614</td>
</tr>
<tr class="even">
<td>32K</td>
<td>2,632</td>
<td>5,920</td>
<td><strong>7,097</strong></td>
<td>2,304</td>
</tr>
<tr class="odd">
<td>65K</td>
<td>2,290</td>
<td>5,273</td>
<td><strong>5,917</strong></td>
<td>1,869</td>
</tr>
</tbody>
</table>
</section>
<section id="generation-speed-tg128" class="level3">
<h3 class="anchored" data-anchor-id="generation-speed-tg128">Generation Speed (tg128)</h3>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>Architecture</th>
<th>Generation (tokens/s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Qwen3.5-35B-A3B</td>
<td>MoE</td>
<td><strong>165.84</strong></td>
</tr>
<tr class="even">
<td>Gemma4-26B-A4B</td>
<td>MoE</td>
<td>164.38</td>
</tr>
<tr class="odd">
<td>Qwen3.5-27B</td>
<td>Dense</td>
<td>45.88</td>
</tr>
<tr class="even">
<td>Gemma4-31B</td>
<td>Dense</td>
<td>44.42</td>
</tr>
</tbody>
</table>
</section>
<section id="notes-on-llama-bench-results" class="level3">
<h3 class="anchored" data-anchor-id="notes-on-llama-bench-results">Notes on llama-bench Results</h3>
<ul>
<li>As expected, the MoE models dominate both prefill and generation speed.</li>
<li>Generation speed for the two MoE models is nearly identical (~165 tok/s). The same story for the two dense models (~45 tok/s) (as memory-bandwidth bound).</li>
</ul>
</section>
</section>
<section id="agentic-coding-one-prompt-test" class="level2">
<h2 class="anchored" data-anchor-id="agentic-coding-one-prompt-test">Agentic Coding: One-Prompt Test</h2>
<blockquote class="blockquote">
<p>The llama-bench numbers tell you how fast tokens move (sort of the max limit we can expect) but they say nothing about how a model actually performs in reasoning, tool calls, writing code and <strong>actual speed with coding assistants</strong>.</p>
</blockquote>
<p>To test it, I ran a simple practical test: give the model one prompt and see if it can figure out the rest on its own. There will be no hand-holding and multi-turn guidance. The idea is to see how the model performs in such a scenario.</p>
<p><strong>This is not a formal test</strong>. It is two prompts at different complexity levels to see how well the model handles mult-step workflows. This is usually the case with most of us, we describe what we want in a single prompt and let the model do its thing.</p>
<section id="setup" class="level3">
<h3 class="anchored" data-anchor-id="setup">Setup</h3>
<p>I used <a href="https://opencode.ai">Open Code</a> as the agentic coding frontend because I find it easier to set up with a local llama-server backend. I also configured <a href="https://context7.com">Context7</a> as an skills + MCP server to let models fetch up-to-date library documentation and API docs during its run.</p>
<p>llama-server was configured with <strong>q8_0 KV cache (turboquant)</strong> and context size varied per model based on VRAM constraints to maximize generation speed (full config in Appendix A).</p>
<ul>
<li>Speed metrics came from llama-server’s <code>/metrics</code> endpoint.</li>
<li>Token usage breakdowns were estimated using the <a href="https://www.npmjs.com/package/@ramtinj95/opencode-tokenscope">opencode-tokenscope</a> plugin.</li>
</ul>
<p>I also made sure to restart llama-server between model runs so the counters would not carry over.</p>
</section>
<section id="prompt-1-simple-httpx-pytest" class="level3">
<h3 class="anchored" data-anchor-id="prompt-1-simple-httpx-pytest">Prompt 1: Simple (httpx + pytest)</h3>
<pre><code>use context7 to look up the httpx library docs. then write me a python script
that fetches the post from https://jsonplaceholder.typicode.com/posts/1 and
prints the title. also write a pytest test for it, no mocks, hit the real API.
use uv run to run everything so we don't install anything in the current
environment. run the test and make sure it passes.</code></pre>
<p>This tests the basics such as can the model call Context7 to look up docs, write a simple script and real integration test (no mocks), use <code>uv run</code> for running and dep. management and actually execute everything to verify it works.</p>
</section>
<section id="prompt-2-comprehensive-image-gen-api-calls-tdd" class="level3">
<h3 class="anchored" data-anchor-id="prompt-2-comprehensive-image-gen-api-calls-tdd">Prompt 2: Comprehensive (Image Gen API calls + TDD)</h3>
<pre><code>use context7 to search for the latest google gemini image generation API docs.
I want you to write a python script that uses the google-genai SDK to generate
images using the gemini-3.1-flash-preview model (nano banana). use TDD
red-green methodology, write failing tests first then make them pass. do not
use any mock tests. use uv run to run everything so we don't install anything
in the current environment. test the script and if it works fine and generates
an image, then use this script to run image generation on the five prompts
given in prompts.json. save the images to an images folder, make sure the
folder exists, if it doesn't then create it.</code></pre>
<p>This is a slightly heavier multi-step workflow. The model has to:</p>
<ul>
<li>Look up the Gemini image generation API docs via Context7</li>
<li>Write a Python script using the google-genai SDK</li>
<li>Follow TDD red-green methodology (write failing tests first, then make them pass)</li>
<li>Use real API calls, no mocks</li>
<li>Use <code>uv run</code> for dependencies</li>
<li>Read <code>prompts.json</code>, generate images for all five prompts</li>
<li>Handle file I/O (create output directory, save images)</li>
</ul>
<p>The idea is to see how well the model handles and executes the mult-step workflows correctly.</p>
</section>
<section id="results-gemma4-26b-a4b" class="level3">
<h3 class="anchored" data-anchor-id="results-gemma4-26b-a4b">Results: Gemma4-26B-A4B</h3>
<p><strong>VRAM:</strong> ~21 GB &nbsp;&nbsp;|&nbsp;&nbsp; <strong>Context:</strong> 256K tokens</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Metric</th>
<th>Prompt 1</th>
<th>Prompt 2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Prefill tok/s</td>
<td>4,338</td>
<td>4,560</td>
</tr>
<tr class="even">
<td>Generation tok/s</td>
<td>135.5</td>
<td>134.4</td>
</tr>
<tr class="odd">
<td>Total prompt tokens processed</td>
<td>17,847</td>
<td>23,204</td>
</tr>
<tr class="even">
<td>Total tokens generated</td>
<td>1,623</td>
<td>3,435</td>
</tr>
<tr class="odd">
<td>Prompt processing time</td>
<td>4.11s</td>
<td>5.09s</td>
</tr>
<tr class="even">
<td>Generation time</td>
<td>11.98s</td>
<td>125.56s</td>
</tr>
<tr class="odd">
<td>API calls</td>
<td>10</td>
<td>13</td>
</tr>
<tr class="even">
<td>Tool calls</td>
<td>7</td>
<td>11</td>
</tr>
<tr class="odd">
<td>Correct on turn</td>
<td>1st</td>
<td>3rd</td>
</tr>
</tbody>
</table>
<blockquote class="blockquote">
<p><strong>API calls</strong> is the number of api calls opencode makes to the llm model.</p>
</blockquote>
<ul>
<li>It is fast with 135 tok/s generation and 4.3K+ prefill is the fastest of all.</li>
<li>It is also the most concise model by far with based on generated tokens.</li>
<li><strong>Needed 3 attempts on Prompt 2.</strong> Despite being the fastest and most concise, it struggled with the multi-step instructions.</li>
</ul>
</section>
<section id="results-gemma4-31b" class="level3">
<h3 class="anchored" data-anchor-id="results-gemma4-31b">Results: Gemma4-31B</h3>
<p><strong>VRAM:</strong> ~24 GB &nbsp;&nbsp;|&nbsp;&nbsp; <strong>Context:</strong> 65K tokens</p>
<blockquote class="blockquote">
<p>Note: I had to drop the context size to 65K from 128K to maintain reasonable generation speed. At 128K, generation speed degrades to around ~10tok/s with best speed only achievable around 65K tokens.</p>
</blockquote>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Metric</th>
<th>Prompt 1</th>
<th>Prompt 2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Prefill tok/s</td>
<td>1,466</td>
<td>1,357</td>
</tr>
<tr class="even">
<td>Generation tok/s</td>
<td>37.7</td>
<td>35.2</td>
</tr>
<tr class="odd">
<td>Total prompt tokens processed</td>
<td>16,618</td>
<td>25,070</td>
</tr>
<tr class="even">
<td>Total tokens generated</td>
<td>2,903</td>
<td>5,968</td>
</tr>
<tr class="odd">
<td>Prompt processing time</td>
<td>11.34s</td>
<td>18.48s</td>
</tr>
<tr class="even">
<td>Generation time</td>
<td>77.07s</td>
<td>169.53s</td>
</tr>
<tr class="odd">
<td>API calls</td>
<td>10</td>
<td>16</td>
</tr>
<tr class="even">
<td>Tool calls</td>
<td>8</td>
<td>14</td>
</tr>
<tr class="odd">
<td>Correct on turn</td>
<td>1st</td>
<td>1st</td>
</tr>
</tbody>
</table>
<ul>
<li><strong>Got Prompt 2 correct on the first turn.</strong> The dense model reliability on the complex task was noticeably better.</li>
<li>The model generated nearly twice the tokens as the MoE variant (2,903 vs 1,623 on Prompt 1) including 1,548 reasoning tokens.</li>
<li>Context limited to 65K is a real practical limitation, not sure whether this degradation in speed at higher context will be solved in future.</li>
</ul>
</section>
<section id="results-qwen3.5-35b-a3b" class="level3">
<h3 class="anchored" data-anchor-id="results-qwen3.5-35b-a3b">Results: Qwen3.5-35B-A3B</h3>
<p><strong>VRAM:</strong> ~23 GB &nbsp;&nbsp;|&nbsp;&nbsp; <strong>Context:</strong> 200K tokens</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Metric</th>
<th>Prompt 1</th>
<th>Prompt 2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Prefill tok/s</td>
<td>3,179</td>
<td>3,056</td>
</tr>
<tr class="even">
<td>Generation tok/s</td>
<td>136.7</td>
<td>132.3</td>
</tr>
<tr class="odd">
<td>Total prompt tokens processed</td>
<td>16,145</td>
<td>92,375</td>
</tr>
<tr class="even">
<td>Total tokens generated</td>
<td>7,564</td>
<td>32,904</td>
</tr>
<tr class="odd">
<td>Prompt processing time</td>
<td>5.08s</td>
<td>30.23s</td>
</tr>
<tr class="even">
<td>Generation time</td>
<td>55.32s</td>
<td>248.75s</td>
</tr>
<tr class="odd">
<td>API calls</td>
<td>13</td>
<td>30</td>
</tr>
<tr class="even">
<td>Tool calls</td>
<td>11</td>
<td>28</td>
</tr>
<tr class="odd">
<td>Correct on turn</td>
<td>1st</td>
<td>2nd</td>
</tr>
</tbody>
</table>
<ul>
<li>The generation speed is identical to Gemma4-26B-A4B.</li>
<li>Though the model is <strong>extremely verbose</strong>. ~7.5K and ~32K tokens on Prompt 1/2.</li>
<li>Prompt 2 was the most intensive run of the entire benchmark: 30 API calls with multiple tool calls.</li>
<li><strong>Got Prompt 2 correct on the 2nd turn</strong> better than Gemma4-26B-A4B.</li>
<li>The 248.7s generation time on Prompt 2 is a direct result of such large API and tool calls.</li>
</ul>
</section>
<section id="results-qwen3.5-27b" class="level3">
<h3 class="anchored" data-anchor-id="results-qwen3.5-27b">Results: Qwen3.5-27B</h3>
<p><strong>VRAM:</strong> ~21 GB &nbsp;&nbsp;|&nbsp;&nbsp; <strong>Context:</strong> 130K tokens</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Metric</th>
<th>Prompt 1</th>
<th>Prompt 2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Prefill tok/s</td>
<td>2,474</td>
<td>2,188</td>
</tr>
<tr class="even">
<td>Generation tok/s</td>
<td>44.9</td>
<td>44.6</td>
</tr>
<tr class="odd">
<td>Total prompt tokens processed</td>
<td>15,043</td>
<td>24,385</td>
</tr>
<tr class="even">
<td>Total tokens generated</td>
<td>2,867</td>
<td>11,824</td>
</tr>
<tr class="odd">
<td>Prompt processing time</td>
<td>6.08s</td>
<td>11.14s</td>
</tr>
<tr class="even">
<td>Generation time</td>
<td>63.91s</td>
<td>265.00s</td>
</tr>
<tr class="odd">
<td>API calls</td>
<td>9</td>
<td>18</td>
</tr>
<tr class="even">
<td>Tool calls</td>
<td>7</td>
<td>14</td>
</tr>
<tr class="odd">
<td>Correct on turn</td>
<td>1st</td>
<td>1st</td>
</tr>
</tbody>
</table>
<ul>
<li><strong>Got Prompt 2 correct on the first turn.</strong> Same as Gemma4-31B.</li>
<li>Most efficient session on Prompt 1 with fewest API calls (9) and tool calls.</li>
<li>Generation at 44.9 tok/s is slower than MoE but faster than Gemma4-31B (37.7).</li>
<li>130K context fits comfortably in VRAM. This is a practical sweet spot with decent enough context size.</li>
</ul>
</section>
</section>
<section id="comparing-the-performance" class="level2">
<h2 class="anchored" data-anchor-id="comparing-the-performance">Comparing the Performance</h2>
<section id="summary-tables" class="level3">
<h3 class="anchored" data-anchor-id="summary-tables">Summary Tables</h3>
<section id="speed" class="level4">
<h4 class="anchored" data-anchor-id="speed">Speed</h4>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>Prefill tok/s (P1)</th>
<th>Prefill tok/s (P2)</th>
<th>Gen tok/s (P1)</th>
<th>Gen tok/s (P2)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Gemma4-26B-A4B</td>
<td><strong>4,338</strong></td>
<td><strong>4,560</strong></td>
<td>135.5</td>
<td><strong>134.4</strong></td>
</tr>
<tr class="even">
<td>Qwen3.5-35B-A3B</td>
<td>3,179</td>
<td>3,056</td>
<td><strong>136.7</strong></td>
<td>132.3</td>
</tr>
<tr class="odd">
<td>Gemma4-31B</td>
<td>1,466</td>
<td>1,357</td>
<td>37.7</td>
<td>35.2</td>
</tr>
<tr class="even">
<td>Qwen3.5-27B</td>
<td>2,474</td>
<td>2,188</td>
<td>44.9</td>
<td>44.6</td>
</tr>
</tbody>
</table>
</section>
<section id="efficiency-and-completion" class="level4">
<h4 class="anchored" data-anchor-id="efficiency-and-completion">Efficiency and Completion</h4>
<table class="caption-top table">
<colgroup>
<col style="width: 14%">
<col style="width: 14%">
<col style="width: 14%">
<col style="width: 14%">
<col style="width: 14%">
<col style="width: 14%">
<col style="width: 14%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>Tokens Gen (P1)</th>
<th>Tokens Gen (P2)</th>
<th>API Calls (P1)</th>
<th>API Calls (P2)</th>
<th>Tool Calls (P2)</th>
<th>Correct Turn (P2)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Gemma4-26B-A4B</td>
<td><strong>1,623</strong></td>
<td><strong>3,435</strong></td>
<td>10</td>
<td><strong>13</strong></td>
<td>11</td>
<td>3rd</td>
</tr>
<tr class="even">
<td>Qwen3.5-35B-A3B</td>
<td>7,564</td>
<td>32,904</td>
<td>13</td>
<td>30</td>
<td>28</td>
<td>2nd</td>
</tr>
<tr class="odd">
<td>Gemma4-31B</td>
<td>2,903</td>
<td>5,968</td>
<td>10</td>
<td>16</td>
<td>14</td>
<td>1st</td>
</tr>
<tr class="even">
<td>Qwen3.5-27B</td>
<td>2,867</td>
<td>11,824</td>
<td><strong>9</strong></td>
<td>18</td>
<td>14</td>
<td>1st</td>
</tr>
</tbody>
</table>
</section>
<section id="hardware-fit-rtx-4090-24-gb" class="level4">
<h4 class="anchored" data-anchor-id="hardware-fit-rtx-4090-24-gb">Hardware Fit (RTX 4090 24 GB)</h4>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>VRAM Usage</th>
<th>Max Context</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Gemma4-26B-A4B</td>
<td>~21 GB</td>
<td>256,000</td>
</tr>
<tr class="even">
<td>Qwen3.5-35B-A3B</td>
<td>~23 GB</td>
<td>200,000</td>
</tr>
<tr class="odd">
<td>Qwen3.5-27B</td>
<td>~21 GB</td>
<td>130,672</td>
</tr>
<tr class="even">
<td>Gemma4-31B</td>
<td>~24 GB</td>
<td>65,336</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="code-quality" class="level3">
<h3 class="anchored" data-anchor-id="code-quality">Code Quality</h3>
<p>I looked at the working code each model produced for Prompt 2 (the Nano Banana image generation task) and used Opus to compare them on structure, error handling, TDD compliance, API correctness and overall cleanliness.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Aspect</th>
<th>Gemma4-26B-A4B</th>
<th>Gemma4-31B</th>
<th>Qwen3.5-35B-A3B</th>
<th>Qwen3.5-27B</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Structure</strong></td>
<td>2 files, basic separation</td>
<td>3 files, clean separation</td>
<td>Class-based with helpers, cleanest design</td>
<td>3 files + dead <code>main.py</code> stub</td>
</tr>
<tr class="even">
<td><strong>Error handling</strong></td>
<td>Minimal, no API error handling</td>
<td>Poor, no try/except around API</td>
<td>Adequate but no batch error recovery</td>
<td>Weak, silent failures</td>
</tr>
<tr class="odd">
<td><strong>TDD</strong></td>
<td>Placeholder test, no real TDD</td>
<td>One integration test, superficial</td>
<td>Integration tests only, claimed but not real</td>
<td>Integration tests only, claimed but not real</td>
</tr>
<tr class="even">
<td><strong>Cleanliness</strong></td>
<td>Acceptable, concise</td>
<td>Good, readable, concise</td>
<td>Good structure but unused <code>base64</code> import</td>
<td>Good docstrings, type hints, pathlib usage</td>
</tr>
<tr class="odd">
<td><strong>Critical issues</strong></td>
<td>Broken summary, no <code>uv run</code> setup</td>
<td>New client per API call</td>
<td><strong>Hardcoded API key in tests</strong>, wrong model</td>
<td>Dead <code>main.py</code>, new client per call</td>
</tr>
</tbody>
</table>
<ul>
<li><strong>None of the models truly followed TDD.</strong> All of them claimed red-green methodology in their summaries but wrote integration tests that hit the real API. No model used mocks or wrote genuinely failing tests first.</li>
<li><strong>Qwen3.5-27B produced the most correct code.</strong> It got the model name right, used type hints and docstrings, used pathlib properly and had the cleanest overall implementation. Its issues (dead <code>main.py</code> stub, client created per call) are minor compared to the others.</li>
<li><strong>Qwen3.5-35B-A3B had the best code structure</strong> with a proper class-based design, but committed a security sin by hardcoding an API key in the test file and used the wrong model name entirely. For a task that specifically asked for <code>gemini-3.1-flash-preview</code> using <code>gemini-2.5-flash-image</code> is a correctness failure.</li>
<li><strong>Gemma4-31B was clean and concise</strong> but shallow. Minimal code, readable but no error handling and superficial testing.</li>
<li><strong>Gemma4-26B-A4B was the weakest.</strong> Missing a critical API parameter and broken summary file and no <code>uv run</code> integration despite being asked for it. This lines up with it needing 3 attempts to get working code.</li>
</ul>
</section>
<section id="takeaways" class="level3">
<h3 class="anchored" data-anchor-id="takeaways">Takeaways</h3>
<section id="speed-and-efficiency" class="level4">
<h4 class="anchored" data-anchor-id="speed-and-efficiency">Speed and Efficiency</h4>
<ul>
<li><strong>Dense models were more reliable on the complex task.</strong> Both Qwen3.5-27B and Gemma4-31B got Prompt 2 right on the first turn. Both MoE models needed retries. Two data points is not a conclusion, but it is a pattern worth noting.</li>
<li><strong>MoE speed advantage is real but verbosity can eat it up.</strong> Both MoE models hit ~135 tok/s generation vs ~40-45 tok/s for dense. But Qwen3.5-35B-A3B generated 32,904 tokens on Prompt 2 which means 248 seconds of generation even at MoE speeds. Gemma4-26B-A4B was the only model that was both fast and concise.</li>
<li><strong>Gemma4-26B-A4B is the speed king.</strong> If you are doing high-volume simpler tasks where first-try reliability matters less, it is hard to beat.</li>
</ul>
</section>
<section id="code-quality-1" class="level4">
<h4 class="anchored" data-anchor-id="code-quality-1">Code Quality</h4>
<ul>
<li><strong>Qwen3.5-27B produced the most correct and cleanest code overall.</strong> Right model name, type hints, docstrings, pathlib usage. Its issues are minor compared to every other model.</li>
<li><strong>None of the models truly followed TDD.</strong> All claimed red-green methodology but wrote integration tests hitting the real API. No mocks, no genuinely failing tests first.</li>
<li><strong>Better structure does not mean better code.</strong> Qwen3.5-35B-A3B had the cleanest design (class-based) but hardcoded an API key and used the wrong model name. Structure alone is not enough.</li>
</ul>
</section>
<section id="bottom-line" class="level4">
<h4 class="anchored" data-anchor-id="bottom-line">Bottom Line</h4>
<ul>
<li><strong>Qwen3.5-27B feels like the best overall pick for agentic coding on a 4090.</strong>
<ul>
<li>Reliable: got the complex task right on the first try</li>
<li>130K context is a practical sweet spot for long agentic sessions without maxing out the card</li>
<li>44.9 tok/s is slower than MoE but fast enough for interactive use</li>
<li>Most efficient on the simple task (fewest API calls)</li>
<li>Only uses ~21 GB VRAM, leaving headroom</li>
<li>Produced the most correct and cleanest code of all four models</li>
</ul></li>
</ul>
<blockquote class="blockquote">
<p>These are notes from a single benchmarking session with two prompts and my experience over the last 2 days. I am not claiming any of this is statistically rigorous.</p>
</blockquote>
</section>
</section>
</section>
<section id="appendix" class="level2">
<h2 class="anchored" data-anchor-id="appendix">Appendix</h2>
<section id="appendix-a-hardware-fit-and-server-config" class="level3">
<h3 class="anchored" data-anchor-id="appendix-a-hardware-fit-and-server-config">A: Hardware Fit and Server Config</h3>
<section id="llama-server-launch-config" class="level4">
<h4 class="anchored" data-anchor-id="llama-server-launch-config">llama-server Launch Config</h4>
<p>Base config used for all models:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">llama-server</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-2">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--model</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$MODEL_PATH</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--jinja</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--host</span> 100.80.101.103 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--port</span> 8001 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--parallel</span> 1 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--batch-size</span> 2048 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ubatch-size</span> 512 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--cache-type-k</span> q8_0 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--cache-type-v</span> q8_0 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--flash-attn</span> on <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--context-shift</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--metrics</span></span></code></pre></div></div>
</section>
<section id="per-model-overrides" class="level4">
<h4 class="anchored" data-anchor-id="per-model-overrides">Per-Model Overrides</h4>
<p><strong>Gemma4 models:</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--ctx-size</span> 256000     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># MoE (26B-A4B)</span></span>
<span id="cb5-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--ctx-size</span> 65336      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Dense (31B) - reduced due to VRAM constraints</span></span>
<span id="cb5-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--temp</span> 1.0</span>
<span id="cb5-4"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--top-p</span> 0.95</span>
<span id="cb5-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--top-k</span> 64</span>
<span id="cb5-6"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--min-p</span> 0.00</span></code></pre></div></div>
<p><strong>Qwen 3.5 models:</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb6-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--ctx-size</span> 200000     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># MoE (35B-A3B)</span></span>
<span id="cb6-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--ctx-size</span> 130672     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Dense (27B)</span></span>
<span id="cb6-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--temp</span> 0.6</span>
<span id="cb6-4"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--top-k</span> 20</span>
<span id="cb6-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--chat-template-file</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$TEMPLATES_DIR</span>/qwen35-chat-template-corrected.jinja</span>
<span id="cb6-6"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">--chat-template-kwargs</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'{"enable_thinking":true}'</span></span></code></pre></div></div>
</section>
</section>
<section id="appendix-b-installation" class="level3">
<h3 class="anchored" data-anchor-id="appendix-b-installation">B: Installation</h3>
<ol type="1">
<li><strong>llama.cpp Installation</strong></li>
</ol>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> apt-get update</span>
<span id="cb7-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev libssl-dev <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-y</span></span>
<span id="cb7-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> clone https://github.com/ggml-org/llama.cpp</span>
<span id="cb7-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> llama.cpp <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">&amp;&amp;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> pull origin master <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">&amp;&amp;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> ..</span>
<span id="cb7-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cmake</span> llama.cpp <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-B</span> llama.cpp/build <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-DBUILD_SHARED_LIBS</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>OFF <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-DGGML_CUDA</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ON <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-DLLAMA_CURL</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ON</span>
<span id="cb7-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cmake</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--build</span> llama.cpp/build <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--config</span> Release <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-j24</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--clean-first</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--target</span> llama-cli llama-mtmd-cli llama-server llama-gguf-split llama-bench</span>
<span id="cb7-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cp</span> llama.cpp/build/bin/llama-<span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">*</span> llama.cpp</span></code></pre></div></div>
<ol start="2" type="1">
<li><p><strong>Tokenscope</strong>: I used the <a href="https://www.npmjs.com/package/@ramtinj95/opencode-tokenscope">opencode-tokenscope</a> plugin to get per-session token breakdowns. You need to add <code>"plugin": ["@ramtinj95/opencode-tokenscope"]</code> to your <code>opencode.json</code> then create a <code>/tokenscope</code> slash command in <code>~/.config/opencode/command/tokenscope.md</code>.</p></li>
<li><p><strong>llama-server /metrics</strong>: llama-server exposes a <code>/metrics</code> endpoint (enabled with the <code>--metrics</code> flag) that returns Prometheus-format counters.</p></li>
</ol>
</section>
<section id="appendix-c-troubleshooting" class="level3">
<h3 class="anchored" data-anchor-id="appendix-c-troubleshooting">Troubleshooting</h3>
<ol type="1">
<li><strong>Qwen3.5-35B-A3B todowrite Parse Error</strong>: Qwen3.5-35B-A3B sometimes returned tool call arguments as a raw JSON string instead of a parsed object. This caused the <code>todowrite</code> tool to fail because Open Code expected <code>todos</code> to be an array, not a string containing an array. You can fix this using a small plugin at <code>~/.opencode/plugins/todo-fix-plugins.ts</code>:</li>
</ol>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode typescript code-with-copy"><code class="sourceCode typescript"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">export</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">const</span> TodoFixPlugin <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> (ctx) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">=&gt;</span> {</span>
<span id="cb8-2">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {</span>
<span id="cb8-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tool.execute.before"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> (input<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> output) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">=&gt;</span> {</span>
<span id="cb8-4">      <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (input<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">tool</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">===</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"todowrite"</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">typeof</span> output<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">args</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">todos</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">===</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"string"</span>) {</span>
<span id="cb8-5">        output<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">args</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">todos</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">JSON</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">parse</span>(output<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">args</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">todos</span>)</span>
<span id="cb8-6">      }</span>
<span id="cb8-7">    }</span>
<span id="cb8-8">  }</span>
<span id="cb8-9">}</span></code></pre></div></div>
<ol start="2" type="1">
<li><p><strong>Gemma4-31B Context Size</strong>: I had to reduce to 65,336 tokens to maintain ~40 tok/s generation. You can push it higher but generation speed degrades as context grows.</p></li>
<li><p>**Qwen3.5 models needed a corrected Jinja chat template <a href="">qwen35-chat-template-corrected.jinja</a>https://gist.github.com/garg-aayush/c0211a5fdca3e237d248d52806ff8d96 to work properly with llama-server. The default template had issues with thinking mode.</p></li>
</ol>


</section>
</section>

 ]]></description>
  <category>Local LLMs</category>
  <guid>https://garg-aayush.github.io/posts/2026-04-05-qwen35-vs-gemma4/</guid>
  <pubDate>Sun, 05 Apr 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Self-Hosted Gemma 4 Chat with Web UI</title>
  <link>https://garg-aayush.github.io/posts/2026-04-03-self-hosted-gemma4-chat/</link>
  <description><![CDATA[ 




<p>These are the steps to set up a self-hosted Gemma 4 chat with a web UI that you can use from your phone and laptop, keeping all your data and models private. It is just llama.cpp’s built-in web UI served over Tailscale.</p>
<blockquote class="blockquote">
<p>This post is based on my <a href="https://gist.github.com/garg-aayush/5c93e167831330c5d7c96dbc8541ef80">gist</a> which I keep as context for future reference.</p>
</blockquote>
<p>The setup gives you:</p>
<ul>
<li>A chat interface accessible from any device on your Tailscale network</li>
<li>Web search via MCP so the model can look things up (important since models have a knowledge cutoff)</li>
<li>Streaming responses, conversation history and the same UI everywhere</li>
</ul>
<p>Here it is running on my iPhone:</p>
<center>
<div class="quarto-video"><iframe data-external="1" src="https://www.youtube.com/embed/azq2ADEMKQA" width="250" height="500" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</center>
<section id="my-setup" class="level2">
<h2 class="anchored" data-anchor-id="my-setup">My Setup</h2>
<ul>
<li>RTX 4090 GPU server running Ubuntu with CUDA installed</li>
<li><a href="https://tailscale.com/">Tailscale</a> set up on the server and all my devices (phone, laptop)</li>
<li><a href="https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF">gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf</a> from the Unsloth HuggingFace repo. If you also want vision/image support, use <code>mmproj-BF16.gguf</code> from the same repo.</li>
<li>The Q4_K_XL quant fits on a 4090 even with full 256K context at approximately 20.5 GB VRAM</li>
</ul>
</section>
<section id="build-llama.cpp" class="level2">
<h2 class="anchored" data-anchor-id="build-llama.cpp">1. Build llama.cpp</h2>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> apt-get update</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev libssl-dev <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-y</span></span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> clone https://github.com/ggml-org/llama.cpp</span>
<span id="cb1-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> llama.cpp</span>
<span id="cb1-6"></span>
<span id="cb1-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cmake</span> . <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-B</span> build <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-DBUILD_SHARED_LIBS</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>OFF <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-DGGML_CUDA</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ON <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-DLLAMA_CURL</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ON</span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cmake</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--build</span> build <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--config</span> Release <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-j</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$(</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nproc</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">)</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--clean-first</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--target</span> llama-server</span></code></pre></div></div>
<p>Verify OpenSSL is linked:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ldd</span> build/bin/llama-server <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">|</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grep</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-i</span> ssl</span>
<span id="cb2-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Should show: libssl.so.3 =&gt; /lib/x86_64-linux-gnu/libssl.so.3</span></span></code></pre></div></div>
<blockquote class="blockquote">
<p><strong>Note:</strong> OpenSSL is needed because the MCP proxy makes HTTPS calls to external servers (like Exa for web search). Without it, you’ll get a 500 error about <code>CPPHTTPLIB_OPENSSL_SUPPORT</code> not being defined.</p>
</blockquote>
</section>
<section id="create-the-mcp-config" class="level2">
<h2 class="anchored" data-anchor-id="create-the-mcp-config">2. Create the MCP Config</h2>
<p>Before launching the server, create a config file that sets up web search and a system prompt with today’s date.</p>
<p>The <code>systemMessage</code> is important. I found that without it, the model usually won’t initiate a web search on its own when you ask for current information or facts. It just responds with its training data. The system prompt with today’s date nudges it to actually use the search tools.</p>
<p>Create <code>~/MODELS/templates/llamacpp-webui-chat-template.json</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"systemMessage"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are a helpful assistant. Today's date is {{DATE}}. When the user asks for current or recent information, use the available search tools to find up-to-date answers rather than relying on your training data."</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"mcpServers"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb3-4">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb3-5">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"url"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://mcp.exa.ai/mcp?exaApiKey={{EXA_API_KEY}}"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-6">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"exa"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-7">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"useProxy"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">true</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb3-8">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"enabled"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">true</span></span>
<span id="cb3-9">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb3-10">  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb3-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
<p>I use Exa because they give you 1,000 free searches per month (no credit card required). You can get your API key at <a href="https://dashboard.exa.ai">dashboard.exa.ai</a>. Other web search MCP options:</p>
<ul>
<li><a href="https://github.com/nicholasgriffintn/brave-search-mcp">Brave Search MCP</a></li>
<li><a href="https://tavily.com/">Tavily</a></li>
</ul>
</section>
<section id="create-the-launch-script" class="level2">
<h2 class="anchored" data-anchor-id="create-the-launch-script">3. Create the Launch Script</h2>
<p>I use a wrapper script that injects today’s date and the API key into the config, then starts the server. This way the date stays fresh on every restart.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#!/bin/bash</span></span>
<span id="cb4-2"></span>
<span id="cb4-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># export EXA_API_KEY="" or source from bashrc/zshrc</span></span>
<span id="cb4-4"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">TEMPLATE_FILE</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>~/MODELS/templates/llamacpp-webui-chat-template.json</span>
<span id="cb4-5"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">CONFIG_FILE</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>~/MODELS/templates/temp-<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$(</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">basename</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$TEMPLATE_FILE)</span></span>
<span id="cb4-6"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">HOSTNAME</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&lt;YOUR_TAILSCALE_IP&gt;"</span></span>
<span id="cb4-7"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">PORT</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>8001</span>
<span id="cb4-8"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">CONTEXT_SIZE</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>65536</span>
<span id="cb4-9"></span>
<span id="cb4-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sed</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-e</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"s/{{DATE}}/</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$(</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">date</span> +%Y-%m-%d<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">)</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">/"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-e</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"s/{{EXA_API_KEY}}/</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$EXA_API_KEY</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">/"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-12">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$TEMPLATE_FILE</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$CONFIG_FILE</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb4-13"></span>
<span id="cb4-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># start the server</span></span>
<span id="cb4-15"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">MODEL_PATH</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>~/MODELS/unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf</span>
<span id="cb4-16"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">MMPROJ_PATH</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>~/MODELS/unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf</span>
<span id="cb4-17"></span>
<span id="cb4-18"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">./llama.cpp/build/bin/llama-server</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-19">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--model</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$MODEL_PATH</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-20">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--mmproj</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$MMPROJ_PATH</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-21">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--jinja</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-22">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--host</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$HOSTNAME</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-23">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--port</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$PORT</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-24">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ctx-size</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$CONTEXT_SIZE</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-25">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--parallel</span> 1 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-26">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-ngl</span> 999 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-27">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--batch-size</span> 2048 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-28">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ubatch-size</span> 512 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-29">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--temp</span> 1.0 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-30">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--top-p</span> 0.95 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-31">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--top-k</span> 64 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-32">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--cache-type-k</span> q8_0 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--cache-type-v</span> q8_0 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-33">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--flash-attn</span> on <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-34">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--context-shift</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-35">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--metrics</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-36">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--webui-mcp-proxy</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-37">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--webui-config-file</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$CONFIG_FILE</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span></code></pre></div></div>
<p>This assumes Tailscale is already set up on your system. Replace <code>&lt;YOUR_TAILSCALE_IP&gt;</code> with your server’s Tailscale IP (find it with <code>tailscale ip -4</code>).</p>
<p><strong>What the key flags do:</strong></p>
<table class="caption-top table">
<colgroup>
<col style="width: 54%">
<col style="width: 45%">
</colgroup>
<thead>
<tr class="header">
<th>Flag</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>--jinja</code></td>
<td>Required for tool-call formatting via the model’s chat template</td>
</tr>
<tr class="even">
<td><code>--webui-mcp-proxy</code></td>
<td>Enables the CORS proxy so the web UI can reach external MCP servers</td>
</tr>
<tr class="odd">
<td><code>--webui-config-file</code></td>
<td>Bakes MCP config server-side so it persists across restarts</td>
</tr>
<tr class="even">
<td><code>-ngl 999</code></td>
<td>Offloads all layers to GPU</td>
</tr>
<tr class="odd">
<td><code>--ctx-size 65536</code></td>
<td>64K context window. You can go up to 256K on a 4090 but 64K is plenty for chat</td>
</tr>
<tr class="even">
<td><code>--temp 1.0 --top-p 0.95 --top-k 64</code></td>
<td>Google’s recommended sampling defaults for Gemma 4</td>
</tr>
</tbody>
</table>
<p>Start the server:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bash</span> ~/start-server.sh</span></code></pre></div></div>
<blockquote class="blockquote">
<p><strong>Note:</strong> The full 256K context size works on a 4090 with Q4_K_XL, but I don’t think it’s needed for chat. I usually run with 64K or 128K.</p>
</blockquote>
</section>
<section id="connect-from-your-devices" class="level2">
<h2 class="anchored" data-anchor-id="connect-from-your-devices">4. Connect from Your Devices</h2>
<p>Open a browser on your phone or laptop and go to:</p>
<pre><code>http://&lt;YOUR_TAILSCALE_IP&gt;:8001</code></pre>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-04-03/snapshot-iphone.jpg" class="img-fluid figure-img" style="width:50.0%"></p>
<figcaption>llama.cpp web UI running Gemma 4 on iPhone over Tailscale</figcaption>
</figure>
</div>
</section>
<section id="verify-web-search-is-working" class="level2">
<h2 class="anchored" data-anchor-id="verify-web-search-is-working">5. Verify Web Search Is Working</h2>
<p>The MCP config should be loaded automatically, but it’s worth verifying:</p>
<ol type="1">
<li>Open the web UI and go to <strong>MCP server settings</strong></li>
<li>You should see the Exa entry already configured and enabled</li>
<li>Send a message like <em>“What happened in tech news today?”</em></li>
<li>The model should trigger a search tool call and cite results</li>
</ol>
<p>To confirm tools are being sent, open browser DevTools and go to Network tab, send a message and click the <code>completions</code> request. Check the payload for a <code>tools</code> array.</p>
<blockquote class="blockquote">
<p><strong>Note:</strong> If the model says “I can’t search the web” or “my knowledge cutoff is January 2025”, the MCP toggle may have auto-disabled itself. Edit the MCP entry in settings and flip the toggle back ON.</p>
</blockquote>
</section>
<section id="enable-vision-optional" class="level2">
<h2 class="anchored" data-anchor-id="enable-vision-optional">6. Enable Vision (Optional)</h2>
<p>The launch script in Section 3 already includes the <code>--mmproj</code> flag, so vision is enabled by default. If you don’t need it, remove the <code>--mmproj $MMPROJ_PATH</code> line from the script.</p>
<p>The web UI will automatically show an image upload button when vision is enabled.</p>
<blockquote class="blockquote">
<p><strong>Note:</strong> The BF16 projector is ~800MB-1GB on GPU. If VRAM is tight, add <code>--no-mmproj-offload</code> to keep it on CPU (slightly slower image processing but saves VRAM).</p>
</blockquote>
</section>
<section id="troubleshooting" class="level2">
<h2 class="anchored" data-anchor-id="troubleshooting">Troubleshooting</h2>
<table class="caption-top table">
<colgroup>
<col style="width: 42%">
<col style="width: 33%">
<col style="width: 23%">
</colgroup>
<thead>
<tr class="header">
<th>Problem</th>
<th>Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>500: <code>CPPHTTPLIB_OPENSSL_SUPPORT is not defined</code></td>
<td>Built without OpenSSL</td>
<td>Rebuild with <code>-DLLAMA_CURL=ON</code> and <code>libssl-dev</code> installed</td>
</tr>
<tr class="even">
<td>Model says “I can’t search” or “my knowledge cutoff is…”</td>
<td>MCP toggle auto-disabled or system prompt missing</td>
<td>Re-enable MCP toggle in settings, check <code>systemMessage</code> in config</td>
</tr>
<tr class="odd">
<td>No <code>tools</code> array in request payload</td>
<td>MCP server not connected</td>
<td>Check Connection Log, enable “Use llama-server proxy” via edit icon</td>
</tr>
<tr class="even">
<td>MCP toggle keeps turning itself off</td>
<td>Connection fails on startup</td>
<td>Use <code>--webui-config-file</code> (Section 3) instead of manual UI config</td>
</tr>
<tr class="odd">
<td>Model ignores tools even though they’re in payload</td>
<td>Chat template not applied</td>
<td>Make sure <code>--jinja</code> flag is set</td>
</tr>
<tr class="even">
<td><code>Failed to fetch</code> in Connection Log</td>
<td>CORS blocking direct request</td>
<td>Enable “Use llama-server proxy” on the MCP entry</td>
</tr>
<tr class="odd">
<td>Can’t reach UI from phone</td>
<td>Wrong bind address</td>
<td>Make sure <code>--host</code> is your Tailscale IP, not <code>127.0.0.1</code> or <code>0.0.0.0</code></td>
</tr>
</tbody>
</table>


</section>

 ]]></description>
  <category>Local LLMs</category>
  <guid>https://garg-aayush.github.io/posts/2026-04-03-self-hosted-gemma4-chat/</guid>
  <pubDate>Fri, 03 Apr 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Using a local LLM in OpenCode with llama.cpp</title>
  <link>https://garg-aayush.github.io/posts/2026-03-29-local-llm-opencode/</link>
  <description><![CDATA[ 




<p>This post covers the full setup for running a local LLM (<a href="https://huggingface.co/unsloth/Qwen3.5-27B-GGUF">Qwen3.5-27B</a>) with <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> and using it as an <a href="https://opencode.ai">OpenCode</a> provider.</p>
<p>I have focused a lot on actually getting it to work well with an agentic coding tools like OpenCode/Codex. When you try to do that there are a bunch of choices and gotchas you run into like, <strong>which model variant, which quantization, why the chat template breaks with tool-calling, how much context you can actually fit on your GPU, and so on.</strong> I have made sure to include all of these so that whether you have a similar setup to mine or a different one, you can go ahead and set it up.</p>
<p>My setup is an RTX 4090 workstation running the model, my personal Macbook as the client and <a href="https://tailscale.com">Tailscale</a> connecting the two.</p>
<blockquote class="blockquote">
<p>If you already know how to set up a local model and use it with OpenCode, I would recommend skipping to Reasoning and things I learned along the way, there might be something new you can pick up.</p>
</blockquote>
<section id="step-1-build-llama.cpp-on-your-gpu-machine" class="level2">
<h2 class="anchored" data-anchor-id="step-1-build-llama.cpp-on-your-gpu-machine">Step 1: Build llama.cpp on your GPU machine</h2>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> apt-get update</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-y</span></span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> clone https://github.com/ggml-org/llama.cpp</span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cmake</span> llama.cpp <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-B</span> llama.cpp/build <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-DBUILD_SHARED_LIBS</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>OFF <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-DGGML_CUDA</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ON</span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cmake</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--build</span> llama.cpp/build <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--config</span> Release <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-j4</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--clean-first</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--target</span> llama-cli llama-mtmd-cli llama-server llama-gguf-split</span></code></pre></div></div>
<p>Use <code>-j4</code>/<code>-j2</code>/<code>-j8</code> instead of <code>-j</code> to limit parallel jobs and avoid OOM errors during compilation.</p>
</section>
<section id="step-2-install-tailscale-on-both-machines" class="level2">
<h2 class="anchored" data-anchor-id="step-2-install-tailscale-on-both-machines">Step 2: Install Tailscale on both machines</h2>
<blockquote class="blockquote">
<p>If you are running everything on the same machine, you can skip this and just use <code>127.0.0.1</code>.</p>
</blockquote>
<section id="on-the-gpu-machine-rtx-4090" class="level3">
<h3 class="anchored" data-anchor-id="on-the-gpu-machine-rtx-4090">On the GPU machine (RTX 4090)</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Install Tailscale</span></span>
<span id="cb2-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">curl</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-fsSL</span> https://tailscale.com/install.sh <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">|</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sh</span></span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Start with SSH enabled and authenticate</span></span>
<span id="cb2-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> tailscale up <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ssh</span></span>
<span id="cb2-6"></span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Enable on boot</span></span>
<span id="cb2-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> systemctl enable tailscaled</span>
<span id="cb2-9"></span>
<span id="cb2-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Check your IP and hostname</span></span>
<span id="cb2-11"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">tailscale</span> status</span></code></pre></div></div>
</section>
<section id="on-your-macbook" class="level3">
<h3 class="anchored" data-anchor-id="on-your-macbook">On your MacBook</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Install via Homebrew (or get it from the Mac App Store)</span></span>
<span id="cb3-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">brew</span> install <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--cask</span> tailscale</span>
<span id="cb3-3"></span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Open the app, log in from the menu bar icon, then:</span></span>
<span id="cb3-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> tailscale up <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ssh</span></span></code></pre></div></div>
<p>If everything worked, you should be able to ping your GPU machine from your MacBook using the Tailscale IP and see both devices connected to your tailscale VPN:</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-03-29/tailscale-status.png" class="img-fluid" style="width:75.0%"></p>
</section>
</section>
<section id="step-3-download-the-qwen3.5-27b-gguf-model" class="level2">
<h2 class="anchored" data-anchor-id="step-3-download-the-qwen3.5-27b-gguf-model">Step 3: Download the Qwen3.5-27B GGUF model</h2>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mkdir</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-p</span> ~/MODELS</span>
<span id="cb4-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> ~/MODELS</span>
<span id="cb4-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">uv</span> run <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--with</span> huggingface_hub<span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">cli</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span> hf download unsloth/Qwen3.5-27B-GGUF <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--local-dir</span> unsloth/Qwen3.5-27B-GGUF <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--include</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"*mmproj-F16*"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb4-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--include</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"*UD-Q4_K_XL*"</span></span>
<span id="cb4-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-</span></span></code></pre></div></div>
<p><a href="https://docs.astral.sh/uv/"><code>uv run</code></a> ensures you don’t need to install <code>huggingface_hub[cli]</code> into your venv separately.</p>
</section>
<section id="step-4-test-the-llama.cpp-server-locally" class="level2">
<h2 class="anchored" data-anchor-id="step-4-test-the-llama.cpp-server-locally">Step 4: Test the llama.cpp server locally</h2>
<p>Start the server on localhost first to make sure everything works:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">QWEN35_27B_MODEL_PATH</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>~/MODELS/unsloth/Qwen3.5-27B-GGUF</span>
<span id="cb5-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">./llama.cpp/build/bin/llama-server</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb5-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--model</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$QWEN35_27B_MODEL_PATH</span>/Qwen3.5-27B-UD-Q4_K_XL.gguf <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb5-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--mmproj</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$QWEN35_27B_MODEL_PATH</span>/mmproj-F16.gguf <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb5-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--host</span> 127.0.0.1 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb5-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--port</span> 8001 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb5-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ctx-size</span> 16384 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb5-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--temp</span> 0.6 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb5-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--top-p</span> 0.95 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb5-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--top-k</span> 20 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb5-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--min-p</span> 0.00</span></code></pre></div></div>
<p>Test it from another terminal:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb6-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">curl</span> http://127.0.0.1:8001/v1/chat/completions <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb6-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-H</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Content-Type: application/json"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb6-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-H</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Authorization: Bearer sk-no-key-required"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb6-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-d</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'{</span></span>
<span id="cb6-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "model": "Qwen3.5-27B",</span></span>
<span id="cb6-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "messages": [</span></span>
<span id="cb6-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      {"role": "user", "content": "What is 2+2?"}</span></span>
<span id="cb6-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    ]</span></span>
<span id="cb6-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">  }'</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">|</span> <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python3</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-m</span> json.tool</span></code></pre></div></div>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-03-29/local-server-test.png" class="img-fluid" style="width:100.0%"></p>
</section>
<section id="step-5-start-the-server-on-the-tailscale-ip" class="level2">
<h2 class="anchored" data-anchor-id="step-5-start-the-server-on-the-tailscale-ip">Step 5: Start the server on the Tailscale IP</h2>
<p>Now start the server bound to your Tailscale IP so it is accessible from your MacBook:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb7-1"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">QWEN35_27B_MODEL_PATH</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>~/MODELS/unsloth/Qwen3.5-27B-GGUF</span>
<span id="cb7-2"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">TEMPLATES_DIR</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>~/MODELS/templates</span>
<span id="cb7-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">./llama.cpp/build/bin/llama-server</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--model</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$QWEN35_27B_MODEL_PATH</span>/Qwen3.5-27B-UD-Q4_K_XL.gguf <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--jinja</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--chat-template-file</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$TEMPLATES_DIR</span>/qwen35-chat-template-corrected.jinja <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--host</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>YOUR_GPU_SERVER_IP<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--port</span> 8001 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ctx-size</span> 65536 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--parallel</span> 1 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--batch-size</span> 2048 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ubatch-size</span> 512 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--temp</span> 0.6 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-14">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--top-p</span> 0.95 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-15">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--top-k</span> 20 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-16">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--min-p</span> 0.00 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-17">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--cache-type-k</span> bf16 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--cache-type-v</span> bf16 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-18">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--flash-attn</span> on <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-19">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--context-shift</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-20">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--metrics</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb7-21">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--chat-template-kwargs</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'{"enable_thinking":true}'</span></span></code></pre></div></div>
<ul>
<li>Replace <code>&lt;YOUR_GPU_SERVER_IP&gt;</code> with your GPU server’s IP (Tailscale IP if remote or <code>127.0.0.1</code> if local). Check with <code>tailscale status</code>.</li>
<li>I would recommend starting with a smaller <code>--ctx-size</code> (eg 16384) first to verify everything works. The server starts faster with less KV cache allocation so you can catch misconfigurations quickly. Once confirmed, restart with your target context size.</li>
<li>The sampling parameters (<code>--temp 0.6</code>, <code>--top-p 0.95</code>, <code>--top-k 20</code>) are the <a href="https://unsloth.ai/docs/models/qwen3.5#thinking-mode">recommended values from Qwen3.5</a> for thinking mode with precise coding tasks.</li>
<li>I chose <code>--ctx-size 65536</code> because at this context length the total VRAM usage sits around 22 GB (includig model) on a 24 GB card. I could probably go higher by 10k but this leaves enough breathing room to avoid OOM on longer prompts or during prefill spikes.</li>
</ul>
<section id="about-the-corrected-chat-template" class="level3">
<h3 class="anchored" data-anchor-id="about-the-corrected-chat-template">About the corrected chat template</h3>
<p>The <code>--chat-template-file</code> flag overrides the template embedded in the GGUF. The corrected template fixes system message ordering that tools like OpenCode and Codex depend on. Without the fix, the model may misinterpret tool-calling system prompts. The <code>--jinja</code> flag is required for the template and thinking toggle to work. You can grab the corrected template <a href="https://gist.github.com/garg-aayush/c0211a5fdca3e237d248d52806ff8d96">here</a>.</p>
</section>
<section id="test-it-over-tailscale" class="level3">
<h3 class="anchored" data-anchor-id="test-it-over-tailscale">Test it over Tailscale</h3>
<p>From your MacBook:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb8-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">curl</span> http://<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>YOUR_GPU_SERVER_HOSTNAME<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>:8001/v1/chat/completions <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb8-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-H</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Content-Type: application/json"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb8-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-H</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Authorization: Bearer sk-no-key-required"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb8-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-d</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'{</span></span>
<span id="cb8-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "model": "Qwen3.5-27B",</span></span>
<span id="cb8-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    "messages": [</span></span>
<span id="cb8-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">      {"role": "user", "content": "What is 2+2?"}</span></span>
<span id="cb8-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    ]</span></span>
<span id="cb8-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">  }'</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">|</span> <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python3</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-m</span> json.tool</span></code></pre></div></div>
<p>You can use either the Tailscale hostname or IP.</p>
<div class="callout callout-style-default callout-note callout-titled" title="Understanding the llama.cpp flags">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Understanding the llama.cpp flags
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<table class="caption-top table">
<colgroup>
<col style="width: 31%">
<col style="width: 68%">
</colgroup>
<thead>
<tr class="header">
<th>Flag</th>
<th>What it does</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>--model</code></td>
<td>Path to quantized model weights</td>
</tr>
<tr class="even">
<td><code>--jinja</code></td>
<td>Enable Jinja2 template engine (needed for thinking toggle)</td>
</tr>
<tr class="odd">
<td><code>--chat-template-file</code></td>
<td>Patched template that fixes system message ordering for OpenCode/Codex</td>
</tr>
<tr class="even">
<td><code>--host</code></td>
<td>IP to bind the server to (Tailscale IP for remote access)</td>
</tr>
<tr class="odd">
<td><code>--port</code></td>
<td>Port to listen on</td>
</tr>
<tr class="even">
<td><code>--ctx-size</code></td>
<td>Max context window in tokens (default 262K would OOM)</td>
</tr>
<tr class="odd">
<td><code>--parallel</code></td>
<td>Number of concurrent request slots (each reserves its own KV cache)</td>
</tr>
<tr class="even">
<td><code>--batch-size</code></td>
<td>Tokens scheduled per prompt processing chunk</td>
</tr>
<tr class="odd">
<td><code>--ubatch-size</code></td>
<td>Tokens hitting GPU at once (controls peak VRAM during prefill)</td>
</tr>
<tr class="even">
<td><code>--temp</code></td>
<td>Sampling temperature (0.6 for precise coding, 1.0 for general)</td>
</tr>
<tr class="odd">
<td><code>--top-p</code></td>
<td>Nucleus sampling cutoff</td>
</tr>
<tr class="even">
<td><code>--top-k</code></td>
<td>Keep top K tokens before sampling</td>
</tr>
<tr class="odd">
<td><code>--min-p</code></td>
<td>Minimum probability threshold (disabled at 0.00)</td>
</tr>
<tr class="even">
<td><code>--cache-type-k/v</code></td>
<td>KV cache precision (bf16 works best for hybrid architectures)</td>
</tr>
<tr class="odd">
<td><code>--flash-attn</code></td>
<td>Reduces VRAM usage and speeds up attention computation</td>
</tr>
<tr class="even">
<td><code>--context-shift</code></td>
<td>Auto-trims oldest tokens when context fills up</td>
</tr>
<tr class="odd">
<td><code>--metrics</code></td>
<td>Exposes performance stats (tokens/s, eval time) in API responses</td>
</tr>
<tr class="even">
<td><code>--chat-template-kwargs</code></td>
<td>Enable thinking/reasoning mode by default</td>
</tr>
</tbody>
</table>
<p><strong>Some flags worth understanding in more detail:</strong></p>
<ul>
<li><p><strong><code>--ctx-size</code></strong> must be set explicitly. If omitted, llama.cpp tries to allocate the full 262K context window from the model metadata. On a 24GB card, this will OOM immediately.</p></li>
<li><p><strong><code>--parallel</code></strong> is more expensive than it looks. Each slot gets its own KV cache. <code>--parallel 4</code> with <code>--ctx-size 16384</code> allocates 4 separate 16K KV caches. For single-user OpenCode, <code>--parallel 1</code> is the right choice.</p></li>
<li><p><strong><code>--batch-size</code> and <code>--ubatch-size</code></strong> only affect prompt ingestion not generation. These matter when sending large system prompts or codebases as context. <code>--ubatch-size</code> controls peak VRAM during prefill. If you OOM only on large prompts (not during generation), reduce <code>--ubatch-size</code> first.</p></li>
<li><p><strong><code>--cache-type-k bf16 --cache-type-v bf16</code></strong> is the safe choice for Qwen3.5.</p></li>
<li><p><strong><code>--context-shift</code></strong> silently drops the oldest tokens when context fills up. For coding workflows this can be dangerous since the model might lose your original instructions. OpenCode manages its own context so this acts as a safety net.</p></li>
<li><p><strong><code>--chat-template-file</code></strong> overrides the embedded template completely. If the GGUF ships an updated template in a future release, you won’t get those improvements unless you re-extract and re-patch.</p></li>
</ul>
</div>
</div>
</div>
</section>
</section>
<section id="step-6-add-the-provider-in-opencode" class="level2">
<h2 class="anchored" data-anchor-id="step-6-add-the-provider-in-opencode">Step 6: Add the provider in OpenCode</h2>
<p>Update <code>~/.config/opencode/opencode.json</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb9-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb9-2">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"$schema"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://opencode.ai/config.json"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb9-3">  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"provider"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb9-4">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"llama-local"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb9-5">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Llama.cpp (RTX4090)"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb9-6">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"npm"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"@ai-sdk/openai-compatible"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb9-7">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"options"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb9-8">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"baseURL"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"http://&lt;YOUR_GPU_SERVER_IP&gt;/v1"</span></span>
<span id="cb9-9">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb9-10">      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"models"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb9-11">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"unsloth/Qwen3.5-27B-GGUF"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb9-12">          <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Qwen3.5-27B Q4_K_XL"</span></span>
<span id="cb9-13">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb9-14">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb9-15">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb9-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb9-17"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
<p>If everything worked, you should see the local model available in OpenCode’s model selector:</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-03-29/models-in-opencode.png" class="img-fluid" style="width:50.0%"></p>
</section>
<section id="trying-it-out" class="level2">
<h2 class="anchored" data-anchor-id="trying-it-out">Trying it out</h2>
<p>I ran through a few prompts in OpenCode using the <code>Qwen3.5-27B</code> model to see how well it handles agentic coding tasks, tool calls and skills:</p>
<ol type="1">
<li>I ask it to write a Python script for Gemini image generation, pointing it at <code>context7</code> to fetch the latest docs</li>
<li>The model initially uses Gemini 2.5, so I tell it to switch to the 3.1 image generation model and it updates the script</li>
<li>I run the script with <code>uv run</code> to generate an image of a cat on a window sill</li>
<li>I use <code>/explain-code</code> (a custom skill) to have the model explain the generated script</li>
<li>Finally, I ask it to save the explanation as a readme</li>
</ol>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/Hc98yck9AXU" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p><strong>The model handles all of this well. It picks up the tool calls, uses the skills correctly, follows up on corrections and produces working code. Honestly, for a 27B model running quantized on a single 4090, the quality is surprisingly good.</strong></p>
<p>For reference, here are the speeds I am getting across some of the sessions:</p>
<table class="caption-top table">
<tbody>
<tr class="odd">
<td>Prefill speed</td>
<td>~2400 tokens/s</td>
</tr>
<tr class="even">
<td>Generation speed</td>
<td>~40 tokens/s</td>
</tr>
</tbody>
</table>
</section>
<section id="using-it-with-codex" class="level2">
<h2 class="anchored" data-anchor-id="using-it-with-codex">Using it with Codex</h2>
<p>This setup also works with <a href="https://github.com/openai/codex">Codex</a>. Add this to your <code>~/.codex/config.toml</code> (refer to this <a href="https://github.com/ggml-org/llama.cpp/issues/14702#issuecomment-3825824862">thread</a> for more details):</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode toml code-with-copy"><code class="sourceCode toml"><span id="cb10-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[model_providers.llama_cpp]</span></span>
<span id="cb10-2"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">name</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"llama_cpp API"</span></span>
<span id="cb10-3"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">base_url</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"http://&lt;YOUR_GPU_SERVER_IP&gt;:8001/v1"</span></span>
<span id="cb10-4"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">wire_api</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"responses"</span></span>
<span id="cb10-5"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">stream_idle_timeout_ms</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000000</span></span>
<span id="cb10-6"></span>
<span id="cb10-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[profiles.gpt-oss]</span></span>
<span id="cb10-8"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">model</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-oss"</span></span>
<span id="cb10-9"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">model_provider</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"llama_cpp"</span></span>
<span id="cb10-10"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">web_search</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"disabled"</span></span></code></pre></div></div>
<p>Then start Codex with:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb11-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">codex</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-p</span> gpt-oss</span></code></pre></div></div>
</section>
<section id="reasoning-and-things-i-learned-along-the-way" class="level2">
<h2 class="anchored" data-anchor-id="reasoning-and-things-i-learned-along-the-way">Reasoning and things I learned along the way</h2>
<p>Some of the choices I made and what I picked up in the process.</p>
<ul>
<li><strong>Run inference on a separate machine if you can.</strong> You don’t need two machines for this but if you have a personal GPU workstation or a Mac Mini, I would recommend running the model there instead of on your daily use machine. Running inference on your laptop eats into your available RAM, it drains your battery fast and the laptop starts heating up bad.</li>
<li><strong>Why llama.cpp?</strong> I started with the <a href="https://unsloth.ai/docs/models/qwen3.5">Unsloth guide</a> which uses llama.cpp with the GGUF format. Since I am setting this up locally for myself, llama.cpp felt like an easier choice than vLLM.</li>
<li><strong>Why Qwen3.5-27B over 35B-A3B?</strong> The MoE variant is 3-5x faster (~60-100 tok/s) because only 3B parameters are active per token but the 27B has all 27B parameters active and <a href="https://www.reddit.com/r/LocalLLaMA/comments/1rivckt/visualizing_all_qwen_35_vs_qwen_3_benchmarks/">consistently scores higher across benchmarks</a>. For coding tasks, I preferred quality.</li>
<li><strong>Why UD-Q4_K_XL quantization?</strong> <a href="https://unsloth.ai/blog/dynamic-v2">Unsloth’s Dynamic 2.0</a> quantization selectively upcasts important layers to 8 or 16-bit precision, so you get better quality without paying the full VRAM cost of a higher quant. <a href="https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations">Benjamin Marie’s benchmarks</a> show UD-Q4_K_XL stays within a 1-point accuracy drop of the original while being ~8GB smaller than comparable quants.</li>
<li><strong>Hybrid architecture and KV cache.</strong> Qwen3.5 uses a Gated DeltaNet + Gated Attention hybrid architecture. Only every 4th layer has standard attention (16 out of 64 for 27B) and the rest use DeltaNet which maintains a fixed-size state regardless of context length. This makes the KV cache dramatically smaller than a pure transformer of the same size which is why 64K context fits on a 24 GB card at all. <img src="https://garg-aayush.github.io/static/img/blog-2026-03-29/kv-cache-comparison.png" class="img-fluid" style="width:75.0%"></li>
<li><strong>KV cache type.</strong> Qwen3.5 is trained in bfloat16, so bf16 is a better choice than llama.cpp’s default f16 given it has a better dynamic range. This <a href="https://www.reddit.com/r/LocalLLaMA/comments/1ryoab7/qwen_35_27b_quantize_kv_cache_or_not/">r/LocalLLaMA discussion</a> mentions that <code>q8_0</code> doesn’t seem to hurt quality too much but I haven’t tested it myself and decided to go with the safe option of <code>bf16</code>.</li>
<li><strong>Start with a small context size.</strong> Begin with <code>--ctx-size 16384</code> to verify everything works (correct IP, template path, model loading) before committing more VRAM. The server starts faster with a smaller KV cache, so you can iterate quickly on configuration issues.</li>
<li><strong>Use <code>-j4</code> instead of <code>-j</code> when building llama.cpp.</strong> The <code>-j</code> flag without a number spawns as many parallel compiler processes as it can. This can lead to an OOM kill (Error 137). Limiting to <code>-j4/2/8</code> depending on your available RAM avoids this.</li>
<li><strong>Use <code>uv run</code> for one-off CLI tools.</strong> <code>uv run --with huggingface_hub[cli]</code> lets you run <code>hf download</code> without installing the package into your venv. It keeps your environment clean.</li>
<li><strong>The chat template fix is critical for OpenCode/Codex.</strong> The default Qwen3.5 template throws a 500 error when OpenCode or Codex sends messages where the system message isn’t strictly first. The <a href="https://gist.github.com/garg-aayush/c0211a5fdca3e237d248d52806ff8d96">corrected template</a> removes this restriction. Without it, the server will reject most agentic tool-calling prompts.</li>
<li><strong>Use <a href="https://github.com/upstash/context7">Context7</a> with local models.</strong> Smaller models due to the size are more likely to hallucinate APIs or use outdated syntax. They also rely much more heavily on the context you give them. Using Context7 to inject up-to-date documentation into the prompt makes a noticeable difference in code quality.</li>
</ul>


</section>

 ]]></description>
  <category>Local LLMs</category>
  <guid>https://garg-aayush.github.io/posts/2026-03-29-local-llm-opencode/</guid>
  <pubDate>Sun, 29 Mar 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>FlashAttention: Making Attention I/O-Aware</title>
  <link>https://garg-aayush.github.io/posts/2026-03-27-flash-attention/</link>
  <description><![CDATA[ 




<p><a href="https://arxiv.org/abs/2205.14135">FlashAttention</a> is the default attention implementation across the stack. Whether you are training or running inference on GPUs and whether using MHA/GQA/MLA variants, you are almost certainly running a variant of it.</p>
<p>Standard attention is memory-bound, i.e.&nbsp;it does not account for the GPU memory hierarchy, repeatedly shuffling large intermediate matrices between slow and fast GPU memory. FlashAttention addresses this by making attention <strong>IO-aware</strong>. It computes exact standard attention with the same numerical output but restructures the computation to minimize data movement between these memory levels. It does this through a combination of operator fusion, tiling, recomputation and a particularly elegant online softmax algorithm that computes softmax in a single pass without needing to see all the scores first. The result is faster and longer context length training and lower memory usage without approximation.</p>
<p>This post walks through the “why” behind each of these pieces along with a bit deeper discussion on online softmax derivation. I also plan to follow this up with blogs on implementing FlashAttention in pure PyTorch and as a fused Triton kernel to build a deeper hands-on understanding of these ideas.</p>
<section id="the-gpu-memory-hierarchy" class="level2">
<h2 class="anchored" data-anchor-id="the-gpu-memory-hierarchy">The GPU Memory Hierarchy</h2>
<p>It is helpful to have a basic understanding of GPU memory hierarchy before diving into FlashAttention. A GPU has two levels of memory that matter here:</p>
<ul>
<li><p><strong>HBM (High Bandwidth Memory)</strong>: It is the GPU’s main (slow) memory which you see when you run <code>nvidia-smi</code>. It sits off-compute chip. An A100 has about 80 GB of HBM with a bandwidth of ~2 TB/s.</p></li>
<li><p><strong>SRAM (Static RAM)</strong> is on-chip memory. A A100 has about ~20 MB total (spread across 108 SMs) with ~19 TB/s bandwidth. This is roughly <strong>10x</strong> the bandwidth of HBM but nearly <strong>4000x</strong> smaller in capacity.</p></li>
</ul>
<p>These numbers scale with each generation: an H100 SXM has 80 GB HBM3 at 3.35 TB/s, a B200 pushes to 192 GB HBM3e at 8 TB/s but the SRAM-to-HBM bandwidth gap persists across all of them. The principles apply regardless of which GPU you are on.</p>
<p>Every GPU kernel (operation) must load its inputs from HBM into SRAM, do the computation and write results back to HBM. The key intuition: <strong>SRAM is where compute happens. HBM is where data lives</strong>.</p>
<section id="compute-bound-vs.-memory-bound-operations" class="level3">
<h3 class="anchored" data-anchor-id="compute-bound-vs.-memory-bound-operations">Compute-bound vs.&nbsp;memory-bound operations</h3>
<p>Given these two memory levels, the operations on GPU can either be <code>compute-bound</code> or <code>memory-bound</code>.</p>
<ul>
<li><p><strong>Compute-bound</strong>: An operation where the GPU’s compute cores are the bottleneck because they cannot do matmuls as fast as the data is fed. Typical examples are large matrix multiplications and convolutions ops.</p></li>
<li><p><strong>Memory-bound</strong>: An operation where the GPU’s memory bandwidth is the bottleneck, leaving the compute cores sitting idle while they wait for data to arrive from main memory (HBM &lt;-&gt; SRAM). Examples include elementwise operations (eactivations, dropout) and reductions (sum, softmax, batch norm, layer norm).</p></li>
</ul>
<p>The way to quantify this is <strong>arithmetic intensity</strong>: how many FLOPs the operation performs per byte it moves to/from HBM.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BArithmetic%20Intensity%7D%20=%20%5Cfrac%7B%5Ctext%7BFLOPs%7D%7D%7B%5Ctext%7BBytes%20accessed%20from%20HBM%7D%7D%0A"></p>
<p>Every GPU has a theoretical arithmetic intensity where compute time and memory transfer time are exactly balanced. For an A100, this is <img src="https://latex.codecogs.com/png.latex?%5Capprox"> <strong>156 FLOPs/byte</strong>. Operations above this threshold are compute-bound, operations below it are memory-bound. This framework is known as the <a href="https://en.wikipedia.org/wiki/Roofline_model">roofline model</a>.</p>
<blockquote class="blockquote">
<p>Note, <a href="https://horace.io/brrr_intro.html">Horace He’s “Making Deep Learning Go Brrrr”</a> is a great resource to understand this concept in detail. Also, remember a punchline from the blog, <em>if an operation is memory-bound, making it faster is not about fewer FLOPs. It is about moving fewer bytes.</em> This holds true for FlashAttention.</p>
</blockquote>
</section>
</section>
<section id="standard-attention-and-its-io-cost" class="level2">
<h2 class="anchored" data-anchor-id="standard-attention-and-its-io-cost">Standard Attention and Its IO Cost</h2>
<p>Now that we have the memory hierarchy picture, we can revisit the standard attention and forward and backward pass implementation.</p>
<section id="the-attention-formula" class="level3">
<h3 class="anchored" data-anchor-id="the-attention-formula">The attention formula</h3>
<p>Standard scaled dot-product attention computes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AO%20=%20%5Ctext%7Bsoftmax%7D%5C!%5Cleft(%5Cfrac%7BQK%5ET%7D%7B%5Csqrt%7Bd%7D%7D%5Cright)%20V%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?Q,%20K,%20V%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BN%20%5Ctimes%20d%7D">, <img src="https://latex.codecogs.com/png.latex?N"> is the sequence length and <img src="https://latex.codecogs.com/png.latex?d"> is the head dimension. This is for a single attention head, multi-head attention just runs this independently across heads. Note, I am ignoring masking and dropout here to keep the IO analysis clean, they do not change the fundamental bottleneck.</p>
<p>The formula looks like one operation but in practice PyTorch executes it as a sequence of separate GPU kernels. Each kernel reads its inputs from HBM, does its computation in SRAM and writes the results back to HBM. Then the next kernel reads those results from HBM again.</p>
</section>
<section id="standard-forward-pass" class="level3">
<h3 class="anchored" data-anchor-id="standard-forward-pass">Standard forward pass</h3>
<blockquote class="blockquote">
<p><strong>Algorithm: Standard Attention Forward</strong></p>
<p>Require: <img src="https://latex.codecogs.com/png.latex?Q,%20K,%20V%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BN%20%5Ctimes%20d%7D"> in HBM.</p>
<ol type="1">
<li>Load <img src="https://latex.codecogs.com/png.latex?Q,%20K"> from HBM, compute <img src="https://latex.codecogs.com/png.latex?S%20=%20QK%5ET">, write <img src="https://latex.codecogs.com/png.latex?S"> to HBM.</li>
<li>Read <img src="https://latex.codecogs.com/png.latex?S"> from HBM, compute <img src="https://latex.codecogs.com/png.latex?P%20=%20%5Ctext%7Bsoftmax%7D(S)">, write <img src="https://latex.codecogs.com/png.latex?P"> to HBM.</li>
<li>Load <img src="https://latex.codecogs.com/png.latex?P"> and <img src="https://latex.codecogs.com/png.latex?V"> from HBM, compute <img src="https://latex.codecogs.com/png.latex?O%20=%20PV">, write <img src="https://latex.codecogs.com/png.latex?O"> to HBM.</li>
<li>Return <img src="https://latex.codecogs.com/png.latex?O">.</li>
</ol>
</blockquote>
<p>We have <strong>3 separate kernels and 3 HBM round-trips</strong>! The <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> attention matrix is read and written multiple times. Here, total HBM IO is ~<img src="https://latex.codecogs.com/png.latex?O(N%5E2)"> dominated by sequence length (<img src="https://latex.codecogs.com/png.latex?N">).</p>
</section>
<section id="standard-backward-pass" class="level3">
<h3 class="anchored" data-anchor-id="standard-backward-pass">Standard backward pass</h3>
<blockquote class="blockquote">
<p><strong>Algorithm: Standard Attention Backward</strong></p>
<p>Require: <img src="https://latex.codecogs.com/png.latex?Q,%20K,%20V,%20dO%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BN%20%5Ctimes%20d%7D">, <img src="https://latex.codecogs.com/png.latex?P%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BN%20%5Ctimes%20N%7D"> in HBM.</p>
<ol type="1">
<li>Load <img src="https://latex.codecogs.com/png.latex?P,%20dO">, compute <img src="https://latex.codecogs.com/png.latex?dV%20=%20P%5ET%20dO">, write <img src="https://latex.codecogs.com/png.latex?dV"> to HBM.</li>
<li>Load <img src="https://latex.codecogs.com/png.latex?dO,%20V">, compute <img src="https://latex.codecogs.com/png.latex?dP%20=%20dO%20V%5ET">, write <img src="https://latex.codecogs.com/png.latex?dP"> to HBM.</li>
<li>Read <img src="https://latex.codecogs.com/png.latex?P,%20dP">, compute <img src="https://latex.codecogs.com/png.latex?dS_%7Bij%7D%20=%20P_%7Bij%7D(dP_%7Bij%7D%20-%20%5Csum_l%20P_%7Bil%7D%20dP_%7Bil%7D)">, write <img src="https://latex.codecogs.com/png.latex?dS"> to HBM.</li>
<li>Load <img src="https://latex.codecogs.com/png.latex?dS"> and <img src="https://latex.codecogs.com/png.latex?K">, compute <img src="https://latex.codecogs.com/png.latex?dQ%20=%20dS%20K">, write <img src="https://latex.codecogs.com/png.latex?dQ"> to HBM.</li>
<li>Load <img src="https://latex.codecogs.com/png.latex?dS"> and <img src="https://latex.codecogs.com/png.latex?Q">, compute <img src="https://latex.codecogs.com/png.latex?dK%20=%20dS%5ET%20Q">, write <img src="https://latex.codecogs.com/png.latex?dK"> to HBM.</li>
<li>Return <img src="https://latex.codecogs.com/png.latex?dQ,%20dK,%20dV">.</li>
</ol>
</blockquote>
<p>Here, we even have more HBM traffic with the backward pass reads and writes multiple <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> intermediates (<img src="https://latex.codecogs.com/png.latex?P">, <img src="https://latex.codecogs.com/png.latex?dP">, <img src="https://latex.codecogs.com/png.latex?dS">), giving <img src="https://latex.codecogs.com/png.latex?O(N%5E2)"> IO again.</p>
</section>
<section id="the-two-problems" class="level3">
<h3 class="anchored" data-anchor-id="the-two-problems">The two problems</h3>
<p>Looking at these algorithms, there are two issues:</p>
<p><strong>Problem 1: Multiple HBM round-trips.</strong> Every step shuffles <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> matrices between HBM and SRAM. As a result, your standard attention is dominated by comparatively slow HBM accesses while the compute cores sit idle waiting for data to arrive.</p>
<p><strong>Problem 2: <img src="https://latex.codecogs.com/png.latex?O(N%5E2)"> activation memory.</strong> The forward pass must save <img src="https://latex.codecogs.com/png.latex?P"> (<img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N">) per head per layer for backpropagation. This not only creates extra HBM traffic but also consumes a lot of GPU memory especially for long sequences.</p>
<blockquote class="blockquote">
<p><strong>Standard attention implementation is memory-bound and not “IO-AWARE”!</strong> The <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> matrices are repeatedly read and write in HBM during both forward and backward passes.</p>
</blockquote>
</section>
</section>
<section id="what-flashattention-does" class="level2">
<h2 class="anchored" data-anchor-id="what-flashattention-does">What FlashAttention Does</h2>
<p><strong>FlashAttention is an IO-aware, exact attention algorithm</strong>. It computes the same output as standard attention but it never materializes the <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> attention matrix in HBM. The total HBM IO drops from <img src="https://latex.codecogs.com/png.latex?O(N%5E2)"> to <img src="https://latex.codecogs.com/png.latex?O(N%5E2%20d%5E2%20/%20M)">, where <img src="https://latex.codecogs.com/png.latex?M"> is the SRAM size per SM. For typical values of <img src="https://latex.codecogs.com/png.latex?d"> and <img src="https://latex.codecogs.com/png.latex?M">, this is dramatically less than <img src="https://latex.codecogs.com/png.latex?O(N%5E2)">.</p>
<p>The algorithm achieves this through five interdependent fixes:</p>
<ol type="1">
<li><strong>Operator fusion</strong>: run the entire attention computation (matmul, softmax, matmul) in a single kernel so intermediates stay in SRAM.</li>
<li><strong>Tiling</strong>: partition <img src="https://latex.codecogs.com/png.latex?Q">, <img src="https://latex.codecogs.com/png.latex?K">, <img src="https://latex.codecogs.com/png.latex?V"> into blocks that fit in SRAM, processing one block at a time.</li>
<li><strong>Recomputation</strong>: do not store the <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> probability matrix <img src="https://latex.codecogs.com/png.latex?P">. Recompute it from <img src="https://latex.codecogs.com/png.latex?Q">, <img src="https://latex.codecogs.com/png.latex?K">, and a tiny normalization constant during the backward pass.</li>
<li><strong>Online softmax</strong>: compute softmax incrementally across tiles using running statistics so that tiling produces the exact result without ever seeing the full row.</li>
<li><strong>The logsumexp value <img src="https://latex.codecogs.com/png.latex?L"></strong>: a single scalar per query row that encodes everything needed to recover <img src="https://latex.codecogs.com/png.latex?P"> during the backward pass.</li>
</ol>
<p>These fixes are not independent. Fusion needs tiling (intermediates are too large for SRAM otherwise). Tiling needs online softmax (softmax is not directly associative). Recomputation needs <img src="https://latex.codecogs.com/png.latex?L"> (the backward pass must recover <img src="https://latex.codecogs.com/png.latex?P"> without storing it). I will walk through each fix in turn, building up the complete algorithm.</p>
<img src="https://garg-aayush.github.io/static/img/blog-2026-03-27/flashattention.png" class="img-fluid" style="width:100.0%">
<p align="center">
<em>Left: GPU memory hierarchy. Center: FlashAttention tiled computation with Q, K, V blocks loaded into SRAM. Right: FlashAttention fuses all attention ops into a single kernel, eliminating separate HBM round-trips. (Source: <a href="https://arxiv.org/abs/2205.14135">Dao et al., 2022</a>)</em>
</p>
<section id="operator-fusion" class="level3">
<h3 class="anchored" data-anchor-id="operator-fusion">Operator Fusion</h3>
<p>As shown above, standard attention launches three separate GPU kernels, each writing <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> intermediates to HBM. For instance, the softmax step is especially wasteful, it does very little arithmetic per element but reads and writes the entire attention matrix. The compute cores sit almost entirely idle waiting for data.</p>
<p>The most common approach to accelerate memory-bound operations is <strong>kernel fusion</strong>: if there are multiple operations applied to the same input, the input can be loaded once from HBM instead of multiple times for each operation.</p>
<p>Often, compilers (e.g.&nbsp;PyTorch’s torch.compile) can automatically fuse operations but they are not always able to fuse complex operations like matmul + softmax in attention as it requires domain specific knowledge about how to tile and reorder the computation. <strong>FlashAttention is a hand-crafted fused kernel that exploits the specific structure of attention.</strong></p>
<p>Fusion alone is not enough. For it to actually deliver the speedups we want two independent problems need to be solved:</p>
<ol type="1">
<li><p><strong>The intermediates must fit in SRAM.</strong> Even a single block of query rows scored against all <img src="https://latex.codecogs.com/png.latex?N"> keys produces a score matrix that exceeds SRAM capacity, and this only gets worse as <img src="https://latex.codecogs.com/png.latex?N"> grows. We need a way to break the computation into pieces that fit. That is <strong>tiling</strong> (next section).</p></li>
<li><p><strong>We need to avoid storing <img src="https://latex.codecogs.com/png.latex?P"> for the backward pass.</strong> Even if we fuse the forward pass perfectly, backpropagation requires the <img src="https://latex.codecogs.com/png.latex?P"> matrix to be saved in HBM for gradient computation. This reintroduces the <img src="https://latex.codecogs.com/png.latex?O(N%5E2)"> memory cost we are trying to eliminate. The fix is <strong>recomputation</strong>.</p></li>
</ol>
</section>
<section id="tiling" class="level3">
<h3 class="anchored" data-anchor-id="tiling">Tiling</h3>
<p>As mentioned above the full <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> attention matrix does not fit in SRAM. Moreover, we do not need to hold it all at once. <strong>The attention computation does not require the full matrix to be in memory simultaneously. The output <img src="https://latex.codecogs.com/png.latex?O"> can be computed tile by tile.</strong> This ensures all intermediate values stays in SRAM and only touch HBM at the start (to read <img src="https://latex.codecogs.com/png.latex?Q">, <img src="https://latex.codecogs.com/png.latex?K">, <img src="https://latex.codecogs.com/png.latex?V">) and at the end (to write <img src="https://latex.codecogs.com/png.latex?O">).</p>
<p>We partition <img src="https://latex.codecogs.com/png.latex?Q"> into row blocks and <img src="https://latex.codecogs.com/png.latex?K">, <img src="https://latex.codecogs.com/png.latex?V"> into column blocks. For each query block, we loop over all key/value blocks, compute a small score tile that fits in SRAM, apply softmax, multiply by the value block, and accumulate into an output block that also stays in SRAM. <strong>The block sizes are chosen so that all the tiles (query, key, value, score, output) fit in SRAM simultaneously.</strong></p>
<p>The key point is that the full <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> attention matrix is never materialized, neither in HBM nor in SRAM. The HBM IO cost of the tiled algorithm is <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BIO%20cost%7D%20%5Capprox%20O(Nd)"> instead of <img src="https://latex.codecogs.com/png.latex?O(N%5E2)"> for vanilla attention.</p>
<section id="the-softmax-problem" class="level4">
<h4 class="anchored" data-anchor-id="the-softmax-problem">The Softmax Problem</h4>
<p>There is a subtlety that makes tiling attention fundamentally harder than tiling a standard matrix multiplication. For matmul, tiling works because addition is associative: the partial products from each tile can simply be summed. But attention has a softmax sandwiched between two matmuls and <strong>softmax is not directly associative across tiles</strong>. Softmax for each query row needs the global maximum (<img src="https://latex.codecogs.com/png.latex?m">) and global denominator (<img src="https://latex.codecogs.com/png.latex?l">) over all keys. If we are processing key tiles one at a time, we do not know these global statistics until we have seen every tile. The online softmax algorithm (discussed later) solves this problem.</p>
</section>
</section>
<section id="recomputation" class="level3">
<h3 class="anchored" data-anchor-id="recomputation">Recomputation</h3>
<p>Tiling eliminates the <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> matrix from the forward pass. However, the gradient computation requires the probability matrix <img src="https://latex.codecogs.com/png.latex?P"> (see the backward algorithm above). If we store <img src="https://latex.codecogs.com/png.latex?P"> for the backward pass, we reintroduce <img src="https://latex.codecogs.com/png.latex?O(N%5E2)"> memory per head which is exactly the cost we just eliminated.</p>
<p>FlashAttention makes a deliberate choice: <strong>do not store <img src="https://latex.codecogs.com/png.latex?P">, recompute it during the backward pass from the saved <img src="https://latex.codecogs.com/png.latex?Q">, <img src="https://latex.codecogs.com/png.latex?K">, and <img src="https://latex.codecogs.com/png.latex?V"> tiles.</strong></p>
<p>This trades memory for compute. Since, standard attention is memory-bound, the additional compute is small in comparison. You need one extra forward pass over the <img src="https://latex.codecogs.com/png.latex?QK%5ET"> tiles per backward pass. Since the backward pass already has to load <img src="https://latex.codecogs.com/png.latex?Q">, <img src="https://latex.codecogs.com/png.latex?K">, and <img src="https://latex.codecogs.com/png.latex?V"> from HBM, recomputation adds no extra HBM reads. The memory savings are enormous. Thus, we only need to store <img src="https://latex.codecogs.com/png.latex?Q">, <img src="https://latex.codecogs.com/png.latex?K">, <img src="https://latex.codecogs.com/png.latex?V">, <img src="https://latex.codecogs.com/png.latex?O"> (all <img src="https://latex.codecogs.com/png.latex?O(Nd)">) and one additional scalar <img src="https://latex.codecogs.com/png.latex?L">.</p>
<blockquote class="blockquote">
<p>In practice, FlashAttention backward pass is <em>faster</em> than standard attention despite doing more FLOPs, because it eliminates the massive HBM reads and writes of <img src="https://latex.codecogs.com/png.latex?P"> (remember FlashAttention is memory-bound!). This is the same principle as <a href="https://arxiv.org/abs/1604.06174">gradient checkpointing</a> but applied surgically to a single intermediate rather than at the coarse granularity of entire layers.</p>
</blockquote>
</section>
<section id="online-softmax" class="level3">
<h3 class="anchored" data-anchor-id="online-softmax">Online Softmax</h3>
<p>As hinted before, tiling works naturally for matmul because addition is associative. But attention has a softmax sandwiched between two matmuls and softmax needs two global statistics the row maximum <img src="https://latex.codecogs.com/png.latex?m"> and the denominator <img src="https://latex.codecogs.com/png.latex?l"> before any output can be produced. The numerically stable (“safe”) softmax for a row <img src="https://latex.codecogs.com/png.latex?x%20=%20(x_1,%20%5Cldots,%20x_N)"> is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7Bsoftmax%7D(x)_i%20=%20%5Cfrac%7Be%5E%7Bx_i%20-%20m%7D%7D%7B%5Csum_%7Bj=1%7D%5E%7BN%7D%20e%5E%7Bx_j%20-%20m%7D%7D,%20%5Cquad%20m%20=%20%5Cmax_%7Bj=1%7D%5E%7BN%7D%20x_j%0A"></p>
<p>Computing this requires three sequential passes over the data (<a href="https://arxiv.org/abs/1805.02867">Milakov &amp; Gimelshein, 2018</a>):</p>
<ul>
<li>Pass 1 sweeps for the max <img src="https://latex.codecogs.com/png.latex?m_N"></li>
<li>Pass 2 uses <img src="https://latex.codecogs.com/png.latex?m_N"> to accumulate the denominator <img src="https://latex.codecogs.com/png.latex?%5Cell_N%20=%20%5Csum_j%20e%5E%7Bx_j%20-%20m_N%7D"></li>
<li>Pass 3 uses both to emit the final values <img src="https://latex.codecogs.com/png.latex?a_i%20=%20e%5E%7Bx_i%20-%20m_N%7D/%5Cell_N">.</li>
</ul>
<p>Each pass depends on the result of the previous one. If the full row of logits does not fit in SRAM (which it generally does not for long sequences), each pass must re-read <img src="https://latex.codecogs.com/png.latex?Q"> and <img src="https://latex.codecogs.com/png.latex?K"> from HBM to recompute logits on the fly. This causes three passes, three HBM round-trips!</p>
<section id="passes-to-2-the-surrogate-denominator" class="level4">
<h4 class="anchored" data-anchor-id="passes-to-2-the-surrogate-denominator">3 passes to 2: the surrogate denominator</h4>
<p>The denominator update <img src="https://latex.codecogs.com/png.latex?%5Cell_i%20=%20%5Cell_%7Bi-1%7D%20+%20e%5E%7Bx_i%20-%20m_N%7D"> depends on the <em>final</em> max <img src="https://latex.codecogs.com/png.latex?m_N"> which blocks fusion with the max pass. The trick (<a href="https://arxiv.org/abs/1805.02867">Milakov &amp; Gimelshein, 2018</a>) is to define a <strong>surrogate denominator</strong> that uses the running max <img src="https://latex.codecogs.com/png.latex?m_i"> instead:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cell_i%20:=%20%5Csum_%7Bj=1%7D%5E%7Bi%7D%20e%5E%7Bx_j%20-%20m_i%7D%0A"></p>
<p>At position <img src="https://latex.codecogs.com/png.latex?N"> the running max equals the final max, so the surrogate <img src="https://latex.codecogs.com/png.latex?%5Cell_N"> equals the true denominator. If we can update <img src="https://latex.codecogs.com/png.latex?%5Cell_i"> incrementally, we get the final denominator for free.</p>
<p>Start from the definition and split off the <img src="https://latex.codecogs.com/png.latex?i">-th term:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cell_i%20=%20%5Cleft(%5Csum_%7Bj=1%7D%5E%7Bi-1%7D%20e%5E%7Bx_j%20-%20m_i%7D%5Cright)%20+%20e%5E%7Bx_i%20-%20m_i%7D%0A"></p>
<p>The key move is to relate each <img src="https://latex.codecogs.com/png.latex?e%5E%7Bx_j%20-%20m_i%7D"> in the sum to <img src="https://latex.codecogs.com/png.latex?e%5E%7Bx_j%20-%20m_%7Bi-1%7D%7D"> (which is what <img src="https://latex.codecogs.com/png.latex?%5Cell_%7Bi-1%7D"> uses) by factoring out <img src="https://latex.codecogs.com/png.latex?e%5E%7Bm_%7Bi-1%7D%20-%20m_i%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cell_i%20=%20%5Cunderbrace%7B%5Cleft(%5Csum_%7Bj=1%7D%5E%7Bi-1%7D%20e%5E%7Bx_j%20-%20m_%7Bi-1%7D%7D%5Cright)%7D_%7B%5Cell_%7Bi-1%7D%7D%20%5Ccdot%5C;%20e%5E%7Bm_%7Bi-1%7D%20-%20m_i%7D%20%5C;+%5C;%20e%5E%7Bx_i%20-%20m_i%7D%0A"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cell_i%20=%20%5Cell_%7Bi-1%7D%20%5Ccdot%20e%5E%7Bm_%7Bi-1%7D%20-%20m_i%7D%20+%20e%5E%7Bx_i%20-%20m_i%7D%7D%0A"></p>
<p>Everything on the right is available at step <img src="https://latex.codecogs.com/png.latex?i"> with no dependency on the future. This fuses the max and denominator into a single pass, reducing 3 passes to 2.</p>
</section>
<section id="passes-to-1-the-surrogate-output" class="level4">
<h4 class="anchored" data-anchor-id="passes-to-1-the-surrogate-output">2 passes to 1: the surrogate output</h4>
<p>In attention, our final target is not the attention score matrix but the output matrix <img src="https://latex.codecogs.com/png.latex?O"> or more specifically <img src="https://latex.codecogs.com/png.latex?O%20=%20A%20%5Ccdot%20V"> which still requires a second sweep. The same surrogate trick eliminates it, you define a <strong>surrogate output</strong> <img src="https://latex.codecogs.com/png.latex?o'_i"> using the running statistics <img src="https://latex.codecogs.com/png.latex?m_i"> and <img src="https://latex.codecogs.com/png.latex?%5Cell_i"> instead of the final ones. Applying the identical factor and rescale algebra yields:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7Bo'_i%20=%20o'_%7Bi-1%7D%20%5Ccdot%20%5Cfrac%7B%5Cell_%7Bi-1%7D%20%5Ccdot%20e%5E%7Bm_%7Bi-1%7D%20-%20m_i%7D%7D%7B%5Cell_i%7D%20+%20%5Cfrac%7Be%5E%7Bx_i%20-%20m_i%7D%7D%7B%5Cell_i%7D%20%5Ccdot%20V%5Bi,%20:%5D%7D%0A"></p>
<p>At position <img src="https://latex.codecogs.com/png.latex?N">, the running statistics equal the final statistics, so <img src="https://latex.codecogs.com/png.latex?o'_N"> equals the true output <img src="https://latex.codecogs.com/png.latex?o_N">. <strong>The online update is not an approximation</strong>. It produces the same result as computing softmax over the full row.</p>
<p>In practice, FlashAttention track <img src="https://latex.codecogs.com/png.latex?m"> and <img src="https://latex.codecogs.com/png.latex?%5Cell"> as scalar registers in SRAM, updated once per key tile per query tile.</p>
<blockquote class="blockquote">
<p>I would recommend reading these excellent notes <a href="https://courses.cs.washington.edu/courses/cse599m/23sp/notes/flashattn.pdf">“From online softmax to flash attention”</a> to understand all the derivations in more detail.</p>
</blockquote>
</section>
</section>
<section id="the-logsumexp-value-l" class="level3">
<h3 class="anchored" data-anchor-id="the-logsumexp-value-l">The Logsumexp Value L</h3>
<p>At the end of the forward pass, the online softmax loop has produced the final row maximum <img src="https://latex.codecogs.com/png.latex?m_i"> and denominator <img src="https://latex.codecogs.com/png.latex?%5Cell_i"> for each query row <img src="https://latex.codecogs.com/png.latex?i">. <a href="https://tridao.me/publications/flash2/flash2.pdf">FlashAttention-2</a> compresses these two scalars into a single value, the <strong>logsumexp value</strong>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL_i%20=%20m_i%20+%20%5Clog%20%5Cell_i%20=%20%5Clog%5C!%5Cleft(%5Csum_j%20e%5E%7BS_%7Bij%7D%7D%5Cright)%0A"></p>
<p><img src="https://latex.codecogs.com/png.latex?L_i"> is the log of the softmax partition function for row <img src="https://latex.codecogs.com/png.latex?i">, one scalar per query row and can be used to compute the <img src="https://latex.codecogs.com/png.latex?P"> as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AP_%7Bij%7D%20=%20e%5E%7BS_%7Bij%7D%20-%20L_i%7D%0A"></p>
<p>This ensures we don’t need to store the <img src="https://latex.codecogs.com/png.latex?P"> during the forward pass. Together with <img src="https://latex.codecogs.com/png.latex?Q,%20K,%20V,%20O"> and <img src="https://latex.codecogs.com/png.latex?L">, we have everything we need to compute the backward pass.</p>
<blockquote class="blockquote">
<p><a href="https://tridao.me/publications/flash2/flash2.pdf">FlashAttention-2</a> also introduces other parallelization and loop order engineering efficiencies which I would discuss in later blogs, for now I have focused on the core conceptual improvements that are part of FlashAttention algorithm.</p>
</blockquote>
</section>
</section>
<section id="wrapping-up" class="level2">
<h2 class="anchored" data-anchor-id="wrapping-up">Wrapping Up</h2>
<p>The core insight behind FlashAttention is not about clever matmul tricks. It is about bytes and minimizing the data movement. Standard attention is slow because it moves too much data through the memory bus: the <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> attention matrix gets written to and read from HBM multiple times across three separate kernel launches.</p>
<p>The five fixes of FlashAttention addresses that bottleneck systematically:</p>
<ul>
<li><strong>Operator fusion</strong> eliminates HBM round-trips by running the entire attention computation in a single kernel.</li>
<li><strong>Tiling</strong> breaks the computation into SRAM-sized blocks so that fusion is actually possible.</li>
<li><strong>Recomputation</strong> removes the need to store the <img src="https://latex.codecogs.com/png.latex?N%20%5Ctimes%20N"> probability matrix <img src="https://latex.codecogs.com/png.latex?P">, trading cheap extra compute for a massive reduction in activation memory.</li>
<li><strong>Online softmax</strong> makes tiling exact by maintaining running statistics that converge to the correct answer in a single pass.</li>
<li><strong>The logsumexp value <img src="https://latex.codecogs.com/png.latex?L"></strong> bridges the forward and backward passes, compressing everything needed to recover <img src="https://latex.codecogs.com/png.latex?P"> into one scalar per query row.</li>
</ul>
<p>The result is an attention algorithm with the same numerical output, dramatically less HBM IO and less memory. It is faster despite doing slightly more work because the bottleneck is memory, not compute.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<section id="papers" class="level3">
<h3 class="anchored" data-anchor-id="papers">Papers</h3>
<ul>
<li><a href="https://arxiv.org/abs/2205.14135">FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness</a>: The original FlashAttention paper</li>
<li><a href="https://tridao.me/publications/flash2/flash2.pdf">FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning</a>: FlashAttention-2 paper that improves upon the original FlashAttention using better parallelism and work partitioning</li>
<li><a href="https://arxiv.org/abs/1805.02867">Online Normalizer Calculation for Softmax</a>: Original derivation of the online softmax recurrence</li>
<li><a href="https://courses.cs.washington.edu/courses/cse599m/23sp/notes/flashattn.pdf">From Online Softmax to FlashAttention</a>: Excellent notes bridging online softmax to FlashAttention</li>
</ul>
</section>
<section id="blogs" class="level3">
<h3 class="anchored" data-anchor-id="blogs">Blogs</h3>
<ul>
<li><a href="https://horace.io/brrr_intro.html">Making Deep Learning Go Brrrr From First Principles</a>: A must read to understand memory-bound vs compute-bound workloads</li>
<li><a href="https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad">ELI5: FlashAttention</a>: Accessible conceptual explanation of FlashAttention</li>
<li><a href="https://modal.com/gpu-glossary">Modal GPU Performance Glossary</a>: dictionary of terms and concepts related to programming GPUs</li>
</ul>


</section>
</section>

 ]]></description>
  <category>Transformers</category>
  <guid>https://garg-aayush.github.io/posts/2026-03-27-flash-attention/</guid>
  <pubDate>Fri, 27 Mar 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Building Browser Tools with Claude Code</title>
  <link>https://garg-aayush.github.io/posts/2026-03-20-browser-tools/</link>
  <description><![CDATA[ 




<p>I have been an admirer of Simon Willison’s <a href="https://tools.simonwillison.net/">tools site</a> for a while now. On his website you can find more than a hundred single-file HTML + JavaScript tools he has built using LLMs. They are mostly self-contained HTML + JavaScript files.</p>
<p>I always wanted something of my own, specific tools that I use often and would be handy to have on my website. Though I never got around to actually writing them. I am an AI/ML engineer and HTML and JavaScript are not my strongest suits. A few days ago I decided to just start building them using <a href="https://claude.ai/code">Claude Code</a> and see how far I could get. Below are some of the tools I have built for myself and you might find them useful too.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-03-20/tools_preview.jpg" class="img-fluid" style="width:100.0%"></p>
<section id="the-tools" class="level2">
<h2 class="anchored" data-anchor-id="the-tools">The Tools</h2>
<p>Here is what is there so far, grouped by category. All tools support drag-and-drop for file input and run entirely in the browser (nothing leaves your machine).</p>
<section id="image-tools" class="level3">
<h3 class="anchored" data-anchor-id="image-tools">Image Tools</h3>
<ul>
<li><a href="../../tools/image-converter.html"><strong>Image Format Converter</strong></a>: Convert between PNG, JPEG, and WebP with transparency handling.</li>
<li><a href="../../tools/image-resizer.html"><strong>Image Resizer</strong></a>: Resize imagesby by dimensions, aspect ratio and percentage.</li>
<li><a href="../../tools/image-cropper.html"><strong>Image Cropper</strong></a>: Interactive crop with draggable selection and aspect ratio presets.</li>
<li><a href="../../tools/image-mask-creator.html"><strong>Binary Mask Creator</strong></a>: Paint binary black and white masks for AI inpainting and fill models.</li>
<li><a href="../../tools/image-operations.html"><strong>Image Adjust &amp; Transform</strong></a>: Flip, rotate, invert, grayscale, brightness, and contrast adjustments.</li>
<li><a href="../../tools/svg-viewer.html"><strong>SVG Viewer &amp; Converter</strong></a>: Preview SVGs, inspect metadata, export to raster formats.</li>
</ul>
<p>I find the Binary Mask Creator pretty handy, especially when working with inpainting models like <a href="https://replicate.com/black-forest-labs/flux-fill-dev">FLUX Fill</a> or <a href="https://replicate.com/ideogram-ai/ideogram-v3-turbo">Ideogram v3</a>. Creating clean masks is usually a bit of a pain using some clunky web app. The SVG Viewer is another one I reach for often to preview and convert SVGs on the go (saves me time).</p>
<table align="center">
<tbody><tr>
<td>
<img src="https://garg-aayush.github.io/static/img/blog-2026-03-20/binary_mask_creator.jpg" alt="Binary Mask Creator" width="100%">
</td>
<td>
<img src="https://garg-aayush.github.io/static/img/blog-2026-03-20/svg_convertor.jpg" alt="SVG Viewer &amp; Converter" width="100%">
</td>
</tr>
</tbody></table>
<p align="center">
<em>Binary Mask Creator and SVG Viewer &amp; Converter</em>
</p>
</section>
<section id="viewer-tools" class="level3">
<h3 class="anchored" data-anchor-id="viewer-tools">Viewer Tools</h3>
<ul>
<li><a href="../../tools/html-viewer.html"><strong>HTML Viewer</strong></a>: Paste HTML and see it rendered in a sandboxed frame.</li>
<li><a href="../../tools/markdown-viewer.html"><strong>Markdown Viewer</strong></a>: Live markdown preview with syntax highlighting and latex math support.</li>
<li><a href="../../tools/json-formatter.html"><strong>JSON Formatter &amp; Viewer</strong></a>: Format, validate, collapsible tree view, and search.</li>
<li><a href="../../tools/diff-viewer.html"><strong>Diff Viewer</strong></a>: Side-by-side and unified diff with word-level highlighting.</li>
<li><a href="../../tools/latex-preview.html"><strong>LaTeX Math Preview</strong></a>: Live math latex rendering with a symbol palette and saved snippets.</li>
</ul>
<p>The HTML and Markdown viewers comes in handy especially if you are working with an OCR model and quickly want to see the generated preview of an image or PDF. And whenever I am working with math equations, the LaTeX Preview is a nice scratchpad where I can quickly test out an equation and make sure it renders correctly.</p>
</section>
<section id="data-tools" class="level3">
<h3 class="anchored" data-anchor-id="data-tools">Data Tools</h3>
<ul>
<li><a href="../../tools/yaml-json-converter.html"><strong>YAML/JSON Converter</strong></a>: Bidirectional conversion with validation and configurable indentation.</li>
<li><a href="../../tools/base64.html"><strong>Base64 Encoder/Decoder</strong></a>: Encode/decode images and text with image preview.</li>
<li><a href="../../tools/url-encoder.html"><strong>URL Encoder/Decoder</strong></a>: Percent encode and decode useful for API calls and debugging.</li>
</ul>
<p>The YAML/JSON converter I like a lot, especially when working with training configs. The Base64 tool is another one I use often when working with llm/image-gen APIs, sometimes they send back base64 encoded images or require base64 encoded images as input and I can directly do that here.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-03-20/yaml2json.jpg" class="img-fluid" style="width:100.0%"></p>
<blockquote class="blockquote">
<p>For what it is worth, I do have all of these set up as skills in my Claude Code setup too. But it is always nice to have something in the browser where you can quickly drag and drop an image and get the <strong>result without opening a terminal session or burning tokens</strong>.</p>
</blockquote>
</section>
</section>
<section id="how-i-built-them" class="level2">
<h2 class="anchored" data-anchor-id="how-i-built-them">How I Built Them</h2>
<p>Nathan Lambert wrote a good piece on how <a href="https://www.interconnects.ai/p/claude-code-hits-different">Claude Code hits different</a> and Sergey Karayev put it well when he said Claude Code with Opus is <em>“moving software creation from an artisanal, craftsman activity to a true industrial process”</em>. I think that captures it. The bottleneck is no longer writing the code, it is planning, designing and validating.</p>
<p>When I was building these tools, most of my time went into two things. First, planning out what tools to build with focus on being simple, standalone HTML files that serve my purpose. I would plan everything out, write a detailed plan of tools description and then fire multiple sub-agents in Claude Code to build them in parallel. Second, <strong>doing that last ~10% of the work</strong> where you polish things according to your needs, making sure there are no missing libraries, no edge cases falling through, testing each tool to make sure it actually works as expected.</p>
<p>I did not sit down and write HTML or JavaScript. I spent my time figuring out what these tools should do and then verifying that they do it correctly. Without Claude Code I would not have attempted this not because the tools are complex but because HTML and JavaScript are not what I work in daily. With Claude Code, I could just focus on the requirements and planning part, ensure a correct test path, and let it handle the implementation.</p>
</section>
<section id="try-them-out" class="level2">
<h2 class="anchored" data-anchor-id="try-them-out">Try Them Out</h2>
<p>All tools are at <a href="../../tools/">aayushgarg.dev/tools</a>. If you have ideas for useful tools, feel free to reach out.</p>


</section>

 ]]></description>
  <category>Tools &amp; Infra</category>
  <guid>https://garg-aayush.github.io/posts/2026-03-20-browser-tools/</guid>
  <pubDate>Fri, 20 Mar 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>GRPO: Building Intuition Through Ablation Studies</title>
  <link>https://garg-aayush.github.io/posts/2026-02-26-grpo-from-scratch/</link>
  <description><![CDATA[ 




<p>Following the same approach as I did for <a href="https://huggingface.co/blog/garg-aayush/building-sft-from-ground-up">SFT</a> and <a href="https://huggingface.co/blog/garg-aayush/expert-iteration-math-reasoning">Expert Iteration</a>, I have written the GRPO training code from scratch. I loosely followed GRPO experiment part of Stanford CS336 <a href="https://github.com/stanford-cs336/assignment5-alignment/blob/main/cs336_spring2025_assignment5_alignment.pdf">Assignment 5</a> as a reference point and trained <a href="https://huggingface.co/Qwen/Qwen2.5-Math-1.5B">Qwen2.5-Math-1.5B</a> with verifiable math rewards. This time around I had 3 main motivations:</p>
<ol type="1">
<li>As usual, write the GRPO code from scratch for the sake of understanding.</li>
<li>Train Qwen2.5-Math-1.5B model with verifiable math rewards and get a feel of what kind of accuracy we can push with pure RL (no supervised fine-tuning).</li>
<li><strong>Most importantly</strong>, run a lot of ablation studies to understand and build intuition on what matters in GRPO training, the different design choices we can make and how to interpret the different metrics. <strong>Now I look back, I think this is the most important part of this long exercise.</strong></li>
</ol>
<p align="center">
<img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/grpo/results/best_run/best_run.png" alt="Best GRPO run: eval reward accuracy and mean response length" width="700">
</p><p align="center">
<em>Best: <b>~75% reward accuracy</b> on MATH validation (up from ~3% base model accuracy)</em>
</p>
<p></p>
<section id="quick-recap-on-grpo" class="level2">
<h2 class="anchored" data-anchor-id="quick-recap-on-grpo">Quick Recap on GRPO</h2>
<p><a href="https://huggingface.co/blog/garg-aayush/derive-grpo-loss">GRPO</a> (Group Relative Policy Optimization) is a RL algorithm that eliminates the need for a separate critic/value model by using group-relative advantages. It generates multiple candidate outputs per prompt, scores them and normalizes rewards within the group to get advantages. It also uses ratio clipping (similar to PPO) to prevent the policy from drifting too far from the reference distribution. Instead of imitating expert reasoning traces (SFT), it lets the model discover its own strategies by generating multiple candidate solutions per problem, scoring them and reinforcing the better ones.</p>
<blockquote class="blockquote">
<p>You can read more about it in the <a href="https://huggingface.co/blog/garg-aayush/derive-grpo-loss">GRPO derivation blog post</a> I wrote a few weeks back and in this <a href="https://substack.com/home/post/p-177823868">blog post</a>.</p>
</blockquote>
</section>
<section id="building-the-training-loop" class="level2">
<h2 class="anchored" data-anchor-id="building-the-training-loop">Building the Training Loop</h2>
<p>I followed the same approach as I mentioned in the assignment whiere I wrote and tested the helper functions first and made sure each piece works in isolation. Finally, wired all the helper functions together into the full training loop.</p>
<p>The GRPO algorithm has two nested loops:</p>
<ul>
<li><strong>Outer loop</strong>: sample a batch of prompts, generate G <code>rollouts</code> per prompt via vLLM, compute rewards, normalize advantages within each group</li>
<li><strong>Inner loop</strong>: policy gradient updates over the rollout batch (using gradient accumulation to fit on the GPU)</li>
</ul>
<blockquote class="blockquote">
<p>Note, I have made sure to reuse the functions, data pipelines etc. from the <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/sft">SFT code</a> wherever possible.</p>
</blockquote>
<p>For the training data, I used the MATH dataset (problems only, no reasoning traces) sourced from this <a href="https://github.com/kkaitlyn111/cs336-a5-RL/tree/main/MATH">CS336 MATH dataset</a> repo.</p>
<p>The core GRPO functions (group-normalized rewards, four loss variants, microbatch train step) live in <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/grpo/utils/grpo.py"><code>utils/grpo.py</code></a> and the main training script that wires the outer and inner loops together is <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/grpo/train_grpo.py"><code>train_grpo.py</code></a>. You can find all the key files and details in the <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/grpo/Readme.md">GRPO README</a>.</p>
<p>A few important notes:</p>
<ul>
<li>I used the same vLLM colocate setup and workarounds as I did for SFT (see the <a href="https://huggingface.co/blog/garg-aayush/building-sft-from-ground-up">SFT blog post</a> for the debugging details). This allowed me to run the training loop and intermediate evaluations and rollout generations on a single GPU.</li>
<li>All training configs are managed via <a href="https://omegaconf.readthedocs.io/">OmegaConf</a> structured configs and yaml files. This is especially useful for ablation studies where each experiment config is a minimal diff from the defaults, making it easy to see exactly what changed and importantly to reproduce any run.</li>
<li>As usual, I added intermediate evaluations, model checkpointing at configurable intervals, wandb logging, other eval and timging metrics for better observability.</li>
</ul>
<blockquote class="blockquote">
<p>To keep costs manageable, all intermediate evaluations use a 1024 examples subset of the validation set rather than the full ~5K.</p>
</blockquote>
</section>
<section id="gpu-memory-optimization-fitting-on-24-gb" class="level2">
<h2 class="anchored" data-anchor-id="gpu-memory-optimization-fitting-on-24-gb">GPU Memory Optimization (Fitting on 24 GB)</h2>
<p>I wrote the training code and tested it on my personal RTX 4090 with 24 GB of VRAM. With the default parameters suggested in the assignment, I immediately ran into OOM errors and had to do some memory optimizations to fit the training loop.</p>
<ul>
<li><strong>Peak memory tracking</strong>: Not an optimization in itself but an essential step to keep track of the peak memory usage. I made sure to log the peak memory at important junctures in the training loop.</li>
<li><strong>Gradient checkpointing</strong>: The simplest one is to enable gradient checkpointing which recomputes activations during backward pass instead of storing them. You can trade up to ~30% memory savings for speed.</li>
<li><strong>vLLM sleep mode</strong>: This is quite a nice trick to offload vLLM KV cache and weights to CPU during the training phase (when vLLM is not generating). This frees GPU memory for the backward pass and prevents the vLLM cache from competing with training activations.</li>
<li><strong>8-bit AdamW</strong>: I added an option to use bitsandbytes <code>AdamW8bit</code> optimizer instead of the default <code>AdamW</code> optimizer. This reduces the optimizer state memory by almost half.</li>
</ul>
<p>After all the above optimization tricks, I was able to successfully train with <code>rollout_batch_size=256</code>, <code>group_size=8</code>, <code>gradient_accumulation_steps=256</code> (microbatch size = 1) on 24 GB.</p>
<blockquote class="blockquote">
<p>You should be able to run most of the ablation studies on your local RTX 4090/3090 GPU with the default configs and optimization flags.</p>
</blockquote>
</section>
<section id="scaling-to-modal-h100s" class="level2">
<h2 class="anchored" data-anchor-id="scaling-to-modal-h100s">Scaling to Modal (H100s)</h2>
<p>The local training script runs fine on a 24 GB GPU (e.g.&nbsp;RTX 4090) but there are two practical limitations that made me want to scale to <a href="https://modal.com/">Modal</a>:</p>
<ol type="1">
<li><strong>Speed</strong>: Even with all memory optimizations, a single experiment on the 4090 takes a few hours. Some of the ablation studies and hyperparameter searches become impractical. It always helps to run the experiments on a larger GPU for faster iterations.</li>
<li><strong>Parallelism</strong>: I wanted to run a lot of ablation experiments. On a single local GPU that means running them one at a time which would take weeks. <strong>I needed a way to fire off multiple experiments in parallel and compare them side-by-side in wandb.</strong> At the same time, I am not spending my full time on this and I work on it whenever I get time. Thus, I did not want to deal with spinning up and shutting down GPU instances each time.</li>
</ol>
<blockquote class="blockquote">
<p><a href="https://modal.com/">Modal</a> lets you define GPU workloads purely in Python (no Docker/Kubernetes), with pay-per-second billing and containers that spin up in seconds. I could fire off multiple H100 runs in parallel and everything scales back to zero when done.</p>
</blockquote>
<section id="h100-config-optimization" class="level3">
<h3 class="anchored" data-anchor-id="h100-config-optimization">H100 Config Optimization</h3>
<p>On the H100, I disabled the memory tricks that only exist to fit on 24 GB (gradient checkpointing, 8-bit AdamW) and used larger microbatches (4 instead of 1) for better tensor core utilization. I kept vLLM sleep mode on since it is still useful to free KV cache during training.</p>
<p><strong>Timing comparison (20 GRPO steps, <code>reinforce_with_baseline</code>):</strong></p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Hardware</th>
<th>Config</th>
<th>Time</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>RTX 4090 (24 GB)</td>
<td>RTX 4090 defaults</td>
<td>~28 min</td>
</tr>
<tr class="even">
<td>H100 (80 GB)</td>
<td>RTX 4090 defaults (unchanged)</td>
<td>~18 min</td>
</tr>
<tr class="odd">
<td>H100 (80 GB)</td>
<td>H100 optimized config</td>
<td>~10 min</td>
</tr>
</tbody>
</table>
<blockquote class="blockquote">
<p><strong>Cost note:</strong> Running all the ablation studies discussed below including failed experiments and runs I terminated early, cost approximately <strong>$140</strong> on Modal. I think that is well worth it for the understanding I gained.</p>
</blockquote>
</section>
</section>
<section id="ablation-studies" class="level2">
<h2 class="anchored" data-anchor-id="ablation-studies">Ablation Studies</h2>
<p>I ran a series of ablation studies as per the assignment to understand what matters in GRPO training. Each ablation isolates one design choice while keeping everything else fixed.</p>
<blockquote class="blockquote">
<p>You will notice the experiments vary in length: some run for 200 GRPO steps, some for 100/50 or some were terminated mid-way. This is intentional and done to keep the total cost manageable while still getting the information I needed.</p>
</blockquote>
<section id="learning-rate-sweep" class="level3">
<h3 class="anchored" data-anchor-id="learning-rate-sweep">Learning Rate Sweep</h3>
<p>The learning rate is the most critical hyperparameter to get right first. It determines whether the policy updates are large enough to learn but not so large to cause the policy to collapse. Moreover, unlike supervised learning where a bad <code>lr</code> just causes loss divergence, in GRPO a high <code>lr</code> can cause the policy to collapse onto degenerate outputs before learning anything useful.</p>
<p>To find the right <code>lr</code>, I ran a log-spaced search from <code>1e-6</code> to <code>1e-4</code> for 100 steps each on H100.</p>
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/grpo/results/lr_sweep/lr_sweep.png" class="img-fluid"></p>
<ul>
<li><code>1e-6</code> and <code>3e-6</code> barely move the eval reward accuracy. The gradient signal is too small to update the policy meaningfully.</li>
<li><code>1e-4</code> shows policy collapse with mean response length spikes and token entropy dropping to near zero.</li>
<li><code>1e-5</code> to <code>3e-5</code>: reward rises steadily, response length stabilizes and token entropy drops smoothly.</li>
</ul>
<blockquote class="blockquote">
<p><strong>Winner: <code>lr=3e-5</code></strong> as it gives the most stable training and highest reward accuracy. I used this for all future runs.</p>
</blockquote>
</section>
<section id="baseline-ablation" class="level3">
<h3 class="anchored" data-anchor-id="baseline-ablation">Baseline Ablation</h3>
<p>The vanilla REINFORCE gradient has notoriously high variance. A common technique is to subtract a baseline (the group mean reward) from the advantage which reduces variance without introducing bias. Here I tested whether that variance reduction actually matters in practice.</p>
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/grpo/results/baselines/baseline_ablation.png" class="img-fluid"></p>
<ul>
<li><code>reinforce_with_baseline</code>: evaluation reward accuracy steadily climbs to ~0.61 with stable gradient norm and consistent mean response length around 300-350 tokens.</li>
<li>Both <code>no_baseline</code> runs peak early then decline. Their gradient norm is way high and seems to keep increasing, and both suffer rapid response length collapse after some steps.</li>
</ul>
<blockquote class="blockquote">
<p>Subtracting the group mean reward from the advantage reduces variance and prevents response length collapse. <code>reinforce_with_baseline</code> is the clear choice.</p>
</blockquote>
</section>
<section id="length-normalization" class="level3">
<h3 class="anchored" data-anchor-id="length-normalization">Length Normalization</h3>
<p>When aggregating per-token losses over the sequence dimension, the choice of normalization affects how much gradient signal each individual token receives. As noted in the assignment, it is not necessary or even correct to always average losses by sequence length. I tested three modes:</p>
<ul>
<li><code>mean</code>: divide by number of response tokens per sequence. In this case, short correct answers get disproportionately large per-token gradients.</li>
<li><code>constant</code>: divide by a fixed constant (<code>max_gen_len</code>=1024). Every token gets the same gradient magnitude regardless of sequence length (used in DeepSeek).</li>
<li><code>microbatch</code>: normalize by the longest response in the current microbatch. This is a middle ground between <code>mean</code> and <code>constant</code>.</li>
</ul>
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/grpo/results/length_normalization/length_normalization.png" class="img-fluid"></p>
<ul>
<li>All three modes converge to similar final reward accuracy and mean response length.</li>
<li>The main difference is in gradient norm. <code>constant</code> produces consistently lower norms in comparison to <code>microbatch</code> and <code>mean</code>. This is expected since <code>constant</code> divides everything by 1024 (max generation length), which is 2-2.5x larger than typical response length.</li>
</ul>
<blockquote class="blockquote">
<p>Length normalization mode has minimal impact on final reward for math reasoning with binary reward. The primary observable difference is in gradient scale and not learning dynamics. I kept <code>mean</code> as the default.</p>
</blockquote>
</section>
<section id="standard-deviation-normalization" class="level3">
<h3 class="anchored" data-anchor-id="standard-deviation-normalization">Standard Deviation Normalization</h3>
<p>The standard GRPO advantage computation divides by the group standard deviation: <code>advantage_i = (reward_i - mean(group)) / (std(group) + eps)</code>. But <a href="https://arxiv.org/abs/2503.20783">Dr.&nbsp;GRPO</a> argued that this can introduce unwanted biases where groups with low variance (too easy or too hard questions where all rollouts get the same reward) produce near-zero std deviation, inflating their advantages disproportionately. They proposed removing the division entirely. This ablation tests whether removing that division actually helps.</p>
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/grpo/results/std_dev/std_dev_normalization.png" class="img-fluid"></p>
<ul>
<li><strong>With std normalization</strong>: reaches higher final reward accuracy (~0.72).</li>
<li><strong>Without</strong>: plateaus at ~0.65, but gradient norms are lower and more stable.</li>
</ul>
<p>Removing std dev normalization does improve gradient stability and this confirms the observation that dividing by group std dev. amplifies gradients for low-variance groups. However, the improved stability does not translate to better performance here. The ~0.07 reward gap is substantial enough to justify the slightly noisier gradients.</p>
<blockquote class="blockquote">
<p><strong>Winner: keep std normalization</strong> as the reward benefit outweighs the noisier gradients.</p>
</blockquote>
</section>
<section id="off-policy-sweep" class="level3">
<h3 class="anchored" data-anchor-id="off-policy-sweep">Off-Policy Sweep</h3>
<p>On-policy training is clean but we are generating a full batch of rollouts just to take a single gradient step. I wanted to test how far I could push off-policy reuse (multiple gradient steps per rollout batch) before the policy drifts too far and training destabilizes.</p>
<section id="broad-sweep-50-steps" class="level4">
<h4 class="anchored" data-anchor-id="broad-sweep-50-steps">Broad sweep (50 steps)</h4>
<p>First, I ran a broad sweep over 6 configs varying <code>epochs_per_rollout_batch</code> and <code>train_batch_size</code> ranging from on-policy (1 optimizer step per GRPO step) to aggressive off-policy (16 optimizer steps per GRPO step). All runs use <code>grpo_clip</code> loss, <code>lr=3e-5</code>.</p>
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/grpo/results/off_policy_sweep/off_policy_sweep.png" class="img-fluid"></p>
<ul>
<li>Most configs converge to ~0.55-0.65. The clear outlier is <code>e4_tb64_ga16</code> (16 opt steps/GRPO). It collapses mid-way with gradient norm spikes and response length collapsing to ~100 tokens. <strong>This is a classic failure mode where the policy drifts too far from the rollout distribution and the model learns to produce minimal outputs.</strong></li>
<li>Mild off-policy (2 opt steps/GRPO) works as well as on-policy.</li>
</ul>
<blockquote class="blockquote">
<p>Aggressive off-policy reuse (16 opt steps/GRPO) causes policy collapse. Mild off-policy (2 opt steps) looks comparable to on-policy in this short sweep.</p>
</blockquote>
</section>
<section id="full-sweep-200-steps" class="level4">
<h4 class="anchored" data-anchor-id="full-sweep-200-steps">Full sweep (200 steps)</h4>
<p>I then selected the three most promising configs (on-policy and two mild off-policy) for full 200 step training.</p>
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/grpo/results/off_policy_full_sweep/off_policy_full_sweep.png" class="img-fluid"></p>
<ul>
<li>On-policy (<code>e1_tb256_ga64</code>) is consistently the best. It converges fastest and maintains highest reward accuracy (~0.65-0.75). The two mild off-policy configs track slightly behind it.</li>
<li><code>e2_tb256_ga64</code> (2 epochs) shows higher gradient norm variance with spikes but doesn’t destabilize.</li>
</ul>
<blockquote class="blockquote">
<p>On-policy training works better in this case. Reusing rollouts does not help and the extra compute per GRPO step is not justified by the performance gain.</p>
</blockquote>
</section>
</section>
<section id="prompt-template-ablation" class="level3">
<h3 class="anchored" data-anchor-id="prompt-template-ablation">Prompt Template Ablation</h3>
<p>Here, I compared the <code>r1_zero</code> prompt (structured <code>&lt;think&gt;...&lt;/think&gt;</code> and <code>&lt;answer&gt;...&lt;/answer&gt;</code> blocks) against question-only (just <code>{question}</code>), each with a matching reward function.</p>
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/grpo/results/prompt_ablation/prompt_ablation.png" class="img-fluid"></p>
<ul>
<li>Question-only starts with much higher accuracy because Qwen2.5-Math-1.5B seems to be pre-trained on math data with <code>\boxed{}</code> formatting. It already solves nearly half the problems out of the box. In comparison, r1-zero starts near zero (unfamiliar format) but catches up quickly once the model learns the structured format. Finally, r1-zero consistently performs better than question-only by the end.</li>
<li>r1-zero settles at much lower entropy in comparison to question-only. This is expected since the r1-zero prompt is more structured and constrains the output space</li>
</ul>
<blockquote class="blockquote">
<p><strong>Winner: R1-zero structured prompt</strong> as it provides a dedicated reasoning scratchpad for the model to reason before committing to an answer. Without this, reasoning is interleaved with the answer in less predictable ways.</p>
</blockquote>
</section>
<section id="sft-checkpoint-initialization" class="level3">
<h3 class="anchored" data-anchor-id="sft-checkpoint-initialization">SFT Checkpoint Initialization</h3>
<p>This was not part of the assignment but I thought it would be interesting to see how starting from an SFT checkpoint affects performance. We already have an SFT model that gets ~53% accuracy, can GRPO push it even higher?</p>
<p>I ran five runs: base model (no SFT), three SFT checkpoints (early/mid/final) and final with lower <code>lr</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/grpo/results/sft_grpo/sft_grpo.png" class="img-fluid figure-img"></p>
<figcaption>SFT -&gt; GRPO sweep: eval reward, format reward, entropy</figcaption>
</figure>
</div>
<ul>
<li>The base model (no SFT) still performs better than the SFT runs. As we use more and more SFT trained checkpoints, the GRPO ceiling plateaus at lower and lower accuracy. Similarly, we see higher entropy in the SFT runs in comparison to the base model.</li>
<li>This indicates that the SFT checkpoints are not helping GRPO training and are actually hurting it. SFT narrows the policy distribution early on, limiting the exploration that RL needs to discover better strategies. The more SFT training the checkpoint has seen, the narrower the distribution and the lower the GRPO ceiling.</li>
</ul>
<blockquote class="blockquote">
<p><strong>SFT initialization hurts in this case</strong>, the pre-narrowed distribution limits exploration before RL even starts.</p>
</blockquote>
</section>
</section>
<section id="summary-and-key-takeaways" class="level2">
<h2 class="anchored" data-anchor-id="summary-and-key-takeaways">Summary and Key Takeaways</h2>
<p><strong>Best configuration:</strong></p>
<ul>
<li>on-policy (<code>epochs_per_rollout_batch=1</code>)</li>
<li>lr=<code>3e-5</code></li>
<li>loss_type=<code>grpo_clip</code></li>
<li>use_std_normalization=<code>True</code></li>
<li><code>r1_zero</code> structured prompt</li>
<li>base model (no SFT initialization)</li>
</ul>
<p><strong>Best performance:</strong> ~0.75 on MATH validation (up from ~3% base model accuracy).</p>
<p><strong>Key lessons:</strong></p>
<ul>
<li><strong>Eval reward accuracy is the north star metric</strong>. It directly measures what we care about. Especially in the case of reinforcement learning with verifiable rewards.</li>
<li>However, <strong>gradient norm</strong> and <strong>mean_response_length</strong> are the two other important metrics to watch. They are the early warning signals for instability and reward collapse.</li>
<li><strong>Binary math reward is robust to some design choices</strong> (length normalization) <strong>but sensitive to others</strong> (baseline subtraction, learning rate).</li>
<li><strong>On-policy training wins</strong> in this case, reusing rollouts introduces policy drift that is not worth the compute savings.</li>
<li><strong>The R1-zero structured prompt matters</strong>. A dedicated reasoning scratchpad produces sharper final policies and higher accuracy.</li>
<li><strong>SFT initialization does not help</strong> in this case.</li>
</ul>
<p>Ideally, I should now try to push the accuracy further by training for longer, using curriculum strategies or modifying the GRPO loss itself. <strong>But that’s for another time! I think it is enough learning and compute expense for now.</strong></p>
</section>
<section id="resources" class="level2">
<h2 class="anchored" data-anchor-id="resources">Resources</h2>
<section id="papers-and-blog-posts" class="level3">
<h3 class="anchored" data-anchor-id="papers-and-blog-posts">Papers and Blog Posts</h3>
<ul>
<li><a href="https://huggingface.co/blog/garg-aayush/derive-grpo-loss">Deriving the GRPO Loss</a>: My blog post deriving the GRPO loss function</li>
<li><a href="https://github.com/stanford-cs336/assignment5-alignment">CS336 Assignment 5</a>: Stanford CS336 alignment assignment I followed as a reference</li>
<li><a href="https://huggingface.co/blog/garg-aayush/building-sft-from-ground-up">Building SFT from Ground Up</a>: My previous SFT experiments blog post</li>
<li><a href="https://huggingface.co/blog/garg-aayush/expert-iteration-math-reasoning">Expert Iteration for Math Reasoning</a>: My Expert Iteration experiments blog post</li>
</ul>
</section>
<section id="code-and-artifacts" class="level3">
<h3 class="anchored" data-anchor-id="code-and-artifacts">Code and Artifacts</h3>
<ul>
<li><strong>Code</strong>: <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/grpo">building-from-scratch/grpo</a></li>
<li><strong>Configs</strong>: <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/grpo/configs">grpo/configs</a></li>
<li><strong>Trained Checkpoints</strong>: <a href="https://huggingface.co/garg-aayush/cs336-grpo-exps">garg-aayush/cs336-grpo-exps</a></li>
<li><strong>Datasets</strong>: <a href="https://github.com/kkaitlyn111/cs336-a5-RL/tree/main/MATH">CS336 MATH dataset</a></li>
<li><strong>Training Logs</strong>: <a href="https://wandb.ai/garg-aayush/grpo">wandb.ai/garg-aayush/grpo</a></li>
</ul>


</section>
</section>

 ]]></description>
  <category>RL &amp; Alignment</category>
  <guid>https://garg-aayush.github.io/posts/2026-02-26-grpo-from-scratch/</guid>
  <pubDate>Thu, 26 Feb 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Tools for Better Claude Code and Terminal Experience</title>
  <link>https://garg-aayush.github.io/posts/2026-02-12-cc-tools/</link>
  <description><![CDATA[ 




<p>I have been using <a href="https://cursor.com/">Cursor</a> for a long time and recently started using <a href="https://claude.com/code">Claude Code</a> in my day-to-day work. Thus, over the past few weeks I have spent a lot more time in the terminal and a handful of tools and practices have stuck with me. None of them are groundbreaking or elaborate but have made my overall experience noticeably more productive. This is a short post sharing what has worked for me in case any of it is useful to you.</p>
<section id="tracking-your-ai-usage" class="level2">
<h2 class="anchored" data-anchor-id="tracking-your-ai-usage">Tracking Your AI Usage</h2>
<p>If you are using Claude Code you are prone to hitting the session limit more often than not. Thus, knowing how much you are consuming and session limits comes in quite handy. This visibility across claude code and different providers is quite useful.</p>
<section id="codexbar" class="level3">
<h3 class="anchored" data-anchor-id="codexbar">CodexBar</h3>
<p><a href="https://github.com/steipete/CodexBar">CodexBar</a> is a lightweight menu bar app that tracks usage and limits across different AI providers. It shows session and weekly limits (credits) right in the menu bar. This is genuinely useful as you always know where you stand.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-02-12/codexbar-snapshot.jpg" class="img-fluid" style="width:40.0%"></p>
</section>
<section id="ccusage" class="level3">
<h3 class="anchored" data-anchor-id="ccusage">ccusage</h3>
<p><a href="https://github.com/ryoppippi/ccusage">ccusage</a> is a cli tool for analyzing claude code usage from local json files. It gives you session wise, daily and monthly breakdowns of your usage. It is useful for understanding your consumption patterns and keeping an eye on how much you are actually using. Most importantly, it is a great tool to have inline usage info in claude code. <img src="https://garg-aayush.github.io/static/img/blog-2026-02-12/ccusage-snapshot.jpg" class="img-fluid" style="width:100.0%"></p>
</section>
</section>
<section id="managing-your-ai-context" class="level2">
<h2 class="anchored" data-anchor-id="managing-your-ai-context">Managing Your AI Context</h2>
<p>One of the biggest challenges with AI assistants is getting the right context in and keeping the context window healthy.</p>
<section id="context-commands-context-clear-compact" class="level3">
<h3 class="anchored" data-anchor-id="context-commands-context-clear-compact">Context Commands: <code>/context</code>, <code>/clear</code>, <code>/compact</code></h3>
<p>Most of us are aware of these three Claude Code commands but more often than not, maybe out of habit, we dont use them often enough and end up hitting the session limit.</p>
<ul>
<li><strong><code>/context</code></strong> — it provides a great visualization of your current conversation context usage as a colored grid. <strong>I prefer to use it often to understand how much of the context window is consumed, loaded skills, mcps, tools and corresponding tokens used.</strong> <img src="https://garg-aayush.github.io/static/img/blog-2026-02-12/context-snapshot.jpg" class="img-fluid" style="width:75.0%"></li>
<li><strong><code>/clear</code></strong> — this completely resets your conversation history. <strong>I use it between unrelated tasks to start fresh with an empty context window avoiding both context degradation and session limit.</strong></li>
<li><strong><code>/compact [instructions]</code></strong> — It summarizes your conversation to free up context space while preserving important information.</li>
</ul>
<blockquote class="blockquote">
<p><strong>Thumb rules I follow while using claude code:</strong></p>
<ul>
<li><strong>Haiku / Sonnet:</strong> I start a new conversation or use <code>/compact</code> at ~50-60% context usage. Beyond this, I consistently see context rot with noticeable drop in model performance.</li>
<li><strong>Opus:</strong> I can push it to ~70-80% but usually starts a fresh conversation whenever possible to save tokens and avoids hitting session limits.</li>
<li><strong>Never start a new task mid-conversation.</strong> Either compact or preferrably start a new one.</li>
</ul>
</blockquote>
</section>
<section id="context7" class="level3">
<h3 class="anchored" data-anchor-id="context7">Context7</h3>
<p><a href="https://context7.com/">Context7</a> MCP server provides AI assistants with up-to-date documentation and code examples for various libraries and frameworks. As a developer, I find this genuinely useful as it solves the problem of outdated training data by fetching current docs in real-time. Whether I am writing boilerplate code, setting up tests or working with a library I have not used recently, Context7 ensures the AI has accurate and up to date API references instead of hallucinating outdated patterns.</p>
<p>The free plan gives you around <code>1000</code> API requests per month which is more than enough for most use cases. <strong>This is one MCP server I use regularly and recommend to everyone.</strong></p>
</section>
<section id="repomix-and-jina-reader" class="level3">
<h3 class="anchored" data-anchor-id="repomix-and-jina-reader">RepoMix and Jina Reader</h3>
<p>Sometimes you need to provide context as a single file whether for web-based AI interfaces like ChatGPT or Claude where you cannot point at a codebase directly or when you want to feed a webpage content to an LLM.</p>
<p><a href="https://repomix.com/">RepoMix</a> handles the codebase side. It lets you take a git repo or any local folder and concatenate all the files (based on a pattern) into a single file. You can create a context file for an entire repo or filter it down to just the files relevant to the bug or feature you are working on. It is on similar lines as Karpathy’s <a href="https://github.com/karpathy/rendergit">rendergit</a> which renders any git repo into a single static HTML page for humans or LLMs.</p>
<p><a href="https://github.com/jina-ai/reader">Jina Reader</a> handles the web side. It converts any HTML page into clean markdown which is much better than feeding raw HTML as context to an LLM. You just need to prepend <code>https://r.jina.ai/</code> to any URL.</p>
</section>
</section>
<section id="prefer-skills-over-mcp-servers" class="level2">
<h2 class="anchored" data-anchor-id="prefer-skills-over-mcp-servers">Prefer Skills Over MCP Servers</h2>
<p>MCP server tool descriptions consume tokens upfront and often hundreds or even thousands of tokens regardless of whether you actually use them in that session. <a href="https://agentskills.io/home">Skills</a> on the other hand use progressive loading where Claude sees only the name and description (~30-100 tokens) at startup and loads the full instructions only when relevant.</p>
<p>For example, here is the difference in context usage between <a href="https://context7.com/skills/playwright-mcp">Playwright as an MCP server</a> versus <a href="https://github.com/microsoft/playwright-cli">Playwright as a skill</a>:</p>
<table align="center">
<tbody><tr>
<td>
<img src="https://garg-aayush.github.io/static/img/blog-2026-02-12/context-skills.jpg" alt="Playwright Skills" width="100%">
</td>
<td>
<img src="https://garg-aayush.github.io/static/img/blog-2026-02-12/context-mcp.jpg" alt="Playwright MCP" width="100%">
</td>
</tr>
</tbody></table>
<p align="center">
<em>Difference in context usage between Playwright as a skill vs Playwright as an MCP server</em>
</p>
<p>My recommendation is to use skills whenever possible. For example, <a href="https://github.com/huggingface/skills">Hugging Face Skills</a> and <a href="https://github.com/microsoft/playwright-cli">Playwright CLI Skills</a> instead of corresponding mcp servers.</p>
</section>
<section id="the-llm-cli-tool" class="level2">
<h2 class="anchored" data-anchor-id="the-llm-cli-tool">The <code>llm</code> CLI Tool</h2>
<p>This is the tool I use a lot in terminal outside of Claude Code itself. Simon Willison’s <a href="https://llm.datasette.io/en/stable/">llm</a> is a command-line tool for interacting with LLMs directly from the terminal. It supports multiple providers and models, stores conversation logs and is endlessly composable with other CLI tools.</p>
<p>I still prefer working in <code>iTerm2</code>. I do not use agentic terminals like <a href="https://warp.dev/">Warp</a>. Cursor or claude code are my standard coding assistants but not everything needs a full agent session. For example while I am terminal, you sometimes need to quickly answer a question, run a bash or single command, explain/review a file. These are the tasks where <code>llm</code> comes in really handy. It is fast, intuitive to use and does not consume my Claude Code session limits.</p>
<p>One way I use <code>llm</code> in my day-to-day work is by wrapping it as shell functions in my <code>.zshrc</code>. Each function has a specific system prompt tuned for its task, giving me dedicated LLM-based commands I can use for without thinking. Here are some I use most often:</p>
<pre><code>cmd &lt;query&gt;                       Convert natural language to a shell command
explain &lt;question&gt; [file]         Answer a question, optionally using a file as context
image_qa &lt;question&gt; &lt;image&gt;       Ask a question about an image (vision)
pycode [-x|--exec] &lt;task&gt;         Generate a Python script from a task description
                                    -x  also execute the script via uv run</code></pre>
<p>The pattern I follow for each function is the same: define a model, write a system prompt, wrap <code>llm</code> in a function. For example, here is the <code>cmd</code> function:</p>
<section id="cmd" class="level4">
<h4 class="anchored" data-anchor-id="cmd"><code>cmd</code></h4>
<p>It converts natural language to a raw shell command (something we all need to do everyday).</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">CMD_LLM</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-5-mini"</span></span>
<span id="cb2-2"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">CMD_SYSTEM_PROMPT</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are inside a macOS terminal. Output only the raw shell command(s). </span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb2-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">No formatting, no code blocks, no explanations, no extra text."</span></span>
<span id="cb2-4"></span>
<span id="cb2-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cmd()</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">{</span></span>
<span id="cb2-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[[</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$1</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--help"</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">||</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">-z</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$1</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">]];</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">then</span></span>
<span id="cb2-7">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Usage: cmd &lt;natural language shell command&gt;"</span></span>
<span id="cb2-8">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"  Converts natural language to a raw shell command using LLM."</span></span>
<span id="cb2-9">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"  Example: cmd 'find all png files larger than 1MB'"</span></span>
<span id="cb2-10">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb2-11">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">fi</span></span>
<span id="cb2-12">    <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">llm</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-m</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$CMD_LLM</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-s</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$CMD_SYSTEM_PROMPT</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$1</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb2-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">}</span></span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">$</span> cmd <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"find all png files larger than 1MB"</span></span>
<span id="cb3-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">find</span> . <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-name</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"*.png"</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-size</span> +1M</span></code></pre></div></div>
<p>I would encourage you to write your own wrappers for whatever repetitive tasks you have. And since <code>llm</code> supports <a href="https://llm.datasette.io/en/stable/plugins/directory.html">many providers and models</a>, you can swap in whichever model works best for each task.</p>
</section>
</section>
<section id="upgrading-your-terminal-basics" class="level2">
<h2 class="anchored" data-anchor-id="upgrading-your-terminal-basics">Upgrading Your Terminal Basics</h2>
<p>The tools below are not AI-specific but make my terminal experience much better.</p>
<section id="eza" class="level3">
<h3 class="anchored" data-anchor-id="eza">eza</h3>
<p><a href="https://github.com/eza-community/eza">eza</a> is a feature rich replacement for <code>ls</code> with color highlighting, icons for different file types, git awareness and tree views.</p>
<p>I have the following aliases in my <code>.zshrc</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">alias</span> lla=<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'eza -alh --git --sort=modified --icons'</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># list view with hidden files</span></span>
<span id="cb4-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">alias</span> ll=<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'eza -lh --git --sort=modified --icons'</span>    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># list view sorted by timestamp</span></span>
<span id="cb4-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">alias</span> lt=<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'eza -lh --git --sort=modified --tree --level=2 --icons'</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># tree view</span></span></code></pre></div></div>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-02-12/eza-snapshot.jpg" class="img-fluid" style="width:75.0%"></p>
</section>
<section id="bat" class="level3">
<h3 class="anchored" data-anchor-id="bat">bat</h3>
<p><a href="https://github.com/sharkdp/bat">bat</a> is simly <code>cat</code> with syntax highlighting, line numbers and git integration. It automatically detects the file type and renders accordingly.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-02-12/bat-snapshot.jpg" class="img-fluid" style="width:75.0%"></p>
</section>
<section id="yazi" class="level3">
<h3 class="anchored" data-anchor-id="yazi">yazi</h3>
<p><a href="https://github.com/sxyazi/yazi">yazi</a> is a fast terminal file manager with support for previewing different file types including images and PDFs right in the terminal. <strong>This is especially useful when you are already deep in a terminal session with Claude Code and need to quickly browse files or check an image without switching to Finder.</strong></p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-02-12/yazi-snapshot.jpg" class="img-fluid" style="width:100.0%"></p>
</section>
</section>
<section id="wrapping-up" class="level2">
<h2 class="anchored" data-anchor-id="wrapping-up">Wrapping Up</h2>
<p>None of these tools and tips are complex or elaborate and most of them are not even AI-specific. But together they have quietly improved how I work with better visibility into usage, healthier context windows, quicker LLM access from the terminal and a slightly nicer shell experience.</p>


</section>

 ]]></description>
  <category>Tools &amp; Infra</category>
  <guid>https://garg-aayush.github.io/posts/2026-02-12-cc-tools/</guid>
  <pubDate>Thu, 12 Feb 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Expert Iteration for Math Reasoning</title>
  <link>https://garg-aayush.github.io/posts/2026-01-23-expert-iteration/</link>
  <description><![CDATA[ 




<p>Expert Iteration, unlike SFT, does not require human-annotated reasoning traces. Instead, the model generates candidate solutions, filters for correct ones and trains on its own successful attempts. Thus, bypassing the need for external high-quality expensive annotations.</p>
<p>This post discusses my experiments with Expert Iteration for math reasoning as part of <a href="https://github.com/stanford-cs336/assignment5-alignment">CS336 Assignment 5</a>. This post continues from my <a href="https://huggingface.co/blog/garg-aayush/building-sft-from-ground-up">previous SFT experiments</a>. I will share what makes it work (and what does not), explore hyperparameter choices and compare the performance against SFT.</p>
<section id="what-is-expert-iteration" class="level2">
<h2 class="anchored" data-anchor-id="what-is-expert-iteration">What is Expert Iteration?</h2>
<p>Expert Iteration was first introduced by <a href="https://arxiv.org/abs/1705.08439">Anthony et. al.&nbsp;(2017)</a> in the context of game-playing AI. The core idea is to alternate between using a slow but powerful <strong>expert</strong> to find high-quality solutions and training a fast <strong>apprentice</strong> to imitate those solutions. As the apprentice improves, it provides a better starting point for the expert. Thus, creating a virtuous cycle of improvement.</p>
<p>The <a href="https://arxiv.org/abs/2203.14465">Self-taught Reasoner (STaR)</a> uses Expert Iteration idea for improving reasoning capabilities. It works as follows: at each iteration, we prompt the model to generate rationales for many problems, filter to keep only those that lead to correct answers, finetune on the filtered set and repeat.</p>
<blockquote class="blockquote">
<p>Here, the filtering step acts as the “expert”. It identifies which of the model own generations are correct. <strong>The model then learns to imitate its own best behaviors.</strong></p>
</blockquote>
<p>I see Expert Iteration for LLMs as <strong>SFT on self-generated, filtered data, repeated iteratively</strong> as shown in this loop:</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-23/expert-iteration-diagram.png" class="img-fluid"></p>
<blockquote class="blockquote">
<p>Note: STaR paper retrains from the base pretrained model at each iteration to avoid overfitting and introduces rationalization for failed problems. The approach used here (following the CS336 assignment) does not uses rationalization and continues training from the previous iteration checkpoint.</p>
</blockquote>
<p>A key advantage of expert iteration in comparison to SFT is <strong>we do not need human or other LLMs generated reasoning traces</strong>. The model generates its own reasoning trace and correctness filtering ensures quality. This makes Expert Iteration particularly attractive for domains with verifiable rewards like math or code.</p>
<section id="connection-to-reinforcement-learning" class="level3">
<h3 class="anchored" data-anchor-id="connection-to-reinforcement-learning">Connection to Reinforcement Learning</h3>
<p>You can also understand expert iteration as an approximation to policy gradient RL. To see this, consider the REINFORCE objective with a binary reward <img src="https://latex.codecogs.com/png.latex?r%20%5Cin%20%5C%7B0,%201%5C%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla%20J(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7Bo%20%5Csim%20%5Cpi_%5Ctheta%7D%5Br%20%5Ccdot%20%5Cnabla%20%5Clog%20%5Cpi_%5Ctheta(o%7Cq)%5D%0A"></p>
<p>When we filter for correct outputs (where <img src="https://latex.codecogs.com/png.latex?r%20=%201">) we are essntially zeroing out the gradient contribution from incorrect samples. The remaining gradient update:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla%20J(%5Ctheta)%20%5Capprox%20%5Cmathbb%7BE%7D_%7Bo%20%5Csim%20%5Cpi_%5Ctheta,%20r=1%7D%5B%5Cnabla%20%5Clog%20%5Cpi_%5Ctheta(o%7Cq)%5D%0A"></p>
<p>is exactly what SFT on filtered data computes.</p>
</section>
</section>
<section id="experimental-setup" class="level2">
<h2 class="anchored" data-anchor-id="experimental-setup">Experimental Setup</h2>
<p>I use the same model and evaluation setup as my <a href="https://huggingface.co/blog/garg-aayush/building-sft-from-ground-up">previous SFT experiments</a>. Looking at the Expert Iteration diagram above, you can see how straightforward it is to extend an <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/sft">SFT</a> to <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/expert-iteration">Expert Iteration</a> codebase. You basically just wrap the training loop with generation and filtering steps with other minor changes.</p>
<section id="model-and-dataset" class="level3">
<h3 class="anchored" data-anchor-id="model-and-dataset">Model and Dataset</h3>
<ul>
<li><strong>Model</strong>: <a href="https://huggingface.co/Qwen/Qwen2.5-Math-1.5B">Qwen2.5-Math-1.5B</a> (base, not instruction-tuned)</li>
<li><strong>Training data</strong>: ~3.5K problems from the <a href="https://huggingface.co/datasets/hiyouga/math12k">MATH</a> (same data as in SFT experiments)</li>
<li><strong>Validation data</strong>: 5K problems from CS336 Assignment 5 for evaluation</li>
</ul>
<blockquote class="blockquote">
<p>This is the same dataset I used in the SFT experiments. However, unlike SFT where I trained on GPT-generated reasoning traces, <strong>Expert Iteration only needs the problems and their ground-truth answers</strong>. The model generates candidate reasoning traces on the fly.</p>
</blockquote>
</section>
<section id="key-hyperparameters" class="level3">
<h3 class="anchored" data-anchor-id="key-hyperparameters">Key Hyperparameters</h3>
<p>The Expert Iteration loop has three main key params:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Parameter</th>
<th>Symbol</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>batch_per_ei</code></td>
<td><img src="https://latex.codecogs.com/png.latex?D"></td>
<td>number of questions sampled per iteration</td>
</tr>
<tr class="even">
<td><code>num_rollouts</code></td>
<td><img src="https://latex.codecogs.com/png.latex?R"></td>
<td>number of outputs generated per question</td>
</tr>
<tr class="odd">
<td><code>num_ei</code></td>
<td><img src="https://latex.codecogs.com/png.latex?G"></td>
<td>number of expert iteration steps (fixed at 5)</td>
</tr>
</tbody>
</table>
<p>Each iteration samples <img src="https://latex.codecogs.com/png.latex?D"> questions, generates <img src="https://latex.codecogs.com/png.latex?R"> candidate solutions per question (giving <img src="https://latex.codecogs.com/png.latex?D%20%5Ctimes%20R"> total rollouts), filters for correct answers and finetunes on the filtered set.</p>
<p>I also use an <strong>adaptive learning rate and batch size scheme</strong> that scales based on the number of filtered examples per iteration. However, more on that in the next section. You can find the full hyperparameters configuration details in the <a href="https://github.com/garg-aayush/building-from-scratch/blob/exp-iter/expert-iteration/train_exp_iter.py">train_exp_iter.py</a> script.</p>
</section>
</section>
<section id="tuning-the-training-setup" class="level2">
<h2 class="anchored" data-anchor-id="tuning-the-training-setup">Tuning the Training Setup</h2>
<section id="finding-the-right-learning-rate" class="level3">
<h3 class="anchored" data-anchor-id="finding-the-right-learning-rate">Finding the Right Learning Rate</h3>
<p>Expert Iteration training approach differs from standard SFT because the number of filtered examples varies significantly across iterations with early iterations filtering fewer correct solutions since the model hasn’t improved yet. <strong>This makes learning rate selection important.</strong></p>
<p>My initial experiments with a constant <code>7e-5</code> showed overfitting with val. accuracy plateauing and then declining. Using a <code>lr=1e-5</code> avoided overfitting but resulted in extremely slow learning.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-23/lr_sweep_ei_acc.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:70.0%"></p>
</figure>
</div>
<blockquote class="blockquote">
<p><strong>Note</strong>: All accuracy values reported in the current and following sections is on the validation data.</p>
</blockquote>
<p>The solution was an <strong>adaptive scheme</strong> that scales both learning rate and batch size based on filtered data size:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Filtered Examples</th>
<th>Batch Size</th>
<th>Learning Rate</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>&lt; 24</td>
<td>8</td>
<td>3.5e-5</td>
</tr>
<tr class="even">
<td>24–128</td>
<td>32</td>
<td>5e-5</td>
</tr>
<tr class="odd">
<td>&gt; 128</td>
<td>64</td>
<td>7e-5</td>
</tr>
</tbody>
</table>
<p>When we have A few filtered examples, we use smaller batch size and lower learning rate to avoid overfitting. As more rollouts pass the filter, we can train more aggressively. This setup achieved the best val. accuracy of <strong>32.6%</strong> after 5 iters.</p>
<blockquote class="blockquote">
<p>As filtered dataset size varies across iterations, adaptive learning rate and batch size prevents overfitting when data is scarce while enabling efficient training when data is abundant.</p>
</blockquote>
</section>
<section id="single-vs.-multiple-reasoning-traces-per-question" class="level3">
<h3 class="anchored" data-anchor-id="single-vs.-multiple-reasoning-traces-per-question">Single vs.&nbsp;Multiple Reasoning Traces per Question</h3>
<p>When generating <img src="https://latex.codecogs.com/png.latex?R"> rollouts per question, multiple rollouts can produce correct answers for the same question often with different reasoning paths. This leads to the following question: <strong>should we keep just one correct trace per question in every iteration or all of them?</strong></p>
<p>I ran two experiments:</p>
<ul>
<li><strong>Single-trace</strong>: Keep only 1 correct trace per question (randomly selected if multiple are correct)</li>
<li><strong>Multi-trace</strong>: Keep all correct traces per question</li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-23/sampling_strategy_ei_acc.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:70.0%"></p>
</figure>
</div>
<p>Multi-trace sampling reaches higher accuracy compared to single-trace. Interestingly, single-trace starts faster but plateaus early while multi-trace continues improving. My understanding is diverse reasoning paths to the same answer provide a richer training signal.</p>
<blockquote class="blockquote">
<p>Keeping all correct traces per question rather than just one provides diverse reasoning examples for improved training signal.</p>
</blockquote>
</section>
</section>
<section id="exploring-d-sample-size-and-r-rollouts" class="level2">
<h2 class="anchored" data-anchor-id="exploring-d-sample-size-and-r-rollouts">Exploring D (Sample Size) and R (Rollouts)</h2>
<p>The assignment suggests to vary the batch size <img src="https://latex.codecogs.com/png.latex?D"> and rollout count <img src="https://latex.codecogs.com/png.latex?R"> to understand what configurations work best. I ran a grid of experiments to find the best configuration and see if I could make some sense of the results:</p>
<ul>
<li><strong><img src="https://latex.codecogs.com/png.latex?D"> (batch_per_ei)</strong>: <img src="https://latex.codecogs.com/png.latex?%7B512,%201024,%202048%7D"></li>
<li><strong><img src="https://latex.codecogs.com/png.latex?R"> (num_rollouts)</strong>: <img src="https://latex.codecogs.com/png.latex?%7B2,%204%7D"></li>
</ul>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-23/ei_grid_plot.png" class="img-fluid"></p>
<p>Looking at the results, some observations stand out:</p>
<ul>
<li><p>The best configuration is <img src="https://latex.codecogs.com/png.latex?D=1024,%20R=4"> achieving <strong>41.7%</strong> accuracy.</p></li>
<li><p>Increasing <img src="https://latex.codecogs.com/png.latex?D"> does not always guarantee better performance, both <img src="https://latex.codecogs.com/png.latex?D=2048"> configurations underperform their <img src="https://latex.codecogs.com/png.latex?D=1024"> counterparts.</p></li>
</ul>
<blockquote class="blockquote">
<p>Beyond a certain batch size, sampling more questions with limited rollouts means pulling in harder problems without correct solutions (contributing nothing to training) and a filtered dataset skewed toward easy problems the model can already solve. This results in less diversity per question and degrade performance.</p>
</blockquote>
<ul>
<li>Increasing <img src="https://latex.codecogs.com/png.latex?R"> from 2 to 4 improves accuracy across all values of <img src="https://latex.codecogs.com/png.latex?D">, hinting that rollouts matter more than batch size.</li>
</ul>
<blockquote class="blockquote">
<p>More rollouts means higher probability of solving each question, more diverse reasoning traces when you do solve (which we saw helps in the multi-trace experiment) and better coverage of harder problems that rarely get solved with fewer attempts.</p>
</blockquote>
<p>This aligns with an observation from the <a href="https://arxiv.org/abs/2203.14465">STaR paper</a> where the performance stalled when the model received no direct training signal for problems it failed to solve. The large rollouts helped with training signal from each sampled question.</p>
<section id="filter-rate-analysis" class="level3">
<h3 class="anchored" data-anchor-id="filter-rate-analysis">Filter Rate Analysis</h3>
<p>To better understand the above accuracy results, I also tracked the <strong>filter rate (correct rollouts / total rollouts)</strong> across iterations for different experiments.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-23/ei_filter_rate_plot.png" class="img-fluid"></p>
<p>A few things stand out:</p>
<ul>
<li><p>The filter rate increase across iterations as the model improves, generating more correct solutions. This in turn provides more training signal and improves the model. <strong>This is the core mechanism behind Expert Iteration.</strong></p></li>
<li><p>The filter rate correlates with final accuracy better than absolute example count. <img src="https://latex.codecogs.com/png.latex?D=2048,%20R=4"> produces the most correct examples (~4000) but <img src="https://latex.codecogs.com/png.latex?D=1024,%20R=4"> achieves the highest filter rate (~57%) and best accuracy. Basically, more data is not always better. You need better training examples than more examples.</p></li>
</ul>
<blockquote class="blockquote">
<p>This filter rate reflects how well the model is improving relative to what it attempts. A high filter rate means the model is genuinely getting better at solving problems not just accumulating easy examples.</p>
</blockquote>
</section>
</section>
<section id="pushing-accuracy-with-multi-epoch-training" class="level2">
<h2 class="anchored" data-anchor-id="pushing-accuracy-with-multi-epoch-training">Pushing Accuracy with Multi-Epoch Training</h2>
<p>So far, all my experiments used just 1 epoch of SFT per expert iteration step. I wanted to see if I could push accuracy even higher and the simplest way to check this is by training for more epochs per iteration. I ran additional experiments with 2 epochs per iteration for two batch sizes: <img src="https://latex.codecogs.com/png.latex?D=512"> and <img src="https://latex.codecogs.com/png.latex?D=1024"> (both with <img src="https://latex.codecogs.com/png.latex?R=4">).</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-23/ei_multiepoch_plot.png" class="img-fluid"></p>
<p>Training for 2 epochs per iteration consistently outperforms 1 epoch across both configurations. My understanding is that with a large rollout count <img src="https://latex.codecogs.com/png.latex?R"> the filtered dataset is already diverse enough that training for multiple epochs extracts more signal without memorizing. This improved learning leads to better filter rates and accuracy in subsequent iterations.</p>
</section>
<section id="comparison-to-sft-experiments" class="level2">
<h2 class="anchored" data-anchor-id="comparison-to-sft-experiments">Comparison to SFT Experiments</h2>
<p>In my <a href="https://huggingface.co/blog/garg-aayush/building-sft-from-ground-up">previous SFT experiments</a>, I finetuned the same <code>Qwen2.5-Math-1.5B</code> model using the same train and eval data, with train data being GPT-oss-120B generated reasoning traces. Below is the comparison of SFT and Expert Iteration best experiments:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Method</th>
<th>Configuration</th>
<th>Reward Accuracy</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Baseline</td>
<td>Untrained Qwen2.5-Math-1.5B</td>
<td>2.9%</td>
</tr>
<tr class="even">
<td><strong>Expert Iteration (best)</strong></td>
<td>D=1024, R=4, 2 epochs</td>
<td><strong>47.1%</strong></td>
</tr>
<tr class="odd">
<td><strong>SFT (best)</strong></td>
<td>Filtered data, 2 epochs</td>
<td><strong>53.4%</strong></td>
</tr>
</tbody>
</table>
<p>Despite extensive tuning across batch sizes, rollout counts and learning rate schedules <strong>Expert Iteration reached 47.1% accuracy which is around 6% below SFT’s 53.4%</strong>.</p>
<p>In my understanding, the gap comes down to two factors: <strong>data quality</strong> and <strong>rationalization</strong>:</p>
<ul>
<li>SFT trains on reasoning traces from GPT-OSS-120B, a much more capable model. These traces aren’t just correct but also demonstrate sophisticated problem-solving strategies. Expert Iteration by contrast trains on the model <em>own</em> correct outputs which are inherently less refined. The model can only learn from its own best behaviors.</li>
<li>Moroever, the STaR paper shows that rationalization (providing the answer as a hint to generate rationales for failed problems) provides crucial training signal that accelerates learning. Without it, the model receives no direct signal for problems it cannot solve.</li>
</ul>
<blockquote class="blockquote">
<p>There are still experiments one could run to close this gap like larger rollout counts (8, 12), adaptive sampling strategies to limit correct rollouts per question and prevent overfitting etc. However, I didn’t find it compelling enough to run more experiments.</p>
</blockquote>
<p>Also, I have also not come across many experiments/blogs/papers on Expert Iteration replacing SFT and given my experiments, I think there are a few reasons:</p>
<ol type="1">
<li><p>Expert Iteration requires a model that can already solve some problems correctly. A weaker base model would struggle to provide any training signal.</p></li>
<li><p>The method is sensitive to learning rate, batch size, and rollout count. You need to find the right balance otherwise you risk overfitting. My grid search showed final accuracy ranging from 21.9% to 47.1% depending on configuration.</p></li>
<li><p>Each iteration requires generating <img src="https://latex.codecogs.com/png.latex?D%20%5Ctimes%20R"> rollouts before any training happens. With <img src="https://latex.codecogs.com/png.latex?D=1024"> and <img src="https://latex.codecogs.com/png.latex?R=4">, that is 4,096 generations per iteration. You end up spending as much or more compute on generation as on training.</p></li>
</ol>
<p>Despite all the above, Expert Iteration has one compelling advantage: <strong>it does not require expensive annotated data and is well-suited for domains with verifiable rewards like math and code</strong>.</p>
<p>It is also worth noting that the core mechanism in expert iterations, generating many outputs and filtering for correct ones is linked to <strong>rejection sampling</strong> mentioned and used in <a href="https://arxiv.org/abs/2402.03300">DeepSeek-Math</a> and <a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1</a> papers. You can think of Expert Iteration as rejection sampling applied iteratively: sample, filter, finetune, repeat.</p>
</section>
<section id="resources" class="level2">
<h2 class="anchored" data-anchor-id="resources">Resources</h2>
<section id="papers-and-blogposts" class="level3">
<h3 class="anchored" data-anchor-id="papers-and-blogposts">Papers and blogposts</h3>
<ul>
<li><a href="https://arxiv.org/abs/1705.08439">Thinking Fast and Slow with Deep Learning and Tree Search</a>: Original Expert Iteration paper</li>
<li><a href="https://arxiv.org/abs/2203.14465">STaR: Bootstrapping Reasoning With Reasoning</a>: Expert Iteration for LLM reasoning</li>
<li><a href="https://github.com/stanford-cs336/assignment5-alignment">CS336 Assignment 5</a>: Stanford CS336 alignment assignment</li>
<li><a href="https://huggingface.co/blog/garg-aayush/building-sft-from-ground-up">Building SFT from Ground Up</a>: Previous SFT experiments blogpost</li>
</ul>
</section>
<section id="code" class="level3">
<h3 class="anchored" data-anchor-id="code">Code</h3>
<ul>
<li><a href="https://github.com/garg-aayush/building-from-scratch/tree/main/expert-iteration">Expert Iteration Implementation</a>: My Expert Iteration codebase</li>
<li><a href="https://huggingface.co/datasets/garg-aayush/sft-cs336-assign5-datasets">Training and validation datasets</a>: Training and validation datasets used in the experiments</li>
<li><a href="https://wandb.ai/garg-aayush/expert-iter">wandb Training logs</a></li>
<li><a href="https://huggingface.co/garg-aayush/cs336_exp-iter_exps/">Finetuned checkpoints</a></li>
</ul>


</section>
</section>

 ]]></description>
  <category>RL &amp; Alignment</category>
  <category>LLM Training</category>
  <guid>https://garg-aayush.github.io/posts/2026-01-23-expert-iteration/</guid>
  <pubDate>Fri, 23 Jan 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>A Brief Introduction to Claude Agent Skills</title>
  <link>https://garg-aayush.github.io/posts/2026-01-13-claude-skills/</link>
  <description><![CDATA[ 




<p>If you have been following X or LinkedIn for the last few weeks, you have probably noticed the buzz around Claude Skills (or Agent Skills), apart from all the fanfare around Claude Code. There are tweets and posts appreciating their simplicity, devs sharing custom skills for everything from document generation to API integrations. I would say the hype is well deserved and genuine.</p>
<p>I had been aware of <a href="https://www.claude.com/news/skills">Claude Skills</a> when they launched in October 2025 (thanks to <a href="https://simonwillison.net/2025/Oct/16/claude-skills/">Simon Willison’s blog post</a>). However, I did not really dig into them until I came across <a href="https://huggingface.co/blog/hf-skills-training">this Hugging Face blog post</a> where they used Claude Code to fine-tune an open-source LLM. They built <a href="https://github.com/huggingface/skills">Hugging Face Skills</a> that let you do something like this <code>Fine-tune Qwen3-0.6B on the dataset open-r1/codeforces-cots</code> and Claude handles everything from GPU selection, script generation, job submission, progress monitoring and pushing the finished model to the Hub. That was a woah moment for me!</p>
<p>Skills deserve all the attention they are getting. They provide the domain knowledge that modern LLMs and agents need despite their impressive general capabilities. They are simple folders with packaged expertise that agents can dynamically invoke for relevant requests.</p>
<p>In this post, I will explain what skills are, why they matter and walk you through one of the skills I use daily: a <a href="https://github.com/garg-aayush/tutorials/tree/main/claude-skills/basic-image-editing">simple image editing skill</a> as an example of how you can quickly build skills for your own use.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-13/image-editing-skill-ex-1.png" class="img-fluid"></p>
<section id="what-are-agent-skills" class="level2">
<h2 class="anchored" data-anchor-id="what-are-agent-skills">What Are Agent Skills?</h2>
<p>Simply put, Skills are organized folders that package expertise into discoverable capabilities. Each skill contains a <code>SKILL.md</code>, a markdown file with some YAML metadata, with instructions that Claude reads when relevant. This is along with optional supporting files like scripts and templates.</p>
<blockquote class="blockquote">
<p>As Barry described in <a href="https://www.youtube.com/watch?v=CEvIs9y1uog">his AI Engineer talk</a>, think of them as <strong>“expertise packages”</strong> that Claude can discover and load dynamically.</p>
</blockquote>
<p>That is really all there is to skills. There are no complex protocols, no server infrastructure and no elaborate configuration. Just text files that describe how to do something optionally paired with scripts that make the task more reliable. Simon Wilson has rightly mentioned it in <a href="https://simonwillison.net/2025/Oct/16/claude-skills/">his blog post</a>: <strong>“Skills feel a lot closer to the spirit of LLMs - throw in some text and let the model figure it out.”</strong></p>
<p>Since Anthropic released <a href="https://agentskills.io">Agent Skills</a> as an open standard in December 2025, skills are rapidly becoming available across different coding agents: GitHub Copilot, Codex CLI, Cursor and more.</p>
<p>It is also important to distinguish skills from other customization options like <code>claude.md</code>, MCP servers, and subagents.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>Skills</strong></th>
<th><strong>Claude.md</strong></th>
<th><strong>MCP Servers</strong></th>
<th><strong>Subagents</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>What it is</strong></td>
<td>Folders containing instructions, scripts, and resources that teach Claude <em>how</em> to perform specialized tasks</td>
<td>A markdown file that tells Claude <em>about</em> a specific project</td>
<td>A protocol that <em>connects</em> Claude to external data sources</td>
<td>Specialized AI assistants with fixed roles and their own context window</td>
</tr>
<tr class="even">
<td><strong>Scope</strong></td>
<td>Portable across projects and agents</td>
<td>Project-specific</td>
<td>Tool/service-specific</td>
<td>Task-specific</td>
</tr>
<tr class="odd">
<td><strong>Lives in</strong></td>
<td><code>~/.claude/skills/</code> or <code>.claude/skills/</code></td>
<td>Root of your repository</td>
<td>External servers (GitHub, Slack, databases)</td>
<td>Spawned during task execution</td>
</tr>
<tr class="even">
<td><strong>Example</strong></td>
<td>Generate accessible PDF reports according to company guidelines</td>
<td>This project uses Next.js 14, Tailwind, and PostgreSQL</td>
<td>Connect to our PostgreSQL database</td>
<td>Code review agent with read-only permissions</td>
</tr>
<tr class="odd">
<td><strong>Use when</strong></td>
<td>You need repeatable domain expertise across multiple projects</td>
<td>You are onboarding Claude to a specific codebase</td>
<td>You need Claude to access external data or services</td>
<td>You want to delegate distinct subtasks to specialized assistants</td>
</tr>
</tbody>
</table>
<p>This is how I differential them:</p>
<ul>
<li><strong>Claude.md</strong> gives Claude <em>context</em> about where it’s working</li>
<li><strong>MCP</strong> gives Claude <em>access</em> to data and services<br>
</li>
<li><strong>Skills</strong> give Claude <em>expertise</em> on how to do things well</li>
<li><strong>Subagents</strong> let Claude <em>delegate</em> work to specialized assistants</li>
</ul>
<section id="how-skills-are-dynamically-loaded" class="level3">
<h3 class="anchored" data-anchor-id="how-skills-are-dynamically-loaded">How Skills are Dynamically Loaded</h3>
<p>I have used the term “dynamically loaded” quite a few times now. This is because Skills are <strong>progressively disclosed</strong> using a three-tier loading mechanism:</p>
<ol type="1">
<li><p><strong>Metadata only (~30-100 tokens):</strong> At startup, Claude sees just the <code>name</code> and <code>description</code> from each skill’s YAML frontmatter. This is enough to know the skill exists.</p></li>
<li><p><strong>Full instructions (when relevant):</strong> When your request matches a skill’s description, Claude loads the complete <code>SKILL.md</code> content.</p></li>
<li><p><strong>Supporting files (on demand):</strong> Scripts, templates, and reference docs are loaded only when actually needed.</p></li>
</ol>
<p><strong>This is what makes skills so useful and efficient</strong>. Claude can have hundreds of skills installed without blowing up the context window. Compare this to MCP servers where tool descriptions often consume hundreds or even thousands of tokens upfront regardless of whether you use them.</p>
<blockquote class="blockquote">
<p>For a deeper dive into the loading mechanics, please go through the <a href="https://platform.claude.com/cookbook/skills-notebooks-01-skills-introduction">Claude Skills Cookbook</a>.</p>
</blockquote>
</section>
<section id="what-makes-skills-valuable" class="level3">
<h3 class="anchored" data-anchor-id="what-makes-skills-valuable">What Makes Skills Valuable</h3>
<p>Once you start using skills, you will discover many benefits. For me, the following ones stand out:</p>
<ul>
<li>Skills provide Claude domain expertise, turning <strong>Claude from a brilliant generalist into a domain expert for your specific workflows</strong>.</li>
<li><strong>Write a skill once, use it anywhere</strong>. The same skill works in Claude.ai, Claude Code, and <a href="https://platform.claude.com/docs/en/build-with-claude/skills-guide">via the API</a>. And since Agent Skills became an open standard, they work across other agents like GitHub Copilot, Codex CLI and Cursor.</li>
<li>Skills ensure <strong>repeatability and reliability</strong>. Often, skills include scripts and pre-defined workflows, so Claude does not reinvent the wheel every time. It uses code that is known to work consistently.</li>
<li>They are progressively loaded which ensures not just <strong>token efficiency</strong> but also <strong>time saving</strong>. For example, it forces Claude to use existing solutions instead of generating code from scratch.</li>
<li>Moreover, you <strong>don’t need to be a developer to create skills</strong>. You can use the <code>Skills-Creator</code> skill in the Claude app or Claude Code to build your own skills just by describing (and iterating) what you want.</li>
</ul>
</section>
</section>
<section id="how-skills-are-organized" class="level2">
<h2 class="anchored" data-anchor-id="how-skills-are-organized">How Skills are Organized</h2>
<p>The basic structure is surprisingly simple. Every skill follows the same pattern:</p>
<pre><code>my-skill/
├── SKILL.md          # Required: the brain of the skill
├── scripts/          # Optional: utility scripts
│   └── helper.py
│.....                # Other optional files</code></pre>
<p>The only required file is <code>SKILL.md</code>. Everything else is optional and loaded only when needed. It has two parts: YAML frontmatter for metadata and the markdown content for instructions.</p>
<p>For example, here is the frontmatter from my <a href="https://github.com/garg-aayush/tutorials/blob/main/claude-skills/basic-image-editing/SKILL.md">basic image editing skill</a>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb2-1"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">---</span></span>
<span id="cb2-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> basic-image-editing</span></span>
<span id="cb2-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">description</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> Image manipulation tool for resizing, rotation, flipping, cropping, padding, format conversion (JPEG/PNG/WebP/TIFF/HEIC), transparency operations (remove/replace/extract/blend), grayscale conversion, auto-cropping borders, and file size optimization. Use when users need to transform, convert, or optimize images.</span></span>
<span id="cb2-4"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">---</span></span></code></pre></div></div>
<blockquote class="blockquote">
<p>This is what gets loaded in Claude’s context at startup. You can see the full <a href="https://github.com/garg-aayush/tutorials/blob/main/claude-skills/basic-image-editing/SKILL.md">SKILL.md</a> on GitHub.</p>
</blockquote>
<p>After the frontmatter comes the actual instruction content. Your SKILL.md content usually include:</p>
<ul>
<li><strong>A brief definition</strong> of what the skill does</li>
<li><strong>Clear instructions</strong> on how to use it (commands, workflows, steps)</li>
<li><strong>Examples</strong> showing common use cases with actual command snippets. <strong>I find that providing examples in the SKILL.md file really helps Claude understand the correct way to call the scripts.</strong></li>
<li><strong>Edge cases or constraints</strong> if any exist</li>
<li><strong>References</strong> to supporting scripts or templates if the skill uses them</li>
</ul>
</section>
<section id="building-your-own-skill" class="level2">
<h2 class="anchored" data-anchor-id="building-your-own-skill">Building Your Own Skill</h2>
<p>There are multiple ways to create skills:</p>
<ol type="1">
<li><p><strong>Write everything from scratch</strong>: You have full control. Create the folder structure manually, write the SKILL.md yourself and add your own scripts. &gt; I would not recommend this approach. It is better to use Claude to generate the first version and iterate over it.</p></li>
<li><p><strong>Use Claude Code with the Skills Creator skill</strong> — If you are already working in Claude Code, you can describe what you want and let Claude scaffold the skill for you. See <a href="https://www.youtube.com/watch?v=7LtCEJ4sfSE">Eleanor Berger’s videos</a> where she walks through an example of building invoice and reports generation skills in Claude Code.</p></li>
<li><p><strong>Use the Claude web app</strong>: This is my preferred approach, especially for personal use. Everything happens in a UI-based interface. You describe what you want, have a conversation with Claude, test it immediately and iterate until it works.</p></li>
</ol>
<section id="walkthrough-creating-the-basic-image-editing-skill" class="level3">
<h3 class="anchored" data-anchor-id="walkthrough-creating-the-basic-image-editing-skill">Walkthrough: Creating the Basic Image Editing Skill</h3>
<p>Let me walk through how I created my basic image editing skill using the third approach.</p>
<p><strong>Step 1: Have Clarity on What the Skill Should Do</strong></p>
<p>Before creating a skill, <strong>you should have a clear idea of what it should do</strong>. Maybe you want a skill that generates PDF reports on the fly or one that handles data analysis or something else entirely. It doesn’t matter if you don’t know <em>how</em> to implement it. What matters is knowing <em>what</em> you want it to accomplish.</p>
<p>For example, I work a lot with images. I often need to do basic editing operations like resizing, rotating, and cropping. These are simple operations but I do them constantly. <strong>A basic image editing skill accessible in Claude makes a lot of sense for my day to day work.</strong></p>
<p><strong>Step 2: Bootstrap with the Skills Creator</strong></p>
<p>First of all, ensure the <code>skill-creator</code> skill is enabled in your Claude account. You can do this by going to Settings -&gt; Capabilities and enabling the <code>skill-creator</code> skill.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-13/walkthrough_1.png" class="img-fluid"></p>
<p>Once you have enabled the skill-creator, you can describe what you want to create a skill for. For eg, I want to create a skill for basic image editing:</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-13/walkthrough_2.png" class="img-fluid"></p>
<p>Claude used the skill-creator framework to scaffold the initial structure: a <code>SKILL.md</code> with proper YAML frontmatter and a Python script with core operations.</p>
<p><strong>Step 3: Test Immediately, Iterate Constantly</strong></p>
<p>One thing I always make sure of: I dont try to create a polished skill in one go. Whatever skill gets created, test it immediately. This is one of the things I appreciate about creating skills in the Claude web app. You can test them instantly in the same conversation.</p>
<p>For eg, I uploaded a test image and asked Claude to change the background to orange:</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-13/walkthrough_3.png" class="img-fluid"></p>
<p>This gives you a tight feedback loop. You test what you’re building, and if something doesn’t work, Claude can fix it for you right there. No context-switching between environments. You build, test, and refine in one place.</p>
<p><strong>Step 4: Iterate and Improve</strong></p>
<p>Once you have a basic working version, you can start adding more operations and optimizing.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-13/walkthrough_4.png" class="img-fluid"></p>
<p>Similarly, since this skill runs Python scripts locally, I switched to PEP 723 inline metadata so <code>uv run</code> installs dependencies automatically on the fly:</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-13/walkthrough_4.png" class="img-fluid"></p>
<p>If you have some domain knowledge, use it. Go through the code Claude generated and check if it’s doing something redundant. You can review and ask Claude to make those optimizations.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-13/walkthrough_5.png" class="img-fluid"></p>
<p>Finally, once you have a version you are happy with, go through the <code>SKILL.md</code> file and make sure the examples are good. If not, add them manually. <strong>Good examples help Claude understand exactly how to use the skill.</strong></p>
<p><strong>Step 5: Export and Use Across Agents</strong></p>
<p>Once you have a skill you want to keep, export it. You can use it in the Claude web app, Claude Code, or other agents like Codex CLI and Cursor.</p>
</section>
<section id="the-iterative-philosophy" class="level3">
<h3 class="anchored" data-anchor-id="the-iterative-philosophy">The Iterative Philosophy</h3>
<p><strong>Treat skills as living documents, not finished products.</strong> Your first version may not be perfect. You will find gaps and discover better ways to do the same tasks. Maybe the description doesn’t trigger for certain phrasings. Maybe you forgot an edge case. That is expected.</p>
<p>I think of skills the same way I think of code: you write it, use it, find gaps and improve it. Always make sure the skill you use today will be more polished than the one you started with.</p>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Skills are simple yet powerful. They are simply a folder with markdown file and optionally some scripts that provides domain expertise to your agents without blowing up the context window. If you work with AI agents regularly, building your own skills is worth the investment. Start with something you do repeatedly. Create the simplest version that works and improve it iteratively.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><a href="https://simonwillison.net/2025/Oct/16/claude-skills/">Simon Willison’s Blog Post</a>: A good overview of Claude Skills.</li>
<li><a href="https://www.youtube.com/watch?v=CEvIs9y1uog">Barry’s AI Engineer Talk</a>: I highly recommend watching this talk if you are interested in skills.</li>
<li><a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview">Agent Skills Documentation</a></li>
<li><a href="https://platform.claude.com/cookbook/skills-notebooks-01-skills-introduction">Claude Skills Cookbook</a></li>
<li><a href="https://platform.claude.com/docs/en/build-with-claude/skills-guide">Using Skills with the API</a></li>
<li><a href="https://agentskills.io">Agent Skills Open Standard</a></li>
</ul>


</section>

 ]]></description>
  <category>Tools &amp; Infra</category>
  <guid>https://garg-aayush.github.io/posts/2026-01-13-claude-skills/</guid>
  <pubDate>Tue, 13 Jan 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Key Insights from DeepSeekMath paper</title>
  <link>https://garg-aayush.github.io/posts/2026-01-06-review-deepseek-math.html</link>
  <description><![CDATA[ 




<p>Over the weekend, I finished reading the <a href="https://arxiv.org/abs/2402.03300">DeepSeekMath</a> paper which introduced GRPO (the RL algorithm I covered in my <a href="https://aayushgarg.dev/posts/2026-01-01-understanding-grpo.html">previous post</a>). Below are my thoughts and key takeaways from the paper.</p>
<p>In this paper, the authors show that a small domain-specific model (7B parameters) can approach the performance of SOTA general models like GPT-4 on competition-level math when it is pre-trained on a sufficiently large, well-curated math corpus (120B tokens) and then reinforced with RL. It outperform major open-source models available at that time including same or larger size math-specialized models like <a href="https://arxiv.org/abs/2308.09583">WizardMath-v1.1 7B</a>, <a href="https://arxiv.org/abs/2310.10631">Llemma 34B</a> and <a href="https://arxiv.org/abs/2309.12284">MetaMath 70B</a>. As shown below, DeepseekMath-7B achieves 51.7% on the competition-level <a href="https://arxiv.org/abs/2103.03874">MATH benchmark</a>.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-06/image1-benchmark.png" class="img-fluid"></p>
<blockquote class="blockquote">
<p>In a nutshell: <strong>data quality and domain-specific pre-training are more critical for mathematical reasoning than sheer parameter count</strong>.</p>
</blockquote>
<section id="key-insights" class="level2">
<h2 class="anchored" data-anchor-id="key-insights">Key Insights</h2>
<section id="iterative-data-curation-pipeline" class="level3">
<h3 class="anchored" data-anchor-id="iterative-data-curation-pipeline">Iterative Data Curation Pipeline</h3>
<p>One of the biggest contributors to DeepSeekMath-7B performance is its pre-training corpus which is a 120B-token, high-quality mathematical dataset built from <a href="https://commoncrawl.org/">Common Crawl</a> using an <strong>iterative <a href="https://arxiv.org/abs/1612.03651">fastText</a> classifier-based pipeline</strong>.</p>
<p>What stood out to me here is not just the dataset and data curation pipeline itself but what it implies:</p>
<ol type="1">
<li><p>Firstly, they created the pre-training corpus from the Common Crawl which shows that if you use a well-thought data curation pipeline you can extract a high-quality domain-specific data from the public Common Crawl data.</p></li>
<li><p>Secondly, the resulting corpus is substantially larger than <a href="https://arxiv.org/abs/2310.06786">OpenWebMath</a> (roughly 9 times larger), reinforcing the point that <em>scale matters</em>, as long as quality is maintained.</p></li>
</ol>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-06/image2-math-datacorpus-comparison.png" class="img-fluid"></p>
<blockquote class="blockquote">
<p>Note, I will go deeper into the pipeline later in this post because I feel it is broadly reusable beyond math.</p>
</blockquote>
</section>
<section id="grpo-ppo-without-the-critic" class="level3">
<h3 class="anchored" data-anchor-id="grpo-ppo-without-the-critic">GRPO: PPO Without the Critic</h3>
<p>The second major novelty is their <strong>more memory-efficient alternative to <a href="https://arxiv.org/abs/1707.06347">PPO</a></strong>: <strong>GRPO (Group Relative Policy Optimization)</strong>.</p>
<p>At a high level, GRPO removes the need to train a separate critic (value) model. Instead of learning a value function for advantage estimation, GRPO samples a <em>group</em> of multiple completions per prompt (64 in their experiments) and uses the normalized average reward of the group as the baseline. This significantly reduces memory and compute overhead while preserving the stability mechanisms associated with PPO (clipping and KL regularization).</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-06/image3-grpo.png" class="img-fluid"></p>
<blockquote class="blockquote">
<p>For a detailed explanation of GRPO, see my <a href="https://aayushgarg.dev/posts/2026-01-01-understanding-grpo.html">previous post on Understanding GRPO</a>.</p>
</blockquote>
</section>
<section id="code-training-helps-math" class="level3">
<h3 class="anchored" data-anchor-id="code-training-helps-math">Code Training Helps Math</h3>
<p>The paper also provides evidence for the hypothesized connection between code training and reasoning. In their results, models that underwent code pre-training before math training showed improved performance on mathematical benchmarks, both with and without tool use. Thus, DeepSeekMath-Base is initialized with <a href="https://arxiv.org/abs/2401.14196">DeepSeek-Coder-Base-v1.5 7B</a>, not a general language model.</p>
<blockquote class="blockquote">
<p>My understanding is code pushes the model toward more structured, step-wise patterns of reasoning and that structure transfers well to math.</p>
</blockquote>
</section>
<section id="arxiv-papers-are-surprisingly-ineffective" class="level3">
<h3 class="anchored" data-anchor-id="arxiv-papers-are-surprisingly-ineffective">ArXiv Papers are Surprisingly Ineffective</h3>
<p>ArXiv papers are often a default ingredient in many math pre-training recipes, but the authors report that pre-training on arXiv content was not helpful in their setup. In some cases, it led to no improvement or even to worse performance.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-06/image4-archive.png" class="img-fluid"></p>
<p>Note, the authors have been cautious claiming it to be definitely true and rather presented it as an empirical finding that requires more studies to confirm it.</p>
<blockquote class="blockquote">
<p>Still, I found this interesting. It suggests arXiv-style technical writing might be more useful for formal exposition (or informalization) than for improving competition-style problem solving.</p>
</blockquote>
</section>
<section id="online-rl-training-is-superior-to-offline" class="level3">
<h3 class="anchored" data-anchor-id="online-rl-training-is-superior-to-offline">Online RL Training is Superior to Offline</h3>
<p>Sampling training data from the real-time policy model (online) significantly outperforms sampling from the initial SFT model (offline).</p>
<p>In their experiments, Online <a href="https://arxiv.org/abs/2308.01825">Rejection Sampling Fine-Tuning (RFT)</a> significantly outperformed standard (offline) RFT on both <a href="https://arxiv.org/abs/2110.14168">GSM8K</a> and MATH benchmarks. While the two methods perform similarly in the early stages of training, Online RFT gains a distinct advantage as training progresses.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-06/image5-online-offline.png" class="img-fluid"></p>
<blockquote class="blockquote">
<p>As the policy diverges from the initial SFT model, data sampled from SFT becomes less relevant to the current model’s decision boundaries. Early on when the policy is close to SFT, it doesn’t matter much. Later, this staleness hurts the performance.</p>
</blockquote>
</section>
<section id="rl-sharpens-distribution-does-not-expand-model-capability" class="level3">
<h3 class="anchored" data-anchor-id="rl-sharpens-distribution-does-not-expand-model-capability">RL Sharpens Distribution, Does not Expand Model Capability</h3>
<p>This is one of my favorite analyses in the paper. The authors compare <strong>Maj@K</strong> (majority voting accuracy) and <strong>Pass@K</strong> (whether any of K samples is correct) for both the Instruct and RL models.</p>
<p>At K=64 samples, both models reach similar Pass@K ceilings (around 83-85% on MATH) which indicates that the fundamental capability is the same. However, RL consistently outperforms on Maj@K.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-06/image6-majpass.png" class="img-fluid"></p>
<blockquote class="blockquote">
<p>This suggests RL isn’t expanding fundamental reasoning abilities. Instead it is sharpening the distribution of the model’s output which boosts the correct responses that were already within the model’s capability.</p>
</blockquote>
</section>
</section>
<section id="the-iterative-data-curation-pipeline" class="level2">
<h2 class="anchored" data-anchor-id="the-iterative-data-curation-pipeline">The Iterative Data Curation Pipeline</h2>
<p>As I mentioned earlier, one of the key contributions of this paper is their data curation pipeline that extracts high-quality mathematical content from Common Crawl.</p>
<p>The reason I am going into detail here is that you can <strong>draw parallels from this approach to create any other domain-specific dataset from publicly available data like Common Crawl</strong>. The pipeline is iterative and uses a fastText classifier at its core. Crudely, this is how the pipeline works:</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2026-01-06/image7-data-pipeline.png" class="img-fluid"></p>
<section id="train-a-fasttext-classifier" class="level3">
<h3 class="anchored" data-anchor-id="train-a-fasttext-classifier">1) Train a fastText Classifier</h3>
<p>They start with <a href="https://huggingface.co/datasets/open-web-math/open-web-math">OpenWebMath</a> as a seed corpus (high-quality math web text). Using this seed, they build a binary classification dataset:</p>
<ul>
<li>Sample ~500K examples from the seed corpus as <strong>positive</strong> examples</li>
<li>Sample ~500K random web pages from Common Crawl as <strong>negative</strong> examples</li>
<li>Train a fastText binary classifier on this labeled data</li>
</ul>
</section>
<section id="recall-math-related-web-pages-from-common-crawl" class="level3">
<h3 class="anchored" data-anchor-id="recall-math-related-web-pages-from-common-crawl">2) Recall Math-Related Web Pages from Common Crawl</h3>
<p>Next, they run the trained fastText model over a deduplicated Common Crawl snapshot to score pages, rank them by score and keep the highest-scoring subset.</p>
<blockquote class="blockquote">
<p>The paper also mentions reducing the overall common crawl size to 40B pages via URL-based deduplication and near-deduplication, before applying the classifier at scale.</p>
</blockquote>
</section>
<section id="find-new-math-related-domains" class="level3">
<h3 class="anchored" data-anchor-id="find-new-math-related-domains">3) Find New Math-Related Domains</h3>
<p>One issue is that the initial seed (OpenWebMath) is not diverse enough to filter effectively all the math content from Common Crawl. So they add a domain discovery step to expand on it. To find what was missed in the first round of collection:</p>
<ul>
<li>Group Common Crawl into domains (pages sharing the same base URL)</li>
<li>For each domain, calculate what percentage of its pages were already collected</li>
<li>Treat domains where <strong>over 10% of pages are captured</strong> as candidate math-related domains</li>
<li>For the newly discovered math-related domains, use manual annotators to identify which URL paths actually contain mathematical content</li>
</ul>
<blockquote class="blockquote">
<p>The paper is not very detailed about the annotation mechanism beyond describing it as manual. My guess is it is a mix of human annotations and LLMs-as-a-judge.</p>
</blockquote>
</section>
<section id="expand-the-seed-corpus-and-repeat" class="level3">
<h3 class="anchored" data-anchor-id="expand-the-seed-corpus-and-repeat">4) Expand the Seed Corpus and Repeat</h3>
<p>Finally, these new math related web pages are added to the seed corpus and the fastText classifier is retrained. This process is repeated until some sort of convergence is reached.</p>
<p>This approach enables training an improved classifier with each iteration leading to better recall of math-related web pages in each subsequent iteration.</p>
<blockquote class="blockquote">
<p>According to the paper, the pipeline converges after 4 iterations with 98% of the data already collected in the third round.</p>
</blockquote>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>DeepSeekMath is worth reading for a few reasons.</p>
<ul>
<li>First, it shows that a well-curated dataset matters more than model size for math reasoning.</li>
<li>Second, the data curation pipeline is explained in enough detail that you can adapt it for other domains.</li>
<li>And thirdly, the discussed ablation studies and experiments are genuinely useful.</li>
</ul>


</section>

 ]]></description>
  <category>Paper Notes</category>
  <guid>https://garg-aayush.github.io/posts/2026-01-06-review-deepseek-math.html</guid>
  <pubDate>Tue, 06 Jan 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Understanding GRPO: PPO without the Critic</title>
  <link>https://garg-aayush.github.io/posts/2026-01-01-understanding-grpo.html</link>
  <description><![CDATA[ 




<p>In my previous posts, I worked through the derivations of <a href="https://aayushgarg.dev/posts/2025-12-25-deriving-ppo-loss.html">PPO</a> and <a href="https://aayushgarg.dev/posts/2025-12-30-deriving-dpo-loss.html">DPO</a> for LLM post-training. PPO gave us a full-fledged RL approach with clipped surrogate objectives, value functions and GAE-based advantage estimation. DPO on the other hand, showed a clever way to bypass RL entirely by reformulating the optimization as a simple classification loss on preference pairs.</p>
<p>That brings us to Group Relative Policy Optimization (GRPO), introduced in the <a href="https://arxiv.org/abs/2402.03300">DeepSeekMath</a> paper. If you have been following recent developments in reasoning models throughout 2025, GRPO has become one of the most widely used post-training algorithms behind open-source reasoning models.</p>
<p>In simple terms, GRPO can be thought of as PPO without the critic (value function). Recall that PPO trains a value function in addition to the policy to estimate baselines for advantage computation. GRPO takes a simpler approach where it samples multiple completions (“group”) for each prompt and uses their rewards to form a baseline for advantage computation. This group-derived baseline replaces the learned value function entirely (no need to train a critic!).</p>
<p>The practical implication is lower memory consumption and reduced training complexity relative to PPO while still preserving PPO’s core stability mechanisms, including the clipped surrogate objective and KL regularization.</p>
<p>In this blog, I will discuss and derive the GRPO objective step by step showing exactly how it simplifies PPO.</p>
<section id="i-the-ppo-objective-and-the-critic-problem" class="level2">
<h2 class="anchored" data-anchor-id="i-the-ppo-objective-and-the-critic-problem">I: The PPO Objective and the Critic Problem</h2>
<p>Let’s briefly recap the key relevant elements of PPO. For the full derivation and PPO details, see my <a href="https://aayushgarg.dev/posts/2025-12-25-deriving-ppo-loss.html">previous blog on PPO</a>.</p>
<p>PPO optimizes an LLM by maximizing a <strong>clipped surrogate objective</strong> (constrained using KL regularization):</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL%5E%7B%5Ctext%7BCLIP%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5B%5Cmin%5Cleft(r_t(%5Ctheta)%20%5Chat%7BA%7D_t,%20%5C;%20%5Ctext%7Bclip%7D(r_t(%5Ctheta),%201-%5Cepsilon,%201+%5Cepsilon)%20%5Ccdot%20%5Chat%7BA%7D_t%5Cright)%5Cright%5D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20=%20%5Cfrac%7B%5Cpi_%5Ctheta(a_t%7Cs_t)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_t%7Cs_t)%7D"> is the probability ratio between current and old policies.</p>
<p>The critical component here is the <strong>advantage estimate</strong> (<img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t">). The advantage measures how much better (or worse) a specific action is compared to what we expected:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AA%5E%5Cpi(s_t,%20a_t)%20=%20Q%5E%5Cpi(s_t,%20a_t)%20-%20V%5E%5Cpi(s_t)%0A"></p>
<p>To compute it, PPO uses a <strong>value function</strong> <img src="https://latex.codecogs.com/png.latex?V%5E%5Cpi(s)"> (baseline) also called the <strong>critic</strong>) that predicts expected future rewards from any state. The critic is trained alongside the policy and PPO uses <strong>Generalized Advantage Estimation (GAE)</strong> to compute advantages from per-token value predictions.</p>
<section id="the-value-function-problem-in-ppo" class="level3">
<h3 class="anchored" data-anchor-id="the-value-function-problem-in-ppo">The Value Function Problem in PPO</h3>
<p>The value function is implemented as a <strong>learned critic model</strong> with the same architecture as the policy (i.e.&nbsp;another full LLM copy). This critic is trained alongside the policy using a regression loss:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL%5E%7B%5Ctext%7BVF%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5B%5Cleft(V_%5Ctheta(s_t)%20-%20V_t%5E%7B%5Ctext%7Btarget%7D%7D%5Cright)%5E2%5Cright%5D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?V_t%5E%7B%5Ctext%7Btarget%7D%7D"> is typically the discounted return-to-go from the sampled trajectory.</p>
<p>In PPO, we train two large neural networks (policy and critic) together rather than a single model. This substantially increases computational and memory overhead. Maintaining and training the critic alongside the policy not only increases memory consumption but also adds significant complexity to the training pipeline. In practice, PPO requires four models to be resident in memory at the same time: the policy, the critic, the reference model and the reward model.</p>
<p>One more issue with PPO is that GAE needs <strong>per-token rewards</strong> to compute Temporal Difference (TD) residuals at each position. But in LLM fine-tuning, we typically get <strong>outcome rewards</strong> which is a single score for the entire completion assigned only at the final token.</p>
<blockquote class="blockquote">
<p><em>From DeepSeekMath</em>: “During RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. While in the LLM context, <strong>usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token</strong>.”</p>
</blockquote>
<p>It raises a fundamental question: how can the critic learn accurate per-token values when all training signal comes from a single final reward?</p>
</section>
</section>
<section id="ii-replacing-the-critic-with-group-sampling" class="level2">
<h2 class="anchored" data-anchor-id="ii-replacing-the-critic-with-group-sampling">II: Replacing the Critic with Group Sampling</h2>
<p>As mentioned earlier, in PPO the value function <img src="https://latex.codecogs.com/png.latex?V(s)"> acts as a <strong>baseline <img src="https://latex.codecogs.com/png.latex?b(s)"></strong> for advantage estimation: <img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7BA%7D_t%20=%20Q(s_t,%20a_t)%20-%20V(s_t).%0A"> Subtracting this baseline reduces the variance of the policy gradient estimator which in turn stabilizes training.</p>
<blockquote class="blockquote">
<p><strong>Key insight:</strong> the value function is just <em>one possible</em> choice of baseline. In principle, <strong>any function <img src="https://latex.codecogs.com/png.latex?b(s)"> that depends only on the state and not on the action can be used without introducing bias into the gradient estimates.</strong></p>
</blockquote>
<p>Common baseline choices are:</p>
<ul>
<li><strong>Constant baseline:</strong> The average reward across samples. This is the simplest option and is used in vanilla REINFORCE.</li>
<li><strong>Learned value function:</strong> <img src="https://latex.codecogs.com/png.latex?V(s)"> trained alongside the policy as in PPO.</li>
<li><strong>Monte Carlo estimate:</strong> An empirical average of returns computed from multiple samples starting from the same state.</li>
</ul>
<p><strong>GRPO adopts the third approach. Instead of learning a value function, it directly estimates the expected return using multiple samples.</strong></p>
<section id="monte-carlo-baseline" class="level3">
<h3 class="anchored" data-anchor-id="monte-carlo-baseline">Monte Carlo Baseline</h3>
<p>For each prompt <img src="https://latex.codecogs.com/png.latex?q">, GRPO samples <strong>multiple completions</strong> <img src="https://latex.codecogs.com/png.latex?%5C%7Bo_1,%20o_2,%20%5Cldots,%20o_G%5C%7D"> from the policy and obtains their rewards <img src="https://latex.codecogs.com/png.latex?%5C%7Br_1,%20r_2,%20%5Cldots,%20r_G%5C%7D">. The average reward across these completions provides a Monte Carlo estimate of the expected return:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ab(q)%20=%20%5Cfrac%7B1%7D%7BG%7D%5Csum_%7Bi=1%7D%5EG%20r_i%20%5Capprox%20%5Cmathbb%7BE%7D_%7Bo%20%5Csim%20%5Cpi_%5Ctheta(o%7Cq)%7D%5Br(q,%20o)%5D%0A"></p>
<p>This is a natural and unbiased estimator. With enough samples it converges to the true expected reward for that prompt, <strong>similar to what a well-trained value function would predict and can eliminates the need to train a separate critic model</strong>.</p>
</section>
<section id="group-relative-advantage" class="level3">
<h3 class="anchored" data-anchor-id="group-relative-advantage">Group-Relative Advantage</h3>
<p>Using the average reward as a baseline, the advantage for completion <img src="https://latex.codecogs.com/png.latex?i"> becomes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7BA%7D_i%20=%20r_i%20-%20%5Cfrac%7B1%7D%7BG%7D%5Csum_%7Bj=1%7D%5EG%20r_j%20=%20r_i%20-%20%5Ctext%7Bmean%7D(r_1,%20%5Cldots,%20r_G)%0A"></p>
<p>GRPO normalizes the advantage by the standard deviation of rewards in the group:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7BA%7D_i%20=%20%5Cfrac%7Br_i%20-%20%5Ctext%7Bmean%7D(r_1,%20%5Cldots,%20r_G)%7D%7B%5Ctext%7Bstd%7D(r_1,%20%5Cldots,%20r_G)%7D%20%5Ctag%7BII.I%7D%0A"></p>
<p>This normalization ensures that advantages are on a comparable scale regardless of the prompt’s inherent difficulty.</p>
<blockquote class="blockquote">
<p>Think it this way: different prompts can have vastly different reward scales. For example, a simple arithmetic question might yield rewards clustered around 0.9 while a challenging proof might have rewards spread across say 0.1-0.9. Without normalization, the policy gradient updates would be dominated by high-variance prompts which can possibly destabilizing training.</p>
</blockquote>
<p>GRPO’s group-relative advantage mirrors the comparative nature of rewards models as we are asking “<em>how good is this completion relative to other completions for the same prompt</em>?”.</p>
</section>
</section>
<section id="iii-the-grpo-objective" class="level2">
<h2 class="anchored" data-anchor-id="iii-the-grpo-objective">III: The GRPO Objective</h2>
<p>We now have all the pieces needed to construct the full GRPO objective. The construction follows three key modifications:</p>
<p><strong>1. Start with PPO’s clipped surrogate.</strong> Recall from Section I that PPO optimizes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL%5E%7B%5Ctext%7BCLIP%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5B%5Cmin%5Cleft(r_t(%5Ctheta)%20%5Chat%7BA%7D_t,%20%5C;%20%5Ctext%7Bclip%7D(r_t(%5Ctheta),%201-%5Cepsilon,%201+%5Cepsilon)%20%5Ccdot%20%5Chat%7BA%7D_t%5Cright)%5Cright%5D%0A"></p>
<p>Here <img src="https://latex.codecogs.com/png.latex?clip(.)"> is <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bclip%7D(r_t(%5Ctheta),%201-%5Cepsilon,%201+%5Cepsilon)"> and this clipping mechanism provides a <strong>soft trust region</strong> that prevents destructively large policy updates.</p>
<p><strong>2. Replace GAE advantage with group-relative advantage.</strong> Instead of computing <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t"> using a learned critic and GAE, we substitute the group-relative advantage from Section II:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7BA%7D_i%20=%20%5Cfrac%7Br_i%20-%20%5Ctext%7Bmean%7D(r_1,%20%5Cldots,%20r_G)%7D%7B%5Ctext%7Bstd%7D(r_1,%20%5Cldots,%20r_G)%7D%0A"></p>
<p>This is the <strong>key simplification</strong>. &gt; We no longer need per-token value predictions. Instead we estimate the baseline directly from sampled completions.</p>
<p><strong>3. Move KL penalty from reward to loss.</strong> In PPO, the KL penalty is typically subtracted from the reward signal (<img src="https://latex.codecogs.com/png.latex?r_t">) before computing advantages:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctilde%7Br%7D_t%20=%20r_t%20-%20%5Cbeta%20%5Ccdot%20%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(a_t%7Cs_t)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(a_t%7Cs_t)%7D%0A"></p>
<p>GRPO takes a different approach by adding the KL divergence directly as a penalty term in the loss function. This is a design choice that simplifies advantage computation since we dont need to consider KL penalties in the baseline estimation.</p>
<blockquote class="blockquote">
<p><em>From DeepSeekMath</em>: “Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_i">.”</p>
</blockquote>
<section id="the-full-grpo-objective" class="level3">
<h3 class="anchored" data-anchor-id="the-full-grpo-objective">The Full GRPO Objective</h3>
<p>Combining these modifications, the GRPO objective (to be maximized) is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AJ_%7B%5Ctext%7BGRPO%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7Bq%20%5Csim%20%5Cmathcal%7BD%7D,%5C,%20%5C%7Bo_i%5C%7D_%7Bi=1%7D%5EG%20%5Csim%20%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(o%7Cq)%7D%5Cleft%5B%5Cfrac%7B1%7D%7BG%7D%5Csum_%7Bi=1%7D%5EG%20%5Cleft(%20%5Cmin%5Cleft(%5Cfrac%7B%5Cpi_%5Ctheta(o_i%7Cq)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(o_i%7Cq)%7D%5Chat%7BA%7D_i,%20%5C;%20%5Ctext%7Bclip%7D(%5Ccdot)%20%5Ccdot%20%5Chat%7BA%7D_i%5Cright)%20-%20%5Cbeta%20%5C,%20D_%7B%5Ctext%7BKL%7D%7D%5Cleft(%5Cpi_%5Ctheta%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D%5Cright)%20%5Cright)%5Cright%5D%0A"></p>
<p>where: - <img src="https://latex.codecogs.com/png.latex?q"> is a prompt sampled from the training distribution <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D"> - <img src="https://latex.codecogs.com/png.latex?%5C%7Bo_1,%20o_2,%20%5Cldots,%20o_G%5C%7D"> are <img src="https://latex.codecogs.com/png.latex?G"> completions sampled from the old policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D"> - <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_i"> is the group-relative advantage for completion <img src="https://latex.codecogs.com/png.latex?i"> (from II.I) - <img src="https://latex.codecogs.com/png.latex?%5Cbeta"> is the KL penalty coefficient - <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bref%7D%7D"> is the frozen reference model (typically the SFT checkpoint)</p>
<p>The objective averages over all completions in the group, treating each completion equally in the policy update.</p>
<p>For implementation, we expand the sequence-level objective over individual tokens. Since autoregressive models factor the probability of a completion as a product of token probabilities:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpi_%5Ctheta(o_i%7Cq)%20=%20%5Cprod_%7Bt=1%7D%5E%7B%7Co_i%7C%7D%20%5Cpi_%5Ctheta(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%0A"></p>
<p>The per-token formulation of GRPO becomes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AJ_%7B%5Ctext%7BGRPO%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7Bq%20%5Csim%20%5Cmathcal%7BD%7D,%5C,%20%5C%7Bo_i%5C%7D_%7Bi=1%7D%5EG%20%5Csim%20%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(o%7Cq)%7D%20%5Cleft%5B%5Cfrac%7B1%7D%7BG%7D%5Csum_%7Bi=1%7D%5EG%20%5Cfrac%7B1%7D%7B%7Co_i%7C%7D%5Csum_%7Bt=1%7D%5E%7B%7Co_i%7C%7D%20%5Cleft(%20%5Cmin%5Cleft(%5Cfrac%7B%5Cpi_%5Ctheta(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%7D%5Chat%7BA%7D_%7Bi,t%7D,%20%5C;%20%5Ctext%7Bclip%7D(%5Ccdot)%20%5Ccdot%20%5Chat%7BA%7D_%7Bi,t%7D%5Cright)%20-%20%5Cbeta%20%5C,%20D_%7B%5Ctext%7BKL%7D%7D%5E%7B(t)%7D%20%5Cright)%5Cright%5D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?D_%7B%5Ctext%7BKL%7D%7D%5E%7B(t)%7D"> is the per-token KL divergence between the current policy and the reference model.</p>
<p><strong>Note, all tokens in completion <img src="https://latex.codecogs.com/png.latex?i"> receive the same advantage:</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7BA%7D_%7Bi,t%7D%20=%20%5Chat%7BA%7D_i%20%5Cquad%20%5Cforall%20%5C,%20t%20%5Cin%20%5C%7B1,%202,%20%5Cldots,%20%7Co_i%7C%5C%7D%0A"></p>
<p>This is a deliberate simplification. Since we only receive a single reward for the entire completion trying to learn which specific tokens were “good” or “bad” can be difficult.</p>
<blockquote class="blockquote">
<p><em>From DeepSeekMath</em>: “While in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token.”</p>
</blockquote>
</section>
<section id="single-gradient-step-simplification" class="level3">
<h3 class="anchored" data-anchor-id="single-gradient-step-simplification">Single Gradient Step Simplification</h3>
<p>In practice (as mentioned in RLHF Book), GRPO is often run with only <strong>one gradient step per batch</strong> of sampled data. In this case <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta%20=%20%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D"> at the start of the update which means the policy ratio equals 1 and the clipping mechanism has no effect:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ar_t(%5Ctheta)%20=%20%5Cfrac%7B%5Cpi_%5Ctheta(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%7D%20=%201%0A"></p>
<p>The objective then simplifies to a weighted policy gradient:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AJ_%7B%5Ctext%7BGRPO%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7Bq%20%5Csim%20%5Cmathcal%7BD%7D,%5C,%20%5C%7Bo_i%5C%7D_%7Bi=1%7D%5EG%20%5Csim%20%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(o%7Cq)%7D%20%5Cleft%5B%5Cfrac%7B1%7D%7BG%7D%5Csum_%7Bi=1%7D%5EG%20%5Cfrac%7B1%7D%7B%7Co_i%7C%7D%5Csum_%7Bt=1%7D%5E%7B%7Co_i%7C%7D%20%5Cleft(%20%5Chat%7BA%7D_i%20%5Ccdot%20%5Clog%20%5Cpi_%5Ctheta(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%20-%20%5Cbeta%20%5C,%20D_%7B%5Ctext%7BKL%7D%7D%5E%7B(t)%7D%20%5Cright)%5Cright%5D%0A"></p>
</section>
</section>
<section id="iv-kl-divergence-in-grpo" class="level2">
<h2 class="anchored" data-anchor-id="iv-kl-divergence-in-grpo">IV: KL Divergence in GRPO</h2>
<p>The KL Divergence is a measure of the difference between two probability distributions. It is defined as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%7B%5Ctext%7BKL%7D%7D(%5Cpi_%5Ctheta%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D)%20=%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cleft%5B%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(x)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(x)%7D%5Cright%5D%0A"></p>
<p>It can simply be estimated as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%7B%5Ctext%7BKL%7D%7D(%5Cpi_%5Ctheta%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D)%20%5Capprox%20%5Clog%20%5Cpi_%5Ctheta(x)%20-%20%5Clog%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(x)%0A"></p>
<p>However this can be <strong>negative</strong> for individual samples (when <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta%20%3C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D">) even though KL divergence is always non-negative. This may lead to high variance in gradient estimates.</p>
<p>GRPO uses an alternative estimator that is both <strong>unbiased</strong> and <strong>guaranteed non-negative</strong>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%7B%5Ctext%7BKL%7D%7D%5E%7B(t)%7D%20=%20%5Cfrac%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%7D%7B%5Cpi_%5Ctheta(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%7D%20-%20%5Clog%20%5Cfrac%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%7D%7B%5Cpi_%5Ctheta(o_%7Bi,t%7D%7Cq,%20o_%7Bi,%3Ct%7D)%7D%20-%201%20%5Ctag%7BIV.I%7D%0A"></p>
<p>This estimator can be understood as measuring the gap between <img src="https://latex.codecogs.com/png.latex?%5Clog(x)"> and its tangent line at <img src="https://latex.codecogs.com/png.latex?x=1">. Since <img src="https://latex.codecogs.com/png.latex?%5Clog"> is concave. This gap is always non-negative ensuring the KL penalty never incorrectly suggests that diverging from the reference <em>decreases</em> the penalty.</p>
<blockquote class="blockquote">
<p>For a detailed derivation of why this estimator is unbiased and non-negative, see <a href="http://joschu.net/blog/kl-approx.html">John Schulman’s excellent blog post on approximating KL divergence</a>.</p>
</blockquote>
</section>
<section id="v-outcome-vs.-process-supervision" class="level2">
<h2 class="anchored" data-anchor-id="v-outcome-vs.-process-supervision">V: Outcome vs.&nbsp;Process Supervision</h2>
<p>The GRPO formulation presented so far assumes <strong>outcome supervision</strong> which provides a single reward at the end of each completion with the same advantage assigned to every token. However for complex reasoning tasks knowing only the final answer reward might not be sufficient.</p>
<blockquote class="blockquote">
<p><em>From DeepSeekMath Paper</em>: “Outcome supervision only provides a reward at the end of each output, which may not be sufficient and efficient to supervise the policy in complex mathematical tasks.”</p>
</blockquote>
<p><strong>Process supervision</strong> addresses this by providing rewards at the end of each reasoning step. Given a completion <img src="https://latex.codecogs.com/png.latex?o_i"> with <img src="https://latex.codecogs.com/png.latex?K_i"> reasoning steps, a <strong>process reward model (PRM)</strong> assigns rewards <img src="https://latex.codecogs.com/png.latex?%5C%7Br_i%5E%7B%5Ctext%7Bindex%7D(1)%7D,%20%5Cldots,%20r_i%5E%7B%5Ctext%7Bindex%7D(K_i)%7D%5C%7D"> at step boundaries where <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bindex%7D(j)"> is the end token index of the <img src="https://latex.codecogs.com/png.latex?j">-th step.</p>
<p>GRPO extends to process supervision with two modifications:</p>
<p><strong>1. Normalize across all step rewards in the group:</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctilde%7Br%7D_i%5E%7B%5Ctext%7Bindex%7D(j)%7D%20=%20%5Cfrac%7Br_i%5E%7B%5Ctext%7Bindex%7D(j)%7D%20-%20%5Ctext%7Bmean%7D(%5Cmathcal%7BR%7D)%7D%7B%5Ctext%7Bstd%7D(%5Cmathcal%7BR%7D)%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BR%7D"> contains all step rewards across all <img src="https://latex.codecogs.com/png.latex?G"> completions.</p>
<p><strong>2. Compute advantages as cumulative future rewards:</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7BA%7D_%7Bi,t%7D%20=%20%5Csum_%7B%5Ctext%7Bindex%7D(j)%20%5Cgeq%20t%7D%20%5Ctilde%7Br%7D_i%5E%7B%5Ctext%7Bindex%7D(j)%7D%20%5Ctag%7BV.I%7D%0A"></p>
<p>This mirrors <strong>return-to-go</strong> in traditional RL where earlier tokens accumulate rewards from all subsequent steps, while tokens near the end see only remaining rewards.</p>
<blockquote class="blockquote">
<p>The DeepSeekMath experiments found process supervision can accelerate learning, though the gap narrows with iterative training. For domains with reliable verifiers (code execution, math answer checking), outcome supervision with RLVR has become dominant. DeepSeek-R1 uses only outcome-level verification.</p>
</blockquote>
</section>
<section id="vi-connection-to-reinforce-leave-one-out-rloo" class="level2">
<h2 class="anchored" data-anchor-id="vi-connection-to-reinforce-leave-one-out-rloo">VI: Connection to REINFORCE Leave-One-Out (RLOO)</h2>
<p>GRPO is not the only critic-free algorithm leveraging group sampling. <a href="https://arxiv.org/html/2402.14740v1">REINFORCE Leave-One-Out (RLOO)</a> takes a similar approach but computes the baseline as the mean reward over all <em>other</em> completions, <strong>excluding</strong> the current sample:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AA_i%5E%7B%5Ctext%7BRLOO%7D%7D%20=%20r_i%20-%20%5Cfrac%7B1%7D%7BG-1%7D%5Csum_%7Bj=1,%20j%20%5Cneq%20i%7D%5E%7BG%7D%20r_j%20%5Ctag%7BVI.I%7D%0A"></p>
<p>This “leave-one-out” baseline avoids a subtle correlation that exists when the baseline includes the sample being evaluated.</p>
<p>The two algorithms are conceptually very similar. However, there are some key differences:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Aspect</th>
<th>RLOO</th>
<th>GRPO</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Baseline</strong></td>
<td>Mean of <em>other</em> samples</td>
<td>Mean of <em>all</em> samples</td>
</tr>
<tr class="even">
<td><strong>Normalization</strong></td>
<td>None</td>
<td>Divide by std</td>
</tr>
<tr class="odd">
<td><strong>Clipping</strong></td>
<td>No</td>
<td>Yes (PPO-style)</td>
</tr>
<tr class="even">
<td><strong>KL Placement</strong></td>
<td>In reward</td>
<td>In loss</td>
</tr>
</tbody>
</table>
<p><strong>GRPO can be understood as inheriting PPO’s clipping mechanism for stability while adopting RLOO-style group sampling to eliminate the critic.</strong></p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>GRPO ingenuity comes from recognizing that PPO value function is fundamentally just a baseline for advantage computation and that an stimate obtained via group sampling can serve the same role. By sampling multiple completions per prompt and using their mean reward as the baseline, GRPO achieves the stability provided by PPO clipped surrogate objective without the memory overhead or training complexity of training a separate critic model.</p>
<p>This design choice is what makes GRPO a preferred approach for RLVR training of LLMs focused on reasoning capabilities.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<p><strong>Papers:</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/2402.03300">DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models</a>: The original GRPO paper</li>
<li><a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</a>: DeepSeek’s reasoning model trained with GRPO</li>
<li><a href="https://arxiv.org/abs/2402.14740">Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs</a>: The RLOO paper comparing critic-free methods</li>
</ul>
<p><strong>Blogs:</strong></p>
<ul>
<li><a href="http://joschu.net/blog/kl-approx.html">John Schulman’s post on approximating KL divergence</a>: Derivation of the unbiased non-negative KL estimator</li>
<li><a href="https://rlhfbook.com/">RLHF Book by Nathan Lambert</a>: Comprehensive resource on RLHF algorithms including GRPO implementation details</li>
<li><a href="https://substack.com/home/post/p-177823868">Understanding GRPO by Cameron R. Wolfe</a>: Deep dive into GRPO mechanics and implementation</li>
</ul>


</section>

 ]]></description>
  <category>RL &amp; Alignment</category>
  <guid>https://garg-aayush.github.io/posts/2026-01-01-understanding-grpo.html</guid>
  <pubDate>Thu, 01 Jan 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Deriving the DPO Loss from First Principles</title>
  <link>https://garg-aayush.github.io/posts/2025-12-30-deriving-dpo-loss.html</link>
  <description><![CDATA[ 




<p>In my <a href="https://aayushgarg.dev/posts/2025-12-25-deriving-ppo-loss.html">previous post</a>, I worked through the derivation of the PPO loss used in RLHF for LLMs. By the end, we arrived at a fairly daunting objective function with multiple components: clipped surrogate, value function loss, entropy bonus and KL penalty. It is not just that the final objective is intimidating but the entire RLHF pipeline is complex and multi-step. You first train a separate reward model to reflect human preferences then fine-tune the LLM using RL with PPO.</p>
<p>That brings us to <a href="https://arxiv.org/abs/2305.18290">Direct Preference Optimization</a> (DPO). DPO is a computationally lightweight alternative that directly optimizes LLMs to adhere to human preferences <strong>without explicit reward modeling or reinforcement learning</strong>. The key insight is that DPO implicitly optimizes the <strong>same</strong> objective as PPO-based RLHF (reward maximization with a KL-divergence constraint) but it replaces the entire reward model + PPO loop with a single supervised objective on preference pairs. There is no sampling during training, no value function, no clipping, just a classification loss!</p>
<p>Here I derive the DPO loss showing exactly how this simplification is possible. I will assume familiarity with concepts from the PPO post, particularly the reward model and the KL-constrained RLHF objective.</p>
<blockquote class="blockquote">
<p>Again, a huge shoutout to Umar Jamil’s <a href="https://www.youtube.com/watch?v=hvGa5Mba4c8">video on DPO</a> for an excellent walkthrough that helped me understand the derivation.</p>
</blockquote>
<section id="i-the-rlhf-objective" class="level2">
<h2 class="anchored" data-anchor-id="i-the-rlhf-objective">I: The RLHF Objective</h2>
<p>Let’s recall the RLHF objective from the PPO blog. The goal of RLHF is to find a policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> that maximizes expected reward while staying close to a reference model <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bref%7D%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AJ_%7B%5Ctext%7BRLHF%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20%5Cmathcal%7BD%7D,%20y%20%5Csim%20%5Cpi_%5Ctheta(y%7Cx)%7D%5Cleft%5Br_%5Cphi(x,%20y)%5Cright%5D%20-%20%5Cbeta%20%5Ccdot%20D_%7B%5Ctext%7BKL%7D%7D%5Cleft(%5Cpi_%5Ctheta(y%7Cx)%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%5Cright)%0A"></p>
<p>The first term encourages the model to generate high-reward responses. The second term (KL penalty) prevents the model from drifting too far from the reference which helps avoid reward hacking and maintains language quality.</p>
<p>As we saw in the PPO blog, we can’t optimize this objective directly with gradient descent because the expectation <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi_%5Ctheta(y%7Cx)%7D%5B%5Ccdot%5D"> requires sampling from the policy and sampling is non-differentiable. This is why we needed reinforcement learning algorithms like REINFORCE and PPO. They provide ways to estimate policy gradients without differentiating through the sampling process.</p>
<blockquote class="blockquote">
<p>What if we could reformulate the problem so that we don’t need to sample from the policy during training? This is exactly what DPO will achieve.</p>
</blockquote>
</section>
<section id="ii-the-bradley-terry-model-for-preference-learning" class="level2">
<h2 class="anchored" data-anchor-id="ii-the-bradley-terry-model-for-preference-learning">II: The Bradley-Terry Model for Preference Learning</h2>
<p>We also need to understand the <a href="https://www.jstor.org/stable/2334029">Bradley-Terry model</a> for reward model training in a bit more detail with focus on <em>why</em> it works the way it does.</p>
<p>Training a reward model requires human-labeled preference data that compares pairs of responses. <img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BD%7D%20=%20%5Cleft%5C%7B(x%5E%7B(i)%7D,%20y_w%5E%7B(i)%7D,%20y_l%5E%7B(i)%7D)%5Cright%5C%7D_%7Bi=1%7D%5E%7BN%7D%0A"></p>
<p>where: - <img src="https://latex.codecogs.com/png.latex?x"> is the prompt - <img src="https://latex.codecogs.com/png.latex?y_w"> is the <strong>preferred</strong> (winning) response - <img src="https://latex.codecogs.com/png.latex?y_l"> is the <strong>dispreferred</strong> (losing) response</p>
<section id="the-bradley-terry-probability-model" class="level3">
<h3 class="anchored" data-anchor-id="the-bradley-terry-probability-model">The Bradley-Terry Probability Model</h3>
<p>The Bradley-Terry model provides a principled way to convert comparison data (defined above) into a probabilistic model. It assumes there exists some latent reward function <img src="https://latex.codecogs.com/png.latex?r%5E*(x,%20y)"> that captures true response quality and models the probability that response <img src="https://latex.codecogs.com/png.latex?y_w"> is preferred over <img src="https://latex.codecogs.com/png.latex?y_l"> as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AP(y_w%20%5Csucc%20y_l%20%7C%20x)%20=%20%5Cfrac%7Be%5E%7Br%5E*(x,%20y_w)%7D%7D%7Be%5E%7Br%5E*(x,%20y_w)%7D%20+%20e%5E%7Br%5E*(x,%20y_l)%7D%7D%20%5Ctag%7BII.I%7D%0A"></p>
<p>The intuition is straightforward that the responses with higher reward are exponentially more likely to be preferred.</p>
<p>A key step is recognizing that this <strong>ratio of exponentials can be written as a sigmoid function</strong>. This is important because it connects Bradley-Terry to standard binary classification.</p>
<p>Let <img src="https://latex.codecogs.com/png.latex?A%20=%20r(x,%20y_w)"> and <img src="https://latex.codecogs.com/png.latex?B%20=%20r(x,%20y_l)">. We want to show:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Be%5EA%7D%7Be%5EA%20+%20e%5EB%7D%20=%20%5Csigma(A%20-%20B)%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Csigma(z)%20=%20%5Cfrac%7B1%7D%7B1%20+%20e%5E%7B-z%7D%7D"> is the sigmoid function.</p>
<p>Starting with the left-hand side:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Be%5EA%7D%7Be%5EA%20+%20e%5EB%7D%0A"></p>
<p>we can rewrite it as: <img src="https://latex.codecogs.com/png.latex?%0A=%20%5Cfrac%7Be%5EA%20/%20e%5EA%7D%7B(e%5EA%20+%20e%5EB)%20/%20e%5EA%7D%20=%20%5Cfrac%7B1%7D%7B1%20+%20e%5EB%20/%20e%5EA%7D%20=%20%5Cfrac%7B1%7D%7B1%20+%20e%5E%7BB-A%7D%7D%0A"></p>
<p>which is the sigmoid function of <img src="https://latex.codecogs.com/png.latex?(A%20-%20B)">: <img src="https://latex.codecogs.com/png.latex?%0A=%20%5Cfrac%7B1%7D%7B1%20+%20e%5E%7B-(A-B)%7D%7D%20=%20%5Csigma(A%20-%20B)%0A"></p>
<p>Therefore, the Bradley-Terry model can be written as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7BP(y_w%20%5Csucc%20y_l%20%7C%20x)%20=%20%5Csigma%5Cleft(r(x,%20y_w)%20-%20r(x,%20y_l)%5Cright)%7D%20%5Ctag%7BII.II%7D%0A"></p>
</section>
<section id="reward-model-loss" class="level3">
<h3 class="anchored" data-anchor-id="reward-model-loss">Reward Model Loss</h3>
<p>Given a dataset of preferences <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D">, we can train a parameterized reward model <img src="https://latex.codecogs.com/png.latex?r_%5Cphi(x,%20y)"> using maximum likelihood estimation. We want to maximize the probability of observing the preferences in our dataset:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmax_%5Cphi%20%5Cprod_%7B(x,%20y_w,%20y_l)%20%5Cin%20%5Cmathcal%7BD%7D%7D%20P(y_w%20%5Csucc%20y_l%20%7C%20x)%0A"></p>
<p>Taking the log and negating (to turn maximization into minimization), we get the negative log-likelihood loss:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cmathcal%7BL%7D_%7B%5Ctext%7BRM%7D%7D(%5Cphi)%20=%20-%5Cmathbb%7BE%7D_%7B(x,%20y_w,%20y_l)%20%5Csim%20%5Cmathcal%7BD%7D%7D%5Cleft%5B%5Clog%20%5Csigma%5Cleft(r_%5Cphi(x,%20y_w)%20-%20r_%5Cphi(x,%20y_l)%5Cright)%5Cright%5D%7D%20%5Ctag%7BII.III%7D%0A"></p>
<p>This is just binary cross-entropy objective which helps the reward model learn to assign higher rewards to preferred responses. Notice that the Bradley-Terry model depends <strong>only on the difference of rewards</strong>: <img src="https://latex.codecogs.com/png.latex?r(x,%20y_w)%20-%20r(x,%20y_l)">. The absolute values dont matter, it is only their relative ordering. This means:</p>
<blockquote class="blockquote">
<p>If we add any constant <img src="https://latex.codecogs.com/png.latex?c"> or any function <img src="https://latex.codecogs.com/png.latex?f(x)"> that depends only on the prompt (not the response), the <strong>preference probabilities dont change</strong>. This invariance property will be the key to deriving DPO.</p>
</blockquote>
</section>
</section>
<section id="iii-optimal-policy-in-closed-form" class="level2">
<h2 class="anchored" data-anchor-id="iii-optimal-policy-in-closed-form">III: Optimal Policy in Closed Form</h2>
<p>Here, we will find the <strong>exact</strong> analytical optimal policy solution to the optimization problem. We want to find the policy that maximizes expected reward while keeping the KL divergence from the reference policy bounded:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmax_%5Cpi%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20%5Cmathcal%7BD%7D,%20y%20%5Csim%20%5Cpi(y%7Cx)%7D%5Cleft%5Br(x,%20y)%5Cright%5D%20-%20%5Cbeta%20%5Ccdot%20D_%7B%5Ctext%7BKL%7D%7D%5Cleft(%5Cpi(y%7Cx)%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%5Cright)%20%5Ctag%7BIII.I%7D%0A"></p>
<p>Note, I am writing <img src="https://latex.codecogs.com/png.latex?%5Cpi"> instead of <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> to emphasize that we are looking for the optimal policy in general not just the parameterized version.</p>
<p>Expanding the KL divergence:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%7B%5Ctext%7BKL%7D%7D%5Cleft(%5Cpi(y%7Cx)%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%5Cright)%20=%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(y%7Cx)%7D%5Cleft%5B%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%5Cright%5D%0A"></p>
<p>So our objective becomes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmax_%5Cpi%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20%5Cmathcal%7BD%7D%7D%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(y%7Cx)%7D%5Cleft%5Br(x,%20y)%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%5Cright%5D%0A"></p>
<p>For a fixed prompt <img src="https://latex.codecogs.com/png.latex?x">, we want to find the distribution <img src="https://latex.codecogs.com/png.latex?%5Cpi(%5Ccdot%7Cx)"> that maximizes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(y%7Cx)%7D%5Cleft%5Br(x,%20y)%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%5Cright%5D%0A"></p>
<p>This is a constrained optimization problem over probability distributions. We can solve it using the method of <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a>, enforcing that <img src="https://latex.codecogs.com/png.latex?%5Cpi(y%7Cx)"> sums to 1. For discrete <img src="https://latex.codecogs.com/png.latex?y">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D(%5Cpi,%20%5Clambda)%20=%20%5Csum_y%20%5Cpi(y%7Cx)%20%5Cleft%5Br(x,%20y)%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%5Cright%5D%20+%20%5Clambda%20%5Cleft(1%20-%20%5Csum_y%20%5Cpi(y%7Cx)%5Cright)%0A"></p>
<p>Taking the derivative with respect to <img src="https://latex.codecogs.com/png.latex?%5Cpi(y%7Cx)"> and setting it to zero (stationary point):</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B%5Cpartial%20%5Cmathcal%7BL%7D%7D%7B%5Cpartial%20%5Cpi(y%7Cx)%7D%20=%20r(x,%20y)%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%20-%20%5Cbeta%20-%20%5Clambda%20=%200%0A"></p>
<p>Now, solving for <img src="https://latex.codecogs.com/png.latex?%5Cpi(y%7Cx)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%20=%20%5Cfrac%7B1%7D%7B%5Cbeta%7D%5Cleft(r(x,%20y)%20-%20%5Cbeta%20-%20%5Clambda%5Cright)%0A"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%20=%20%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%5Cright)%20%5Ccdot%20%5Cexp%5Cleft(-1%20-%20%5Cfrac%7B%5Clambda%7D%7B%5Cbeta%7D%5Cright)%0A"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpi(y%7Cx)%20=%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20%5Ccdot%20%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%5Cright)%20%5Ccdot%20%5Cexp%5Cleft(-1%20-%20%5Cfrac%7B%5Clambda%7D%7B%5Cbeta%7D%5Cright)%0A"></p>
<p>The term <img src="https://latex.codecogs.com/png.latex?%5Cexp%5Cleft(-1%20-%20%5Cfrac%7B%5Clambda%7D%7B%5Cbeta%7D%5Cright)"> is a constant (with respect to <img src="https://latex.codecogs.com/png.latex?y">) that ensures normalization. To find its value, we enforce that <img src="https://latex.codecogs.com/png.latex?%5Cpi(y%7Cx)"> must be a valid probability distribution and sum to 1:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Csum_y%20%5Cpi(y%7Cx)%20=%201%0A"></p>
<p>Substituting our expression for <img src="https://latex.codecogs.com/png.latex?%5Cpi(y%7Cx)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Csum_y%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20%5Ccdot%20%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%5Cright)%20%5Ccdot%20%5Cexp%5Cleft(-1%20-%20%5Cfrac%7B%5Clambda%7D%7B%5Cbeta%7D%5Cright)%20=%201%0A"></p>
<p>Since <img src="https://latex.codecogs.com/png.latex?%5Cexp%5Cleft(-1%20-%20%5Cfrac%7B%5Clambda%7D%7B%5Cbeta%7D%5Cright)"> doesn’t depend on <img src="https://latex.codecogs.com/png.latex?y">, we can factor it out of the sum:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cexp%5Cleft(-1%20-%20%5Cfrac%7B%5Clambda%7D%7B%5Cbeta%7D%5Cright)%20%5Ccdot%20%5Csum_y%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20%5Ccdot%20%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%5Cright)%20=%201%0A"></p>
<p>Solving for the constant:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cexp%5Cleft(-1%20-%20%5Cfrac%7B%5Clambda%7D%7B%5Cbeta%7D%5Cright)%20=%20%5Cfrac%7B1%7D%7B%5Csum_y%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20%5Ccdot%20%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%5Cright)%7D%0A"></p>
<p>We define this normalizing sum as the <strong>partition function</strong> <img src="https://latex.codecogs.com/png.latex?Z(x)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AZ(x)%20=%20%5Csum_y%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%5Cright)%20%5Ctag%7BIII.II%7D%0A"></p>
<p>Substituting back, we get the optimal policy:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cpi_r(y%7Cx)%20=%20%5Cfrac%7B1%7D%7BZ(x)%7D%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%5Cright)%7D%20%5Ctag%7BIII.III%7D%0A"></p>
<p>We have an exact closed-form expression for the optimal policy. However, we cannot compute it directly because <img src="https://latex.codecogs.com/png.latex?Z(x)"> is intractable. To compute it, we need to sum over <strong>all possible responses</strong> <img src="https://latex.codecogs.com/png.latex?y"> which not possible.</p>
</section>
<section id="iv-the-reparameterization-trick" class="level2">
<h2 class="anchored" data-anchor-id="iv-the-reparameterization-trick">IV: The Reparameterization Trick</h2>
<p>The key insight of DPO is to flip the relationship between reward and policy. Now, we frame the problem as: “given an optimal policy what reward function does it correspond to?”</p>
<p>Starting from the optimal policy equation (III.III):</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpi_r(y%7Cx)%20=%20%5Cfrac%7B1%7D%7BZ(x)%7D%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20%5Cexp%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%5Cright)%0A"></p>
<p>We solve for the reward <img src="https://latex.codecogs.com/png.latex?r(x,%20y)"> by first taking the log of both sides:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Clog%20%5Cpi_r(y%7Cx)%20=%20%5Clog%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20+%20%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%20-%20%5Clog%20Z(x)%0A"></p>
<p>Now rearrange to get <img src="https://latex.codecogs.com/png.latex?r(x,%20y)"> on left-side:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B1%7D%7B%5Cbeta%7Dr(x,%20y)%20=%20%5Clog%20%5Cpi_r(y%7Cx)%20-%20%5Clog%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20+%20%5Clog%20Z(x)%0A"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ar(x,%20y)%20=%20%5Cbeta%20%5Clog%20%5Cpi_r(y%7Cx)%20-%20%5Cbeta%20%5Clog%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%20+%20%5Cbeta%20%5Clog%20Z(x)%0A"></p>
<p>This can be written more compactly as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7Br(x,%20y)%20=%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi_r(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%20+%20%5Cbeta%20%5Clog%20Z(x)%7D%20%5Ctag%7BIV.I%7D%0A"></p>
<p>The reward is expressed as: - A term involving the log-ratio of the optimal policy to the reference policy - <img src="https://latex.codecogs.com/png.latex?%5Cbeta%20%5Clog%20Z(x)"> which depends only on <img src="https://latex.codecogs.com/png.latex?x"> (not on <img src="https://latex.codecogs.com/png.latex?y">)</p>
</section>
<section id="v-deriving-the-dpo-loss" class="level2">
<h2 class="anchored" data-anchor-id="v-deriving-the-dpo-loss">V: Deriving the DPO Loss</h2>
<p>Finally, we have all the pieces to derive the DPO loss. From Section II, the Bradley-Terry preference model is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AP(y_w%20%5Csucc%20y_l%20%7C%20x)%20=%20%5Csigma%5Cleft(r(x,%20y_w)%20-%20r(x,%20y_l)%5Cright)%0A"></p>
<p>From Section IV, assuming we have access to an optimal policy <img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*">, the reward can be written as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ar(x,%20y)%20=%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%20+%20%5Cbeta%20%5Clog%20Z(x)%0A"></p>
<p>Substituting this into Bradley-Terry:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AP(y_w%20%5Csucc%20y_l%20%7C%20x)%20=%20%5Csigma%5Cleft(%5Cleft%5B%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y_w%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y_w%7Cx)%7D%20+%20%5Cbeta%20%5Clog%20Z(x)%5Cright%5D%20-%20%5Cleft%5B%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y_l%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y_l%7Cx)%7D%20+%20%5Cbeta%20%5Clog%20Z(x)%5Cright%5D%5Cright)%0A"></p>
<p>Simplifying the expression inside the sigmoid:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A=%20%5Csigma%5Cleft(%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y_w%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y_w%7Cx)%7D%20+%20%5Cbeta%20%5Clog%20Z(x)%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y_l%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y_l%7Cx)%7D%20-%20%5Cbeta%20%5Clog%20Z(x)%5Cright)%0A"></p>
<p>The <img src="https://latex.codecogs.com/png.latex?%5Cbeta%20%5Clog%20Z(x)"> terms cancel:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A=%20%5Csigma%5Cleft(%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y_w%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y_w%7Cx)%7D%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y_l%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y_l%7Cx)%7D%5Cright)%0A"></p>
<blockquote class="blockquote">
<p>Now recall the critical insight from Section II where we mentioned that the Bradley-Terry model depends only on reward <strong>differences</strong>. Thus, when we compute <img src="https://latex.codecogs.com/png.latex?r(x,%20y_w)%20-%20r(x,%20y_l)"> the intractable partition function <img src="https://latex.codecogs.com/png.latex?Z(x)"> cancels out. This is what makes DPO possible.</p>
</blockquote>
<p>We can write this more cleanly by defining the <strong>implicit reward</strong> in terms of the optimal policy:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7Br%7D(x,%20y)%20=%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%0A"></p>
<p>Thus:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AP(y_w%20%5Csucc%20y_l%20%7C%20x)%20=%20%5Csigma%5Cleft(%5Chat%7Br%7D(x,%20y_w)%20-%20%5Chat%7Br%7D(x,%20y_l)%5Cright)%0A"></p>
<p>We dont actually have access to the optimal policy <img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*">. But we can <strong>parameterize</strong> a policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> and optimize it to maximize the likelihood of the observed preferences. This is exactly what the reward model loss (II.III) does except now <strong>our reward is implicitly defined by the policy itself</strong>.</p>
<p>The DPO loss is the negative log-likelihood:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cmathcal%7BL%7D_%7B%5Ctext%7BDPO%7D%7D(%5Cpi_%5Ctheta;%20%5Cpi_%7B%5Ctext%7Bref%7D%7D)%20=%20-%5Cmathbb%7BE%7D_%7B(x,%20y_w,%20y_l)%20%5Csim%20%5Cmathcal%7BD%7D%7D%5Cleft%5B%5Clog%20%5Csigma%5Cleft(%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(y_w%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y_w%7Cx)%7D%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(y_l%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y_l%7Cx)%7D%5Cright)%5Cright%5D%7D%20%5Ctag%7BV.I%7D%0A"></p>
<p>for implicit reward notation, it can be written as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D_%7B%5Ctext%7BDPO%7D%7D(%5Cpi_%5Ctheta;%20%5Cpi_%7B%5Ctext%7Bref%7D%7D)%20=%20-%5Cmathbb%7BE%7D_%7B(x,%20y_w,%20y_l)%20%5Csim%20%5Cmathcal%7BD%7D%7D%5Cleft%5B%5Clog%20%5Csigma%5Cleft(%5Chat%7Br%7D_%5Ctheta(x,%20y_w)%20-%20%5Chat%7Br%7D_%5Ctheta(x,%20y_l)%5Cright)%5Cright%5D%20%5Ctag%7BV.II%7D%0A"></p>
<p>Some key insights from the above DPO loss:</p>
<ul>
<li>The policy implicitly defines its own reward via the log-ratio with the reference. There is <strong>no separate reward model</strong>.</li>
<li>This is just a supervised classification loss on preference pairs with <strong>no RL</strong>.</li>
<li>DPO uses the fixed preference dataset <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BD%7D">. Thus, <strong>no sampling during training</strong>.</li>
<li><strong>No value function</strong> needed since we’re not doing policy gradients</li>
<li>DPO still optimizes the KL-constrained reward maximization objective but in a different way</li>
</ul>
</section>
<section id="vi-building-intuition-for-dpo" class="level2">
<h2 class="anchored" data-anchor-id="vi-building-intuition-for-dpo">VI: Building Intuition for DPO</h2>
<p>Now that we have the DPO loss we can build some intuition around the implicit reward model and its gradient updates.</p>
<section id="implicit-reward-model" class="level3">
<h3 class="anchored" data-anchor-id="implicit-reward-model">Implicit Reward Model</h3>
<p>The DPO paper subtitle is <strong>“Your Language Model is Secretly a Reward Model”</strong> and this captures the key insight as we are using the LLM for implicit reward. The policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> defines an implicit reward function:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7Br%7D_%5Ctheta(x,%20y)%20=%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(y%7Cx)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)%7D%0A"></p>
<p>This reward measures how much more likely the current policy is to generate response <img src="https://latex.codecogs.com/png.latex?y"> compared to the reference policy, scaled by <img src="https://latex.codecogs.com/png.latex?%5Cbeta">.</p>
<ul>
<li>If <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(y%7Cx)%20%3E%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)">: The implicit reward is positive (the policy “likes” this response more than reference)</li>
<li>If <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(y%7Cx)%20%3C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(y%7Cx)">: The implicit reward is negative (the policy “likes” this response less than reference)</li>
</ul>
</section>
<section id="analyzing-the-gradient-update" class="level3">
<h3 class="anchored" data-anchor-id="analyzing-the-gradient-update">Analyzing the Gradient Update</h3>
<p>We can flex our brain muscles one more time and compute the gradient for the DPO objective:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D_%7B%5Ctext%7BDPO%7D%7D%20=%20-%5Cmathbb%7BE%7D%5Cleft%5B%5Clog%20%5Csigma%5Cleft(%5Chat%7Br%7D_%5Ctheta(x,%20y_w)%20-%20%5Chat%7Br%7D_%5Ctheta(x,%20y_l)%5Cright)%5Cright%5D%0A"></p>
<p>Let <img src="https://latex.codecogs.com/png.latex?u%20=%20%5Chat%7Br%7D_%5Ctheta(x,%20y_w)%20-%20%5Chat%7Br%7D_%5Ctheta(x,%20y_l)">. Using the chain rule:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla_%5Ctheta%20%5Cmathcal%7BL%7D_%7B%5Ctext%7BDPO%7D%7D%20=%20-%5Cmathbb%7BE%7D%5Cleft%5B%5Cfrac%7B%5Csigma'(u)%7D%7B%5Csigma(u)%7D%20%5Cnabla_%5Ctheta%20u%5Cright%5D%0A"></p>
<p>Using the property <img src="https://latex.codecogs.com/png.latex?%5Csigma'(u)%20=%20%5Csigma(u)(1%20-%20%5Csigma(u))">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A=%20-%5Cmathbb%7BE%7D%5Cleft%5B%5Cfrac%7B%5Csigma(u)(1-%5Csigma(u))%7D%7B%5Csigma(u)%7D%20%5Cnabla_%5Ctheta%20u%5Cright%5D%20=%20-%5Cmathbb%7BE%7D%5Cleft%5B(1%20-%20%5Csigma(u))%20%5Cnabla_%5Ctheta%20u%5Cright%5D%0A"></p>
<p>Using <img src="https://latex.codecogs.com/png.latex?1%20-%20%5Csigma(u)%20=%20%5Csigma(-u)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A=%20-%5Cmathbb%7BE%7D%5Cleft%5B%5Csigma(-u)%20%5Cnabla_%5Ctheta%20u%5Cright%5D%0A"></p>
<p>Now, <img src="https://latex.codecogs.com/png.latex?-u%20=%20%5Chat%7Br%7D_%5Ctheta(x,%20y_l)%20-%20%5Chat%7Br%7D_%5Ctheta(x,%20y_w)">, and:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla_%5Ctheta%20u%20=%20%5Cnabla_%5Ctheta%20%5Chat%7Br%7D_%5Ctheta(x,%20y_w)%20-%20%5Cnabla_%5Ctheta%20%5Chat%7Br%7D_%5Ctheta(x,%20y_l)%20=%20%5Cbeta%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(y_w%7Cx)%20-%20%5Cbeta%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(y_l%7Cx)%0A"></p>
<p>Putting it together:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cnabla_%5Ctheta%20%5Cmathcal%7BL%7D_%7B%5Ctext%7BDPO%7D%7D%20=%20-%5Cbeta%20%5Cmathbb%7BE%7D%5Cleft%5B%5Cunderbrace%7B%5Csigma%5Cleft(%5Chat%7Br%7D_%5Ctheta(x,%20y_l)%20-%20%5Chat%7Br%7D_%5Ctheta(x,%20y_w)%5Cright)%7D_%7B%5Ctext%7Bweight%7D%7D%5Cleft(%5Cunderbrace%7B%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(y_w%7Cx)%7D_%7B%5Ctext%7Bincrease%20%7D%20y_w%7D%20-%20%5Cunderbrace%7B%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(y_l%7Cx)%7D_%7B%5Ctext%7Bdecrease%20%7D%20y_l%7D%5Cright)%5Cright%5D%7D%20%5Ctag%7BVI.I%7D%0A"></p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(y_w%7Cx)"> points in the direction that increases probability of the preferred response</li>
<li><img src="https://latex.codecogs.com/png.latex?-%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(y_l%7Cx)"> points in the direction that decreases probability of the dispreferred response</li>
<li>The weight term is high when <img src="https://latex.codecogs.com/png.latex?%5Chat%7Br%7D_%5Ctheta(x,%20y_l)%20%3E%20%5Chat%7Br%7D_%5Ctheta(x,%20y_w)">, i.e.&nbsp;when the model currently assigns higher implicit reward to the losing response than the winning response. In other words:
<ul>
<li><strong>When the model is wrong</strong> (ranks <img src="https://latex.codecogs.com/png.latex?y_l"> above <img src="https://latex.codecogs.com/png.latex?y_w">), we get large gradient updates</li>
<li><strong>When the model is right</strong> (ranks <img src="https://latex.codecogs.com/png.latex?y_w"> above <img src="https://latex.codecogs.com/png.latex?y_l">), we get small gradient updates</li>
</ul></li>
</ul>
<p>This dynamic sigmoid weighting is crucial. It naturally focuses learning on the examples the model currently gets wrong.</p>
</section>
</section>
<section id="vii-computing-log-probabilities-in-practice" class="level2">
<h2 class="anchored" data-anchor-id="vii-computing-log-probabilities-in-practice">VII: Computing Log Probabilities in Practice</h2>
<blockquote class="blockquote">
<p>This section is fully adapted from Umar Jamil’s video. I think it is essential to understand how log probabilities are computed in practice.</p>
</blockquote>
<p>The DPO loss requires computing <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%5Ctheta(y%7Cx)">, the log probability of a complete response <img src="https://latex.codecogs.com/png.latex?y"> given a prompt <img src="https://latex.codecogs.com/png.latex?x">. Let’s see how this works in practice with LLMs.</p>
<p>Language models are autoregressive: they generate text one token at a time, conditioning on all previous tokens. For a response <img src="https://latex.codecogs.com/png.latex?y%20=%20(y_1,%20y_2,%20%5Cldots,%20y_T)">, the probability factorizes as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpi_%5Ctheta(y%7Cx)%20=%20%5Cprod_%7Bt=1%7D%5E%7BT%7D%20%5Cpi_%5Ctheta(y_t%20%7C%20x,%20y_1,%20%5Cldots,%20y_%7Bt-1%7D)%20=%20%5Cprod_%7Bt=1%7D%5E%7BT%7D%20%5Cpi_%5Ctheta(y_t%20%7C%20x,%20y_%7B%3Ct%7D)%0A"></p>
<p>Taking the logarithm:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Clog%20%5Cpi_%5Ctheta(y%7Cx)%20=%20%5Csum_%7Bt=1%7D%5E%7BT%7D%20%5Clog%20%5Cpi_%5Ctheta(y_t%20%7C%20x,%20y_%7B%3Ct%7D)%7D%20%5Ctag%7BVII.I%7D%0A"></p>
<p><strong>The log probability of the full response is the sum of log probabilities at each position.</strong></p>
<p>Here’s how to compute <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%5Ctheta(y%7Cx)">:</p>
<ol type="1">
<li><p><strong>Prepare input</strong>: Concatenate the prompt and response into a single sequence</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7Binput%7D%20=%20%5Bx_1,%20x_2,%20%5Cldots,%20x_n,%20y_1,%20y_2,%20%5Cldots,%20y_T%5D"></p></li>
<li><p><strong>Forward pass</strong>: Run the transformer to get hidden states at each position</p></li>
<li><p><strong>Project to logits</strong>: Apply the language model head (typically a linear layer) to get vocabulary logits at each position</p></li>
<li><p><strong>Log softmax</strong>: Convert logits to log probabilities over the vocabulary using <code>logsoftmax</code></p></li>
<li><p><strong>Gather relevant log probs</strong>: For each position <img src="https://latex.codecogs.com/png.latex?t"> in the response extract the log probability of the actual next token <img src="https://latex.codecogs.com/png.latex?y_t"> (since we know the output)</p></li>
<li><p><strong>Sum with masking</strong>: Sum the log probabilities but only for response tokens (not prompt tokens)</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%5Ctheta(y%7Cx)%20=%20%5Csum_%7Bt%20%5Cin%20%5Ctext%7Bres%20pos%7D%7D%20%5Clog%20%5Cpi_%5Ctheta(y_t%20%7C%20x,%20y_%7B%3Ct%7D)"></p></li>
</ol>
<p>This gives us <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%5Ctheta(y%7Cx)"> for one response. We do this for both the preferred response <img src="https://latex.codecogs.com/png.latex?y_w"> and the dispreferred response <img src="https://latex.codecogs.com/png.latex?y_l"> and we also do it for both the policy model <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> and the frozen reference model <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bref%7D%7D">. With these four log probabilities in hand, we can compute the DPO loss.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Once you derive the DPO loss, you start appreciating the simplicity and elegance of the solution especially when compared to PPO. The derivation hinges on one observation that the Bradley-Terry model only cares about reward differences and this causes the intractable partition function from analytical solution to cancel out completely. In turn, what remains is a straightforward classification loss.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><strong>Papers:</strong>
<ul>
<li><a href="https://arxiv.org/abs/2305.18290">Direct Preference Optimization: Your Language Model is Secretly a Reward Model</a>: The original DPO paper</li>
<li><a href="https://arxiv.org/pdf/2203.02155">Training language models to follow instructions with human feedback</a>: The InstructGPT paper that established PPO-based RLHF</li>
</ul></li>
<li><strong>Videos:</strong>
<ul>
<li><a href="https://www.youtube.com/watch?v=hvGa5Mba4c8">Umar Jamil’s video on DPO</a>: Excellent walkthrough of the DPO derivation</li>
</ul></li>
</ul>


</section>

 ]]></description>
  <category>RL &amp; Alignment</category>
  <guid>https://garg-aayush.github.io/posts/2025-12-30-deriving-dpo-loss.html</guid>
  <pubDate>Tue, 30 Dec 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Deriving the PPO Loss from First Principles</title>
  <link>https://garg-aayush.github.io/posts/2025-12-25-deriving-ppo-loss.html</link>
  <description><![CDATA[ 




<p>I have been trying to wrap my head around reinforcement learning methods like <a href="https://arxiv.org/abs/2305.18290">DPO</a>, <a href="https://arxiv.org/pdf/2402.03300">GRPO</a>, and <a href="https://arxiv.org/abs/2506.14245">RLVR</a> for a while now, especially with all the recent work showing how effective they can be for LLM post-training. Since I amm still pretty new to RL, I figured the best place to start was Proximal Policy Optimization (PPO), the algorithm OpenAI used to show how reinforcement learning could meaningfully improve LLM alignment (<a href="https://arxiv.org/pdf/2203.02155">InstructGPT</a> paper). My hope is that getting comfortable with PPO will give me the right mental model for the policy-gradient side of things and make it easier to understand the newer LLM-specific RL methods built on similar ideas.</p>
<p>If you start learning RL, you quickly realize it involves a lot of math! So I decided to lean into that and do a few (possibly annoying) derivation sessions to really understand the PPO objective by building it up from first principles, similar to how Umar Jamil does in his video.</p>
<blockquote class="blockquote">
<p>A huge shoutout to Umar Jamil’s <a href="https://www.youtube.com/watch?v=qGyFrqc34yc">video on RLHF and PPO</a>: it was incredibly helpful for building intuition and understanding the math behind the PPO loss.</p>
</blockquote>
<p>Below is my attempt at the derivation based on the original PPO and InstructGPT papers and Umar Jamil’s video.</p>
<section id="i-reinforcement-learning-core-definitions" class="level2">
<h2 class="anchored" data-anchor-id="i-reinforcement-learning-core-definitions">I: Reinforcement Learning: Core Definitions</h2>
<table class="caption-top table">
<colgroup>
<col style="width: 9%">
<col style="width: 39%">
<col style="width: 50%">
</colgroup>
<thead>
<tr class="header">
<th>Concept</th>
<th>General RL Definition</th>
<th>LLM Context (RLHF)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Reinforcement Learning</strong></td>
<td>A learning setup where an agent learns to act in an environment to maximize expected cumulative reward.</td>
<td>Fine-tuning a language model to generate responses that better match human preferences using reward-based feedback.</td>
</tr>
<tr class="even">
<td><strong>Environment</strong></td>
<td>Everything outside the agent that it interacts with and that produces observations and rewards.</td>
<td>The prompt distribution and interaction loop and the reward signal from a reward model evaluating generated responses.</td>
</tr>
<tr class="odd">
<td><strong>Agent</strong></td>
<td>The learner/decision-maker that observes states, takes actions, and receives rewards.</td>
<td>The language model generating text token by token.</td>
</tr>
<tr class="even">
<td><strong>Action (<img src="https://latex.codecogs.com/png.latex?a">)</strong></td>
<td>A choice made by the agent, usually conditioned on the state <img src="https://latex.codecogs.com/png.latex?s">.</td>
<td>Picking the next token at each step of generation.</td>
</tr>
<tr class="odd">
<td><strong>State (<img src="https://latex.codecogs.com/png.latex?s">)</strong></td>
<td>The information available to the agent at a given time step.</td>
<td>The prompt plus the response generated so far (the current token context).</td>
</tr>
<tr class="even">
<td><strong>Reward (<img src="https://latex.codecogs.com/png.latex?r">)</strong></td>
<td>A scalar signal telling the agent how good or bad an outcome was.</td>
<td>A score from the reward model (trained on preference data) that judges how good or bad a response is.</td>
</tr>
<tr class="odd">
<td><strong>Policy (<img src="https://latex.codecogs.com/png.latex?%5Cpi">)</strong></td>
<td>A stochastic mapping from states to a distribution over actions.</td>
<td>The model’s probability distribution over the next token given the context.</td>
</tr>
<tr class="even">
<td><strong>Goal</strong></td>
<td>Find an optimal policy <img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*"> that maximizes expected cumulative reward over time.</td>
<td>Update (align) the model so it tends to generate responses with higher reward-model scores.</td>
</tr>
</tbody>
</table>
</section>
<section id="ii-reward-model-in-rlhf-for-llms" class="level2">
<h2 class="anchored" data-anchor-id="ii-reward-model-in-rlhf-for-llms">II: Reward Model in RLHF for LLMs</h2>
<p>A <strong>Reward Model (RM)</strong> is a neural network that takes a prompt <img src="https://latex.codecogs.com/png.latex?x"> and a response <img src="https://latex.codecogs.com/png.latex?y"> as input and outputs a <strong>scalar reward</strong> <img src="https://latex.codecogs.com/png.latex?r_%5Cphi(x,%20y)%20%5Cin%20%5Cmathbb%7BR%7D"> indicating how “good” or “aligned” that response is according to human preferences.</p>
<p>Policy-gradient methods (including PPO) require a scalar objective to update the policy parameters. In standard RL, the environment provides this signal. However for language generation, there is no natural environment giving us rewards for “good” responses. Having humans rate every output is impractical and for gradient-based optimization, we need a differentiable scalar signal to backpropagate through. Thus, we require a cheap, differentiable proxy for human preferences during RL training. A learned RM provides exactly this.</p>
<section id="how-is-the-reward-model-trained" class="level3">
<h3 class="anchored" data-anchor-id="how-is-the-reward-model-trained">How is the Reward Model Trained?</h3>
<p>The standard procedure for training the reward model is:</p>
<ol type="1">
<li>Sample prompts (<img src="https://latex.codecogs.com/png.latex?x">)</li>
<li>Generate multiple candidate completions (<img src="https://latex.codecogs.com/png.latex?y_1,%20y_2,%20%5Cldots,%20y_K">) from a baseline policy (often an SFT model).</li>
<li>Ask humans to <strong>compare</strong> candidates (pairwise preferences are easier than absolute scoring).</li>
<li>Train the RM (<img src="https://latex.codecogs.com/png.latex?r_%5Cphi">) to predict those preferences.</li>
</ol>
<p>Architecturally, the reward model is typically: - Initialized from a pretrained language model (often the SFT model itself) - The final non-embedding layer (which projects to vocabulary) is <strong>removed</strong> - Replaced it with a <strong>linear layer</strong> that projects the hidden state of the last token to a single scalar output</p>
</section>
<section id="reward-model-loss-function" class="level3">
<h3 class="anchored" data-anchor-id="reward-model-loss-function">Reward Model Loss Function</h3>
<p>The reward model is trained using the <strong>Bradley-Terry model</strong> for pairwise comparisons. The probability that response <img src="https://latex.codecogs.com/png.latex?y_w"> (preferred) is preferred over <img src="https://latex.codecogs.com/png.latex?y_l"> (less preferred) for any prompt <img src="https://latex.codecogs.com/png.latex?x"> is modeled as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AP(y_w%20%5Csucc%20y_l%20%7C%20x)%20=%20%5Csigma%5Cleft(r_%5Ctheta(x,%20y_w)%20-%20r_%5Ctheta(x,%20y_l)%5Cright)%20%5Ctag%7BII.I%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Csigma"> is the sigmoid function: <img src="https://latex.codecogs.com/png.latex?%5Csigma(z)%20=%20%5Cfrac%7B1%7D%7B1%20+%20e%5E%7B-z%7D%7D"></p>
<p>The <strong>negative log-likelihood loss</strong> is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%7BL%7D_%7B%5Ctext%7BRM%7D%7D(%5Cphi)%20=%20-%5Cmathbb%7BE%7D_%7B(x,%20y_w,%20y_l)%20%5Csim%20%5Cmathcal%7BD%7D%7D%20%5Cleft%5B%20%5Clog%20%5Csigma%5Cleft(r_%5Cphi(x,%20y_w)%20-%20r_%5Cphi(x,%20y_l)%5Cright)%20%5Cright%5D%0A"></p>
<p>One can verify that this loss forces the reward model to assign higher rewards to preferred responses (see <a href="https://arxiv.org/pdf/2203.02155">InstructGPT paper</a> or Umar Jamil’s video for a detailed walkthrough).</p>
<p>There are two key insights here: 1. We don’t need absolute scores, we only need the reward model to <strong>correctly rank</strong> responses. 2. The loss depends only on <strong>differences</strong> (<img src="https://latex.codecogs.com/png.latex?r_%5Cphi(x,%20y_w)%20-%20r_%5Cphi(x,%20y_l)">), so it is invariant to adding a constant to all rewards. This will be useful later when we discuss the PPO loss.</p>
<p>The reward model serves as a <strong>learned proxy for human preferences</strong>, converting the intractable problem of getting human feedback on every generation into a tractable supervised learning problem. Once trained, it provides the scalar signal <img src="https://latex.codecogs.com/png.latex?r_%5Cphi(x,%20y)"> needed to optimize our policy (LLM) using rl algorithms like PPO.</p>
</section>
</section>
<section id="iii-trajectories-and-returns" class="level2">
<h2 class="anchored" data-anchor-id="iii-trajectories-and-returns">III: Trajectories and Returns</h2>
<section id="trajectory" class="level3">
<h3 class="anchored" data-anchor-id="trajectory">Trajectory</h3>
<p>A <strong>trajectory</strong> (also called a rollout or episode) is a sequence of states (<img src="https://latex.codecogs.com/png.latex?s">), actions (<img src="https://latex.codecogs.com/png.latex?a">), and rewards (<img src="https://latex.codecogs.com/png.latex?r">) generated by an agent interacting with an environment:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctau%20=%20(s_0,%20a_0,%20r_0,%20s_1,%20a_1,%20r_1,%20%5Cldots,%20s_T,%20a_T,%20r_T)%0A"></p>
<p>In the context of LLMs, a trajectory corresponds to the entire sequence of token generations. It is the prompt followed by all generated tokens until the end-of-sequence token.</p>
<p>Note that the states are always stochastically modeled, and <img src="https://latex.codecogs.com/png.latex?s_%7Bt+1%7D"> can be represented as <img src="https://latex.codecogs.com/png.latex?s_%7Bt+1%7D%20%5Csim%20P(s_%7Bt+1%7D%20%7C%20s_t,%20a_t)">. Given a stochastic policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta%20(a_t%20%7C%20s_t)">, the probability of a trajectory <img src="https://latex.codecogs.com/png.latex?%5Ctau"> is the product of: 1. The initial state distribution <img src="https://latex.codecogs.com/png.latex?%5Crho_0(s_0)"> 2. The stochastic policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta%20(a_t%20%7C%20s_t)"> 3. The environment transition dynamics <img src="https://latex.codecogs.com/png.latex?P(s_%7Bt+1%7D%20%7C%20s_t,%20a_t)"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0AP(%5Ctau%20%7C%20%5Cpi_%5Ctheta)%20=%20%5Crho_0(s_0)%20%5Cprod_%7Bt=0%7D%5E%7BT-1%7D%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20%5Ccdot%20P(s_%7Bt+1%7D%20%7C%20s_t,%20a_t)%20%5Ctag%7BIII.I%7D%0A"></p>
</section>
<section id="return" class="level3">
<h3 class="anchored" data-anchor-id="return">Return</h3>
<p>The <strong>return</strong> is the cumulative reward collected over the full trajectory (<img src="https://latex.codecogs.com/png.latex?%5Ctau">). The simplest form is the <strong>undiscounted return</strong>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AR(%5Ctau)%20=%20%5Csum_%7Bt=0%7D%5E%7BT%7D%20r_t%0A"></p>
<p>More generally, we use the <strong>discounted return</strong>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AR(%5Ctau)%20=%20%5Csum_%7Bk=0%7D%5E%7B%5Cinfty%7D%20%5Cgamma%5Ek%20r_%7Bk%7D%20=%20r_0%20+%20%5Cgamma%20r_%7B1%7D%20+%20%5Cgamma%5E2%20r_%7B2%7D%20+%20%5Ccdots%20%5Ctag%7BIII.II%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cgamma%20%5Cin%20%5B0,%201%5D"> is the <strong>discount factor</strong>. The discount factor <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> serves a couple of purposes: 1. It ensures the return is finite for infinite-horizon tasks (<img src="https://latex.codecogs.com/png.latex?T%5Cto%5Cinfty">). 2. It prioritizes immediate rewards over distant ones.</p>
</section>
</section>
<section id="iv-policy-gradient-optimization-and-reinforce-algorithm" class="level2">
<h2 class="anchored" data-anchor-id="iv-policy-gradient-optimization-and-reinforce-algorithm">IV: Policy Gradient Optimization and REINFORCE Algorithm</h2>
<p>The goal of reinforcement learning is to find a policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> that maximizes the <strong>expected return</strong> over all possible trajectories:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7BJ(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi_%5Ctheta%7D%5BR(%5Ctau)%5D%7D%20%5Ctag%7BIV.I%7D%0A"></p>
<p>This is our objective function and we want to find parameters <img src="https://latex.codecogs.com/png.latex?%5Ctheta%5E*"> such that:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctheta%5E*%20=%20%5Carg%5Cmax_%5Ctheta%20J(%5Ctheta)%0A"></p>
<p>To maximize <img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)"> using gradient-based methods, we need to compute <img src="https://latex.codecogs.com/png.latex?%5Cnabla_%5Ctheta%20J(%5Ctheta)"> and perform gradient ascent:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Ctheta_%7Bk+1%7D%20=%20%5Ctheta_k%20+%20%5Calpha%20%5Cleft.%20%5Cnabla_%5Ctheta%20J(%5Cpi_%5Ctheta)%20%5Cright%7C_%7B%5Ctheta_k%7D%7D%20%5Ctag%7BIV.II%7D%0A"></p>
<p>This policy gradient looks simple in equation form but it is intractable to compute. The expectation is over trajectories sampled from <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta">, which itself depends on <img src="https://latex.codecogs.com/png.latex?%5Ctheta">. We can’t simply enumerate all possible trajectories. This is computationally intractable for any reasonably sized state-action space (and certainly not possible for LLMs!).</p>
<p>Thus, as a next step we need to derive some sort of reasonable and tractable approximation for <img src="https://latex.codecogs.com/png.latex?%5Cnabla_%5Ctheta%20J(%5Ctheta)">. We do this by using the <strong>log-derivative trick</strong>.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla_%5Ctheta%20J(%5Ctheta)%20=%20%5Cnabla_%5Ctheta%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi_%5Ctheta%7D%5BR(%5Ctau)%5D%0A"></p>
<p>This expectation can be written as an integral: <img src="https://latex.codecogs.com/png.latex?%0A=%20%5Cnabla_%5Ctheta%20%5Cint_%5Ctau%20P(%5Ctau%20%7C%20%5Ctheta)%20R(%5Ctau)%20%5C,%20d%5Ctau%0A"></p>
<p>Bringing the gradient inside the integral: <img src="https://latex.codecogs.com/png.latex?%0A=%20%5Cint_%5Ctau%20%5Cnabla_%5Ctheta%20P(%5Ctau%20%7C%20%5Ctheta)%20R(%5Ctau)%20%5C,%20d%5Ctau%0A"></p>
<p>Now we apply the <strong>log-derivative trick</strong>: <img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla_%5Ctheta%20%5Clog%20P(%5Ctau%20%7C%20%5Ctheta)%20=%20%5Cfrac%7B%5Cnabla_%5Ctheta%20P(%5Ctau%20%7C%20%5Ctheta)%7D%7BP(%5Ctau%20%7C%20%5Ctheta)%7D%0A"></p>
<p>Rearranging: <img src="https://latex.codecogs.com/png.latex?%5Cnabla_%5Ctheta%20P(%5Ctau%20%7C%20%5Ctheta)%20=%20P(%5Ctau%20%7C%20%5Ctheta)%20%5Cnabla_%5Ctheta%20%5Clog%20P(%5Ctau%20%7C%20%5Ctheta)"> and substituting back, we get: <img src="https://latex.codecogs.com/png.latex?%0A=%20%5Cint_%5Ctau%20P(%5Ctau%20%7C%20%5Ctheta)%20%5Cnabla_%5Ctheta%20%5Clog%20P(%5Ctau%20%7C%20%5Ctheta)%20R(%5Ctau)%20%5C,%20d%5Ctau%0A"></p>
<p>which can also be written as the following expectation:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cnabla_%5Ctheta%20J(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cleft%5B%20%5Cnabla_%5Ctheta%20%5Clog%20P(%5Ctau%20%7C%20%5Ctheta)%20%5Ccdot%20R(%5Ctau)%20%5Cright%5D%7D%20%5Ctag%7BIV.III%7D%0A"></p>
<p>Note, here the gradient is now the expectation of the gradient of the log-probability of the trajectory. This can further be simplified by using the trajectory probability expression (III.I): <img src="https://latex.codecogs.com/png.latex?%0AP(%5Ctau%20%7C%20%5Ctheta)%20=%20%5Crho_0(s_0)%20%5Cprod_%7Bt=0%7D%5E%7BT-1%7D%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20%5Ccdot%20P(s_%7Bt+1%7D%20%7C%20s_t,%20a_t)%0A"></p>
<p>Taking the log:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Clog%20P(%5Ctau%20%7C%20%5Ctheta)%20=%20%5Clog%20%5Crho_0(s_0)%20+%20%5Csum_%7Bt=0%7D%5E%7BT-1%7D%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20+%20%5Csum_%7Bt=0%7D%5E%7BT-1%7D%20%5Clog%20P(s_%7Bt+1%7D%20%7C%20s_t,%20a_t)%0A"></p>
<p>When we take <img src="https://latex.codecogs.com/png.latex?%5Cnabla_%5Ctheta">, only the policy term depends on <img src="https://latex.codecogs.com/png.latex?%5Ctheta">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla_%5Ctheta%20%5Clog%20P(%5Ctau%20%7C%20%5Ctheta)%20=%20%5Csum_%7Bt=0%7D%5E%7BT-1%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%0A"></p>
<p>The initial state distribution and transition dynamics are independent of <img src="https://latex.codecogs.com/png.latex?%5Ctheta">, so their gradients vanish. Substituting back, we obtain the <strong>policy gradient theorem</strong>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cnabla_%5Ctheta%20J(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cleft%5B%20%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20%5Ccdot%20R(%5Ctau)%20%5Cright%5D%7D%20%5Ctag%7BIV.IV%7D%0A"></p>
<p>This is a remarkable result. We can compute the gradient of our objective without differentiating through the environment dynamics and <strong>only need gradients of the log-probabilities of our policy</strong>.</p>
<p>Since we cannot compute the expectation exactly, we approximate it with a sample mean by sampling <img src="https://latex.codecogs.com/png.latex?N"> trajectories:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cnabla_%5Ctheta%20J(%5Ctheta)%20%5Capprox%20%5Chat%7Bg%7D%20=%20%5Cfrac%7B1%7D%7BN%7D%20%5Csum_%7Bi=1%7D%5E%7BN%7D%20%5Cleft(%20%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_%7Bi,t%7D%20%7C%20s_%7Bi,t%7D)%20%5Cright)%20R(%5Ctau_i)%7D%20%5Ctag%7BIV.V%7D%0A"></p>
<p>This gives us the <strong>REINFORCE algorithm</strong>:</p>
<ol type="1">
<li><p><strong>Initialize</strong>: Start with a pretrained or supervised fine-tuned (SFT) language model <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"></p></li>
<li><p><strong>Sample prompts</strong>: Draw a batch of <img src="https://latex.codecogs.com/png.latex?N"> prompts <img src="https://latex.codecogs.com/png.latex?%5C%7Bx_1,%20x_2,%20%5Cldots,%20x_N%5C%7D"> from a dataset</p></li>
<li><p><strong>Generate trajectories</strong>: For each prompt <img src="https://latex.codecogs.com/png.latex?x_i">, generate a response <img src="https://latex.codecogs.com/png.latex?y_i%20=%20(a_0,%20a_1,%20%5Cldots,%20a_T)"> by sampling tokens from the policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta">. Each trajectory is the sequence of states (prompt + generated tokens so far) and actions (selected tokens).</p></li>
<li><p><strong>Compute log-probabilities</strong>: For each trajectory, compute the log-probability of each generated token given its context: <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20%5Cquad%20%5Ctext%7Bfor%20%7D%20t%20=%200,%201,%20%5Cldots,%20T"></p></li>
<li><p><strong>Compute rewards</strong>: Score each complete (prompt, response) pair using the reward model: <img src="https://latex.codecogs.com/png.latex?R(%5Ctau_i)%20=%20r_%5Cphi(x_i,%20y_i)"></p></li>
<li><p><strong>Estimate policy gradient</strong>: Compute the gradient estimate using (IV.V): <img src="https://latex.codecogs.com/png.latex?%5Chat%7Bg%7D%20=%20%5Cfrac%7B1%7D%7BN%7D%20%5Csum_%7Bi=1%7D%5E%7BN%7D%20%5Cleft(%20%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_%7Bi,t%7D%20%7C%20s_%7Bi,t%7D)%20%5Cright)%20R(%5Ctau_i)"></p></li>
<li><p><strong>Update policy</strong>: Perform a gradient ascent step: <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Cleftarrow%20%5Ctheta%20+%20%5Calpha%20%5Chat%7Bg%7D"></p></li>
<li><p><strong>Repeat</strong>: Go back to Step 2 and iterate until convergence</p></li>
</ol>
<p>While REINFORCE provides an unbiased gradient estimate, it suffers from two critical issues that make it impractical for LLM training:</p>
<ol type="1">
<li><p><strong>High Variance</strong>: The gradient estimate <img src="https://latex.codecogs.com/png.latex?%5Chat%7Bg%7D"> suffers from high variance depending on the sampled trajectories. This variance can be large and can lead to noisy gradients and unstable training. &gt; If you look again at (IV.V), the gradient estimate for each action is weighted by the return of the <em>entire</em> trajectory <img src="https://latex.codecogs.com/png.latex?R(%5Ctau)">. This means that even if an action was good, it might receive a negative gradient update simply because other actions in the trajectory led to poor outcomes (or vice versa). Over many samples, the noise introduced by this coupling can be substantial, leading to high variance</p></li>
<li><p><strong>On-Policy Constraint (Sample Inefficiency)</strong>: REINFORCE requires trajectories sampled from the <em>current</em> policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta">. Thus after every gradient update, previously collected trajectories must be discarded and new ones need to be sampled from the updated policy. For LLMs, where each trajectory requires a full forward pass through a billion(s)-parameter model, this is prohibitively expensive especially when we need many small gradient steps to train effectively.</p></li>
</ol>
</section>
<section id="v-reducing-variance-and-the-advantage-function" class="level2">
<h2 class="anchored" data-anchor-id="v-reducing-variance-and-the-advantage-function">V: Reducing Variance and the Advantage Function</h2>
<p>The REINFORCE algorithm provides an unbiased gradient estimate (IV.V). However while unbiased, this estimator suffers from <strong>high variance</strong>.</p>
<section id="replacing-full-trajectory-return-with-reward-to-go-using-causality" class="level3">
<h3 class="anchored" data-anchor-id="replacing-full-trajectory-return-with-reward-to-go-using-causality">Replacing Full-Trajectory Return with Reward-to-Go (using causality)</h3>
<p>A first variance reduction comes from noticing that action <img src="https://latex.codecogs.com/png.latex?a_t"> taken at time <img src="https://latex.codecogs.com/png.latex?t"> <strong>cannot influence rewards that were received before time <img src="https://latex.codecogs.com/png.latex?t"></strong>. This is a fundamental consequence of causality. These past reward terms contribute only noise to the gradient estimate and add variance without contributing any signal. Thus, we can remove them and consider only the <strong>rewards-to-go</strong> :</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7BR%7D_t%20=%20%5Csum_%7Bt'=t%7D%5E%7BT%7D%20r_%7Bt'%7D%0A"></p>
<p>This gives us a lower-variance estimator:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cnabla_%5Ctheta%20J(%5Ctheta)%20%5Capprox%20%5Cfrac%7B1%7D%7BN%7D%20%5Csum_%7Bi=1%7D%5E%7BN%7D%20%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_%7Bi,t%7D%20%7C%20s_%7Bi,t%7D)%20%5Ccdot%20%5Chat%7BR%7D_%7Bi,t%7D%7D%20%5Ctag%7BV.I%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Chat%7BR%7D_%7Bi,t%7D%20=%20%5Csum_%7Bt'=t%7D%5E%7BT%7D%20r_%7Bi,t'%7D"> is the rewards-to-go for trajectory <img src="https://latex.codecogs.com/png.latex?i"> starting from time <img src="https://latex.codecogs.com/png.latex?t">.</p>
</section>
<section id="subtracting-a-baseline" class="level3">
<h3 class="anchored" data-anchor-id="subtracting-a-baseline">Subtracting a Baseline</h3>
<p>A second complementary technique for variance reduction is to subtract a <strong>baseline</strong> <img src="https://latex.codecogs.com/png.latex?b(s_t)"> from the rewards. The key insight is that we can subtract <strong>any function that does not depend on the action</strong> from our reward signal without changing the expected value of the gradient.</p>
<p>Thus we can subtract a state-dependent baseline <img src="https://latex.codecogs.com/png.latex?b(s_t)"> from our rewards-to-go to yield an <strong>unbiased</strong> gradient estimator:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cnabla_%5Ctheta%20J(%5Ctheta)%20%5Capprox%20%5Cfrac%7B1%7D%7BN%7D%20%5Csum_%7Bi=1%7D%5E%7BN%7D%20%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_%7Bi,t%7D%20%7C%20s_%7Bi,t%7D)%20%5Ccdot%20%5Cleft(%5Chat%7BR%7D_%7Bi,t%7D%20-%20b(s_%7Bi,t%7D)%5Cright)%7D%20%5Ctag%7BV.II%7D%0A"></p>
</section>
<section id="value-functions-vpis-and-qpis-a" class="level3">
<h3 class="anchored" data-anchor-id="value-functions-vpis-and-qpis-a">Value Functions: <img src="https://latex.codecogs.com/png.latex?V%5E%5Cpi(s)"> and <img src="https://latex.codecogs.com/png.latex?Q%5E%5Cpi(s,%20a)"></h3>
<p>The baseline is still an arbitrary function. To make it more systematic and concrete, there are two fundamental functions from RL theory.</p>
<p><strong>State Value Function:</strong> The <strong>state value function</strong> <img src="https://latex.codecogs.com/png.latex?V%5E%5Cpi(s)"> is the expected return when the agent is in state <img src="https://latex.codecogs.com/png.latex?s"> and acts according to policy <img src="https://latex.codecogs.com/png.latex?%5Cpi">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AV%5E%5Cpi(s)%20=%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi%7D%5Cleft%5B%5Csum_%7Bt=0%7D%5E%7B%5Cinfty%7D%20%5Cgamma%5Et%20r_t%20%5C;%5Cmiddle%7C%5C;%20s_0%20=%20s%5Cright%5D%20%20"></p>
<p>Intuitively, <img src="https://latex.codecogs.com/png.latex?V%5E%5Cpi(s)"> tells <strong>“How good is this state on average?”</strong> and is used as a baseline <img src="https://latex.codecogs.com/png.latex?b(s)%20=%20V%5E%5Cpi(s)">.</p>
<p><strong>Action Value Function (Q-function):</strong> The <strong>action value function</strong> <img src="https://latex.codecogs.com/png.latex?Q%5E%5Cpi(s,%20a)"> is the expected return when starting in state <img src="https://latex.codecogs.com/png.latex?s"> and taking action <img src="https://latex.codecogs.com/png.latex?a"> and then acting according to policy <img src="https://latex.codecogs.com/png.latex?%5Cpi">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AQ%5E%5Cpi(s,%20a)%20=%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi%7D%5Cleft%5B%5Csum_%7Bt=0%7D%5E%7B%5Cinfty%7D%20%5Cgamma%5Et%20r_t%20%5C;%5Cmiddle%7C%5C;%20s_0%20=%20s,%20a_0%20=%20a%5Cright%5D%0A"></p>
<p>Intuitively, <img src="https://latex.codecogs.com/png.latex?Q%5E%5Cpi(s,%20a)"> tells <strong>“How good is this specific action in this state?”</strong> and in RL, the rewards-to-go is estimated as <img src="https://latex.codecogs.com/png.latex?Q%5E%5Cpi(s,%20a)">.</p>
<p>In the LLM context: - <img src="https://latex.codecogs.com/png.latex?V%5E%5Cpi(s)"> estimates the expected reward for a given prompt + partial response, assuming the model continues generating according to its current policy. - <img src="https://latex.codecogs.com/png.latex?Q%5E%5Cpi(s,%20a)"> estimates the expected reward if, from the current prompt + partial response, the model generates a specific next token <img src="https://latex.codecogs.com/png.latex?a"> and then continues according to its policy.</p>
</section>
<section id="advantage-function" class="level3">
<h3 class="anchored" data-anchor-id="advantage-function">Advantage Function</h3>
<p>The <strong>advantage function</strong> <img src="https://latex.codecogs.com/png.latex?A%5E%5Cpi(s,%20a)"> measures how much better (or worse) a specific action <img src="https://latex.codecogs.com/png.latex?a"> is compared to the average action under the policy:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7BA%5E%5Cpi(s,%20a)%20=%20Q%5E%5Cpi(s,%20a)%20-%20V%5E%5Cpi(s)%7D%20%5Ctag%7BV.III%7D%0A"></p>
<p>The advantage function directly tells us: <strong>“How much better is this particular action compared to what we would typically do in this state?”</strong> This is precisely the signal we want for policy improvement. We want to increase the probability of actions with positive advantage and decrease the probability of actions with negative advantage.</p>
<blockquote class="blockquote">
<p>From Umar Jamil’s video:<br>
In the LLM context consider a state where the prompt is “Where is Shanghai?” and the model has generated “Shanghai is”. From this state: - If the model samples the token “in” (leading toward “Shanghai is in China”), this action likely has <strong>positive advantage</strong>. This is because it is better than the average token the model might produce. - If the model samples the token “delicious” (leading toward an incoherent response), this action likely has <strong>negative advantage</strong>. This is because it is worse than the average token the model might produce.</p>
</blockquote>
</section>
<section id="advantage-weighted-policy-gradient" class="level3">
<h3 class="anchored" data-anchor-id="advantage-weighted-policy-gradient">Advantage-Weighted Policy Gradient</h3>
<p>Substituting the rewards-to-go and the value function as a baseline, we get the following form of the policy gradient: <img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla_%5Ctheta%20J(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cleft%5B%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20%5Ccdot%20(Q%5E%5Cpi(s_t,%20a_t)%20-%20V%5E%5Cpi(s_t))%5Cright%5D%0A"></p>
<p>which can be written as: <img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cnabla_%5Ctheta%20J(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cleft%5B%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20%5Ccdot%20A%5E%7B%5Cpi_%5Ctheta%7D(s_t,%20a_t)%5Cright%5D%7D%20%5Ctag%7BV.IV%7D%0A"></p>
<p>and for sample-based approximation:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cnabla_%5Ctheta%20J(%5Ctheta)%20%5Capprox%20%5Cfrac%7B1%7D%7BN%7D%20%5Csum_%7Bi=1%7D%5E%7BN%7D%20%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_%7Bi,t%7D%20%7C%20s_%7Bi,t%7D)%20%5Ccdot%20%5Chat%7BA%7D_%7Bi,t%7D%7D%20%5Ctag%7BV.V%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_%7Bi,t%7D"> is an estimate of the advantage function at time <img src="https://latex.codecogs.com/png.latex?t"> in trajectory <img src="https://latex.codecogs.com/png.latex?i">. This is the form of the policy gradient often used.</p>
<p>In practice, <img src="https://latex.codecogs.com/png.latex?A%5E%5Cpi(s_t,%20a_t)"> can be estimated as follows:</p>
<ol type="1">
<li><p><strong>Learn a value function:</strong> Train a neural network <img src="https://latex.codecogs.com/png.latex?V_%5Cphi(s)"> (often called the “critic” or “value head”) to approximate <img src="https://latex.codecogs.com/png.latex?V%5E%5Cpi(s)">. In LLM fine-tuning, this is often a linear layer on top of the same transformer backbone used for the policy.</p></li>
<li><p><strong>Estimate <img src="https://latex.codecogs.com/png.latex?Q%5E%5Cpi"> from samples:</strong> Given a trajectory, the rewards-to-go <img src="https://latex.codecogs.com/png.latex?%5Chat%7BR%7D_t%20=%20%5Csum_%7Bt'=t%7D%5E%7BT%7D%20%5Cgamma%5E%7Bt'%7D%20r_%7Bt'%7D"> provides an unbiased (but high-variance) estimate of <img src="https://latex.codecogs.com/png.latex?Q%5E%5Cpi(s_t,%20a_t)">.</p></li>
<li><p><strong>Compute advantage estimates:</strong> <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%20=%20%5Chat%7BR%7D_t%20-%20V_%5Cphi(s_t)"></p></li>
</ol>
<p>More sophisticated methods like <strong>Generalized Advantage Estimation (GAE)</strong> interpolate between high-variance, low-bias estimates and low-variance, high-bias estimates by using a weighted combination of multi-step returns. See the <a href="https://arxiv.org/abs/1506.02438">GAE paper</a> for more details.</p>
</section>
</section>
<section id="vi-importance-sampling-and-off-policy-policy-gradients" class="level2">
<h2 class="anchored" data-anchor-id="vi-importance-sampling-and-off-policy-policy-gradients">VI: Importance Sampling and Off-Policy Policy Gradients</h2>
<blockquote class="blockquote">
<p><strong>Note:</strong> In RL literature, “off-policy” typically refers to methods where the <em>behavior policy</em> (generating data) is arbitrarily quite different from the <em>target policy</em> (being optimized) say where transitions from policies thousands of updates old are reused. In this section, what we will call “off-policy” should more precisely be called “local off-policy”.</p>
</blockquote>
<p>The advantage-weighted policy gradient (V.IV) requires trajectories sampled from the current policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta">. … The advantage-weighted policy gradient (V.IV) requires trajectories sampled from the current policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta">. This creates a fundamental <strong>inefficiency</strong> i.e., after each gradient update <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Cto%20%5Ctheta'"> all previously collected trajectories become “stale” and we must discard these trajectories and sample new ones from the updated policy.</p>
<p>For LLMs, where each trajectory requires a full forward pass through billion(s)-parameter model, this is prohibitively expensive especially when we need many small gradient steps to train effectively.</p>
<p>We need a way to reuse the same trajectories for multiple gradient updates. <strong>Importance sampling</strong> provides the mathematical machinery to do exactly this!</p>
<section id="importance-sampling" class="level3">
<h3 class="anchored" data-anchor-id="importance-sampling">Importance Sampling</h3>
<p>Importance sampling is a technique for estimating expectations under one probability distribution using samples drawn from a different distribution. Consider an expectation for distribution <img src="https://latex.codecogs.com/png.latex?p(x)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20p%7D%5Bf(x)%5D%20=%20%5Cint%20p(x)%20f(x)%20%5C,%20dx%0A"></p>
<p>We can rewrite this by multiplying and dividing by another distribution <img src="https://latex.codecogs.com/png.latex?q(x)"> (with <img src="https://latex.codecogs.com/png.latex?q(x)%20%3E%200"> wherever <img src="https://latex.codecogs.com/png.latex?p(x)%20%3E%200">):</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A=%20%5Cint%20q(x)%20%5Cfrac%7Bp(x)%7D%7Bq(x)%7D%20f(x)%20%5C,%20dx%20=%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20q%7D%5Cleft%5B%5Cfrac%7Bp(x)%7D%7Bq(x)%7D%20f(x)%5Cright%5D%0A"></p>
<p>The ratio <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7Bp(x)%7D%7Bq(x)%7D"> is called the <strong>importance weight</strong>. This identity tells us:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20p%7D%5Bf(x)%5D%20=%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20q%7D%5Cleft%5B%5Cfrac%7Bp(x)%7D%7Bq(x)%7D%20f(x)%5Cright%5D%7D%20%5Ctag%7BVI.I%7D%0A"></p>
<p>We can now estimate the expectation under <img src="https://latex.codecogs.com/png.latex?p"> using samples from <img src="https://latex.codecogs.com/png.latex?q"> as long as we reweight each sample by the ratio of probabilities.</p>
</section>
<section id="applying-importance-sampling-to-policy-gradients" class="level3">
<h3 class="anchored" data-anchor-id="applying-importance-sampling-to-policy-gradients">Applying Importance Sampling to Policy Gradients</h3>
<p>We can apply this technique to the policy gradient setting. The on-policy advantage-weighted gradient (V.IV) is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla_%5Ctheta%20J(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cleft%5B%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20%5Ccdot%20A%5E%7B%5Cpi_%5Ctheta%7D(s_t,%20a_t)%5Cright%5D%0A"></p>
<p>To apply importance sampling, we work at time-step level rather than trajectory level (full trajectory importance weights have extremely high variance). For a single timestep: <img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla_%5Ctheta%20J(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B(s_t,%20a_t)%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cleft%5B%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20%5Ccdot%20A%5E%7B%5Cpi_%5Ctheta%7D(s_t,%20a_t)%5Cright%5D%0A"></p>
<p>Using importance sampling with samples from <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A=%20%5Cmathbb%7BE%7D_%7B(s_t,%20a_t)%20%5Csim%20%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D%7D%5Cleft%5B%5Cfrac%7B%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_t%20%7C%20s_t)%7D%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%20%5Ccdot%20A%5E%7B%5Cpi_%5Ctheta%7D(s_t,%20a_t)%5Cright%5D%0A"></p>
<p>Now we apply the log-derivative identity <img src="https://latex.codecogs.com/png.latex?%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta%20=%20%5Cfrac%7B%5Cnabla_%5Ctheta%20%5Cpi_%5Ctheta%7D%7B%5Cpi_%5Ctheta%7D">, which gives us a <strong>surrogate objective <img src="https://latex.codecogs.com/png.latex?L(%5Ctheta)"></strong> whose gradient equals this importance-weighted policy gradient:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cnabla_%5Ctheta%20J(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B(s_t,%20a_t)%20%5Csim%20%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D%7D%5Cleft%5B%5Cfrac%7B%5Cnabla_%5Ctheta%20%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_t%20%7C%20s_t)%7D%20A%5E%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D%7D(s_t,%20a_t)%5Cright%5D%0A"></p>
<p>where the importance-weighted surrogate objective also known as the <strong>Conservative Policy Iteration (CPI)</strong> objective is: <img src="https://latex.codecogs.com/png.latex?%0AL%5E%7B%5Ctext%7BCPI%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B(s_t,%20a_t)%20%5Csim%20%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D%7D%5Cleft%5B%5Cfrac%7B%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_t%20%7C%20s_t)%7D%20A%5E%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D%7D(s_t,%20a_t)%5Cright%5D%0A"></p>
<p>We also define the <strong>probability ratio</strong> as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ar_t(%5Ctheta)%20=%20%5Cfrac%7B%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_t%20%7C%20s_t)%7D%20%5Ctag%7BVI.II%7D%0A"></p>
<p>Note that <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta_%7B%5Ctext%7Bold%7D%7D)%20=%201"> by construction. Thus, the CPI objective can be written as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7BL%5E%7B%5Ctext%7BCPI%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5B%5Cfrac%7B%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_t%20%7C%20s_t)%7D%20%5Chat%7BA%7D_t%5Cright%5D%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5Br_t(%5Ctheta)%20%5Chat%7BA%7D_t%5Cright%5D%7D%20%5Ctag%7BVI.III%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t"> is the estimated advantage at timestep <img src="https://latex.codecogs.com/png.latex?t">, and <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D_t%5B%5Ccdot%5D"> denotes the empirical average over a batch of samples collected under <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D">.</p>
<p>This objective has a clear interpretation: - If <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%20%3E%200"> (action better than average), we want to <strong>increase</strong> <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)">, i.e., make the new policy more likely to take this action. - If <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%20%3C%200"> (action worse than average), we want to <strong>decrease</strong> <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)">, i.e., make the new policy less likely to take this action.</p>
<p>The corresponding sample-based approximation is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7BL%5E%7B%5Ctext%7BCPI%7D%7D(%5Ctheta)%20%5Capprox%20%5Cfrac%7B1%7D%7BN%7D%20%5Csum_%7Bi=1%7D%5E%7BN%7D%20%5Csum_%7Bt=0%7D%5E%7BT%7D%20%5Cfrac%7B%5Cpi_%5Ctheta(a_%7Bi,t%7D%20%7C%20s_%7Bi,t%7D)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_%7Bi,t%7D%20%7C%20s_%7Bi,t%7D)%7D%20%5Chat%7BA%7D_%7Bi,t%7D%7D%20%5Ctag%7BVI.IV%7D%0A"></p>
</section>
<section id="off-policy-learning-reusing-trajectories" class="level3">
<h3 class="anchored" data-anchor-id="off-policy-learning-reusing-trajectories">Off-Policy Learning: Reusing Trajectories</h3>
<p>The CPI objective enables <strong>off-policy learning</strong>: we can sample trajectories from <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D">, store them and then perform multiple gradient updates on <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> using the same batch of data. The typical workflow becomes:</p>
<ol type="1">
<li><strong>Collect</strong>: Sample trajectories <img src="https://latex.codecogs.com/png.latex?%5C%7B%5Ctau_i%5C%7D"> from the current policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D"></li>
<li><strong>Compute</strong>: Calculate advantages <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_%7Bi,t%7D"> and log-probabilities <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_%7Bi,t%7D%20%7C%20s_%7Bi,t%7D)"></li>
<li><strong>Store</strong>: Save the trajectories along with their advantages and old log-probabilities</li>
<li><strong>Optimize</strong>: Perform multiple gradient ascent steps on <img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BCPI%7D%7D(%5Ctheta)"> using mini-batches from the stored data</li>
<li><strong>Repeat</strong>: Set <img src="https://latex.codecogs.com/png.latex?%5Ctheta_%7B%5Ctext%7Bold%7D%7D%20%5Cleftarrow%20%5Ctheta"> and return to step 1</li>
</ol>
<p>This dramatically improves sample efficiency. Instead of discarding trajectories after a single gradient step, we can extract multiple updates from each batch of expensive LLM rollouts.</p>
</section>
<section id="the-instability-problem" class="level3">
<h3 class="anchored" data-anchor-id="the-instability-problem">The Instability Problem</h3>
<p>While the CPI objective improves sample efficiency, <strong>unconstrained optimization of <img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BCPI%7D%7D(%5Ctheta)"> is unstable</strong>. The core issue is that importance sampling becomes unreliable when <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> drifts far from <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D">:</p>
<ul>
<li><strong>Extreme probability ratios</strong>: The ratio <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"> can become arbitrarily large or small, destabilizing gradient estimates.</li>
<li><strong>Stale advantages</strong>: The estimates <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t"> were computed under <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D"> and become inaccurate as <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> diverges. The optimizer may exploit these stale estimates, making updates that appear beneficial but are actually harmful.</li>
</ul>
<p>In practice, unconstrained maximization of <img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BCPI%7D%7D(%5Ctheta)"> often leads to excessively large policy updates that cause catastrophic performance collapse.</p>
<blockquote class="blockquote">
<p><strong>LLM Context (from Umar Jamil):</strong> Suppose we have a trajectory where the model generated “Shanghai is in China” with high advantage. Unconstrained optimization might dramatically upweight “China” as the next token given “Shanghai is in”—but this could simultaneously cause unintended probability shifts elsewhere, perhaps making the model overly likely to say “China” in completely unrelated contexts, or disrupting the probability mass across the entire vocabulary in unpredictable ways.</p>
</blockquote>
<p>We need a mechanism to constrain <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> from deviating too far from <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D"> and keeping the ratio <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"> close to 1 while still allowing meaningful policy improvement.</p>
</section>
</section>
<section id="vii-trust-region-policy-optimization-trpo" class="level2">
<h2 class="anchored" data-anchor-id="vii-trust-region-policy-optimization-trpo">VII: Trust Region Policy Optimization (TRPO)</h2>
<p>The CPI objective is attractive because it lets us reuse data via importance ratios, but <strong>unconstrained optimization is unstable</strong>. When <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> drifts far from <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D">, the probability ratios <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"> become extreme and the advantage estimates <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t"> become stale and can be exploited by the optimizer.</p>
<p>The key insight of Trust Region Policy Optimization (<a href="https://arxiv.org/abs/1502.05477">TRPO</a>) is that the surrogate objective <img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BCPI%7D%7D(%5Ctheta)"> is only a valid approximation to the true objective within a local neighborhood of <img src="https://latex.codecogs.com/png.latex?%5Ctheta_%7B%5Ctext%7Bold%7D%7D">. TRPO paper formalized this by proving policy performance is guaranteed to improve as long as the KL divergence between consecutive policies remains bounded. This theoretical result motivates constraining the policy update to stay within a “trust region” where the surrogate objective remains reliable. See the <a href="https://arxiv.org/abs/1502.05477">TRPO paper</a> for the formal proof.</p>
<p>TRPO converts this insight into a <strong>constrained optimization problem</strong> that ensures the policy update stays within a “trust region” where the surrogate objective remains reliable.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7B%0A%5Cbegin%7Baligned%7D%0A%5Cmax_%5Ctheta%20%5Cquad%20&amp;%20L%5E%7B%5Ctext%7BCPI%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5B%5Cfrac%7B%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_t%20%7C%20s_t)%7D%20%5Chat%7BA%7D_t%5Cright%5D%20%5C%5C%5B6pt%5D%0A%5Ctext%7Bsubject%20to%7D%20%5Cquad%20&amp;%20%5Cmathbb%7BE%7D_t%5Cleft%5BD_%7B%5Ctext%7BKL%7D%7D%5Cleft(%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(%5Ccdot%7Cs_t)%20%5C%7C%20%5Cpi_%5Ctheta(%5Ccdot%7Cs_t)%5Cright)%5Cright%5D%20%5Cleq%20%5Cdelta%0A%5Cend%7Baligned%7D%0A%7D%20%5Ctag%7BVII.I%7D%0A"></p>
<p>The hyperparameter <img src="https://latex.codecogs.com/png.latex?%5Cdelta"> defines the trust region size, the maximum allowed divergence between consecutive policies. This constraint ensures that <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"> remains close to 1, keeping our importance-weighted estimates reliable.</p>
<p>Solving (VII.I) requires <strong>second-order optimization</strong>. TRPO approximates the objective linearly and the KL constraint quadratically (using the Fisher Information Matrix) and then solves the resulting problem via the <strong>conjugate gradient algorithm</strong> followed by a <strong>line search</strong> to ensure constraints are satisfied.</p>
<p>For large-scale LLM training, this approach is impractical:</p>
<ul>
<li><strong>Computational overhead</strong>: Each policy update requires multiple conjugate gradient iterations and line search steps, significantly more expensive than standard gradient descent.</li>
<li><strong>Memory requirements</strong>: Computing Fisher-vector products adds substantial memory overhead for billion(s)-parameter models</li>
</ul>
<p>The theory behind TRPO also suggests using a <strong>KL penalty</strong> rather than a hard constraint. It is easier to implement and more computationally efficient.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmax_%5Ctheta%20%5C;%20%5Cmathbb%7BE%7D_t%5Cleft%5Br_t(%5Ctheta)%20%5Chat%7BA%7D_t%20-%20%5Cbeta%20%5Ccdot%20D_%7B%5Ctext%7BKL%7D%7D%5Cleft(%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(%5Ccdot%7Cs_t)%20%5C%7C%20%5Cpi_%5Ctheta(%5Ccdot%7Cs_t)%5Cright)%5Cright%5D%20%5Ctag%7BVII.II%7D%0A"></p>
<p>However, choosing a penalty coefficient <img src="https://latex.codecogs.com/png.latex?%5Cbeta"> that works across different problems or even across different training stages is notoriously difficult. This motivates Proximal Policy Optimization (PPO): a <strong>first-order method</strong> that achieves TRPO’s stability through a <strong>clipped surrogate objective</strong> rather than explicit constraints.</p>
</section>
<section id="viii-proximal-policy-optimization-ppo" class="level2">
<h2 class="anchored" data-anchor-id="viii-proximal-policy-optimization-ppo">VIII: Proximal Policy Optimization (PPO)</h2>
<p>Proximal Policy Optimization (PPO) achieves TRPO’s stability guarantees using only <strong>first-order optimization</strong>. Instead of explicitly constraining the KL divergence, PPO modifies the objective function itself to discourage large policy updates through a <strong>clipping mechanism</strong>. It implicitly limits how far the policy can move, providing a “soft” trust region using only standard gradient descent.</p>
<section id="clipped-surrogate-objective" class="level3">
<h3 class="anchored" data-anchor-id="clipped-surrogate-objective">Clipped Surrogate Objective</h3>
<p>CPI objective and probability ratio from Section VI:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL%5E%7B%5Ctext%7BCPI%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5Br_t(%5Ctheta)%20%5Chat%7BA%7D_t%5Cright%5D%20%5Cquad%20%5Ctext%7Bwhere%7D%20%5Cquad%20r_t(%5Ctheta)%20=%20%5Cfrac%7B%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_t%20%7C%20s_t)%7D%0A"></p>
<p>The problem with <img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BCPI%7D%7D"> is that nothing prevents <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"> from becoming arbitrarily large or small. PPO addresses this by <strong>clipping</strong> the probability ratio to stay within <img src="https://latex.codecogs.com/png.latex?%5B1-%5Cepsilon,%201+%5Cepsilon%5D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7BL%5E%7B%5Ctext%7BCLIP%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5B%5Cmin%5Cleft(r_t(%5Ctheta)%20%5Chat%7BA%7D_t,%20%5C;%20%5Ctext%7Bclip%7D(r_t(%5Ctheta),%201-%5Cepsilon,%201+%5Cepsilon)%20%5Ccdot%20%5Chat%7BA%7D_t%5Cright)%5Cright%5D%7D%20%5Ctag%7BVIII.I%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cepsilon"> is a hyperparameter (<img src="https://latex.codecogs.com/png.latex?%5Cepsilon%20=%200.2"> from the <a href="https://arxiv.org/abs/1707.06347">PPO paper</a>) and the clip function is defined as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7Bclip%7D(r,%201-%5Cepsilon,%201+%5Cepsilon)%20=%20%5Cbegin%7Bcases%7D%0A1-%5Cepsilon%20&amp;%20%5Ctext%7Bif%20%7D%20r%20%3C%201-%5Cepsilon%20%5C%5C%0Ar%20&amp;%20%5Ctext%7Bif%20%7D%201-%5Cepsilon%20%5Cleq%20r%20%5Cleq%201+%5Cepsilon%20%5C%5C%0A1+%5Cepsilon%20&amp;%20%5Ctext%7Bif%20%7D%20r%20%3E%201+%5Cepsilon%0A%5Cend%7Bcases%7D%0A"></p>
<p>The <img src="https://latex.codecogs.com/png.latex?%5Cmin"> operator in (VIII.I) is important. It ensures we take the <strong>more pessimistic</strong> (lower) estimate between the clipped and unclipped objectives. This creates different behavior depending on the sign of the advantage:</p>
<p><strong>Case 1: Positive Advantage (<img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%20%3E%200">)</strong></p>
<p>When an action is better than average, we want to <strong>increase</strong> its probability, which means increasing <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)">. The objective becomes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL%5E%7B%5Ctext%7BCLIP%7D%7D_t%20=%20%5Cmin%5Cleft(r_t(%5Ctheta),%201+%5Cepsilon%5Cright)%20%5Ccdot%20%5Chat%7BA%7D_t%0A"></p>
<ul>
<li>If <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20%5Cleq%201+%5Cepsilon">: The objective is <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20%5Chat%7BA%7D_t">, so gradient ascent increases <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"></li>
<li>If <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20%3E%201+%5Cepsilon">: The objective becomes <img src="https://latex.codecogs.com/png.latex?(1+%5Cepsilon)%5Chat%7BA%7D_t"></li>
</ul>
<p>The clipping <strong>removes the incentive</strong> to increase <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"> beyond <img src="https://latex.codecogs.com/png.latex?1+%5Cepsilon">.</p>
<p><strong>Case 2: Negative Advantage (<img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%20%3C%200">)</strong></p>
<p>When an action is worse than average, we want to <strong>decrease</strong> its probability, which means decreasing <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)">. Since <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%20%3C%200">, multiplying by a smaller <img src="https://latex.codecogs.com/png.latex?r_t"> makes the product <em>less negative</em> (larger). The objective becomes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL%5E%7B%5Ctext%7BCLIP%7D%7D_t%20=%20%5Cmax%5Cleft(r_t(%5Ctheta),%201-%5Cepsilon%5Cright)%20%5Ccdot%20%5Chat%7BA%7D_t%0A"></p>
<p>(The <img src="https://latex.codecogs.com/png.latex?%5Cmin"> with negative values becomes a <img src="https://latex.codecogs.com/png.latex?%5Cmax"> in terms of which <img src="https://latex.codecogs.com/png.latex?r_t"> is selected.)</p>
<ul>
<li>If <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20%5Cgeq%201-%5Cepsilon">: The objective is <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20%5Chat%7BA%7D_t">, so gradient ascent decreases <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"></li>
<li>If <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20%3C%201-%5Cepsilon">: The objective becomes <img src="https://latex.codecogs.com/png.latex?(1-%5Cepsilon)%5Chat%7BA%7D_t"></li>
</ul>
<p>The clipping <strong>removes the incentive</strong> to decrease <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"> beyond <img src="https://latex.codecogs.com/png.latex?1-%5Cepsilon">.</p>
<p>The takeaway here is that PPO provides a <strong>pessimistic lower bound</strong> on <img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BCPI%7D%7D">. We ignore updates when they would make things “too good to be true.”</p>
<blockquote class="blockquote">
<p><strong>LLM Context (from Umar Jamil Video):</strong> In language model fine-tuning, the policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(a_t%7Cs_t)"> is the probability the model assigns to token <img src="https://latex.codecogs.com/png.latex?a_t"> given the context <img src="https://latex.codecogs.com/png.latex?s_t"> (prompt + previously generated tokens). The probability ratio <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)"> measures how much more or less likely the fine-tuned model is to generate a particular token compared to the reference policy. Clipping ensures that no single token’s probability can change by more than a factor of <img src="https://latex.codecogs.com/png.latex?(1%20%5Cpm%20%5Cepsilon)"> in a single update iteration, preventing the model from “overreacting” to high-advantage tokens.</p>
</blockquote>
</section>
<section id="ppo-objective" class="level3">
<h3 class="anchored" data-anchor-id="ppo-objective">PPO Objective</h3>
<p>In practice, PPO combines the clipped policy objective with two additional terms:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7BL%5E%7B%5Ctext%7BPPO%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5BL%5E%7B%5Ctext%7BCLIP%7D%7D_t(%5Ctheta)%20-%20c_1%20L%5E%7B%5Ctext%7BVF%7D%7D_t(%5Ctheta)%20+%20c_2%20S%5B%5Cpi_%5Ctheta%5D(s_t)%5Cright%5D%7D%20%5Ctag%7BVIII.II%7D%0A"></p>
<p><strong>1. Value Function Loss (<img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BVF%7D%7D">):</strong> Recall from Section V that we need a value function <img src="https://latex.codecogs.com/png.latex?V_%5Cphi(s)"> to compute advantage estimates. The value function is trained to minimize the squared error between its predictions and the actual returns:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL%5E%7B%5Ctext%7BVF%7D%7D_t(%5Ctheta)%20=%20%5Cleft(V_%5Ctheta(s_t)%20-%20V_t%5E%7B%5Ctext%7Btarget%7D%7D%5Cright)%5E2%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?V_t%5E%7B%5Ctext%7Btarget%7D%7D"> is typically the discounted return-to-go. When the policy and value function share parameters (common in LLM fine-tuning where both use the same transformer backbone), this loss is subtracted from the objective (hence the negative sign, since we maximize <img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BPPO%7D%7D"> but minimize <img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BVF%7D%7D">).</p>
<p><strong>2. Entropy Bonus (<img src="https://latex.codecogs.com/png.latex?S%5B%5Cpi_%5Ctheta%5D">):</strong> To encourage exploration and prevent premature convergence to deterministic policies, PPO adds an entropy loss:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AS%5B%5Cpi_%5Ctheta%5D(s_t)%20=%20-%5Csum_a%20%5Cpi_%5Ctheta(a%7Cs_t)%20%5Clog%20%5Cpi_%5Ctheta(a%7Cs_t)%0A"></p>
<p>Here, the coefficients <img src="https://latex.codecogs.com/png.latex?c_1,%20c_2%20%3E%200"> control the regularization strength.</p>
</section>
</section>
<section id="ix-complete-ppo-objective-with-kl-penalty" class="level2">
<h2 class="anchored" data-anchor-id="ix-complete-ppo-objective-with-kl-penalty">IX: Complete PPO Objective with KL Penalty</h2>
<p>When fine-tuning an LLM with “vanilla” PPO, the policy learns to maximize rewards from the reward model. However, the reward model is an imperfect proxy for human preferences. It is a neural network trained on limited data that can be exploited. Without constraints, the policy may discover adversarial outputs that achieve high reward scores while producing text that:</p>
<ul>
<li>Degenerates into repetitive or nonsensical patterns that “fool” the reward model</li>
<li>Drifts far from natural language, losing fluency and coherence</li>
<li>Exploits spurious correlations learned by the reward model</li>
</ul>
<p>This phenomenon is called <strong>reward hacking</strong>. The policy finds a way to “game” the reward model rather than genuinely improving response quality.</p>
<p>To prevent reward hacking, the <a href="https://arxiv.org/pdf/2203.02155">InstructGPT paper</a> adds a <strong>KL divergence penalty</strong> that regularizes the policy to stay close to a <strong>reference model</strong> <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bref%7D%7D"> (typically the SFT model before RL fine-tuning).</p>
<p>From Section VIII, the PPO objective (to be maximized via gradient ascent) consists of three terms:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AL%5E%7B%5Ctext%7BPPO%7D%7D(%5Ctheta)%20=%20%5Cunderbrace%7BL%5E%7B%5Ctext%7BCLIP%7D%7D(%5Ctheta)%7D_%7B%5Ctext%7BClipped%20Policy%20Objective%7D%7D%20-%20%5Cunderbrace%7Bc_1%20L%5E%7B%5Ctext%7BVF%7D%7D(%5Ctheta)%7D_%7B%5Ctext%7BValue%20Function%20Loss%7D%7D%20+%20%5Cunderbrace%7Bc_2%20S%5B%5Cpi_%5Ctheta%5D%7D_%7B%5Ctext%7BEntropy%20Bonus%7D%7D%0A"></p>
<p>Now, we don’t use raw reward model scores directly. Instead, we define a <strong>KL-penalized reward</strong> that regularizes the policy to stay close to a reference model <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bref%7D%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7Br_%7B%5Ctext%7Btotal%7D%7D(s_t,%20a_t)%20=%20r_%7B%5Ctext%7BRM%7D%7D(s_t,%20a_t)%20-%20%5Cbeta%20%5Ccdot%20D_%7B%5Ctext%7BKL%7D%7D%5Cleft(%5Cpi_%5Ctheta(%5Ccdot%7Cs_t)%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(%5Ccdot%7Cs_t)%5Cright)%7D%20%5Ctag%7BIX.I%7D%0A"></p>
<p>where: - <img src="https://latex.codecogs.com/png.latex?r_%7B%5Ctext%7BRM%7D%7D(s_t,%20a_t)"> is the reward signal at timestep <img src="https://latex.codecogs.com/png.latex?t"> - <img src="https://latex.codecogs.com/png.latex?%5Cbeta"> is the KL penalty coefficient - <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bref%7D%7D"> is the frozen reference model</p>
<p>At each token position, the KL divergence simplifies to:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AD_%7B%5Ctext%7BKL%7D%7D%5Cleft(%5Cpi_%5Ctheta(%5Ccdot%7Cs_t)%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D(%5Ccdot%7Cs_t)%5Cright)%20=%20%5Cmathbb%7BE%7D_%7Ba%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cleft%5B%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(a%7Cs_t)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(a%7Cs_t)%7D%5Cright%5D%0A"></p>
<p>In practice we estimate this expectation with the sampled token <img src="https://latex.codecogs.com/png.latex?a_t">, yielding: <img src="https://latex.codecogs.com/png.latex?%0A%5Chat%20d_t=%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(a_t%7Cs_t)%7D%7B%5Cpi_%7B%5Cmathrm%7Bref%7D%7D(a_t%7Cs_t)%7D%0A"></p>
<p>Note that the reward model <img src="https://latex.codecogs.com/png.latex?r_%5Cphi(x,%20y)"> produces a single scalar for the complete response <img src="https://latex.codecogs.com/png.latex?(x,%20y)">. This score is assigned only at the <strong>final token</strong> <img src="https://latex.codecogs.com/png.latex?T">, while the KL penalty applies at <strong>every token</strong>. <img src="https://latex.codecogs.com/png.latex?%0A%5Ctilde%7Br%7D_%5Cphi%20=%20%5Cbegin%7Bcases%7D%0A-%5Cbeta%20%5Ccdot%20%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(a_t%20%7C%20s_t)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(a_t%20%7C%20s_t)%7D%20&amp;%20%5Ctext%7Bif%20%7D%20t%20%3C%20T%20%5C%5C%5B8pt%5D%0Ar_%5Cphi(x,%20y)%20-%20%5Cbeta%20%5Ccdot%20%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(a_T%20%7C%20s_T)%7D%7B%5Cpi_%7B%5Ctext%7Bref%7D%7D(a_T%20%7C%20s_T)%7D%20&amp;%20%5Ctext%7Bif%20%7D%20t%20=%20T%0A%5Cend%7Bcases%7D%0A"></p>
<p>The KL penalty serves two purposes: 1. <strong>Prevents reward hacking</strong>: The policy cannot drift arbitrarily far from natural language 2. <strong>Maintains fluency</strong>: Outputs remain similar in distribution to the well-trained SFT model</p>
<p>It modifies the advantage estimates <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t"> used in PPO through the modified per-token rewards. However, it is mathematically equivalent (and more efficient in implementation) to add the KL term directly to the objective. The PPO objective with KL penalty is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AJ(%5Ctheta)%20=%20%5Cunderbrace%7B%5Cmathbb%7BE%7D_%7Ba%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cleft%5Br_%7B%5Ctext%7BRM%7D%7D(s,%20a)%5Cright%5D%7D_%7B%5Ctext%7BVanilla%20PPO%20objective%7D%7D%20-%20%5Cunderbrace%7B%5Cbeta%20%5Ccdot%20D_%7B%5Ctext%7BKL%7D%7D(%5Cpi_%5Ctheta%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D)%7D_%7B%5Ctext%7BKL%20penalty%20term%7D%7D%0A"></p>
<p>The first term is exactly what vanilla PPO optimizes using the clipped surrogate. The KL penalty term appears as a separate additive component that penalizes divergence from the reference model. Substituting the PPO clipped surrogate for the first term:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AJ_%7B%5Ctext%7Bc%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5Cleft%5B%5Cmin%5Cleft(r_t(%5Ctheta)%20%5Chat%7BA%7D_t,%20%5Ctext%7Bclip%7D(r_t(%5Ctheta),%201-%5Cepsilon,%201+%5Cepsilon)%20%5Chat%7BA%7D_t%5Cright)%5Cright%5D%20-%20%5Cbeta%20%5Ccdot%20D_%7B%5Ctext%7BKL%7D%7D(%5Cpi_%5Ctheta%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D)%0A"></p>
<p>Combining all components, the <strong>complete PPO objective with KL penalty</strong> (to be maximized) is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cboxed%7BL%5E%7B%5Ctext%7BRLHF%7D%7D(%5Ctheta)%20=%20%5Cunderbrace%7BL%5E%7B%5Ctext%7BCLIP%7D%7D(%5Ctheta)%7D_%7B%5Ctext%7BPolicy%20Objective%7D%7D%20-%20%5Cunderbrace%7Bc_1%20L%5E%7B%5Ctext%7BVF%7D%7D(%5Ctheta)%7D_%7B%5Ctext%7BValue%20Loss%7D%7D%20+%20%5Cunderbrace%7Bc_2%20S%5B%5Cpi_%5Ctheta%5D%7D_%7B%5Ctext%7BEntropy%20Bonus%7D%7D%20-%20%5Cunderbrace%7B%5Cbeta%20%5Ccdot%20D_%7B%5Ctext%7BKL%7D%7D(%5Cpi_%5Ctheta%20%5C%7C%20%5Cpi_%7B%5Ctext%7Bref%7D%7D)%7D_%7B%5Ctext%7BKL%20Penalty%7D%7D%7D%20%5Ctag%7BIX.II%7D%0A"></p>
<p>Here, each term serves a distinct purpose:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 50%">
<col style="width: 50%">
</colgroup>
<thead>
<tr class="header">
<th>Term</th>
<th>Role</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Policy Objective</strong> <img src="https://latex.codecogs.com/png.latex?L%5E%7B%5Ctext%7BCLIP%7D%7D"></td>
<td>Improves the policy while preventing destructive updates via clipping</td>
</tr>
<tr class="even">
<td><strong>Value Loss</strong> <img src="https://latex.codecogs.com/png.latex?c_1%20L%5E%7B%5Ctext%7BVF%7D%7D"></td>
<td>Trains the critic for accurate advantage estimation (subtracted to minimize)</td>
</tr>
<tr class="odd">
<td><strong>Entropy Bonus</strong> <img src="https://latex.codecogs.com/png.latex?c_2%20S%5B%5Cpi_%5Ctheta%5D"></td>
<td>Encourages exploration, prevents premature convergence</td>
</tr>
<tr class="even">
<td><strong>KL Penalty</strong> <img src="https://latex.codecogs.com/png.latex?%5Cbeta%20D_%7B%5Ctext%7BKL%7D%7D"></td>
<td>Prevents reward hacking, maintains language quality (subtracted to penalize drift)</td>
</tr>
</tbody>
</table>
<p>It is important to distinguish the two KL-related mechanisms in the complete loss. The PPO clipping mechanism acts as a <strong>short-term anchor</strong> that constrains how much the policy can change in a single update, while the KL penalty is a <strong>long-term anchor</strong> that constrains how far the policy can drift from its starting point across all of training.</p>
</section>
<section id="finally-done" class="level2">
<h2 class="anchored" data-anchor-id="finally-done">Finally done…</h2>
<p>And that’s the full derivation! What I find satisfying is that every term in the final loss has a specific purpose. Each one exists because we ran into a specific problem along the way and needed to fix it. I will admit it was not easy to understand all the math and concepts behind the loss. I still do not fully understand every detail but I understand it far better than I did a few days ago.</p>
<p>I hope this was useful. If you spot any errors in derivation (which I’m sure there are) or have suggestions, feel free to reach out.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><strong>Video:</strong>
<ul>
<li><a href="https://www.youtube.com/watch?v=qGyFrqc34yc">Umar Jamil’s video on RLHF and PPO</a>: A comprehensive and must-watch video covering RLHF and PPO concepts.</li>
</ul></li>
<li><strong>Papers:</strong>
<ul>
<li><a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization Algorithms</a>: The foundational PPO paper introducing the clipped surrogate objective.</li>
<li><a href="https://arxiv.org/pdf/2203.02155">Training language models to follow instructions with human feedback</a>: The InstructGPT paper demonstrating PPO with KL penalty to mitigate reward hacking in LLM fine-tuning.</li>
<li><a href="https://arxiv.org/abs/1502.05477">Trust Region Policy Optimization</a>: The TRPO paper that motivates the trust region constraints used in PPO.</li>
<li><a href="https://arxiv.org/abs/1506.02438">High-Dimensional Continuous Control Using Generalized Advantage Estimation</a>: GAE paper introducing the exponentially-weighted advantage estimator for variance reduction in policy gradients.</li>
</ul></li>
</ul>


</section>

 ]]></description>
  <category>RL &amp; Alignment</category>
  <guid>https://garg-aayush.github.io/posts/2025-12-25-deriving-ppo-loss.html</guid>
  <pubDate>Thu, 25 Dec 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>What I Learned Building SFT from the Ground Up</title>
  <link>https://garg-aayush.github.io/posts/2025-12-28-sft-from-scratch.html</link>
  <description><![CDATA[ 




<p>Over the past few weeks, I implemented supervised fine-tuning (SFT) from scratch, continuing a series of projects where I’m building foundational LLM components as a learning exercise from the ground up. Previously, I’ve worked through <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/gpt-2">implementing GPT-2 from scratch</a> and <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/llm-inference">writing LLM inference scripts from the ground up</a>. Naturally, SFT was the next step in this series.</p>
<p>One thing I realized pretty quickly, writing the training scripts from scratch is not the most difficult part. However, making it actually work, producing results that seems reasonable is where the real challenge begins 😅. You run into all sorts of difficulties: debugging annoying errors, dealing with gradient instabilities, getting vLLM to cooperate for intermediate evaluation (especially with limited GPU memory) etc. <strong>These are the things that eat up your time but teach you the most</strong>.</p>
<p>In this post, I want to share not just what I built, but the building and debugging journey that got me there.</p>
<section id="what-i-built" class="level2">
<h2 class="anchored" data-anchor-id="what-i-built">What I Built</h2>
<p>I loosely followed Stanford’s <a href="https://github.com/stanford-cs336/assignment5-alignment">CS336 Assignment 5</a> as a guide, wrote all the SFT core components, and ran two sets of experiments:</p>
<p><strong>1. Reasoning SFT</strong>: Fine-tuned <a href="https://huggingface.co/Qwen/Qwen2.5-Math-1.5B">Qwen2.5-Math-1.5B</a> on math reasoning traces to improve step-by-step problem solving capabilities.</p>
<p align="center">
<img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/sft/results/plots/sft_reasoning_summary.png" alt="Reasoning SFT Results" width="600">
</p><p align="center">
<em>Best: <b>53.4% reward accuracy</b> (up from 2.9% baseline) with 99.3% format accuracy</em>
</p>
<p></p>
<strong>2. Instruction SFT</strong>: Fine-tuned <a href="https://huggingface.co/meta-llama/Llama-3.1-8B">Llama-3.1-8B</a> on UltraChat-200K + SafetyLlama for general instruction following and safety.
<p align="center">
<img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/sft/results/plots/instruct_finetune_results_nomask.png" alt="Reasoning SFT Results">
</p><p align="center">
<em>Best: GSM8K 16-&gt;<b>33%</b>, Safety 62-&gt;<b>78%</b>, AlpacaEval 1.6-&gt;<b>5.3%</b>, MMLU ~58%</em>
</p>
<p></p>
<p>All experiment code, training scripts, and detailed notes are available in my <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/sft">building-from-scratch</a> repo.</p>
</section>
<section id="part-1-reasoning-sft-with-qwen2.5-math-1.5b" class="level2">
<h2 class="anchored" data-anchor-id="part-1-reasoning-sft-with-qwen2.5-math-1.5b">Part 1: Reasoning SFT with Qwen2.5-Math-1.5B</h2>
<p>The idea behind reasoning SFT is simple. You take a base model that barely outputs correct answers, show it high-quality examples of <em>how</em> to solve problems step-by-step, and train it to replicate/mimic that reasoning process. The model learns to think in a structured format with first generating reasoning inside <code>&lt;think&gt;</code> tags, then outputting the final answer in <code>&lt;answer&gt;</code> tags.</p>
<p>My starting point was <code>Qwen2.5-Math-1.5B</code>, which had quite poor baseline accuracies on the math validation set: <strong>~2.9%</strong> for answers and <strong>~14%</strong> for format.</p>
<section id="creating-the-dataset-first-challenge" class="level3">
<h3 class="anchored" data-anchor-id="creating-the-dataset-first-challenge">Creating the Dataset: First Challenge</h3>
<p>The original CS336 MATH dataset used for SFT training is not publicly available, so I had to create my own. My dataset creation pipeline had three steps:</p>
<ol type="1">
<li><p><strong>Source problems</strong>: I used <a href="https://huggingface.co/datasets/hiyouga/math12k">hiyouga/math12k</a> dataset to create the training set, carefully filtering out any problems that appeared in the validation set to avoid data leakage.</p></li>
<li><p><strong>Generate reasoning traces</strong>: The next and most important step is to <strong>generate the reasoning traces for each problem</strong>. I used <code>gpt-oss-120b</code> model to generate them via Fireworks Batch Inference API. It costed me around ~$4 to generate the reasoning traces.</p></li>
<li><p><strong>Filter for quality</strong>: I also created a subset of around ~3.6K examples by filtering out the reasoning traces that led to wrong answers.</p></li>
</ol>
</section>
<section id="the-training-loop-per-token-vs.-sequence-loss" class="level3">
<h3 class="anchored" data-anchor-id="the-training-loop-per-token-vs.-sequence-loss">The Training Loop: Per-Token vs.&nbsp;Sequence Loss</h3>
<p>The original assignment uses sequence level loss normalization where you sum the loss over all tokens in a sequence and normalize by a constant, not by the variable number of tokens.</p>
<p>While running the initial experiments, I noticed the gradient norms were really large values, and training felt unstable. Even though the loss seemed to be going in the right direction, something didn’t feel right. After some investigation, I realized the issue: with variable-length sequences (my training examples ranged from short to quite long), longer sequences contribute more to the gradient than shorter ones. This creates high variance in gradient updates.</p>
<table align="center">
<tbody><tr>
<td>
<img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/sft/results/plots/grad_norm_wo_token_loss.png" alt="Gradient Norm without Per-Token Loss" width="400">
</td>
<td>
<img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/sft/results/plots/grad_norm_w_token_loss.png" alt="Gradient Norm with Per-Token Loss" width="400">
</td>
</tr>
</tbody></table>
<p align="center">
<em>Left: Sequence-level loss (high variance gradients) | Right: Per-token loss (stable gradients)</em>
</p>
<p>Thus, I added a <code>per_token_loss</code> flag to my training step which when enabled normalizes the loss by the actual number of response tokens in each sequence. The difference was noticeable with subtle improved accuracy. More importantly, the gradients became much more stable with per-token normalization.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Run</th>
<th>Loss Normalization</th>
<th>Reward Accuracy</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>run_filtered</td>
<td>Per-token</td>
<td>0.5204</td>
</tr>
<tr class="even">
<td>run_filtered-res-len</td>
<td>Sequence-level</td>
<td>0.5106</td>
</tr>
</tbody>
</table>
</section>
<section id="vllm-integration-the-debugging-nightmare" class="level3">
<h3 class="anchored" data-anchor-id="vllm-integration-the-debugging-nightmare">vLLM Integration: The Debugging Nightmare</h3>
<p>Here’s where things got really tricky and painful. I wanted to run intermediate evaluations during training using vLLM for fast inference. The assignment provided code for this but it was written for an older vLLM version and nothing worked out of the box 😅.</p>
<p><strong>Problem 1: vLLM initialization changed</strong></p>
<p>The assignment’s approach used a separate GPU dedicated to running vLLM as an inference server. I wasn’t keen on this setup anyway as it meant paying for an extra GPU just for inference. But more importantly, the approach broke completely with the vLLM version I was using (0.7+). The initialization logic had changed, and the old code just wouldn’t run.</p>
<p><strong><em>Solution</em></strong>: I switched to the <code>colocate</code> approach, running vLLM on the same device as the training model. I came across this in the excellent <a href="https://huggingface.co/blog/vllm-colocate">HuggingFace blog post on co-located vLLM</a>. Though, this required being more careful about GPU memory (setting appropriate values for <code>gpu_memory_utilization</code>, <code>max_model_len</code>, and <code>max_num_seqs</code>), but it actually works and saves on GPU costs.</p>
<p><strong>Problem 2: Missing <code>model_executor</code> attribute</strong></p>
<p>When I tried to load updated model weights into the vLLM instance during training, I hit this error:</p>
<pre><code>AttributeError: 'LLMEngine' object has no attribute 'model_executor'</code></pre>
<p>This was really annoying because the attribute clearly existed in the vLLM source code. After much debugging, I found two solutions: - Downgrade to vLLM 0.10.2, or - If using vLLM 0.11.0, set the environment variable <code>VLLM_ENABLE_V1_MULTIPROCESSING=0</code> at the start of the script</p>
<p>I went with the environment variable approach since I didn’t want to deal with version conflicts.</p>
<p><strong>Problem 3: The <code>_orig_mod</code> issue</strong></p>
<p>With <code>torch.compile</code> enabled on my model (for faster training), loading weights into vLLM failed with the below error. The issue is that <code>torch.compile</code> wraps the original model and stores the actual weights under <code>_orig_mod</code>. When loading weights into vLLM, you need to access them through this attribute, not directly from the compiled model.</p>
<pre><code>ValueError: There is no module or parameter named '_orig_mod' in Qwen2ForCausalLM</code></pre>
<p><strong><em>Solution</em></strong>: In my <code>load_policy_into_vllm_instance</code> function, I made sure to load from <code>model._orig_mod</code> when the model is compiled.</p>
<p><strong>These three issues cost me almost a day. However, it was worth it because I learned a lot about vLLM and how to integrate it in training run</strong></p>
</section>
<section id="results" class="level3">
<h3 class="anchored" data-anchor-id="results">Results</h3>
<p>After all that debugging, here’s what the training runs achieved:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/sft/results/plots/sft_train_reasoning_results.png" class="img-fluid figure-img"></p>
<figcaption>Reasoning SFT Results</figcaption>
</figure>
</div>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Run</th>
<th>Training Data</th>
<th>Reward Accuracy</th>
<th>Format Accuracy</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>baseline</td>
<td>-</td>
<td>0.0288</td>
<td>0.1438</td>
</tr>
<tr class="even">
<td>run_all</td>
<td>Full 4.8K (correct + incorrect)</td>
<td>0.4214</td>
<td>0.9924</td>
</tr>
<tr class="odd">
<td>run_filtered</td>
<td>Filtered 3.6K (correct only)</td>
<td>0.5204</td>
<td>0.9906</td>
</tr>
<tr class="even">
<td>run_filtered-2epoch</td>
<td>Filtered 3.6K (2 epochs)</td>
<td>0.5336</td>
<td>0.9926</td>
</tr>
</tbody>
</table>
<p><strong>Key takeaways</strong>: - Filtering out incorrect reasoning traces boosted accuracy from 42% to 52%. Training on wrong traces teaches the model wrong patterns. - The model quickly learned the output format (99%+ format accuracy after training). - Running for 2 epochs gave a boost in accuracy though a marginal one.</p>
</section>
</section>
<section id="part-2-instruction-sft-with-llama-3.1-8b" class="level2">
<h2 class="anchored" data-anchor-id="part-2-instruction-sft-with-llama-3.1-8b">Part 2: Instruction SFT with Llama-3.1-8B</h2>
<p>With the reasoning SFT working, I moved on to the second part: instruction fine-tuning. This loosely follows the <a href="https://github.com/stanford-cs336/assignment5-alignment/blob/main/cs336_spring2025_assignment5_supplement_safety_rlhf.pdf">CS336 Supplementary Assignment 5</a>, where the goal is to build a model that can follow diverse instructions and refuse harmful requests.</p>
<p>Unlike reasoning SFT, instruction fine-tuning uses conversational instruction-response pairs. The training data combines <strong>UltraChat-200K</strong> (diverse multi-turn conversations) and <strong>SafetyLlama</strong> (safety-focused examples) totaling around 200K examples, formatted using the Alpaca prompt template.</p>
<p>For evaluation, I used four benchmarks as specified in the assignment: - <strong>GSM8K</strong>: Grade-school math problems (tests math reasoning) - <strong>MMLU</strong>: Multiple-choice questions across 57 subjects (tests factual knowledge) - <strong>AlpacaEval</strong>: Open-ended instructions judged by LLM-as-judge (tests instruction-following quality)<br>
- <strong>Simple Safety Tests (SST)</strong>: Harmful prompts to test refusal behavior (tests safety)</p>
<section id="the-prompt-masking-implementation-problem" class="level3">
<h3 class="anchored" data-anchor-id="the-prompt-masking-implementation-problem">The Prompt Masking Implementation Problem</h3>
<p>I wanted to experiment with <strong>prompt masking</strong> i.e.&nbsp;masking prompt tokens (labels = -100) so the loss is computed only on response tokens, helping the model focus on generating good responses.</p>
<p><strong>Problem 1: BPE tokenization boundary issues</strong></p>
<p>Implementing this led to an interesting debugging session. When I tokenized the prompt separately (ending with <code>"### Response:\n"</code>) and compared it to the tokens in the full sequence (prompt + response), the boundary tokens didn’t match. This is a known issue of BPE tokenization: subword merging behavior changes based on context.</p>
<p>My first instinct was to try to implement complex boundary detection logic. However, I thought let’s try the simplest fix that works.</p>
<p><strong><em>Solution</em></strong>: I decided to drop the last token from the prompt before masking. This is a bit quick fix. However, I might train on one extra formatting token (likely just a newline) but will never accidentally mask response tokens.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Conservative fix: drop last prompt token to avoid boundary issues</span></span>
<span id="cb3-2">prompt_length <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(prompt_tokens) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb3-3">labels[:prompt_length] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span></code></pre></div></div>
<p><strong>Problem 2: Very short or empty responses</strong></p>
<p>Another issue I ran into with prompt masking, some training examples had very short or empty responses. When all tokens are masked leaving only a few response tokens, the cross-entropy loss calculation can produce extreme values or NaNs.</p>
<p><strong><em>Solution</em></strong>: The fix was simple. I filtered out examples with very short responses (0-2 words) from both training and validation sets.</p>
</section>
<section id="setting-up-alpacaeval" class="level3">
<h3 class="anchored" data-anchor-id="setting-up-alpacaeval">Setting Up AlpacaEval</h3>
<p>A quick note on the AlpacaEval evaluation setup. It uses an LLM-as-judge approach where an annotator model compares outputs from your mode against GPT-4 reference responses.</p>
<p>The assignment suggested deploying <code>Llama-3.3-70B-Instruct</code> locally as the annotator, but that requires at least two GPUs which is not cost effective (atleast for my case). Instead, I used <code>Llama-3.3-70B-Instruct</code> via Fireworks API. This required some config tweaking (API key mapping, judge configuration) but works well.</p>
</section>
<section id="results-and-analysis" class="level3">
<h3 class="anchored" data-anchor-id="results-and-analysis">Results and Analysis</h3>
<p>I ran two experiments: one with prompt masking (<code>mask</code>) and one without (<code>no-mask</code>).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/sft/results/plots/instruct_finetune_results_comparison.png" class="img-fluid figure-img"></p>
<figcaption>Instruction Fine-tuning Comparison</figcaption>
</figure>
</div>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Benchmark</th>
<th>Baseline</th>
<th>No-Mask</th>
<th>Mask</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>GSM8K</strong></td>
<td>16.4%</td>
<td>29.0%</td>
<td><strong>32.7%</strong></td>
</tr>
<tr class="even">
<td><strong>MMLU</strong></td>
<td>58.1%</td>
<td>58.4%</td>
<td>58.2%</td>
</tr>
<tr class="odd">
<td><strong>SST Safety</strong></td>
<td>62.0%</td>
<td><strong>78.0%</strong></td>
<td>77.0%</td>
</tr>
<tr class="even">
<td><strong>AlpacaEval</strong></td>
<td>1.57%</td>
<td><strong>5.3%</strong></td>
<td>4.5%</td>
</tr>
</tbody>
</table>
<ul>
<li><p><strong>GSM8K (16% -&gt; 29-33%)</strong>: Both approaches significantly improved math reasoning, but masking helped more (32.7% vs 29.0%).</p></li>
<li><p><strong>Safety (62% -&gt; 78%)</strong>: You see big improvement as expected since the training data includes SafetyLlama examples.</p></li>
<li><p><strong>AlpacaEval (1.6% -&gt; 5.3%)</strong>: The conversational instruction-following improved substantially. Interestingly, no-mask performed slightly better (5.3% vs 4.5%). My guess: training on the full sequence helps the model learn overall conversational patterns and produce more naturally flowing responses that match the prompt style.</p></li>
<li><p><strong>MMLU (~58% -&gt; ~58%)</strong>: This stayed flat and that’s actually good news. MMLU tests factual knowledge which is encoded during pre-training. SFT teaches the model <em>how</em> to respond, not <em>what</em> to know. The fact that MMLU didn’t drop means we avoided catastrophic forgetting issue.</p></li>
</ul>
<p align="center">
<img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/sft/results/plots/instruct_finetune_mmlu_comparison.png" alt="MMLU Subject Comparison">
</p><p align="center">
<em>Looking at individual MMLU subjects, some regressed slightly (college math: 33% -&gt; 26%) while others improved slightly, leading to near-zero net change.</em>
</p>
<p></p>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>While writing the SFT code from scratch, I ran into a lot of debugging challenges. It was at times painstaking and frustrating but was also a valuable learning experience. By debugging, I learned a lot about how things work under the hood, and the whole experience prepares you for how to go about debugging code/projects in the future.</p>
<p>I leave you with some of the debugging tips I came across:</p>
<ul>
<li><strong>vLLM OOM</strong>: Tune <code>max_model_len</code>, <code>max_num_seqs</code>, and <code>gpu_memory_utilization</code> and start conservative.</li>
<li><strong>Per-token loss</strong>: Normalize by response token count to prevent long sequences from dominating gradients.</li>
<li><strong>torch.compile + vLLM</strong>: Access weights via <code>model._orig_mod</code> when loading into vLLM.</li>
<li><strong>BPE boundaries</strong>: Drop last prompt token before masking to avoid tokenization edge cases.</li>
<li><strong>Data quality matters</strong>: Filtering incorrect traces gave me a 10% accuracy boost.</li>
<li><strong>vLLM version issues</strong>: Set <code>VLLM_ENABLE_V1_MULTIPROCESSING=0</code> if <code>model_executor</code> is missing.</li>
</ul>
</section>
<section id="resources" class="level2">
<h2 class="anchored" data-anchor-id="resources">Resources</h2>
<p>I have made all the code, datasets, and model checkpoints publicly accessible.</p>
<ul>
<li><strong>Code</strong>: <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/sft">building-from-scratch/sft</a></li>
<li><strong>Datasets</strong>: <a href="https://huggingface.co/datasets/garg-aayush/sft-cs336-assign5-datasets">garg-aayush/sft-cs336-assign5-datasets</a></li>
<li><strong>Checkpoints</strong>:
<ul>
<li>Reasoning:
<ul>
<li>run_all: <a href="https://huggingface.co/garg-aayush/qwen-2.5-math-sft-all">qwen-2.5-math-sft-all-2epoch</a></li>
<li>run_filtered: <a href="https://huggingface.co/garg-aayush/qwen-2.5-math-sft-filtered">qwen-2.5-math-sft-filtered-2epoch</a></li>
<li>run_filtered-res-len: <a href="https://huggingface.co/garg-aayush/qwen-2.5-math-sft-filtered-res-len">qwen-2.5-math-sft-filtered-res-len</a></li>
<li>run_filtered-2epoch: <a href="https://huggingface.co/garg-aayush/qwen-2.5-math-sft-filtered-2epoch">qwen-2.5-math-sft-filtered-2epoch</a></li>
</ul></li>
<li>Instruction d:
<ul>
<li>run_mask: <a href="https://huggingface.co/garg-aayush/llama31-8b-sft-mask">llama31-8b-sft-mask</a></li>
<li>run_nomask: <a href="https://huggingface.co/garg-aayush/llama31-8b-sft-nomask">llama31-8b-sft-nomask</a></li>
</ul></li>
</ul></li>
<li><strong>Training logs</strong>: <a href="https://wandb.ai/garg-aayush/sft">wandb/sft</a> and <a href="https://wandb.ai/garg-aayush/sft_instruct">wandb/sft_instruct</a></li>
</ul>


</section>

 ]]></description>
  <category>LLM Training</category>
  <guid>https://garg-aayush.github.io/posts/2025-12-28-sft-from-scratch.html</guid>
  <pubDate>Wed, 03 Dec 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>A Guide to Building Custom Nodes in ComfyUI</title>
  <link>https://garg-aayush.github.io/posts/2025-09-10-build-custom-comfyui-node.html</link>
  <description><![CDATA[ 




<p><a href="https://www.comfy.org/">ComfyUI</a> is by far my favorite open-source software right now. Its intuitive node-based interface has transformed the way we build AI image and video generation workflows.</p>
<p>What I really appreciate about ComfyUI is its flexibility. You can easily extend it with your own custom nodes. Here, I’ll show you how to create custom nodes that let you add exactly the tools you need. I’ll use parts from my <a href="https://github.com/garg-aayush/ComfyUI-Svg2Raster">Svg2Raster</a> nodes as the running example for this purpose.</p>
<section id="svg2raster" class="level2">
<h2 class="anchored" data-anchor-id="svg2raster"><a href="https://github.com/garg-aayush/ComfyUI-Svg2Raster">Svg2Raster</a></h2>
<p>ComfyUI does not natively support vector graphics like <a href="https://en.wikipedia.org/wiki/SVG">SVGs</a>. I often work with them and needed lightweight nodes to load (SVG-&gt;JPEGs/PNGs) and manipulate SVGs in ComfyUI.</p>
<p>Thus, I built <a href="https://github.com/garg-aayush/ComfyUI-Svg2Raster">Svg2Raster</a>, a small custom node package that makes it easy to use SVGs with other nodes. <img src="https://garg-aayush.github.io/static/img/blog-2025-09-10/workflow.png" class="img-fluid" alt="SVG2Raster"></p>
</section>
<section id="writing-your-custom-comfyui-node" class="level1">
<h1>Writing your Custom ComfyUI Node</h1>
<p>So here I am assuming that you are fairly comfortable using ComfyUI and you already have ComfyUI installed locally on your system/cloud instance.</p>
<section id="step-1-validate-the-core-logic-first" class="level2">
<h2 class="anchored" data-anchor-id="step-1-validate-the-core-logic-first">Step 1: Validate the core logic first</h2>
<p>I prefer not to start with the ComfyUI node API. First, I like to write a simple Python notebook to test the functionalities I actually need. This validates your core code logic and packages in isolation.</p>
<p>In my case, I needed a way to read, rasterize and manipulate the SVGs. Thus, I tested all the relevant operations using the core packages <a href="https://cairosvg.org/">CairoSVG</a> and <a href="https://python-pillow.org/">Pillow</a>.</p>
<p>For example:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Simple SVG read and conversion check</span></span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> cairosvg</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> PIL <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Image, ImageOps</span>
<span id="cb1-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> io</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Read SVG file</span></span>
<span id="cb1-7"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'logo.svg'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'r'</span>, encoding<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utf-8'</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> f:</span>
<span id="cb1-8">    svg_text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> f.read()</span>
<span id="cb1-9"></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Basic conversion to PNG</span></span>
<span id="cb1-11">img_bytes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> cairosvg.svg2png(bytestring<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>svg_text.encode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utf-8'</span>), </span>
<span id="cb1-12">                              output_width<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">600</span>)</span>
<span id="cb1-13">img <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Image.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(io.BytesIO(img_bytes)).convert(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'RGBA'</span>)</span>
<span id="cb1-14"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Image size: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>img<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>size<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<p>If you want to see all the code snippets (width/height controls, color and border manipulations etc.), please check out the full <a href="https://github.com/garg-aayush/ComfyUI-Svg2Raster/blob/main/svg2png.ipynb">notebook</a>.</p>
<p><strong>Once you have a working standalone script or code, wrapping it as a ComfyUI node is mostly boilerplate.</strong></p>
</section>
<section id="step-2-understand-the-anatomy-of-a-custom-node" class="level2">
<h2 class="anchored" data-anchor-id="step-2-understand-the-anatomy-of-a-custom-node">Step 2: Understand the Anatomy of a Custom Node</h2>
<p>Every ComfyUI node is a Python class with specific methods that ComfyUI expects. Here are the essential components:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Component</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>INPUT_TYPES</code></td>
<td>What inputs your node accepts</td>
</tr>
<tr class="even">
<td><code>RETURN_TYPES</code></td>
<td>What it outputs to other nodes</td>
</tr>
<tr class="odd">
<td><code>RETURN_NAMES</code></td>
<td>Optional labels for outputs</td>
</tr>
<tr class="even">
<td><code>FUNCTION</code></td>
<td>The method name that runs your logic</td>
</tr>
<tr class="odd">
<td><code>CATEGORY</code></td>
<td>Where it appears in ComfyUI’s node menu</td>
</tr>
</tbody>
</table>
<p>ComfyUI handles the rest of UI, connections and execution order. For example, this is how a simple custom node class will looks like:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> LoadSVG:</span>
<span id="cb2-2">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@classmethod</span></span>
<span id="cb2-3">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> INPUT_TYPES(cls):</span>
<span id="cb2-4">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {</span>
<span id="cb2-5">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"required"</span>: {</span>
<span id="cb2-6">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"svg_file"</span>: (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"STRING"</span>, {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"default"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"file.svg"</span>}),</span>
<span id="cb2-7">            }</span>
<span id="cb2-8">        }</span>
<span id="cb2-9">    </span>
<span id="cb2-10">    RETURN_TYPES <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"IMAGE"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"STRING"</span>)</span>
<span id="cb2-11">    RETURN_NAMES <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"image"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"svg_text"</span>)</span>
<span id="cb2-12">    FUNCTION <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"load_svg"</span></span>
<span id="cb2-13">    CATEGORY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Svg2Raster"</span></span>
<span id="cb2-14">    </span>
<span id="cb2-15">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> load_svg(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, svg_file):</span>
<span id="cb2-16">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Your actual logic here</span></span>
<span id="cb2-17">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (image_tensor, svg_text)</span></code></pre></div></div>
<p>This is a minimal ComfyUI node class explanation that I believe is good enough to start writing your own nodes. If you want more details, check the official ComfyUI custom node <a href="https://docs.comfy.org/custom-nodes/overview">documentation</a>.</p>
</section>
<section id="step-3-implementing-the-loadsvgimage-node" class="level2">
<h2 class="anchored" data-anchor-id="step-3-implementing-the-loadsvgimage-node">Step 3: Implementing the <strong>LoadSVGImage</strong> Node</h2>
<p>First, I set up the file structure in ComfyUI’s <code>custom_nodes</code> folder for my nodes package:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> ComfyUI/custom_nodes</span>
<span id="cb3-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mkdir</span> svg2raster</span>
<span id="cb3-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> svg2raster</span></code></pre></div></div>
<p>Then I create two essential files:</p>
<p><code>__init__.py</code>: it allows ComfyUI to import your custom nodes.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> .svg2raster_node <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb4-2"></span>
<span id="cb4-3"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">__all__</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [ <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"NODE_CLASS_MAPPINGS"</span>,</span>
<span id="cb4-4">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"NODE_DISPLAY_NAME_MAPPINGS"</span>]</span></code></pre></div></div>
<p><code>svg2raster_node.py</code>: this is where the actual nodes code is written.</p>
<p>You can find the <strong>complete code</strong> for these nodes here: <a href="https://github.com/garg-aayush/ComfyUI-Svg2Raster/blob/main/svg2raster_node.py">svg2raster_node.py</a>. Here’s the boilerplate structure of <strong>LoadSVGImage</strong> node:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> LoadSVGImage:</span>
<span id="cb5-2">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@classmethod</span></span>
<span id="cb5-3">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> INPUT_TYPES(cls):</span>
<span id="cb5-4">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Define what inputs this node accepts"""</span></span>
<span id="cb5-5">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use `folder_paths` to access ComfyUI's input directory</span></span>
<span id="cb5-6">        input_dir <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> folder_paths.get_input_directory()</span>
<span id="cb5-7">        files <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [f <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> f <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> os.listdir(input_dir) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> os.path.isfile(os.path.join(input_dir, f)) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">and</span> f.lower().endswith(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'.svg'</span>)]</span>
<span id="cb5-8">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {</span>
<span id="cb5-9">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"required"</span>: {</span>
<span id="cb5-10">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"svg"</span>: (<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sorted</span>(files), {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"image_upload"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>}),</span>
<span id="cb5-11">            }</span>
<span id="cb5-12">        }</span>
<span id="cb5-13">    </span>
<span id="cb5-14">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Output configuration</span></span>
<span id="cb5-15">    RETURN_TYPES <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"STRING"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"IMAGE"</span>)</span>
<span id="cb5-16">    RETURN_NAMES <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"svg_text"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"preview_image"</span>)</span>
<span id="cb5-17">    FUNCTION <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"load_svg"</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Method name to execute</span></span>
<span id="cb5-18">    CATEGORY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"FromSVG/Tools"</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Menu location</span></span>
<span id="cb5-19">    </span>
<span id="cb5-20">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> load_svg(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, svg, background<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#FFFFFF"</span>):</span>
<span id="cb5-21">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Main execution method - does the actual work"""</span></span>
<span id="cb5-22">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Your logic here</span></span>
<span id="cb5-23">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (svg_text, image_tensor)</span>
<span id="cb5-24">    </span>
<span id="cb5-25">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@classmethod</span></span>
<span id="cb5-26">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> IS_CHANGED(cls, svg):</span>
<span id="cb5-27">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Returns file hash or modification time</span></span>
<span id="cb5-28">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">pass</span></span>
<span id="cb5-29">    </span>
<span id="cb5-30">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@classmethod</span></span>
<span id="cb5-31">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> VALIDATE_INPUTS(cls, svg):</span>
<span id="cb5-32">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Check if file exists, return error string if invalid</span></span>
<span id="cb5-33">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span></code></pre></div></div>
<p>Here, the helper methods serve crucial purposes: - <strong><code>IS_CHANGED</code></strong>: Tells ComfyUI when to re-execute the node - <strong><code>VALIDATE_INPUTS</code></strong>: Prevents crashes by validating inputs before execution</p>
<p>ComfyUI expects images as tensors in BHWC format (<code>batch</code>, <code>height</code>, <code>width</code>, <code>channels</code>) with values normalized to 0-1. Thus, you need to have a pil to tensor function.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _pil_to_tensor(pil_img: Image.Image):</span>
<span id="cb6-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Convert PIL image to ComfyUI IMAGE tensor: (B, H, W, C) in [0,1]"""</span></span>
<span id="cb6-3">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># conversion logic</span></span></code></pre></div></div>
<p>Finally, you need the mappings for ComfyUI to discover your nodes:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">NODE_CLASS_MAPPINGS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb7-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"LoadSVGImage"</span>: LoadSVGImage,</span>
<span id="cb7-3">}</span>
<span id="cb7-4">NODE_DISPLAY_NAME_MAPPINGS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb7-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"LoadSVGImage"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Load SVG Image"</span>,</span>
<span id="cb7-6">}</span></code></pre></div></div>
<p>Without these mappings, ComfyUI won’t find your nodes even if the code is correct.</p>
<p>Similarly, I also wrote a <code>RasterizeSVG</code> class for manipulating the loaded SVG. It takes the SVG text from <code>LoadSVGImage</code> and lets you adjust scale, dimensions, borders, and more. You will see the pattern is identical: define inputs, process with CairoSVG/PIL, convert to tensor, return.</p>
<p>That’s how you implement a node. Write a standalone functionality script, wrap it in the ComfyUI class structure, handle the tensor conversions, add the helper methods, and register it with the mappings.</p>
</section>
<section id="step-4-testing-your-custom-node-in-comfyui" class="level2">
<h2 class="anchored" data-anchor-id="step-4-testing-your-custom-node-in-comfyui">Step 4: Testing Your Custom Node in ComfyUI</h2>
<p>Once you are done implementing, testing is straightforward.</p>
<ol type="1">
<li>Ensure all your dependencies are installed, including <code>CairoSVG</code>:</li>
<li>Restart ComfyUI for it to detect the new node.</li>
<li>Find your nodes under the defined category</li>
</ol>
<p><strong>Note</strong>: I have also added the installation steps <a href="https://github.com/garg-aayush/ComfyUI-Svg2Raster?tab=readme-ov-file#installation">here</a>.</p>
</section>
<section id="step-5-sharing-and-publishing-your-node" class="level2">
<h2 class="anchored" data-anchor-id="step-5-sharing-and-publishing-your-node">Step 5: Sharing and Publishing Your Node</h2>
<p>Once everything worked, I created a GitHub repo for the nodes. You can see how I structured mine in the <a href="github.com/garg-aayush/ComfyUI-Svg2Raster">Svg2Raster</a> repo.</p>
<p>Some Essential files in the repo are:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>File</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>README.md</td>
<td>Clear installation instructions and usage examples</td>
</tr>
<tr class="even">
<td>requirements.txt</td>
<td>Python dependencies (cairosvg in my case)</td>
</tr>
<tr class="odd">
<td>pyproject.toml</td>
<td>Required if you plan to publish to ComfyUI Registry</td>
</tr>
<tr class="even">
<td>examples/</td>
<td>Optional Sample SVG files and workflow JSON files</td>
</tr>
</tbody>
</table>
<p><strong>Note</strong>: Having a good Readme and examples makes a huge difference for users trying to understand and use your nodes.</p>
<p>Once your repo is ready, you can even publish it to the <a href="https://registry.comfy.org/nodes/svg2raster">ComfyUI Registry</a>. There’s an excellent guide on <a href="https://docs.comfy.org/registry/publishing">publishing to ComfyUI Registry</a> - just follow those steps.</p>
<p>I also set up a GitHub Actions workflow that automatically publishes updates to the ComfyUI Registry whenever I push changes to my repo. This ensures the registry always has the latest version. You can check out my <a href="https://github.com/garg-aayush/ComfyUI-Svg2Raster/blob/main/.github/workflows/publish.yaml">workflow file</a> to see how I did it.</p>


</section>
</section>

 ]]></description>
  <category>Tools &amp; Infra</category>
  <guid>https://garg-aayush.github.io/posts/2025-09-10-build-custom-comfyui-node.html</guid>
  <pubDate>Wed, 10 Sep 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Building GPT from Scratch: Following Karpathy’s Tutorial</title>
  <link>https://garg-aayush.github.io/posts/2025-09-08-building-gpt-from-scratch.html</link>
  <description><![CDATA[ 




<p>The Transformer architecture has become the workhorse behind modern LLMs. GPT-2/3/4/5, Llama, Claude, Gemini: they all are built on top of the same core architecture or its variants from the 2017 “Attention Is All You Need” paper. I wanted to understand this architecture properly, so I followed Andrej Karpathy’s <a href="https://www.youtube.com/watch?v=kCc8FmEb1nY">“Let’s Build GPT from Scratch”</a> video. It’s a 2-hour walkthrough where you start from an empty file and end up with a working Transformer.</p>
<p>I followed Karpathy’s video and captured each architectural addition as a separate commit. This let me see exactly how each component pulled down the validation loss. In this walkthrough, the training data is ~1M characters of Shakespeare and the goal is to generate Shakespeare-like text.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th></th>
<th>Component</th>
<th>Val Loss</th>
<th>Commit</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Baseline</td>
<td>Bigram Model</td>
<td>~2.49</td>
<td><a href="https://github.com/garg-aayush/building-from-scratch/commit/e0b5864"><code>e0b5864</code></a></td>
</tr>
<tr class="even">
<td>Update 1</td>
<td>Single Head Self-Attention</td>
<td>~2.4</td>
<td><a href="https://github.com/garg-aayush/building-from-scratch/commit/7b0e03a"><code>7b0e03a</code></a></td>
</tr>
<tr class="odd">
<td>Update 2</td>
<td>Multi-Head Attention</td>
<td>~2.28</td>
<td><a href="https://github.com/garg-aayush/building-from-scratch/commit/9d2a7b5"><code>9d2a7b5</code></a></td>
</tr>
<tr class="even">
<td>Update 3</td>
<td>Feed-Forward Network</td>
<td>~2.27</td>
<td><a href="https://github.com/garg-aayush/building-from-scratch/commit/c4c46ff"><code>c4c46ff</code></a></td>
</tr>
<tr class="odd">
<td>Update 4</td>
<td>Residual Connections</td>
<td>~2.09</td>
<td><a href="https://github.com/garg-aayush/building-from-scratch/commit/0239c07"><code>0239c07</code></a></td>
</tr>
<tr class="even">
<td>Update 5</td>
<td>Layer Normalization</td>
<td>~2.076</td>
<td><a href="https://github.com/garg-aayush/building-from-scratch/commit/63ef5f8"><code>63ef5f8</code></a></td>
</tr>
<tr class="odd">
<td>Update 6</td>
<td>Pre-LayerNorm (modern)</td>
<td>~2.076</td>
<td><a href="https://github.com/garg-aayush/building-from-scratch/commit/4f5bef8"><code>4f5bef8</code></a></td>
</tr>
<tr class="even">
<td>Update 7</td>
<td>Scaling Up + Dropout</td>
<td>~1.48</td>
<td><a href="https://github.com/garg-aayush/building-from-scratch/commit/d4141d7"><code>d4141d7</code></a></td>
</tr>
</tbody>
</table>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/basic-gpt/images/loss_curves.png" class="img-fluid figure-img"></p>
<figcaption>Loss Curves</figcaption>
</figure>
</div>
<p>You can find all the code and notebooks in the repo: <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/basic-gpt">building-from-scratch/basic-gpt</a></p>
<section id="baseline-bigram-model-e0b5864" class="level2">
<h2 class="anchored" data-anchor-id="baseline-bigram-model-e0b5864">Baseline: Bigram Model (<a href="https://github.com/garg-aayush/building-from-scratch/commit/e0b5864"><code>e0b5864</code></a>)</h2>
<p>Karpathy starts with the simplest possible language model: a bigram model. It predicts the next character based only on the current character. No context at all. The tokens aren’t talking to each other.</p>
<p>This still works somewhat because some characters naturally follow others (the letter ‘q’ is almost always followed by ‘u’). But the output is complete gibberish because the model has no way to look at what came before.</p>
<p><strong>Result</strong>: ~2.49 validation loss.</p>
</section>
<section id="update-1-self-attention-7b0e03a" class="level2">
<h2 class="anchored" data-anchor-id="update-1-self-attention-7b0e03a">Update 1: Self-Attention (<a href="https://github.com/garg-aayush/building-from-scratch/commit/7b0e03a"><code>7b0e03a</code></a>)</h2>
<p>We want tokens to communicate with each other and predictions to consider context from previous tokens, not just the current one. A token at position 5 should be able to look at tokens 1-4 and gather information from them. But at the same time, it can’t look at tokens 6, 7, 8 because those are the future we’re trying to predict.</p>
<p>Self-attention solves this. Every token is represented by 3 vectors: - <strong>Query</strong>: “What am I looking for?” - <strong>Key</strong>: “What do I contain?”<br>
- <strong>Value</strong>: “If you find me interesting, here’s what I’ll tell you.”</p>
<p>The query dot-products with all the keys. High dot product means high affinity: “I find you interesting.” The values of interesting tokens get aggregated via weighted sum.</p>
<blockquote class="blockquote">
<p><strong>Note</strong>: Attention is really a communication mechanism. You can think of it as nodes in a directed graph where every node aggregates information from nodes that point to it. In our case, token 5 can receive information from tokens 1-4 (and itself), but not from tokens 6-8. The triangular mask creates this directed structure and is what makes this a “decoder” block.</p>
</blockquote>
<p>One subtle but important point: attention has no notion of space. The tokens don’t inherently know where they are in the sequence. That’s why we add <strong>positional embeddings</strong>. Each position gets its own learned embedding that’s added to the token embedding, giving the model spatial information.</p>
<p><strong>Result</strong>: ~2.4 validation loss. Tokens can now see context.</p>
</section>
<section id="update-2-multi-head-attention-9d2a7b5" class="level2">
<h2 class="anchored" data-anchor-id="update-2-multi-head-attention-9d2a7b5">Update 2: Multi-Head Attention (<a href="https://github.com/garg-aayush/building-from-scratch/commit/9d2a7b5"><code>9d2a7b5</code></a>)</h2>
<p>Tokens have a lot to talk about. One head might look for consonants, another for vowels, another for word boundaries, another for patterns at specific positions. Having multiple independent communication channels lets the model gather diverse types of data in parallel.</p>
<blockquote class="blockquote">
<p><strong>Note</strong>: This is similar to grouped convolutions. Instead of one large convolution, you do it in groups. With 4 heads of 8 dimensions each, we get the same total dimensionality (32) but with 4 separate communication channels. Each head can specialize in different patterns.</p>
</blockquote>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/basic-gpt/images/MHA.png" class="img-fluid figure-img"></p>
<figcaption>Multi-Head Attention</figcaption>
</figure>
</div>
<p><strong>Result</strong>: ~2.28 validation loss.</p>
</section>
<section id="update-3-feed-forward-network-c4c46ff" class="level2">
<h2 class="anchored" data-anchor-id="update-3-feed-forward-network-c4c46ff">Update 3: Feed-Forward Network (<a href="https://github.com/garg-aayush/building-from-scratch/commit/c4c46ff"><code>c4c46ff</code></a>)</h2>
<p>The FFN layer addresses a key problem. Until now, “the tokens looked at each other but didn’t have enough time to think about what they found.”</p>
<p>Self-attention is the <strong>communication</strong> phase. Tokens gather data from each other. But then they need to <strong>compute</strong> on that data individually. That’s what the feed-forward network does. It operates on a per-token level. All the tokens process their gathered information independently.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://raw.githubusercontent.com/garg-aayush/building-from-scratch/main/basic-gpt/images/FFN.png" class="img-fluid figure-img"></p>
<figcaption>Feed-Forward Network</figcaption>
</figure>
</div>
<p>So the Transformer block becomes: <strong>communicate</strong> (attention) → <strong>compute</strong> (feed-forward). This pattern repeats for every layer.</p>
<p><strong>Result</strong>: ~2.27 validation loss. The architecture now has both communication and computation.</p>
</section>
<section id="update-4-residual-connections-0239c07" class="level2">
<h2 class="anchored" data-anchor-id="update-4-residual-connections-0239c07">Update 4: Residual Connections (<a href="https://github.com/garg-aayush/building-from-scratch/commit/0239c07"><code>0239c07</code></a>)</h2>
<p>This is one of two optimizations that make deep networks actually trainable. Without it, stacking many layers leads to vanishing gradients and optimization difficulties.</p>
<p>Karpathy visualizes it nicely: imagine a residual pathway running from top to bottom. You can “fork off” from this pathway, do some computation, and project back via addition. The path from inputs to outputs is just a series of additions.</p>
<blockquote class="blockquote">
<p><strong>Note</strong>: Why does this help? During backpropagation, addition distributes gradients equally to both branches. The gradients “hop” through every addition node directly to the input. This creates a <strong>“gradient superhighway”</strong> from supervision to input, unimpeded. The residual blocks are initialized to contribute very little at first, then “come online” over time during optimization.</p>
</blockquote>
<p><strong>Result</strong>: ~2.09 validation loss. Now we can stack layers without vanishing gradients.</p>
</section>
<section id="update-5-6-layer-normalization-63ef5f8-4f5bef8" class="level2">
<h2 class="anchored" data-anchor-id="update-5-6-layer-normalization-63ef5f8-4f5bef8">Update 5 &amp; 6: Layer Normalization (<a href="https://github.com/garg-aayush/building-from-scratch/commit/63ef5f8"><code>63ef5f8</code></a>, <a href="https://github.com/garg-aayush/building-from-scratch/commit/4f5bef8"><code>4f5bef8</code></a>)</h2>
<p>Batch normalization normalizes columns (across examples in a batch). Layer normalization normalizes rows (across features for each example). The implementation is almost identical, you just change which dimension you normalize over.</p>
<p>Layer norm has advantages for Transformers: - No dependency on batch size (works even with batch size 1) - No running buffers to maintain - No distinction between training and test time</p>
<p>The original Transformer paper used <strong>post-layer norm</strong> (normalize after attention/FFN). Modern implementations use <strong>pre-layer norm</strong> (normalize before). Pre-layer norm creates a cleaner residual pathway since the transformation happens on normalized inputs, leading to more stable training.</p>
<p><strong>Result</strong>: ~2.076 validation loss.</p>
</section>
<section id="update-7-scaling-up-d4141d7" class="level2">
<h2 class="anchored" data-anchor-id="update-7-scaling-up-d4141d7">Update 7: Scaling Up (<a href="https://github.com/garg-aayush/building-from-scratch/commit/d4141d7"><code>d4141d7</code></a>)</h2>
<p>With all the architectural pieces in place, Karpathy scales up the architecture:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Parameter</th>
<th>Before</th>
<th>After</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Block size (context)</td>
<td>8</td>
<td>256</td>
</tr>
<tr class="even">
<td>Embedding dim</td>
<td>32</td>
<td>384</td>
</tr>
<tr class="odd">
<td>Heads</td>
<td>4</td>
<td>6</td>
</tr>
<tr class="even">
<td>Layers</td>
<td>3</td>
<td>6</td>
</tr>
<tr class="odd">
<td>Dropout</td>
<td>0</td>
<td>0.2</td>
</tr>
</tbody>
</table>
<p>Dropout is added for regularization. It randomly shuts off neurons during training, effectively training an ensemble of sub-networks. At test time, everything is enabled and the sub-networks merge.</p>
<p><strong>Result</strong>: ~1.48 validation loss. The generated text now looks like Shakespeare (structure, dialogue formatting, character names) even though it’s nonsensical when you actually read it.</p>
</section>
<section id="how-this-compares-to-gpt-3" class="level2">
<h2 class="anchored" data-anchor-id="how-this-compares-to-gpt-3">How This Compares to GPT-3</h2>
<table class="caption-top table">
<thead>
<tr class="header">
<th></th>
<th>My Model</th>
<th>GPT-3</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Parameters</td>
<td>~10M</td>
<td>175B</td>
</tr>
<tr class="even">
<td>Dataset</td>
<td>~300K tokens</td>
<td>300B tokens</td>
</tr>
<tr class="odd">
<td>Architecture</td>
<td>Nearly identical</td>
<td>Nearly identical</td>
</tr>
</tbody>
</table>
<p>The architecture we built is essentially the same as GPT-3. The difference is pure scale: 17,500x more parameters trained on 1 million times more data. By today’s standards, even GPT-3’s 300B tokens is considered modest. Current models train on 1T+ tokens.</p>
<p>This is what makes the Transformer architecture so remarkable. The same fundamental design (attention for communication, feed-forward for computation, residual connections, layer norm) scales from a 10M parameter Shakespeare generator to a 175B parameter model!</p>
</section>
<section id="resources" class="level2">
<h2 class="anchored" data-anchor-id="resources">Resources</h2>
<ul>
<li><strong>Code</strong>: <a href="https://github.com/garg-aayush/building-from-scratch/tree/main/basic-gpt">building-from-scratch/basic-gpt</a></li>
<li><strong>Video</strong>: <a href="https://www.youtube.com/watch?v=kCc8FmEb1nY">Let’s Build GPT from Scratch</a> by Andrej Karpathy</li>
</ul>


</section>

 ]]></description>
  <category>Transformers</category>
  <guid>https://garg-aayush.github.io/posts/2025-09-08-building-gpt-from-scratch.html</guid>
  <pubDate>Mon, 08 Sep 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Key Takeaways from Lecture 1: LLM Evaluation Lifecycle</title>
  <link>https://garg-aayush.github.io/posts/2025-09-02-llm-evaluation-lifecycle.html</link>
  <description><![CDATA[ 




<p>A couple of months back, I enrolled in <a href="https://maven.com/parlance-labs/evals">AI Evals for Engineers and PMs</a>, a course by <a href="https://hamel.dev/">Hamel</a> and <a href="https://www.sh-reya.com/">Shreya</a>. The live cohort for ot ran from July to mid-August, but due to work commitments I couldn’t follow along in real time.</p>
<p>I have now started following it as a self-paced course and plans to write a blog for each lesson as I progress. This will be my way to capture what I learn and to reflect on the material. In this first blog 🤞, I’ll walk through my key takeaways from introductory <strong>Lecture 1</strong>.</p>
<section id="key-takeaways" class="level1">
<h1>Key Takeaways</h1>
<section id="evaluation-isnt-optional-but-fundamental" class="level2">
<h2 class="anchored" data-anchor-id="evaluation-isnt-optional-but-fundamental">1. Evaluation isn’t Optional but Fundamental</h2>
<p>Anyone who has built or worked with LLM pipelines knows that their outputs are open-ended, subjective, and unstructured (unless you enforce it). If you rely on ad-hoc checks which I have been guilty of, it often leads to knee-jerk fixes. Moreover, it completely miss the long-term need of continuous tracking which is essential for improving your pipeline reliability and usefulness. This is why <strong>Evaluation—the systematic measurement of an LLM pipeline quality—is critical!</strong></p>
</section>
<section id="the-three-gulfs" class="level2">
<h2 class="anchored" data-anchor-id="the-three-gulfs">2. The Three Gulfs</h2>
<p>The below image beautifully captures and categorizes the challenges associated with any LLM application: <img src="https://garg-aayush.github.io/static/img/blog-2025-09-02/three-gulfs.png" class="img-fluid" alt="Three gulfs"></p>
<ul>
<li><p><strong>Gulf of Comprehension</strong>: This is a result of limited understanding of the input data (user queries) and the pipeline’s outputs (behavior). Bridging it requires examining examples to identify common failure modes. This brings it own challenge: <strong>“How to manually review every input or output to identify failure modes?”</strong></p></li>
<li><p><strong>Gulf of Specification</strong>: It refers to the difficulty of translating a user’s high-level intent into unambiguous precise instructions for the LLM. Bridging it requires writing detailed prompts that captures “true intent” which in itself is challenging due to <strong>ambiguous nature of natural language.</strong></p></li>
<li><p><strong>Gulf of Generalizaton</strong>: This is due to LLMs unexpected and inconsistent behavior on new or unusual (out of distribution) inputs. Bridging it requires a good understanding of your LLM model capabilities. This leads to the question: <strong>“How to improve LLM model?”</strong></p></li>
</ul>
</section>
<section id="analyze-measure-improve-lifecycle" class="level2">
<h2 class="anchored" data-anchor-id="analyze-measure-improve-lifecycle">3. Analyze → Measure → Improve Lifecycle</h2>
<p>Hamel and Shreya introduced a structured way to bridge the above gulfs: <strong>Analyze → Measure → Improve</strong> lifecycle.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://garg-aayush.github.io/static/img/blog-2025-09-02/pitfalls.png" class="img-fluid figure-img"></p>
<figcaption>Analyze → Measure → Improve Lifecycle</figcaption>
</figure>
</div>
<p>However, the most important takeaways for me was not what each phase means but the <strong>pitfalls</strong> that often derail them:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 6%">
<col style="width: 48%">
<col style="width: 45%">
</colgroup>
<thead>
<tr class="header">
<th>Phase</th>
<th>Pitfalls</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Analyze</strong></td>
<td>Outsourcing annotation; looking at too few examples and forming shaky hypotheses</td>
<td>This is where you learn the most. Spend <strong>~75–80%</strong> of your time here—good analysis sets up everything else.</td>
</tr>
<tr class="even">
<td><strong>Measure</strong></td>
<td>Misaligned or poorly designed LLM judges; “overfitting” by testing judges on the same examples used in the judge prompt</td>
<td>In this phase, you need the rigor of data science. <strong>NEVER</strong> leak test data into judge prompts.</td>
</tr>
<tr class="odd">
<td><strong>Improve</strong></td>
<td>Prematurely jumping to fixes; defaulting to the most complex solution first (fine-tuning, bigger models)</td>
<td><strong>Start simple</strong>. Prompt tweaks and improvements often go a long way before heavier changes are needed.</td>
</tr>
</tbody>
</table>
</section>
<section id="llms-are-imperfectprompt-iteratively" class="level2">
<h2 class="anchored" data-anchor-id="llms-are-imperfectprompt-iteratively">4. LLMs are Imperfect—Prompt Iteratively</h2>
<p>When we write prompts it’s easy to ignore that LLMs are non-deterministic, prompt-sensitive and can confidently hallucinate. Thus, always remember: <strong><em>“LLMs are powerful but imperfect components. Leverage strengths, anticipate weaknesses.”</em></strong></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://garg-aayush.github.io/static/img/blog-2025-09-02/llm-strengths-weaknesses.jpg" class="img-fluid figure-img"></p>
<figcaption>LLM Strengths vs.&nbsp;Weaknesses</figcaption>
</figure>
</div>
<p><strong>Effective prompting starts with you.</strong> You should not delegate the prompting to an LLM or you will miss important failure modes. Instead, write your own draft prompt and if needed, use an LLM only to polish clarity.</p>
<p>From there on, treat prompting as an iterative process where the first draft is a starting point which you refine based on observed outputs.</p>
</section>
<section id="reference-based-vs-reference-free-metrics" class="level2">
<h2 class="anchored" data-anchor-id="reference-based-vs-reference-free-metrics">5. Reference-based vs Reference-free Metrics</h2>
<p>The evaluation metrics broadly fall into two categories: <strong>reference-free</strong> and <strong>reference-based</strong>. Both of them are useful but in different contexts.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 10%">
<col style="width: 44%">
<col style="width: 44%">
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>Reference-Free</strong></th>
<th><strong>Reference-Based</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>What it means</strong></td>
<td>Evaluates properties of the output itself (no golden answer required)</td>
<td>Compares output against a golden reference or ground truth</td>
</tr>
<tr class="even">
<td><strong>When to use</strong></td>
<td>Creative or open-ended tasks, formatting/structure checks, validity tests</td>
<td>Tasks with clearly defined correct answers (e.g., factual QA, deterministic outputs)</td>
</tr>
<tr class="odd">
<td><strong>Examples</strong></td>
<td>- Does the output follow the JSON format?<br>- Does generated code/SQL run without errors?</td>
<td>- Exact match against a gold SQL query<br>- ROUGE/BLEU score for text generation</td>
</tr>
</tbody>
</table>


</section>
</section>

 ]]></description>
  <category>Paper Notes</category>
  <guid>https://garg-aayush.github.io/posts/2025-09-02-llm-evaluation-lifecycle.html</guid>
  <pubDate>Tue, 02 Sep 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Part III: Fine-tuning Llama-3-8B for Structured Functional Representation Extraction</title>
  <link>https://garg-aayush.github.io/posts/2024-07-15-finetune-llama3-8B-predibase.html</link>
  <description><![CDATA[ 




<p>Last week, I published the <a href="https://aayushgarg.dev/2024-07-09-compare-models-structured-data/">second blog</a> in my LLM fine-tuning series, comparing various models performance in functional representation extraction.</p>
<p>In this third part of the series, I discuss the <strong>first steps toward fine-tuning an (open-source)-LLM for functional representation extraction</strong>. My aim is to give you all a sneak peek at the kind of performance you can expect from fine-tuning an LLM for a custom task. To streamline this step (and to satisfy my own curiosity 😊), I will use <a href="https://predibase.com/">Predibase</a>. It is a fast, cheap, and efficient open-source LLM fine-tuning and deployment platform.</p>
<blockquote class="blockquote">
<p>FYI: I have some free Predibase credits through Dan’s and Hamel’s LLM course. Therefore, it is a perfect opportunity to put those credits to good use! 😬</p>
</blockquote>
<blockquote class="blockquote">
<p>Note: Whenever I mention “finetuning LLM,” I am specifically referring to LoRA (Low-Rank Adaptation) finetuning of a Large Language Model. For overview of LoRA, please read Sebastian Raschka’s blogs (<a href="https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html">LORA Blog 1</a>, <a href="https://magazine.sebastianraschka.com/p/llm-research-insights-instruction">LORA Blog 2</a>).</p>
</blockquote>
<section id="task-and-dataset" class="level2">
<h2 class="anchored" data-anchor-id="task-and-dataset">Task and Dataset</h2>
<p>Similar to my previous blogs, the custom task is to predict the structured functional representation from the given text video game opinions of the <a href="https://huggingface.co/datasets/GEM/viggo">ViGGO validation dataset</a>.</p>
<p><strong>To make this exercise interesting and challenging for future experiments, I will use a maximum of 1000 examples for fine-tuning any LLM model, instead of the full ~5K train dataset</strong>.</p>
<p>Below is an example from the randomly selected <code>1K</code> train dataset:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1">Text                      : I remember you saying that you loved The Room. Do you tend to enjoy PC games <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2012</span>?</span>
<span id="cb1-2">functional_representation : verify_attribute(name[The Room], release_year[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2012</span>], rating[excellent], platforms[PC])</span></code></pre></div></div>
<p>As shown in the graph below, the selected <code>1K</code> dataset is a fairly representative sample of the full ViGGO train dataset.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://garg-aayush.github.io/static/img/blog-2024-07-15/viggo_function_name_distribution_1K.png" class="img-fluid figure-img"></p>
<figcaption>Understanding Data Distribution</figcaption>
</figure>
</div>
</section>
<section id="upload-the-dataset-to-predibase" class="level2">
<h2 class="anchored" data-anchor-id="upload-the-dataset-to-predibase">Upload the dataset to Predibase</h2>
<p>Predibase requires you to upload the instruction fine-tuning dataset in particular format. This is from <a href="https://docs.predibase.com/user-guide/fine-tuning/prepare-data#how-to-structure-your-dataset">Predibase docs</a>:</p>
<blockquote class="blockquote">
<p>For instruction fine-tuning, your dataset must contain two columns named prompt and completion: - prompt: Your input prompt. It serves as the starting point or the guiding information for the model. - completion: The expected response that corresponds to the input provided in the “prompt” column. - split (optional): Should be either train or evaluation. To learn more, check out this section.</p>
</blockquote>
<p><strong>Make sure to add the prompt template to the examples and convert them to the correct format.</strong> For this exercise, I use the following prompt template:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">prompt_template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""Given a target sentence convert it structured functional representation.</span></span>
<span id="cb2-2"></span>
<span id="cb2-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">### Target sentence: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{text}</span></span>
<span id="cb2-4"></span>
<span id="cb2-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">### Output Functional representation:</span></span>
<span id="cb2-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span></code></pre></div></div>
<p>You can connect your dataset to Predibase via the UI or Python SDK. Here, I will upload the dataset using SDK.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initialize Predibase client</span></span>
<span id="cb3-2">pb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Predibase(api_token<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"PREDIBASE_API_TOKEN"</span>])</span>
<span id="cb3-3"></span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Upload the dataset</span></span>
<span id="cb3-5">dataset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pb.datasets.from_file(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"viggo_train_val_dataset_1K.csv"</span>, </span>
<span id="cb3-6">                                name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"viggo_train_val_dataset_1K"</span>)</span></code></pre></div></div>
<p>Once uploade, you can check the uploaded dataset on the Predibase UI. <img src="https://garg-aayush.github.io/static/img/blog-2024-07-15/predibase_ui_dataset.png" class="img-fluid" alt="Dataset"> <img src="https://garg-aayush.github.io/static/img/blog-2024-07-15/predibase_ui_dataset2.png" class="img-fluid" alt="Dataset"></p>
<p>For detailed steps on uploading the dataset to Predibase, please refer to the companion <a href="https://github.com/garg-aayush/llm-warehouse/blob/main/tutorials/Finetune_llama-3-8b_Predibase.ipynb">blog notebook</a>.</p>
</section>
<section id="setup-and-finetune" class="level2">
<h2 class="anchored" data-anchor-id="setup-and-finetune">Setup and Finetune</h2>
<p>Once you have uploaded the dataset, running the fine-tuning process is refreshingly simple. For this example, I fine-tune the base <code>llama-3-8b</code> model with the following parameters: <code>epochs=3</code>, <code>rank=16</code>, and <code>learning_rate=2e-4</code>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create an adapter repository</span></span>
<span id="cb4-2">repo <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pb.repos.create(name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"viggo-finetune-1K"</span>, </span>
<span id="cb4-3">                description<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Llama-3-8b adapter repository for viggo 1K examples"</span></span>
<span id="cb4-4">                )</span>
<span id="cb4-5"></span>
<span id="cb4-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create and run the fine-tuning job</span></span>
<span id="cb4-7">adapter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pb.adapters.create(</span>
<span id="cb4-8">   config<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>FinetuningConfig(</span>
<span id="cb4-9">       base_model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"llama-3-8b"</span>,</span>
<span id="cb4-10">       epochs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,</span>
<span id="cb4-11">       rank<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span>,</span>
<span id="cb4-12">       learning_rate<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0002</span>,</span>
<span id="cb4-13">   ),</span>
<span id="cb4-14">   dataset<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>dataset,</span>
<span id="cb4-15">   repo<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>repo,</span>
<span id="cb4-16">   description<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"baseline-llama-3-8b"</span>,</span>
<span id="cb4-17">)</span></code></pre></div></div>
<p>That’s all you need to do to submit a job! Once completed, it will be available on the Predibase platform.</p>
<p><img src="https://garg-aayush.github.io/static/img/blog-2024-07-15/predibase_ui_train1.png" class="img-fluid" alt="train-1"> <img src="https://garg-aayush.github.io/static/img/blog-2024-07-15/predibase_ui_train2.png" class="img-fluid" alt="train-2"></p>
<p>You can always tweak multiple hyperparameters (see <a href="https://docs.predibase.com/sdk-guide/SDKv2/ConfigClasses/FineTuningConfig">Finetuning Config</a>) and run the fine-tune job again. All your fine-tune jobs will be available on the Predibase platform.</p>
</section>
<section id="evaluate-the-fine-tuned-model" class="level2">
<h2 class="anchored" data-anchor-id="evaluate-the-fine-tuned-model">Evaluate the Fine-tuned Model</h2>
<p>Predibase provides both popular <code>Serverless endpoints</code> and <code>Dedicated deployments</code> options for opens-source LLMs and their fine-tuned LORA checkpoints. I will create serverless endpoint for this case.</p>
<blockquote class="blockquote">
<p>Note, atleast for now, <a href="https://docs.predibase.com/user-guide/inference/serverless_endpoints">serverless deployments</a> are available for free.</p>
</blockquote>
<section id="generate-the-responses-for-validation-dataset" class="level3">
<h3 class="anchored" data-anchor-id="generate-the-responses-for-validation-dataset">Generate the responses for validation dataset</h3>
<p>Similar to my previous blogs, I will evaluate the finetuned model on ViGGO <code>validation</code> dataset and calculate custom <a href="https://aayushgarg.dev/2024-07-09-compare-models-structured-data/">performance metrics</a> metrics for a better understanding of finetuned model performance.</p>
<p>First, I generate the responses for the validation dataset:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initialize the Predibase deployment client</span></span>
<span id="cb5-2">lorax_client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pb.deployments.client(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"llama-3-8b"</span>)</span>
<span id="cb5-3"></span>
<span id="cb5-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load the validation dataset</span></span>
<span id="cb5-5">viggo_dataset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> load_dataset(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"GEM/viggo"</span>)</span>
<span id="cb5-6">val_dataset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> viggo_dataset[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'validation'</span>]</span>
<span id="cb5-7"></span>
<span id="cb5-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># finetuned adapter id</span></span>
<span id="cb5-9">adapter_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"viggo-finetune-1K/2"</span> </span>
<span id="cb5-10"></span>
<span id="cb5-11">responses_dict <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {}</span>
<span id="cb5-12"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> idx <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(val_dataset)):</span>
<span id="cb5-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Processing </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>idx<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">/</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(val_dataset)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb5-14">    output <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> lorax_client.generate(prompt_template.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(text<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>val_dataset[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"target"</span>][idx]), adapter_id<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"viggo-finetune-1K/2"</span>, max_new_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">150</span>).generated_text</span>
<span id="cb5-15">    ground_truth <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> val_dataset[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"meaning_representation"</span>][idx]</span>
<span id="cb5-16">    text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> val_dataset[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"target"</span>][idx]</span>
<span id="cb5-17">    responses_dict[idx] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"output"</span>: output, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ground_truth"</span>: ground_truth, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>: text}</span></code></pre></div></div>
<p><strong>Note: Remember to replace “viggo-finetune-1K/2” with the correct adapter ID. You can find the adapter ID in the Predibase dashboard.</strong></p>
<p>Now, I can generate the evaluation scores using custom evaluation metrics and compare them with previously calculated GPT-4 and Claude 3.5 Sonnet scores:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://garg-aayush.github.io/static/img/blog-2024-07-15/finetuned_llama-3-8B_baseline.png" class="img-fluid figure-img"></p>
<figcaption>plot-finetuned-model</figcaption>
</figure>
</div>
<p>The initial finetuning of LLaMA-3-8B using 1,000 random examples from the ViGGO dataset, while not surpassing GPT-4 and Claude 3.5 Sonnet, shows promising results and outperforms several models from our previous blog. Notably, the exact_match score is even better than that of the two best-performing models.</p>
</section>
</section>
<section id="improved-performance-with-updated-prompt-template" class="level2">
<h2 class="anchored" data-anchor-id="improved-performance-with-updated-prompt-template">Improved Performance with Updated Prompt Template</h2>
<p>A simple yet effective way to enhance the model’s performance is by refining the prompt template. By providing clearer instructions that convey the structure of the functional representation, we can guide the model to produce more accurate outputs.</p>
<p>I updated the prompt template as follows:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">prompt_template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""Given a target sentence construct the underlying meaningful functional representation of the input sentence as a single function with attributes and attribute values.</span></span>
<span id="cb6-2"></span>
<span id="cb6-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">### Target sentence: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{text}</span></span>
<span id="cb6-4"></span>
<span id="cb6-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">### Output Functional representation:</span></span>
<span id="cb6-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span></code></pre></div></div>
<p>After uploading this new dataset, finetuning the model, and evaluating it with the new adapter <code>viggo-finetune-1K/3</code>, there is significantly improved evaluation metrics. Notably, the model now surpasses GPT-4o’s scores for <code>exact_match</code> and <code>function_name_match</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://garg-aayush.github.io/static/img/blog-2024-07-15/finetuned_llama-3-8B_update1.png" class="img-fluid figure-img"></p>
<figcaption>plot-finetuned-model</figcaption>
</figure>
</div>
<p><strong>This improvement highlights the importance of clear and specific instructions in prompt engineering, even when working with finetuned models.</strong></p>
</section>
<section id="conclusions.." class="level2">
<h2 class="anchored" data-anchor-id="conclusions..">Conclusions..</h2>
<ul>
<li><p>First of all, <strong>My overall experience with Predibase has been positive, particularly in terms of rapid finetuning of models</strong>. While there are some limitations such as restricted hyperparameter tuning, standardized dataset format, and inability to download adapters in the developer tier, it offers a user-friendly platform for fine-tuning (LORA) large language models. I was able to quickly upload, setut, finetune and infer the llm models.</p></li>
<li><p>I achieve out-of-the-box performance using only random <code>1K</code> examples. Although the fine-tuned llama-3-8b model doesn’t match the performance of GPT-4 and Sonnet 3.5 on all metrics. <strong>This demonstrates the potential of fine-tuning with limited data, highlighting the efficiency of the approach for task-specific model adaptation.</strong></p></li>
</ul>
</section>
<section id="next-steps" class="level2">
<h2 class="anchored" data-anchor-id="next-steps">Next steps…</h2>
<ul>
<li>My next goal is to further enhance the model’s performance on evaluation metrics while maintaining a limit of 1,000 training examples. <em><strong><a href="https://arxiv.org/abs/2305.11206">LIMA: Less Is More for Alignment</a></strong> paper has demonstrated in the past that even 1,000 well-curated examples can lead to strong finetuning performance.</em></li>
<li>Careful curated selection of examples and hyerparameters will definitely improve the performance benchmarks on evaluation metrics.</li>
<li>In addition to it, I will deep dive into one of my favorite LLM fine-tuning tools, <a href="https://github.com/OpenAccess-AI-Collective/axolotl">Axolotl</a>.</li>
</ul>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><a href="https://aayushgarg.dev/2024-07-03-baseline-gpt4o-structured-data/">Part II blog post of series</a></li>
<li><a href="https://github.com/garg-aayush/llm-warehouse/blob/main/tutorials/Analyze_Viggo_Dataset.ipynb">Notebook I: Analyze_Viggo_Dataset.ipynb</a></li>
<li><a href="https://github.com/garg-aayush/llm-warehouse/blob/main/tutorials/Finetune_llama-3-8b_Predibase.ipynb">Notebook II: Finetune_llama-3-8b_Predibase.ipynb</a></li>
<li><a href="https://predibase.com/">Predibase Platform</a></li>
<li><a href="https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html">LORA blog I: Finetuning Large Language Models (LLMs)</a></li>
<li><a href="https://magazine.sebastianraschka.com/p/llm-research-insights-instruction">LORA blog II: LLM Research Insights: Instruction Tuning &amp; Training Paradigms</a></li>
<li><a href="https://llama.meta.com/llama3/">Llama 3</a></li>
<li><a href="https://arxiv.org/abs/2305.11206">LIMA: Less Is More for Alignment</a></li>
<li><a href="https://github.com/OpenAccess-AI-Collective/axolotl">Axolotl: A Framework for Fine-tuning LLMs</a></li>
</ul>
<hr>
<p>Thanks for reading! If you have any questions or feedback, please let me know on <a href="https://twitter.com/Aayush_ander">Twitter</a> or <a href="https://www.linkedin.com/in/aayush-garg-8b26a734/">LinkedIn</a>.</p>


</section>

 ]]></description>
  <category>LLM Training</category>
  <guid>https://garg-aayush.github.io/posts/2024-07-15-finetune-llama3-8B-predibase.html</guid>
  <pubDate>Mon, 15 Jul 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Part II: Comparison of Model Performances on Structured Functional Representation Extraction</title>
  <link>https://garg-aayush.github.io/posts/2024-07-09-compare-models-structured-data.html</link>
  <description><![CDATA[ 




<section id="introduction" class="level3">
<h3 class="anchored" data-anchor-id="introduction">Introduction</h3>
<p>In the <a href="https://aayushgarg.dev/2024-07-03-baseline-gpt4o-structured-data/">previous blog post</a>, I established a performance baseline using GPT-4o for generating structured data, particularly functional representations, from text using the <a href="https://huggingface.co/datasets/GEM/viggo">ViGGO Dataset</a>.</p>
<p>Building on that foundation, I expand the experiment to include a broader range of models, both open-source and proprietary. This comparison aims to provide insights on how well these models perform out of the box in structured data extraction tasks, which is quite crucial for RAG applications, knowledge base construction, and reasoning systems.</p>
<p>I evaluate and compare the performance of these six LLM models:</p>
<ol type="1">
<li><p><a href="https://openai.com/index/hello-gpt-4o/">GPT-4o</a>: OpenAI’s latest iteration of the GPT-4 model, known for its faster generation, advanced natural language understanding and generation capabilities. Currently one of the most popular and capable models.</p></li>
<li><p><a href="https://www.anthropic.com/news/claude-3-5-sonnet">Claude Sonnet-3.5</a>: Anthropic’s refined language model with enhanced reasoning abilities. It aims to provide more context-aware outputs compared to earlier versions and has recently outperformed GPT-4o on many benchmarks.</p></li>
<li><p><a href="https://deepmind.google/technologies/gemini/flash/">Gemini-1.5-Flash</a>: Google DeepMind’s streamlined version of the Gemini model, optimized for faster inference and reduced computational requirements.</p></li>
<li><p><a href="https://replicate.com/meta/meta-llama-3-70b-instruct">llama-3-70b-instruct</a>: Meta’s large-scale instruction-tuned language model, part of the latest LLaMA 3 family, with 70 billion parameters. It’s designed to follow complex instructions and generate high-quality text across diverse domains.</p></li>
<li><p><a href="https://replicate.com/mistralai/mixtral-8x7b-instruct-v0.1">mixtral-8x7b-instruct-v0.1</a>: Mistral AI’s instruction-tuned variant of the Mixtral 8x7B model, known for its mixture-of-experts architecture.</p></li>
<li><p><a href="https://replicate.com/meta/meta-llama-3-8b-instruct">llama-3-8b-instruct</a>: A more compact version of Meta’s LLaMA 3 family, with 8 billion parameters, optimized for instruction following.</p></li>
</ol>
<blockquote class="blockquote">
<p><strong>Note</strong>: I’ve included the smaller <code>Llama-3-8B</code> model as I plan to finetune it’s base 8B model in coming days. It would help me compare the general instruction finetuned 8B model performance.</p>
</blockquote>
</section>
<section id="dataset-and-prompt-template" class="level3">
<h3 class="anchored" data-anchor-id="dataset-and-prompt-template">Dataset and Prompt Template</h3>
<p>For consistency and fair comparison, I used the same ViGGO validation dataset and the prompt template as in the previous blog post. The prompt template and the few-shot examples are designed to guide the models in generating structured functional representations from the given text input:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1">PROMPT_TEMPLATE <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb1-2"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. </span></span>
<span id="cb1-3"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].</span></span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']. The order your list the attributes within the function must follow the order listed above. For example the 'name' attribute must always come before the 'exp_release_date' attribute, and so forth.</span></span>
<span id="cb1-6"></span>
<span id="cb1-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">For each attribute, fill in the corresponding value of the attribute within brackets. A couple of examples are below. Note: you are to output the string after "Output: ". Do not include "Output: " in your answer.</span></span>
<span id="cb1-8"></span>
<span id="cb1-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Example 1)</span></span>
<span id="cb1-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Sentence: Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.</span></span>
<span id="cb1-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Output: inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])</span></span>
<span id="cb1-12"></span>
<span id="cb1-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Example 2) </span></span>
<span id="cb1-14"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Sentence: Were there even any terrible games in 2014?</span></span>
<span id="cb1-15"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Output: request(release_year[2014], specifier[terrible])</span></span>
<span id="cb1-16"></span>
<span id="cb1-17"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Example 3)</span></span>
<span id="cb1-18"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Sentence: Adventure games that combine platforming and puzzles  can be frustrating to play, but the side view perspective is perfect for them. That's why I enjoyed playing Little Nightmares.</span></span>
<span id="cb1-19"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Output: give_opinion(name[Little Nightmares], rating[good], genres[adventure, platformer, puzzle], player_perspective[side view])</span></span>
<span id="cb1-20"></span>
<span id="cb1-21"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Example 4)</span></span>
<span id="cb1-22"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Sentence: Since we're on the subject of games developed by Telltale Games, I'm wondering, have you played The Wolf Among Us?</span></span>
<span id="cb1-23"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Output: recommend(name[The Wolf Among Us], developer[Telltale Games])</span></span>
<span id="cb1-24"></span>
<span id="cb1-25"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Example 5) </span></span>
<span id="cb1-26"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Sentence: Layers of Fear, the indie first person point-and-click adventure game?</span></span>
<span id="cb1-27"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Output: confirm(name[Layers of Fear], genres[adventure, indie, point-and-click], player_perspective[first person])  </span></span>
<span id="cb1-28"></span>
<span id="cb1-29"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Example 6) </span></span>
<span id="cb1-30"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Sentence: I bet you like it when you can play games on Steam, like Worms: Reloaded, right?  </span></span>
<span id="cb1-31"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Output: suggest(name[Worms: Reloaded], available_on_steam[yes])</span></span>
<span id="cb1-32"></span>
<span id="cb1-33"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Example 7)</span></span>
<span id="cb1-34"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Sentence: I recall you saying that you really enjoyed The Legend of Zelda: Ocarina of Time. Are you typically a big fan of games on Nintendo rated E (for Everyone)?    </span></span>
<span id="cb1-35"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Output: verify_attribute(name[The Legend of Zelda: Ocarina of Time], esrb[E (for Everyone)], rating[excellent], platforms[Nintendo])</span></span>
<span id="cb1-36"></span>
<span id="cb1-37"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Example 8)</span></span>
<span id="cb1-38"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Sentence: So what is it about the games that were released in 2005 that you find so excellent?  </span></span>
<span id="cb1-39"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Output: request_explanation(release_year[2005], rating[excellent])</span></span>
<span id="cb1-40"></span>
<span id="cb1-41"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Example 9)</span></span>
<span id="cb1-42"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Sentence: Do you think Mac is a better gaming platform than others?</span></span>
<span id="cb1-43"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Output: request_attribute(has_mac_release[])</span></span>
<span id="cb1-44"></span>
<span id="cb1-45"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Give the output for the following sentence:</span></span>
<span id="cb1-46"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{input}</span></span>
<span id="cb1-47"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span></code></pre></div></div>
</section>
<section id="generating-the-responses" class="level3">
<h3 class="anchored" data-anchor-id="generating-the-responses">Generating the responses</h3>
<p>I used the respective official API calls for the closed models (<code>GPT-4o</code>, <code>Gemini-1.5-flash</code>, <code>Claude-1.5-Sonnet</code>) and the <a href="https://replicate.com/">Replicate</a> API client for open-source models (<code>Llama-3-70B</code>, <code>Llama-3-8B</code>, <code>Mistral-8x7B</code>).</p>
<p><strong>For detailed information on API endpoints and the process of generating responses for all models, please refer to the <a href="https://github.com/garg-aayush/llm-warehouse/blob/main/tutorials/Generate_responses_all_llms.ipynb">Generate_responses_all_llms.ipynb</a> notebook.</strong></p>
<p>Generating responses for this experiment had the following associated costs:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model/API</th>
<th>Cost</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>GPT-4o API</td>
<td>~ $2.5</td>
</tr>
<tr class="even">
<td>Claude-1.5-Sonnet API</td>
<td>~ $3.5</td>
</tr>
<tr class="odd">
<td>Replicate API</td>
<td>~ $3</td>
</tr>
<tr class="even">
<td>Gemini-1.5-flash API</td>
<td>Free*</td>
</tr>
<tr class="odd">
<td><strong>Total</strong></td>
<td><strong>~$9</strong></td>
</tr>
</tbody>
</table>
<p>_*for limited usage_</p>
</section>
<section id="evaluation-strategy" class="level3">
<h3 class="anchored" data-anchor-id="evaluation-strategy">Evaluation Strategy</h3>
<p>To assess the models’ performance, I used the same evaluation criteria as in the <a href="https://aayushgarg.dev/2024-07-03-baseline-gpt4o-structured-data/">previous post</a>:</p>
<ol type="1">
<li><strong>Function Name Match</strong>: The function name must match the ground truth function name.</li>
<li><strong>Function and Attributes Match</strong>: The generated function name and attributes must match the ground truth function attributes. However, the order of the attributes does not matter.</li>
<li><strong>Function, Attributes, and Values Match</strong>: The generated function name, attributes, and values must match the ground truth function attributes and values. The order of the attributes and values does not matter.</li>
<li><strong>Exact Match</strong>: The generated function must exactly match the ground truth function.</li>
</ol>
<p><strong>Note</strong>: I implemented custom Python functions using regex and string manipulation to calculate these metrics, rather than relying on another LLM for evaluation. This approach helps avoid potential biases that might be introduced by using an LLM in the evaluation process.</p>
<p><strong>For the complete evaluation code and functions used, please refer to the <a href="https://github.com/garg-aayush/llm-warehouse/blob/main/tutorials/Compare_model_performances.ipynb">Compare_models_performances.ipynb</a> notebook.</strong></p>
</section>
<section id="comparing-the-models-performance" class="level3">
<h3 class="anchored" data-anchor-id="comparing-the-models-performance">Comparing the models performance</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://garg-aayush.github.io/static/img/blog-2024-07-09/all_metrics_comparison.png" class="img-fluid figure-img"></p>
<figcaption>Compare Models Performance</figcaption>
</figure>
</div>
<p><strong>Based on the evaluation metric plot</strong>:</p>
<ol type="1">
<li><code>Claude Sonnet-3.5</code> and <code>GPT-4o</code> consistently outperform other models across all metrics.</li>
<li><code>Claude Sonnet-3.5</code> performs better than <code>GPT-4o</code>, aligning with all the twitter chats about its superior performance.</li>
<li><code>Gemini-1.5-Flash</code>, despite its optimization for speed, maintains competitive performance for less stringent metrics but struggles significantly with exact matches.</li>
<li>The performance gap between the top-performing models and others widens sharply for more stringent metrics.</li>
<li>As expected, the smaller <code>Llama-3-8B</code> model shows the lowest performance, highlighting the evident size advantage of larger models.</li>
<li><code>Mistral-8x7B's</code> performance is lower than anticipated, suggesting lower instruction-following capabilities.</li>
</ol>
<p><strong>Some key observations</strong></p>
<blockquote class="blockquote">
<p>Based on the quickly eyeballing the generated responses:</p>
</blockquote>
<ol type="1">
<li>Almost all models consistently captures straightforward attributes such as player perspective and multiplayer status.</li>
<li>For queries involving multiple attributes or conditions, the model sometimes misses or misinterprets parts of the input.</li>
<li>The models struggles with capturing subtle distinctions in opinions and inferring information that is implied but not explicitly stated in the text. For example, models struggled to differentiate between <code>inform</code>, <code>give_opinion</code> and <code>suggest</code>.</li>
</ol>
</section>
<section id="conclusions" class="level3">
<h3 class="anchored" data-anchor-id="conclusions">Conclusions</h3>
<p>This comparison provides valuable insights into the capabilities of various LLMs in functional representation extraction. As expected, the proprietary large models like <code>Claude Sonnet-3.5</code> and <code>GPT-4o</code> perform best out of the box, with <code>Claude Sonnet-3.5</code> being the best.</p>
<p>One can argue that these results may not fully represent the models’ maximum capabilities. Different model-specific prompt engineering approaches or dynamic few-shot examples could potentially improve performance further. However, my aim is just to assess how well a model perform without fancy RAG/function calling/complicated prompt engineering approaches.</p>
</section>
<section id="next-steps." class="level3">
<h3 class="anchored" data-anchor-id="next-steps.">Next steps….</h3>
<ul>
<li><p>I plan to fine-tune the base <code>Llama-3-8B</code> and other smaller models (like <code>Phi</code> and <code>Gemma</code>) on the ViGGO dataset to assess whether a fine-tuned smaller model can compete with or surpass the performance of <code>Claude Sonnet-3.5</code> and <code>GPT-4o</code>.</p></li>
<li><p>At the same time, investigate the trade offs like inference speed, accuracy, and latency associated with fine-tuned smaller models and propreitory models api calls.</p></li>
</ul>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><a href="https://aayushgarg.dev/2024-07-03-baseline-gpt4o-structured-data/">Part I blog post of series</a></li>
<li><a href="https://huggingface.co/datasets/GEM/viggo">ViGGO Dataset</a></li>
<li><a href="https://github.com/garg-aayush/llm-warehouse/blob/main/tutorials/Generate_responses_all_llms.ipynb">Notebook I: Generate_responses_all_llms.ipynb</a></li>
<li><a href="https://github.com/garg-aayush/llm-warehouse/blob/main/tutorials/Compare_model_performances.ipynb">Notebook II: Compare_models_performances.ipynb</a></li>
</ul>
<hr>
<p>Thanks for reading! If you have any questions or feedback, please let me know on <a href="https://twitter.com/Aayush_ander">Twitter</a> or <a href="https://www.linkedin.com/in/aayush-garg-8b26a734/">LinkedIn</a>.</p>


</section>

 ]]></description>
  <category>LLM Evaluation</category>
  <guid>https://garg-aayush.github.io/posts/2024-07-09-compare-models-structured-data.html</guid>
  <pubDate>Tue, 09 Jul 2024 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
