Aayush Garg

A Brief Introduction to Claude Agent Skills

Tue, 13 Jan 2026 00:00:00 GMT

If you have been following X or LinkedIn for the last few weeks, you have probably noticed the buzz around Claude Skills (or Agent Skills), apart from all the fanfare around Claude Code. There are tweets and posts appreciating their simplicity, devs sharing custom skills for everything from document generation to API integrations. I would say the hype is well deserved and genuine.

I had been aware of Claude Skills when they launched in October 2025 (thanks to Simon Willison’s blog post). However, I did not really dig into them until I came across this Hugging Face blog post where they used Claude Code to fine-tune an open-source LLM. They built Hugging Face Skills that let you do something like this Fine-tune Qwen3-0.6B on the dataset open-r1/codeforces-cots and Claude handles everything from GPU selection, script generation, job submission, progress monitoring and pushing the finished model to the Hub. That was a woah moment for me!

Skills deserve all the attention they are getting. They provide the domain knowledge that modern LLMs and agents need despite their impressive general capabilities. They are simple folders with packaged expertise that agents can dynamically invoke for relevant requests.

In this post, I will explain what skills are, why they matter and walk you through one of the skills I use daily: a simple image editing skill as an example of how you can quickly build skills for your own use.

What Are Agent Skills?

Simply put, Skills are organized folders that package expertise into discoverable capabilities. Each skill contains a SKILL.md, a markdown file with some YAML metadata, with instructions that Claude reads when relevant. This is along with optional supporting files like scripts and templates.

As Barry described in his AI Engineer talk, think of them as “expertise packages” that Claude can discover and load dynamically.

That is really all there is to skills. There are no complex protocols, no server infrastructure and no elaborate configuration. Just text files that describe how to do something optionally paired with scripts that make the task more reliable. Simon Wilson has rightly mentioned it in his blog post: “Skills feel a lot closer to the spirit of LLMs - throw in some text and let the model figure it out.”

Since Anthropic released Agent Skills as an open standard in December 2025, skills are rapidly becoming available across different coding agents: GitHub Copilot, Codex CLI, Cursor and more.

It is also important to distinguish skills from other customization options like claude.md, MCP servers, and subagents.

	Skills	Claude.md	MCP Servers	Subagents
What it is	Folders containing instructions, scripts, and resources that teach Claude how to perform specialized tasks	A markdown file that tells Claude about a specific project	A protocol that connects Claude to external data sources	Specialized AI assistants with fixed roles and their own context window
Scope	Portable across projects and agents	Project-specific	Tool/service-specific	Task-specific
Lives in	`~/.claude/skills/` or `.claude/skills/`	Root of your repository	External servers (GitHub, Slack, databases)	Spawned during task execution
Example	Generate accessible PDF reports according to company guidelines	This project uses Next.js 14, Tailwind, and PostgreSQL	Connect to our PostgreSQL database	Code review agent with read-only permissions
Use when	You need repeatable domain expertise across multiple projects	You are onboarding Claude to a specific codebase	You need Claude to access external data or services	You want to delegate distinct subtasks to specialized assistants

This is how I differential them:

Claude.md gives Claude context about where it’s working
MCP gives Claude access to data and services
Skills give Claude expertise on how to do things well
Subagents let Claude delegate work to specialized assistants

How Skills are Dynamically Loaded

I have used the term “dynamically loaded” quite a few times now. This is because Skills are progressively disclosed using a three-tier loading mechanism:

Metadata only (~30-100 tokens): At startup, Claude sees just the name and description from each skill’s YAML frontmatter. This is enough to know the skill exists.
Full instructions (when relevant): When your request matches a skill’s description, Claude loads the complete SKILL.md content.
Supporting files (on demand): Scripts, templates, and reference docs are loaded only when actually needed.

This is what makes skills so useful and efficient. Claude can have hundreds of skills installed without blowing up the context window. Compare this to MCP servers where tool descriptions often consume hundreds or even thousands of tokens upfront regardless of whether you use them.

For a deeper dive into the loading mechanics, please go through the Claude Skills Cookbook.

What Makes Skills Valuable

Once you start using skills, you will discover many benefits. For me, the following ones stand out:

Skills provide Claude domain expertise, turning Claude from a brilliant generalist into a domain expert for your specific workflows.
Write a skill once, use it anywhere. The same skill works in Claude.ai, Claude Code, and via the API. And since Agent Skills became an open standard, they work across other agents like GitHub Copilot, Codex CLI and Cursor.
Skills ensure repeatability and reliability. Often, skills include scripts and pre-defined workflows, so Claude does not reinvent the wheel every time. It uses code that is known to work consistently.
They are progressively loaded which ensures not just token efficiency but also time saving. For example, it forces Claude to use existing solutions instead of generating code from scratch.
Moreover, you don’t need to be a developer to create skills. You can use the Skills-Creator skill in the Claude app or Claude Code to build your own skills just by describing (and iterating) what you want.

How Skills are Organized

The basic structure is surprisingly simple. Every skill follows the same pattern:

my-skill/
├── SKILL.md          # Required: the brain of the skill
├── scripts/          # Optional: utility scripts
│   └── helper.py
│.....                # Other optional files

The only required file is SKILL.md. Everything else is optional and loaded only when needed. It has two parts: YAML frontmatter for metadata and the markdown content for instructions.

For example, here is the frontmatter from my basic image editing skill:

---
name: basic-image-editing
description: Image manipulation tool for resizing, rotation, flipping, cropping, padding, format conversion (JPEG/PNG/WebP/TIFF/HEIC), transparency operations (remove/replace/extract/blend), grayscale conversion, auto-cropping borders, and file size optimization. Use when users need to transform, convert, or optimize images.
---

This is what gets loaded in Claude’s context at startup. You can see the full SKILL.md on GitHub.

After the frontmatter comes the actual instruction content. Your SKILL.md content usually include:

A brief definition of what the skill does
Clear instructions on how to use it (commands, workflows, steps)
Examples showing common use cases with actual command snippets. I find that providing examples in the SKILL.md file really helps Claude understand the correct way to call the scripts.
Edge cases or constraints if any exist
References to supporting scripts or templates if the skill uses them

Building Your Own Skill

There are multiple ways to create skills:

Write everything from scratch: You have full control. Create the folder structure manually, write the SKILL.md yourself and add your own scripts. > I would not recommend this approach. It is better to use Claude to generate the first version and iterate over it.
Use Claude Code with the Skills Creator skill — If you are already working in Claude Code, you can describe what you want and let Claude scaffold the skill for you. See Eleanor Berger’s videos where she walks through an example of building invoice and reports generation skills in Claude Code.
Use the Claude web app: This is my preferred approach, especially for personal use. Everything happens in a UI-based interface. You describe what you want, have a conversation with Claude, test it immediately and iterate until it works.

Walkthrough: Creating the Basic Image Editing Skill

Let me walk through how I created my basic image editing skill using the third approach.

Step 1: Have Clarity on What the Skill Should Do

Before creating a skill, you should have a clear idea of what it should do. Maybe you want a skill that generates PDF reports on the fly or one that handles data analysis or something else entirely. It doesn’t matter if you don’t know how to implement it. What matters is knowing what you want it to accomplish.

For example, I work a lot with images. I often need to do basic editing operations like resizing, rotating, and cropping. These are simple operations but I do them constantly. A basic image editing skill accessible in Claude makes a lot of sense for my day to day work.

Step 2: Bootstrap with the Skills Creator

First of all, ensure the skill-creator skill is enabled in your Claude account. You can do this by going to Settings -> Capabilities and enabling the skill-creator skill.

Once you have enabled the skill-creator, you can describe what you want to create a skill for. For eg, I want to create a skill for basic image editing:

Claude used the skill-creator framework to scaffold the initial structure: a SKILL.md with proper YAML frontmatter and a Python script with core operations.

Step 3: Test Immediately, Iterate Constantly

One thing I always make sure of: I dont try to create a polished skill in one go. Whatever skill gets created, test it immediately. This is one of the things I appreciate about creating skills in the Claude web app. You can test them instantly in the same conversation.

For eg, I uploaded a test image and asked Claude to change the background to orange:

This gives you a tight feedback loop. You test what you’re building, and if something doesn’t work, Claude can fix it for you right there. No context-switching between environments. You build, test, and refine in one place.

Step 4: Iterate and Improve

Once you have a basic working version, you can start adding more operations and optimizing.

Similarly, since this skill runs Python scripts locally, I switched to PEP 723 inline metadata so uv run installs dependencies automatically on the fly:

If you have some domain knowledge, use it. Go through the code Claude generated and check if it’s doing something redundant. You can review and ask Claude to make those optimizations.

Finally, once you have a version you are happy with, go through the SKILL.md file and make sure the examples are good. If not, add them manually. Good examples help Claude understand exactly how to use the skill.

Step 5: Export and Use Across Agents

Once you have a skill you want to keep, export it. You can use it in the Claude web app, Claude Code, or other agents like Codex CLI and Cursor.

The Iterative Philosophy

Treat skills as living documents, not finished products. Your first version may not be perfect. You will find gaps and discover better ways to do the same tasks. Maybe the description doesn’t trigger for certain phrasings. Maybe you forgot an edge case. That is expected.

I think of skills the same way I think of code: you write it, use it, find gaps and improve it. Always make sure the skill you use today will be more polished than the one you started with.

Conclusion

Skills are simple yet powerful. They are simply a folder with markdown file and optionally some scripts that provides domain expertise to your agents without blowing up the context window. If you work with AI agents regularly, building your own skills is worth the investment. Start with something you do repeatedly. Create the simplest version that works and improve it iteratively.

References

Simon Willison’s Blog Post: A good overview of Claude Skills.
Barry’s AI Engineer Talk: I highly recommend watching this talk if you are interested in skills.
Agent Skills Documentation
Claude Skills Cookbook
Using Skills with the API
Agent Skills Open Standard

Key Insights from DeepSeekMath paper

Tue, 06 Jan 2026 00:00:00 GMT

Over the weekend, I finished reading the DeepSeekMath paper which introduced GRPO (the RL algorithm I covered in my previous post). Below are my thoughts and key takeaways from the paper.

In this paper, the authors show that a small domain-specific model (7B parameters) can approach the performance of SOTA general models like GPT-4 on competition-level math when it is pre-trained on a sufficiently large, well-curated math corpus (120B tokens) and then reinforced with RL. It outperform major open-source models available at that time including same or larger size math-specialized models like WizardMath-v1.1 7B, Llemma 34B and MetaMath 70B. As shown below, DeepseekMath-7B achieves 51.7% on the competition-level MATH benchmark.

In a nutshell: data quality and domain-specific pre-training are more critical for mathematical reasoning than sheer parameter count.

Key Insights

Iterative Data Curation Pipeline

One of the biggest contributors to DeepSeekMath-7B performance is its pre-training corpus which is a 120B-token, high-quality mathematical dataset built from Common Crawl using an iterative fastText classifier-based pipeline.

What stood out to me here is not just the dataset and data curation pipeline itself but what it implies:

Firstly, they created the pre-training corpus from the Common Crawl which shows that if you use a well-thought data curation pipeline you can extract a high-quality domain-specific data from the public Common Crawl data.
Secondly, the resulting corpus is substantially larger than OpenWebMath (roughly 9 times larger), reinforcing the point that scale matters, as long as quality is maintained.

Note, I will go deeper into the pipeline later in this post because I feel it is broadly reusable beyond math.

GRPO: PPO Without the Critic

The second major novelty is their more memory-efficient alternative to PPO: GRPO (Group Relative Policy Optimization).

At a high level, GRPO removes the need to train a separate critic (value) model. Instead of learning a value function for advantage estimation, GRPO samples a group of multiple completions per prompt (64 in their experiments) and uses the normalized average reward of the group as the baseline. This significantly reduces memory and compute overhead while preserving the stability mechanisms associated with PPO (clipping and KL regularization).

For a detailed explanation of GRPO, see my previous post on Understanding GRPO.

Code Training Helps Math

The paper also provides evidence for the hypothesized connection between code training and reasoning. In their results, models that underwent code pre-training before math training showed improved performance on mathematical benchmarks, both with and without tool use. Thus, DeepSeekMath-Base is initialized with DeepSeek-Coder-Base-v1.5 7B, not a general language model.

My understanding is code pushes the model toward more structured, step-wise patterns of reasoning and that structure transfers well to math.

ArXiv Papers are Surprisingly Ineffective

ArXiv papers are often a default ingredient in many math pre-training recipes, but the authors report that pre-training on arXiv content was not helpful in their setup. In some cases, it led to no improvement or even to worse performance.

Note, the authors have been cautious claiming it to be definitely true and rather presented it as an empirical finding that requires more studies to confirm it.

Still, I found this interesting. It suggests arXiv-style technical writing might be more useful for formal exposition (or informalization) than for improving competition-style problem solving.

Online RL Training is Superior to Offline

Sampling training data from the real-time policy model (online) significantly outperforms sampling from the initial SFT model (offline).

In their experiments, Online Rejection Sampling Fine-Tuning (RFT) significantly outperformed standard (offline) RFT on both GSM8K and MATH benchmarks. While the two methods perform similarly in the early stages of training, Online RFT gains a distinct advantage as training progresses.

As the policy diverges from the initial SFT model, data sampled from SFT becomes less relevant to the current model’s decision boundaries. Early on when the policy is close to SFT, it doesn’t matter much. Later, this staleness hurts the performance.

RL Sharpens Distribution, Does not Expand Model Capability

This is one of my favorite analyses in the paper. The authors compare Maj@K (majority voting accuracy) and Pass@K (whether any of K samples is correct) for both the Instruct and RL models.

At K=64 samples, both models reach similar Pass@K ceilings (around 83-85% on MATH) which indicates that the fundamental capability is the same. However, RL consistently outperforms on Maj@K.

This suggests RL isn’t expanding fundamental reasoning abilities. Instead it is sharpening the distribution of the model’s output which boosts the correct responses that were already within the model’s capability.

The Iterative Data Curation Pipeline

As I mentioned earlier, one of the key contributions of this paper is their data curation pipeline that extracts high-quality mathematical content from Common Crawl.

The reason I am going into detail here is that you can draw parallels from this approach to create any other domain-specific dataset from publicly available data like Common Crawl. The pipeline is iterative and uses a fastText classifier at its core. Crudely, this is how the pipeline works:

1) Train a fastText Classifier

They start with OpenWebMath as a seed corpus (high-quality math web text). Using this seed, they build a binary classification dataset:

Sample ~500K examples from the seed corpus as positive examples
Sample ~500K random web pages from Common Crawl as negative examples
Train a fastText binary classifier on this labeled data

4) Expand the Seed Corpus and Repeat

Finally, these new math related web pages are added to the seed corpus and the fastText classifier is retrained. This process is repeated until some sort of convergence is reached.

This approach enables training an improved classifier with each iteration leading to better recall of math-related web pages in each subsequent iteration.

According to the paper, the pipeline converges after 4 iterations with 98% of the data already collected in the third round.

Conclusion

DeepSeekMath is worth reading for a few reasons.

First, it shows that a well-curated dataset matters more than model size for math reasoning.
Second, the data curation pipeline is explained in enough detail that you can adapt it for other domains.
And thirdly, the discussed ablation studies and experiments are genuinely useful.

Understanding GRPO: PPO without the Critic

Thu, 01 Jan 2026 00:00:00 GMT

In my previous posts, I worked through the derivations of PPO and DPO for LLM post-training. PPO gave us a full-fledged RL approach with clipped surrogate objectives, value functions and GAE-based advantage estimation. DPO on the other hand, showed a clever way to bypass RL entirely by reformulating the optimization as a simple classification loss on preference pairs.

That brings us to Group Relative Policy Optimization (GRPO), introduced in the DeepSeekMath paper. If you have been following recent developments in reasoning models throughout 2025, GRPO has become one of the most widely used post-training algorithms behind open-source reasoning models.

In simple terms, GRPO can be thought of as PPO without the critic (value function). Recall that PPO trains a value function in addition to the policy to estimate baselines for advantage computation. GRPO takes a simpler approach where it samples multiple completions (“group”) for each prompt and uses their rewards to form a baseline for advantage computation. This group-derived baseline replaces the learned value function entirely (no need to train a critic!).

The practical implication is lower memory consumption and reduced training complexity relative to PPO while still preserving PPO’s core stability mechanisms, including the clipped surrogate objective and KL regularization.

In this blog, I will discuss and derive the GRPO objective step by step showing exactly how it simplifies PPO.

I: The PPO Objective and the Critic Problem

Let’s briefly recap the key relevant elements of PPO. For the full derivation and PPO details, see my previous blog on PPO.

PPO optimizes an LLM by maximizing a clipped surrogate objective (constrained using KL regularization):

where is the probability ratio between current and old policies.

The critical component here is the advantage estimate (). The advantage measures how much better (or worse) a specific action is compared to what we expected:

To compute it, PPO uses a value function (baseline) also called the critic) that predicts expected future rewards from any state. The critic is trained alongside the policy and PPO uses Generalized Advantage Estimation (GAE) to compute advantages from per-token value predictions.

The Value Function Problem in PPO

The value function is implemented as a learned critic model with the same architecture as the policy (i.e. another full LLM copy). This critic is trained alongside the policy using a regression loss:

where is typically the discounted return-to-go from the sampled trajectory.

In PPO, we train two large neural networks (policy and critic) together rather than a single model. This substantially increases computational and memory overhead. Maintaining and training the critic alongside the policy not only increases memory consumption but also adds significant complexity to the training pipeline. In practice, PPO requires four models to be resident in memory at the same time: the policy, the critic, the reference model and the reward model.

One more issue with PPO is that GAE needs per-token rewards to compute Temporal Difference (TD) residuals at each position. But in LLM fine-tuning, we typically get outcome rewards which is a single score for the entire completion assigned only at the final token.

From DeepSeekMath: “During RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. While in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token.”

It raises a fundamental question: how can the critic learn accurate per-token values when all training signal comes from a single final reward?

II: Replacing the Critic with Group Sampling

As mentioned earlier, in PPO the value function acts as a baseline for advantage estimation: Subtracting this baseline reduces the variance of the policy gradient estimator which in turn stabilizes training.

Key insight: the value function is just one possible choice of baseline. In principle, any function that depends only on the state and not on the action can be used without introducing bias into the gradient estimates.

Common baseline choices are:

Constant baseline: The average reward across samples. This is the simplest option and is used in vanilla REINFORCE.
Learned value function: trained alongside the policy as in PPO.
Monte Carlo estimate: An empirical average of returns computed from multiple samples starting from the same state.

GRPO adopts the third approach. Instead of learning a value function, it directly estimates the expected return using multiple samples.

Monte Carlo Baseline

For each prompt , GRPO samples multiple completions from the policy and obtains their rewards . The average reward across these completions provides a Monte Carlo estimate of the expected return:

This is a natural and unbiased estimator. With enough samples it converges to the true expected reward for that prompt, similar to what a well-trained value function would predict and can eliminates the need to train a separate critic model.

Group-Relative Advantage

Using the average reward as a baseline, the advantage for completion becomes:

GRPO normalizes the advantage by the standard deviation of rewards in the group:

This normalization ensures that advantages are on a comparable scale regardless of the prompt’s inherent difficulty.

Think it this way: different prompts can have vastly different reward scales. For example, a simple arithmetic question might yield rewards clustered around 0.9 while a challenging proof might have rewards spread across say 0.1-0.9. Without normalization, the policy gradient updates would be dominated by high-variance prompts which can possibly destabilizing training.

GRPO’s group-relative advantage mirrors the comparative nature of rewards models as we are asking “how good is this completion relative to other completions for the same prompt?”.

III: The GRPO Objective

We now have all the pieces needed to construct the full GRPO objective. The construction follows three key modifications:

1. Start with PPO’s clipped surrogate. Recall from Section I that PPO optimizes:

Here is and this clipping mechanism provides a soft trust region that prevents destructively large policy updates.

2. Replace GAE advantage with group-relative advantage. Instead of computing using a learned critic and GAE, we substitute the group-relative advantage from Section II:

This is the key simplification. > We no longer need per-token value predictions. Instead we estimate the baseline directly from sampled completions.

3. Move KL penalty from reward to loss. In PPO, the KL penalty is typically subtracted from the reward signal () before computing advantages:

GRPO takes a different approach by adding the KL divergence directly as a penalty term in the loss function. This is a design choice that simplifies advantage computation since we dont need to consider KL penalties in the baseline estimation.

From DeepSeekMath: “Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of .”

The Full GRPO Objective

Combining these modifications, the GRPO objective (to be maximized) is:

where: - is a prompt sampled from the training distribution - are completions sampled from the old policy - is the group-relative advantage for completion (from II.I) - is the KL penalty coefficient - is the frozen reference model (typically the SFT checkpoint)

The objective averages over all completions in the group, treating each completion equally in the policy update.

For implementation, we expand the sequence-level objective over individual tokens. Since autoregressive models factor the probability of a completion as a product of token probabilities:

The per-token formulation of GRPO becomes:

where is the per-token KL divergence between the current policy and the reference model.

Note, all tokens in completion receive the same advantage:

This is a deliberate simplification. Since we only receive a single reward for the entire completion trying to learn which specific tokens were “good” or “bad” can be difficult.

From DeepSeekMath: “While in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token.”

Single Gradient Step Simplification

In practice (as mentioned in RLHF Book), GRPO is often run with only one gradient step per batch of sampled data. In this case at the start of the update which means the policy ratio equals 1 and the clipping mechanism has no effect:

The objective then simplifies to a weighted policy gradient:

IV: KL Divergence in GRPO

The KL Divergence is a measure of the difference between two probability distributions. It is defined as:

It can simply be estimated as:

However this can be negative for individual samples (when ) even though KL divergence is always non-negative. This may lead to high variance in gradient estimates.

GRPO uses an alternative estimator that is both unbiased and guaranteed non-negative:

This estimator can be understood as measuring the gap between and its tangent line at . Since is concave. This gap is always non-negative ensuring the KL penalty never incorrectly suggests that diverging from the reference decreases the penalty.

For a detailed derivation of why this estimator is unbiased and non-negative, see John Schulman’s excellent blog post on approximating KL divergence.

V: Outcome vs. Process Supervision

The GRPO formulation presented so far assumes outcome supervision which provides a single reward at the end of each completion with the same advantage assigned to every token. However for complex reasoning tasks knowing only the final answer reward might not be sufficient.

From DeepSeekMath Paper: “Outcome supervision only provides a reward at the end of each output, which may not be sufficient and efficient to supervise the policy in complex mathematical tasks.”

Process supervision addresses this by providing rewards at the end of each reasoning step. Given a completion with reasoning steps, a process reward model (PRM) assigns rewards at step boundaries where is the end token index of the -th step.

GRPO extends to process supervision with two modifications:

1. Normalize across all step rewards in the group:

where contains all step rewards across all completions.

2. Compute advantages as cumulative future rewards:

This mirrors return-to-go in traditional RL where earlier tokens accumulate rewards from all subsequent steps, while tokens near the end see only remaining rewards.

The DeepSeekMath experiments found process supervision can accelerate learning, though the gap narrows with iterative training. For domains with reliable verifiers (code execution, math answer checking), outcome supervision with RLVR has become dominant. DeepSeek-R1 uses only outcome-level verification.

VI: Connection to REINFORCE Leave-One-Out (RLOO)

GRPO is not the only critic-free algorithm leveraging group sampling. REINFORCE Leave-One-Out (RLOO) takes a similar approach but computes the baseline as the mean reward over all other completions, excluding the current sample:

This “leave-one-out” baseline avoids a subtle correlation that exists when the baseline includes the sample being evaluated.

The two algorithms are conceptually very similar. However, there are some key differences:

Aspect	RLOO	GRPO
Baseline	Mean of other samples	Mean of all samples
Normalization	None	Divide by std
Clipping	No	Yes (PPO-style)
KL Placement	In reward	In loss

GRPO can be understood as inheriting PPO’s clipping mechanism for stability while adopting RLOO-style group sampling to eliminate the critic.

Conclusion

GRPO ingenuity comes from recognizing that PPO value function is fundamentally just a baseline for advantage computation and that an stimate obtained via group sampling can serve the same role. By sampling multiple completions per prompt and using their mean reward as the baseline, GRPO achieves the stability provided by PPO clipped surrogate objective without the memory overhead or training complexity of training a separate critic model.

This design choice is what makes GRPO a preferred approach for RLVR training of LLMs focused on reasoning capabilities.

References

Papers:

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models: The original GRPO paper
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning: DeepSeek’s reasoning model trained with GRPO
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs: The RLOO paper comparing critic-free methods

Blogs:

John Schulman’s post on approximating KL divergence: Derivation of the unbiased non-negative KL estimator
RLHF Book by Nathan Lambert: Comprehensive resource on RLHF algorithms including GRPO implementation details
Understanding GRPO by Cameron R. Wolfe: Deep dive into GRPO mechanics and implementation

Deriving the DPO Loss from First Principles

Tue, 30 Dec 2025 00:00:00 GMT

In my previous post, I worked through the derivation of the PPO loss used in RLHF for LLMs. By the end, we arrived at a fairly daunting objective function with multiple components: clipped surrogate, value function loss, entropy bonus and KL penalty. It is not just that the final objective is intimidating but the entire RLHF pipeline is complex and multi-step. You first train a separate reward model to reflect human preferences then fine-tune the LLM using RL with PPO.

That brings us to Direct Preference Optimization (DPO). DPO is a computationally lightweight alternative that directly optimizes LLMs to adhere to human preferences without explicit reward modeling or reinforcement learning. The key insight is that DPO implicitly optimizes the same objective as PPO-based RLHF (reward maximization with a KL-divergence constraint) but it replaces the entire reward model + PPO loop with a single supervised objective on preference pairs. There is no sampling during training, no value function, no clipping, just a classification loss!

Here I derive the DPO loss showing exactly how this simplification is possible. I will assume familiarity with concepts from the PPO post, particularly the reward model and the KL-constrained RLHF objective.

Again, a huge shoutout to Umar Jamil’s video on DPO for an excellent walkthrough that helped me understand the derivation.

I: The RLHF Objective

Let’s recall the RLHF objective from the PPO blog. The goal of RLHF is to find a policy that maximizes expected reward while staying close to a reference model :

The first term encourages the model to generate high-reward responses. The second term (KL penalty) prevents the model from drifting too far from the reference which helps avoid reward hacking and maintains language quality.

As we saw in the PPO blog, we can’t optimize this objective directly with gradient descent because the expectation requires sampling from the policy and sampling is non-differentiable. This is why we needed reinforcement learning algorithms like REINFORCE and PPO. They provide ways to estimate policy gradients without differentiating through the sampling process.

What if we could reformulate the problem so that we don’t need to sample from the policy during training? This is exactly what DPO will achieve.

II: The Bradley-Terry Model for Preference Learning

We also need to understand the Bradley-Terry model for reward model training in a bit more detail with focus on why it works the way it does.

Training a reward model requires human-labeled preference data that compares pairs of responses.

where: - is the prompt - is the preferred (winning) response - is the dispreferred (losing) response

The Bradley-Terry Probability Model

The Bradley-Terry model provides a principled way to convert comparison data (defined above) into a probabilistic model. It assumes there exists some latent reward function that captures true response quality and models the probability that response is preferred over as:

The intuition is straightforward that the responses with higher reward are exponentially more likely to be preferred.

A key step is recognizing that this ratio of exponentials can be written as a sigmoid function. This is important because it connects Bradley-Terry to standard binary classification.

Let and . We want to show:

where is the sigmoid function.

Starting with the left-hand side:

we can rewrite it as:

which is the sigmoid function of :

Therefore, the Bradley-Terry model can be written as:

Reward Model Loss

Given a dataset of preferences , we can train a parameterized reward model using maximum likelihood estimation. We want to maximize the probability of observing the preferences in our dataset:

Taking the log and negating (to turn maximization into minimization), we get the negative log-likelihood loss:

This is just binary cross-entropy objective which helps the reward model learn to assign higher rewards to preferred responses. Notice that the Bradley-Terry model depends only on the difference of rewards: . The absolute values dont matter, it is only their relative ordering. This means:

If we add any constant or any function that depends only on the prompt (not the response), the preference probabilities dont change. This invariance property will be the key to deriving DPO.

III: Optimal Policy in Closed Form

Here, we will find the exact analytical optimal policy solution to the optimization problem. We want to find the policy that maximizes expected reward while keeping the KL divergence from the reference policy bounded:

Note, I am writing instead of to emphasize that we are looking for the optimal policy in general not just the parameterized version.

Expanding the KL divergence:

So our objective becomes:

For a fixed prompt , we want to find the distribution that maximizes:

This is a constrained optimization problem over probability distributions. We can solve it using the method of Lagrange multipliers, enforcing that sums to 1. For discrete :

Taking the derivative with respect to and setting it to zero (stationary point):

Now, solving for :

The term is a constant (with respect to ) that ensures normalization. To find its value, we enforce that must be a valid probability distribution and sum to 1:

Substituting our expression for :

Since doesn’t depend on , we can factor it out of the sum:

Solving for the constant:

We define this normalizing sum as the partition function :

Substituting back, we get the optimal policy:

We have an exact closed-form expression for the optimal policy. However, we cannot compute it directly because is intractable. To compute it, we need to sum over all possible responses which not possible.

IV: The Reparameterization Trick

The key insight of DPO is to flip the relationship between reward and policy. Now, we frame the problem as: “given an optimal policy what reward function does it correspond to?”

Starting from the optimal policy equation (III.III):

We solve for the reward by first taking the log of both sides:

Now rearrange to get on left-side:

This can be written more compactly as:

The reward is expressed as: - A term involving the log-ratio of the optimal policy to the reference policy - which depends only on (not on )

V: Deriving the DPO Loss

Finally, we have all the pieces to derive the DPO loss. From Section II, the Bradley-Terry preference model is:

From Section IV, assuming we have access to an optimal policy , the reward can be written as:

Substituting this into Bradley-Terry:

Simplifying the expression inside the sigmoid:

The terms cancel:

Now recall the critical insight from Section II where we mentioned that the Bradley-Terry model depends only on reward differences. Thus, when we compute the intractable partition function cancels out. This is what makes DPO possible.

We can write this more cleanly by defining the implicit reward in terms of the optimal policy:

Thus:

We dont actually have access to the optimal policy . But we can parameterize a policy and optimize it to maximize the likelihood of the observed preferences. This is exactly what the reward model loss (II.III) does except now our reward is implicitly defined by the policy itself.

The DPO loss is the negative log-likelihood:

for implicit reward notation, it can be written as:

Some key insights from the above DPO loss:

The policy implicitly defines its own reward via the log-ratio with the reference. There is no separate reward model.
This is just a supervised classification loss on preference pairs with no RL.
DPO uses the fixed preference dataset . Thus, no sampling during training.
No value function needed since we’re not doing policy gradients
DPO still optimizes the KL-constrained reward maximization objective but in a different way

VI: Building Intuition for DPO

Now that we have the DPO loss we can build some intuition around the implicit reward model and its gradient updates.

Implicit Reward Model

The DPO paper subtitle is “Your Language Model is Secretly a Reward Model” and this captures the key insight as we are using the LLM for implicit reward. The policy defines an implicit reward function:

This reward measures how much more likely the current policy is to generate response compared to the reference policy, scaled by .

If : The implicit reward is positive (the policy “likes” this response more than reference)
If : The implicit reward is negative (the policy “likes” this response less than reference)

Analyzing the Gradient Update

We can flex our brain muscles one more time and compute the gradient for the DPO objective:

Let . Using the chain rule:

Using the property :

Using :

Now, , and:

Putting it together:

points in the direction that increases probability of the preferred response
points in the direction that decreases probability of the dispreferred response
The weight term is high when , i.e. when the model currently assigns higher implicit reward to the losing response than the winning response. In other words:
- When the model is wrong (ranks above ), we get large gradient updates
- When the model is right (ranks above ), we get small gradient updates

This dynamic sigmoid weighting is crucial. It naturally focuses learning on the examples the model currently gets wrong.

VII: Computing Log Probabilities in Practice

This section is fully adapted from Umar Jamil’s video. I think it is essential to understand how log probabilities are computed in practice.

The DPO loss requires computing , the log probability of a complete response given a prompt . Let’s see how this works in practice with LLMs.

Language models are autoregressive: they generate text one token at a time, conditioning on all previous tokens. For a response , the probability factorizes as:

Taking the logarithm:

The log probability of the full response is the sum of log probabilities at each position.

Here’s how to compute :

Prepare input: Concatenate the prompt and response into a single sequence
Forward pass: Run the transformer to get hidden states at each position
Project to logits: Apply the language model head (typically a linear layer) to get vocabulary logits at each position
Log softmax: Convert logits to log probabilities over the vocabulary using logsoftmax
Gather relevant log probs: For each position in the response extract the log probability of the actual next token (since we know the output)
Sum with masking: Sum the log probabilities but only for response tokens (not prompt tokens)

This gives us for one response. We do this for both the preferred response and the dispreferred response and we also do it for both the policy model and the frozen reference model . With these four log probabilities in hand, we can compute the DPO loss.

Conclusion

Once you derive the DPO loss, you start appreciating the simplicity and elegance of the solution especially when compared to PPO. The derivation hinges on one observation that the Bradley-Terry model only cares about reward differences and this causes the intractable partition function from analytical solution to cancel out completely. In turn, what remains is a straightforward classification loss.

References

Papers:
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model: The original DPO paper
- Training language models to follow instructions with human feedback: The InstructGPT paper that established PPO-based RLHF
Videos:
- Umar Jamil’s video on DPO: Excellent walkthrough of the DPO derivation

Deriving the PPO Loss from First Principles

Thu, 25 Dec 2025 00:00:00 GMT

I have been trying to wrap my head around reinforcement learning methods like DPO, GRPO, and RLVR for a while now, especially with all the recent work showing how effective they can be for LLM post-training. Since I amm still pretty new to RL, I figured the best place to start was Proximal Policy Optimization (PPO), the algorithm OpenAI used to show how reinforcement learning could meaningfully improve LLM alignment (InstructGPT paper). My hope is that getting comfortable with PPO will give me the right mental model for the policy-gradient side of things and make it easier to understand the newer LLM-specific RL methods built on similar ideas.

If you start learning RL, you quickly realize it involves a lot of math! So I decided to lean into that and do a few (possibly annoying) derivation sessions to really understand the PPO objective by building it up from first principles, similar to how Umar Jamil does in his video.

A huge shoutout to Umar Jamil’s video on RLHF and PPO: it was incredibly helpful for building intuition and understanding the math behind the PPO loss.

Below is my attempt at the derivation based on the original PPO and InstructGPT papers and Umar Jamil’s video.

I: Reinforcement Learning: Core Definitions

Concept	General RL Definition	LLM Context (RLHF)
Reinforcement Learning	A learning setup where an agent learns to act in an environment to maximize expected cumulative reward.	Fine-tuning a language model to generate responses that better match human preferences using reward-based feedback.
Environment	Everything outside the agent that it interacts with and that produces observations and rewards.	The prompt distribution and interaction loop and the reward signal from a reward model evaluating generated responses.
Agent	The learner/decision-maker that observes states, takes actions, and receives rewards.	The language model generating text token by token.
Action ()	A choice made by the agent, usually conditioned on the state .	Picking the next token at each step of generation.
State ()	The information available to the agent at a given time step.	The prompt plus the response generated so far (the current token context).
Reward ()	A scalar signal telling the agent how good or bad an outcome was.	A score from the reward model (trained on preference data) that judges how good or bad a response is.
Policy ()	A stochastic mapping from states to a distribution over actions.	The model’s probability distribution over the next token given the context.
Goal	Find an optimal policy that maximizes expected cumulative reward over time.	Update (align) the model so it tends to generate responses with higher reward-model scores.

II: Reward Model in RLHF for LLMs

A Reward Model (RM) is a neural network that takes a prompt and a response as input and outputs a scalar reward indicating how “good” or “aligned” that response is according to human preferences.

Policy-gradient methods (including PPO) require a scalar objective to update the policy parameters. In standard RL, the environment provides this signal. However for language generation, there is no natural environment giving us rewards for “good” responses. Having humans rate every output is impractical and for gradient-based optimization, we need a differentiable scalar signal to backpropagate through. Thus, we require a cheap, differentiable proxy for human preferences during RL training. A learned RM provides exactly this.

How is the Reward Model Trained?

The standard procedure for training the reward model is:

Sample prompts ()
Generate multiple candidate completions () from a baseline policy (often an SFT model).
Ask humans to compare candidates (pairwise preferences are easier than absolute scoring).
Train the RM () to predict those preferences.

Architecturally, the reward model is typically: - Initialized from a pretrained language model (often the SFT model itself) - The final non-embedding layer (which projects to vocabulary) is removed - Replaced it with a linear layer that projects the hidden state of the last token to a single scalar output

Reward Model Loss Function

The reward model is trained using the Bradley-Terry model for pairwise comparisons. The probability that response (preferred) is preferred over (less preferred) for any prompt is modeled as:

where is the sigmoid function:

The negative log-likelihood loss is:

One can verify that this loss forces the reward model to assign higher rewards to preferred responses (see InstructGPT paper or Umar Jamil’s video for a detailed walkthrough).

There are two key insights here: 1. We don’t need absolute scores, we only need the reward model to correctly rank responses. 2. The loss depends only on differences (), so it is invariant to adding a constant to all rewards. This will be useful later when we discuss the PPO loss.

The reward model serves as a learned proxy for human preferences, converting the intractable problem of getting human feedback on every generation into a tractable supervised learning problem. Once trained, it provides the scalar signal needed to optimize our policy (LLM) using rl algorithms like PPO.

III: Trajectories and Returns

Trajectory

A trajectory (also called a rollout or episode) is a sequence of states (), actions (), and rewards () generated by an agent interacting with an environment:

In the context of LLMs, a trajectory corresponds to the entire sequence of token generations. It is the prompt followed by all generated tokens until the end-of-sequence token.

Note that the states are always stochastically modeled, and can be represented as . Given a stochastic policy , the probability of a trajectory is the product of: 1. The initial state distribution 2. The stochastic policy 3. The environment transition dynamics

Return

The return is the cumulative reward collected over the full trajectory (). The simplest form is the undiscounted return:

More generally, we use the discounted return:

where is the discount factor. The discount factor serves a couple of purposes: 1. It ensures the return is finite for infinite-horizon tasks (). 2. It prioritizes immediate rewards over distant ones.

IV: Policy Gradient Optimization and REINFORCE Algorithm

The goal of reinforcement learning is to find a policy that maximizes the expected return over all possible trajectories:

This is our objective function and we want to find parameters such that:

To maximize using gradient-based methods, we need to compute and perform gradient ascent:

This policy gradient looks simple in equation form but it is intractable to compute. The expectation is over trajectories sampled from , which itself depends on . We can’t simply enumerate all possible trajectories. This is computationally intractable for any reasonably sized state-action space (and certainly not possible for LLMs!).

Thus, as a next step we need to derive some sort of reasonable and tractable approximation for . We do this by using the log-derivative trick.

This expectation can be written as an integral:

Bringing the gradient inside the integral:

Now we apply the log-derivative trick:

Rearranging: and substituting back, we get:

which can also be written as the following expectation:

Note, here the gradient is now the expectation of the gradient of the log-probability of the trajectory. This can further be simplified by using the trajectory probability expression (III.I):

Taking the log:

When we take , only the policy term depends on :

The initial state distribution and transition dynamics are independent of , so their gradients vanish. Substituting back, we obtain the policy gradient theorem:

This is a remarkable result. We can compute the gradient of our objective without differentiating through the environment dynamics and only need gradients of the log-probabilities of our policy.

Since we cannot compute the expectation exactly, we approximate it with a sample mean by sampling trajectories:

This gives us the REINFORCE algorithm:

Initialize: Start with a pretrained or supervised fine-tuned (SFT) language model
Sample prompts: Draw a batch of prompts from a dataset
Generate trajectories: For each prompt , generate a response by sampling tokens from the policy . Each trajectory is the sequence of states (prompt + generated tokens so far) and actions (selected tokens).
Compute log-probabilities: For each trajectory, compute the log-probability of each generated token given its context:
Compute rewards: Score each complete (prompt, response) pair using the reward model:
Estimate policy gradient: Compute the gradient estimate using (IV.V):
Update policy: Perform a gradient ascent step:
Repeat: Go back to Step 2 and iterate until convergence

While REINFORCE provides an unbiased gradient estimate, it suffers from two critical issues that make it impractical for LLM training:

High Variance: The gradient estimate suffers from high variance depending on the sampled trajectories. This variance can be large and can lead to noisy gradients and unstable training. > If you look again at (IV.V), the gradient estimate for each action is weighted by the return of the entire trajectory . This means that even if an action was good, it might receive a negative gradient update simply because other actions in the trajectory led to poor outcomes (or vice versa). Over many samples, the noise introduced by this coupling can be substantial, leading to high variance
On-Policy Constraint (Sample Inefficiency): REINFORCE requires trajectories sampled from the current policy . Thus after every gradient update, previously collected trajectories must be discarded and new ones need to be sampled from the updated policy. For LLMs, where each trajectory requires a full forward pass through a billion(s)-parameter model, this is prohibitively expensive especially when we need many small gradient steps to train effectively.

V: Reducing Variance and the Advantage Function

The REINFORCE algorithm provides an unbiased gradient estimate (IV.V). However while unbiased, this estimator suffers from high variance.

Replacing Full-Trajectory Return with Reward-to-Go (using causality)

A first variance reduction comes from noticing that action taken at time cannot influence rewards that were received before time . This is a fundamental consequence of causality. These past reward terms contribute only noise to the gradient estimate and add variance without contributing any signal. Thus, we can remove them and consider only the rewards-to-go :

This gives us a lower-variance estimator:

where is the rewards-to-go for trajectory starting from time .

Subtracting a Baseline

A second complementary technique for variance reduction is to subtract a baseline from the rewards. The key insight is that we can subtract any function that does not depend on the action from our reward signal without changing the expected value of the gradient.

Thus we can subtract a state-dependent baseline from our rewards-to-go to yield an unbiased gradient estimator:

Value Functions: and

The baseline is still an arbitrary function. To make it more systematic and concrete, there are two fundamental functions from RL theory.

State Value Function: The state value function is the expected return when the agent is in state and acts according to policy :

Intuitively, tells “How good is this state on average?” and is used as a baseline .

Action Value Function (Q-function): The action value function is the expected return when starting in state and taking action and then acting according to policy :

Intuitively, tells “How good is this specific action in this state?” and in RL, the rewards-to-go is estimated as .

In the LLM context: - estimates the expected reward for a given prompt + partial response, assuming the model continues generating according to its current policy. - estimates the expected reward if, from the current prompt + partial response, the model generates a specific next token and then continues according to its policy.

Advantage Function

The advantage function measures how much better (or worse) a specific action is compared to the average action under the policy:

The advantage function directly tells us: “How much better is this particular action compared to what we would typically do in this state?” This is precisely the signal we want for policy improvement. We want to increase the probability of actions with positive advantage and decrease the probability of actions with negative advantage.

From Umar Jamil’s video:
In the LLM context consider a state where the prompt is “Where is Shanghai?” and the model has generated “Shanghai is”. From this state: - If the model samples the token “in” (leading toward “Shanghai is in China”), this action likely has positive advantage. This is because it is better than the average token the model might produce. - If the model samples the token “delicious” (leading toward an incoherent response), this action likely has negative advantage. This is because it is worse than the average token the model might produce.

Advantage-Weighted Policy Gradient

Substituting the rewards-to-go and the value function as a baseline, we get the following form of the policy gradient:

which can be written as:

and for sample-based approximation:

where is an estimate of the advantage function at time in trajectory . This is the form of the policy gradient often used.

In practice, can be estimated as follows:

Learn a value function: Train a neural network (often called the “critic” or “value head”) to approximate . In LLM fine-tuning, this is often a linear layer on top of the same transformer backbone used for the policy.
Estimate from samples: Given a trajectory, the rewards-to-go provides an unbiased (but high-variance) estimate of .
Compute advantage estimates:

More sophisticated methods like Generalized Advantage Estimation (GAE) interpolate between high-variance, low-bias estimates and low-variance, high-bias estimates by using a weighted combination of multi-step returns. See the GAE paper for more details.

VI: Importance Sampling and Off-Policy Policy Gradients

Note: In RL literature, “off-policy” typically refers to methods where the behavior policy (generating data) is arbitrarily quite different from the target policy (being optimized) say where transitions from policies thousands of updates old are reused. In this section, what we will call “off-policy” should more precisely be called “local off-policy”.

The advantage-weighted policy gradient (V.IV) requires trajectories sampled from the current policy . … The advantage-weighted policy gradient (V.IV) requires trajectories sampled from the current policy . This creates a fundamental inefficiency i.e., after each gradient update all previously collected trajectories become “stale” and we must discard these trajectories and sample new ones from the updated policy.

For LLMs, where each trajectory requires a full forward pass through billion(s)-parameter model, this is prohibitively expensive especially when we need many small gradient steps to train effectively.

We need a way to reuse the same trajectories for multiple gradient updates. Importance sampling provides the mathematical machinery to do exactly this!

Importance Sampling

Importance sampling is a technique for estimating expectations under one probability distribution using samples drawn from a different distribution. Consider an expectation for distribution :

We can rewrite this by multiplying and dividing by another distribution (with wherever ):

The ratio is called the importance weight. This identity tells us:

We can now estimate the expectation under using samples from as long as we reweight each sample by the ratio of probabilities.

Applying Importance Sampling to Policy Gradients

We can apply this technique to the policy gradient setting. The on-policy advantage-weighted gradient (V.IV) is:

To apply importance sampling, we work at time-step level rather than trajectory level (full trajectory importance weights have extremely high variance). For a single timestep:

Using importance sampling with samples from :

Now we apply the log-derivative identity , which gives us a surrogate objective whose gradient equals this importance-weighted policy gradient:

where the importance-weighted surrogate objective also known as the Conservative Policy Iteration (CPI) objective is:

We also define the probability ratio as:

Note that by construction. Thus, the CPI objective can be written as:

where is the estimated advantage at timestep , and denotes the empirical average over a batch of samples collected under .

This objective has a clear interpretation: - If (action better than average), we want to increase , i.e., make the new policy more likely to take this action. - If (action worse than average), we want to decrease , i.e., make the new policy less likely to take this action.

The corresponding sample-based approximation is:

Off-Policy Learning: Reusing Trajectories

The CPI objective enables off-policy learning: we can sample trajectories from , store them and then perform multiple gradient updates on using the same batch of data. The typical workflow becomes:

Collect: Sample trajectories from the current policy
Compute: Calculate advantages and log-probabilities
Store: Save the trajectories along with their advantages and old log-probabilities
Optimize: Perform multiple gradient ascent steps on using mini-batches from the stored data
Repeat: Set and return to step 1

This dramatically improves sample efficiency. Instead of discarding trajectories after a single gradient step, we can extract multiple updates from each batch of expensive LLM rollouts.

The Instability Problem

While the CPI objective improves sample efficiency, unconstrained optimization of is unstable. The core issue is that importance sampling becomes unreliable when drifts far from :

Extreme probability ratios: The ratio can become arbitrarily large or small, destabilizing gradient estimates.
Stale advantages: The estimates were computed under and become inaccurate as diverges. The optimizer may exploit these stale estimates, making updates that appear beneficial but are actually harmful.

In practice, unconstrained maximization of often leads to excessively large policy updates that cause catastrophic performance collapse.

LLM Context (from Umar Jamil): Suppose we have a trajectory where the model generated “Shanghai is in China” with high advantage. Unconstrained optimization might dramatically upweight “China” as the next token given “Shanghai is in”—but this could simultaneously cause unintended probability shifts elsewhere, perhaps making the model overly likely to say “China” in completely unrelated contexts, or disrupting the probability mass across the entire vocabulary in unpredictable ways.

We need a mechanism to constrain from deviating too far from and keeping the ratio close to 1 while still allowing meaningful policy improvement.

VII: Trust Region Policy Optimization (TRPO)

The CPI objective is attractive because it lets us reuse data via importance ratios, but unconstrained optimization is unstable. When drifts far from , the probability ratios become extreme and the advantage estimates become stale and can be exploited by the optimizer.

The key insight of Trust Region Policy Optimization (TRPO) is that the surrogate objective is only a valid approximation to the true objective within a local neighborhood of . TRPO paper formalized this by proving policy performance is guaranteed to improve as long as the KL divergence between consecutive policies remains bounded. This theoretical result motivates constraining the policy update to stay within a “trust region” where the surrogate objective remains reliable. See the TRPO paper for the formal proof.

TRPO converts this insight into a constrained optimization problem that ensures the policy update stays within a “trust region” where the surrogate objective remains reliable.

The hyperparameter defines the trust region size, the maximum allowed divergence between consecutive policies. This constraint ensures that remains close to 1, keeping our importance-weighted estimates reliable.

Solving (VII.I) requires second-order optimization. TRPO approximates the objective linearly and the KL constraint quadratically (using the Fisher Information Matrix) and then solves the resulting problem via the conjugate gradient algorithm followed by a line search to ensure constraints are satisfied.

For large-scale LLM training, this approach is impractical:

Computational overhead: Each policy update requires multiple conjugate gradient iterations and line search steps, significantly more expensive than standard gradient descent.
Memory requirements: Computing Fisher-vector products adds substantial memory overhead for billion(s)-parameter models

The theory behind TRPO also suggests using a KL penalty rather than a hard constraint. It is easier to implement and more computationally efficient.

However, choosing a penalty coefficient that works across different problems or even across different training stages is notoriously difficult. This motivates Proximal Policy Optimization (PPO): a first-order method that achieves TRPO’s stability through a clipped surrogate objective rather than explicit constraints.

VIII: Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) achieves TRPO’s stability guarantees using only first-order optimization. Instead of explicitly constraining the KL divergence, PPO modifies the objective function itself to discourage large policy updates through a clipping mechanism. It implicitly limits how far the policy can move, providing a “soft” trust region using only standard gradient descent.

Clipped Surrogate Objective

CPI objective and probability ratio from Section VI:

The problem with is that nothing prevents from becoming arbitrarily large or small. PPO addresses this by clipping the probability ratio to stay within :

where is a hyperparameter ( from the PPO paper) and the clip function is defined as:

The operator in (VIII.I) is important. It ensures we take the more pessimistic (lower) estimate between the clipped and unclipped objectives. This creates different behavior depending on the sign of the advantage:

Case 1: Positive Advantage ()

When an action is better than average, we want to increase its probability, which means increasing . The objective becomes:

If : The objective is , so gradient ascent increases
If : The objective becomes

The clipping removes the incentive to increase beyond .

Case 2: Negative Advantage ()

When an action is worse than average, we want to decrease its probability, which means decreasing . Since , multiplying by a smaller makes the product less negative (larger). The objective becomes:

(The with negative values becomes a in terms of which is selected.)

If : The objective is , so gradient ascent decreases
If : The objective becomes

The clipping removes the incentive to decrease beyond .

The takeaway here is that PPO provides a pessimistic lower bound on . We ignore updates when they would make things “too good to be true.”

LLM Context (from Umar Jamil Video): In language model fine-tuning, the policy is the probability the model assigns to token given the context (prompt + previously generated tokens). The probability ratio measures how much more or less likely the fine-tuned model is to generate a particular token compared to the reference policy. Clipping ensures that no single token’s probability can change by more than a factor of in a single update iteration, preventing the model from “overreacting” to high-advantage tokens.

PPO Objective

In practice, PPO combines the clipped policy objective with two additional terms:

1. Value Function Loss (): Recall from Section V that we need a value function to compute advantage estimates. The value function is trained to minimize the squared error between its predictions and the actual returns:

where is typically the discounted return-to-go. When the policy and value function share parameters (common in LLM fine-tuning where both use the same transformer backbone), this loss is subtracted from the objective (hence the negative sign, since we maximize but minimize ).

2. Entropy Bonus (): To encourage exploration and prevent premature convergence to deterministic policies, PPO adds an entropy loss:

Here, the coefficients control the regularization strength.

IX: Complete PPO Objective with KL Penalty

When fine-tuning an LLM with “vanilla” PPO, the policy learns to maximize rewards from the reward model. However, the reward model is an imperfect proxy for human preferences. It is a neural network trained on limited data that can be exploited. Without constraints, the policy may discover adversarial outputs that achieve high reward scores while producing text that:

Degenerates into repetitive or nonsensical patterns that “fool” the reward model
Drifts far from natural language, losing fluency and coherence
Exploits spurious correlations learned by the reward model

This phenomenon is called reward hacking. The policy finds a way to “game” the reward model rather than genuinely improving response quality.

To prevent reward hacking, the InstructGPT paper adds a KL divergence penalty that regularizes the policy to stay close to a reference model (typically the SFT model before RL fine-tuning).

From Section VIII, the PPO objective (to be maximized via gradient ascent) consists of three terms:

Now, we don’t use raw reward model scores directly. Instead, we define a KL-penalized reward that regularizes the policy to stay close to a reference model :

where: - is the reward signal at timestep - is the KL penalty coefficient - is the frozen reference model

At each token position, the KL divergence simplifies to:

In practice we estimate this expectation with the sampled token , yielding:

Note that the reward model produces a single scalar for the complete response . This score is assigned only at the final token , while the KL penalty applies at every token.

The KL penalty serves two purposes: 1. Prevents reward hacking: The policy cannot drift arbitrarily far from natural language 2. Maintains fluency: Outputs remain similar in distribution to the well-trained SFT model

It modifies the advantage estimates used in PPO through the modified per-token rewards. However, it is mathematically equivalent (and more efficient in implementation) to add the KL term directly to the objective. The PPO objective with KL penalty is:

The first term is exactly what vanilla PPO optimizes using the clipped surrogate. The KL penalty term appears as a separate additive component that penalizes divergence from the reference model. Substituting the PPO clipped surrogate for the first term:

Combining all components, the complete PPO objective with KL penalty (to be maximized) is:

Here, each term serves a distinct purpose:

Term	Role
Policy Objective	Improves the policy while preventing destructive updates via clipping
Value Loss	Trains the critic for accurate advantage estimation (subtracted to minimize)
Entropy Bonus	Encourages exploration, prevents premature convergence
KL Penalty	Prevents reward hacking, maintains language quality (subtracted to penalize drift)

It is important to distinguish the two KL-related mechanisms in the complete loss. The PPO clipping mechanism acts as a short-term anchor that constrains how much the policy can change in a single update, while the KL penalty is a long-term anchor that constrains how far the policy can drift from its starting point across all of training.

Finally done…

And that’s the full derivation! What I find satisfying is that every term in the final loss has a specific purpose. Each one exists because we ran into a specific problem along the way and needed to fix it. I will admit it was not easy to understand all the math and concepts behind the loss. I still do not fully understand every detail but I understand it far better than I did a few days ago.

I hope this was useful. If you spot any errors in derivation (which I’m sure there are) or have suggestions, feel free to reach out.

References

Video:
- Umar Jamil’s video on RLHF and PPO: A comprehensive and must-watch video covering RLHF and PPO concepts.
Papers:
- Proximal Policy Optimization Algorithms: The foundational PPO paper introducing the clipped surrogate objective.
- Training language models to follow instructions with human feedback: The InstructGPT paper demonstrating PPO with KL penalty to mitigate reward hacking in LLM fine-tuning.
- Trust Region Policy Optimization: The TRPO paper that motivates the trust region constraints used in PPO.
- High-Dimensional Continuous Control Using Generalized Advantage Estimation: GAE paper introducing the exponentially-weighted advantage estimator for variance reduction in policy gradients.

What I Learned Building SFT from the Ground Up

Wed, 03 Dec 2025 00:00:00 GMT

Over the past few weeks, I implemented supervised fine-tuning (SFT) from scratch, continuing a series of projects where I’m building foundational LLM components as a learning exercise from the ground up. Previously, I’ve worked through implementing GPT-2 from scratch and writing LLM inference scripts from the ground up. Naturally, SFT was the next step in this series.

One thing I realized pretty quickly, writing the training scripts from scratch is not the most difficult part. However, making it actually work, producing results that seems reasonable is where the real challenge begins 😅. You run into all sorts of difficulties: debugging annoying errors, dealing with gradient instabilities, getting vLLM to cooperate for intermediate evaluation (especially with limited GPU memory) etc. These are the things that eat up your time but teach you the most.

In this post, I want to share not just what I built, but the building and debugging journey that got me there.

What I Built

I loosely followed Stanford’s CS336 Assignment 5 as a guide, wrote all the SFT core components, and ran two sets of experiments:

1. Reasoning SFT: Fine-tuned Qwen2.5-Math-1.5B on math reasoning traces to improve step-by-step problem solving capabilities.

Best: 53.4% reward accuracy (up from 2.9% baseline) with 99.3% format accuracy

2. Instruction SFT: Fine-tuned Llama-3.1-8B on UltraChat-200K + SafetyLlama for general instruction following and safety.

Best: GSM8K 16->33%, Safety 62->78%, AlpacaEval 1.6->5.3%, MMLU ~58%

All experiment code, training scripts, and detailed notes are available in my building-from-scratch repo.

Part 1: Reasoning SFT with Qwen2.5-Math-1.5B

The idea behind reasoning SFT is simple. You take a base model that barely outputs correct answers, show it high-quality examples of how to solve problems step-by-step, and train it to replicate/mimic that reasoning process. The model learns to think in a structured format with first generating reasoning inside tags, then outputting the final answer in tags.

My starting point was Qwen2.5-Math-1.5B, which had quite poor baseline accuracies on the math validation set: ~2.9% for answers and ~14% for format.

Creating the Dataset: First Challenge

The original CS336 MATH dataset used for SFT training is not publicly available, so I had to create my own. My dataset creation pipeline had three steps:

Source problems: I used hiyouga/math12k dataset to create the training set, carefully filtering out any problems that appeared in the validation set to avoid data leakage.
Generate reasoning traces: The next and most important step is to generate the reasoning traces for each problem. I used gpt-oss-120b model to generate them via Fireworks Batch Inference API. It costed me around ~$4 to generate the reasoning traces.
Filter for quality: I also created a subset of around ~3.6K examples by filtering out the reasoning traces that led to wrong answers.

The Training Loop: Per-Token vs. Sequence Loss

The original assignment uses sequence level loss normalization where you sum the loss over all tokens in a sequence and normalize by a constant, not by the variable number of tokens.

While running the initial experiments, I noticed the gradient norms were really large values, and training felt unstable. Even though the loss seemed to be going in the right direction, something didn’t feel right. After some investigation, I realized the issue: with variable-length sequences (my training examples ranged from short to quite long), longer sequences contribute more to the gradient than shorter ones. This creates high variance in gradient updates.

Left: Sequence-level loss (high variance gradients) | Right: Per-token loss (stable gradients)

Thus, I added a per_token_loss flag to my training step which when enabled normalizes the loss by the actual number of response tokens in each sequence. The difference was noticeable with subtle improved accuracy. More importantly, the gradients became much more stable with per-token normalization.

Run	Loss Normalization	Reward Accuracy
run_filtered	Per-token	0.5204
run_filtered-res-len	Sequence-level	0.5106

vLLM Integration: The Debugging Nightmare

Here’s where things got really tricky and painful. I wanted to run intermediate evaluations during training using vLLM for fast inference. The assignment provided code for this but it was written for an older vLLM version and nothing worked out of the box 😅.

Problem 1: vLLM initialization changed

The assignment’s approach used a separate GPU dedicated to running vLLM as an inference server. I wasn’t keen on this setup anyway as it meant paying for an extra GPU just for inference. But more importantly, the approach broke completely with the vLLM version I was using (0.7+). The initialization logic had changed, and the old code just wouldn’t run.

Solution: I switched to the colocate approach, running vLLM on the same device as the training model. I came across this in the excellent HuggingFace blog post on co-located vLLM. Though, this required being more careful about GPU memory (setting appropriate values for gpu_memory_utilization, max_model_len, and max_num_seqs), but it actually works and saves on GPU costs.

Problem 2: Missing model_executor attribute

When I tried to load updated model weights into the vLLM instance during training, I hit this error:

AttributeError: 'LLMEngine' object has no attribute 'model_executor'

This was really annoying because the attribute clearly existed in the vLLM source code. After much debugging, I found two solutions: - Downgrade to vLLM 0.10.2, or - If using vLLM 0.11.0, set the environment variable VLLM_ENABLE_V1_MULTIPROCESSING=0 at the start of the script

I went with the environment variable approach since I didn’t want to deal with version conflicts.

Problem 3: The _orig_mod issue

With torch.compile enabled on my model (for faster training), loading weights into vLLM failed with the below error. The issue is that torch.compile wraps the original model and stores the actual weights under _orig_mod. When loading weights into vLLM, you need to access them through this attribute, not directly from the compiled model.

ValueError: There is no module or parameter named '_orig_mod' in Qwen2ForCausalLM

Solution: In my load_policy_into_vllm_instance function, I made sure to load from model._orig_mod when the model is compiled.

These three issues cost me almost a day. However, it was worth it because I learned a lot about vLLM and how to integrate it in training run

Results

After all that debugging, here’s what the training runs achieved:

Reasoning SFT Results

Run	Training Data	Reward Accuracy	Format Accuracy
baseline	-	0.0288	0.1438
run_all	Full 4.8K (correct + incorrect)	0.4214	0.9924
run_filtered	Filtered 3.6K (correct only)	0.5204	0.9906
run_filtered-2epoch	Filtered 3.6K (2 epochs)	0.5336	0.9926

Key takeaways: - Filtering out incorrect reasoning traces boosted accuracy from 42% to 52%. Training on wrong traces teaches the model wrong patterns. - The model quickly learned the output format (99%+ format accuracy after training). - Running for 2 epochs gave a boost in accuracy though a marginal one.

Part 2: Instruction SFT with Llama-3.1-8B

With the reasoning SFT working, I moved on to the second part: instruction fine-tuning. This loosely follows the CS336 Supplementary Assignment 5, where the goal is to build a model that can follow diverse instructions and refuse harmful requests.

Unlike reasoning SFT, instruction fine-tuning uses conversational instruction-response pairs. The training data combines UltraChat-200K (diverse multi-turn conversations) and SafetyLlama (safety-focused examples) totaling around 200K examples, formatted using the Alpaca prompt template.

For evaluation, I used four benchmarks as specified in the assignment: - GSM8K: Grade-school math problems (tests math reasoning) - MMLU: Multiple-choice questions across 57 subjects (tests factual knowledge) - AlpacaEval: Open-ended instructions judged by LLM-as-judge (tests instruction-following quality)
- Simple Safety Tests (SST): Harmful prompts to test refusal behavior (tests safety)

The Prompt Masking Implementation Problem

I wanted to experiment with prompt masking i.e. masking prompt tokens (labels = -100) so the loss is computed only on response tokens, helping the model focus on generating good responses.

Problem 1: BPE tokenization boundary issues

Implementing this led to an interesting debugging session. When I tokenized the prompt separately (ending with "### Response:\n") and compared it to the tokens in the full sequence (prompt + response), the boundary tokens didn’t match. This is a known issue of BPE tokenization: subword merging behavior changes based on context.

My first instinct was to try to implement complex boundary detection logic. However, I thought let’s try the simplest fix that works.

Solution: I decided to drop the last token from the prompt before masking. This is a bit quick fix. However, I might train on one extra formatting token (likely just a newline) but will never accidentally mask response tokens.

# Conservative fix: drop last prompt token to avoid boundary issues
prompt_length = len(prompt_tokens) - 1
labels[:prompt_length] = -100

Problem 2: Very short or empty responses

Another issue I ran into with prompt masking, some training examples had very short or empty responses. When all tokens are masked leaving only a few response tokens, the cross-entropy loss calculation can produce extreme values or NaNs.

Solution: The fix was simple. I filtered out examples with very short responses (0-2 words) from both training and validation sets.

Setting Up AlpacaEval

A quick note on the AlpacaEval evaluation setup. It uses an LLM-as-judge approach where an annotator model compares outputs from your mode against GPT-4 reference responses.

The assignment suggested deploying Llama-3.3-70B-Instruct locally as the annotator, but that requires at least two GPUs which is not cost effective (atleast for my case). Instead, I used Llama-3.3-70B-Instruct via Fireworks API. This required some config tweaking (API key mapping, judge configuration) but works well.

Results and Analysis

I ran two experiments: one with prompt masking (mask) and one without (no-mask).

Instruction Fine-tuning Comparison

Benchmark	Baseline	No-Mask	Mask
GSM8K	16.4%	29.0%	32.7%
MMLU	58.1%	58.4%	58.2%
SST Safety	62.0%	78.0%	77.0%
AlpacaEval	1.57%	5.3%	4.5%

GSM8K (16% -> 29-33%): Both approaches significantly improved math reasoning, but masking helped more (32.7% vs 29.0%).
Safety (62% -> 78%): You see big improvement as expected since the training data includes SafetyLlama examples.
AlpacaEval (1.6% -> 5.3%): The conversational instruction-following improved substantially. Interestingly, no-mask performed slightly better (5.3% vs 4.5%). My guess: training on the full sequence helps the model learn overall conversational patterns and produce more naturally flowing responses that match the prompt style.
MMLU (~58% -> ~58%): This stayed flat and that’s actually good news. MMLU tests factual knowledge which is encoded during pre-training. SFT teaches the model how to respond, not what to know. The fact that MMLU didn’t drop means we avoided catastrophic forgetting issue.

Looking at individual MMLU subjects, some regressed slightly (college math: 33% -> 26%) while others improved slightly, leading to near-zero net change.

Conclusion

While writing the SFT code from scratch, I ran into a lot of debugging challenges. It was at times painstaking and frustrating but was also a valuable learning experience. By debugging, I learned a lot about how things work under the hood, and the whole experience prepares you for how to go about debugging code/projects in the future.

I leave you with some of the debugging tips I came across:

vLLM OOM: Tune max_model_len, max_num_seqs, and gpu_memory_utilization and start conservative.
Per-token loss: Normalize by response token count to prevent long sequences from dominating gradients.
torch.compile + vLLM: Access weights via model._orig_mod when loading into vLLM.
BPE boundaries: Drop last prompt token before masking to avoid tokenization edge cases.
Data quality matters: Filtering incorrect traces gave me a 10% accuracy boost.
vLLM version issues: Set VLLM_ENABLE_V1_MULTIPROCESSING=0 if model_executor is missing.

Resources

I have made all the code, datasets, and model checkpoints publicly accessible.

Code: building-from-scratch/sft
Datasets: garg-aayush/sft-cs336-assign5-datasets
Checkpoints:
- Reasoning:
  - run_all: qwen-2.5-math-sft-all-2epoch
  - run_filtered: qwen-2.5-math-sft-filtered-2epoch
  - run_filtered-res-len: qwen-2.5-math-sft-filtered-res-len
  - run_filtered-2epoch: qwen-2.5-math-sft-filtered-2epoch
- Instruction d:
  - run_mask: llama31-8b-sft-mask
  - run_nomask: llama31-8b-sft-nomask
Training logs: wandb/sft and wandb/sft_instruct

A Guide to Building Custom Nodes in ComfyUI

Wed, 10 Sep 2025 00:00:00 GMT

ComfyUI is by far my favorite open-source software right now. Its intuitive node-based interface has transformed the way we build AI image and video generation workflows.

What I really appreciate about ComfyUI is its flexibility. You can easily extend it with your own custom nodes. Here, I’ll show you how to create custom nodes that let you add exactly the tools you need. I’ll use parts from my Svg2Raster nodes as the running example for this purpose.

Svg2Raster

ComfyUI does not natively support vector graphics like SVGs. I often work with them and needed lightweight nodes to load (SVG->JPEGs/PNGs) and manipulate SVGs in ComfyUI.

Thus, I built Svg2Raster, a small custom node package that makes it easy to use SVGs with other nodes.

Writing your Custom ComfyUI Node

So here I am assuming that you are fairly comfortable using ComfyUI and you already have ComfyUI installed locally on your system/cloud instance.

Step 1: Validate the core logic first

I prefer not to start with the ComfyUI node API. First, I like to write a simple Python notebook to test the functionalities I actually need. This validates your core code logic and packages in isolation.

In my case, I needed a way to read, rasterize and manipulate the SVGs. Thus, I tested all the relevant operations using the core packages CairoSVG and Pillow.

For example:

# Simple SVG read and conversion check
import cairosvg
from PIL import Image, ImageOps
import io

# Read SVG file
with open('logo.svg', 'r', encoding='utf-8') as f:
    svg_text = f.read()

# Basic conversion to PNG
img_bytes = cairosvg.svg2png(bytestring=svg_text.encode('utf-8'), 
                              output_width=600)
img = Image.open(io.BytesIO(img_bytes)).convert('RGBA')
print(f"Image size: {img.size}")

If you want to see all the code snippets (width/height controls, color and border manipulations etc.), please check out the full notebook.

Once you have a working standalone script or code, wrapping it as a ComfyUI node is mostly boilerplate.

Step 2: Understand the Anatomy of a Custom Node

Every ComfyUI node is a Python class with specific methods that ComfyUI expects. Here are the essential components:

Component	Description
`INPUT_TYPES`	What inputs your node accepts
`RETURN_TYPES`	What it outputs to other nodes
`RETURN_NAMES`	Optional labels for outputs
`FUNCTION`	The method name that runs your logic
`CATEGORY`	Where it appears in ComfyUI’s node menu

ComfyUI handles the rest of UI, connections and execution order. For example, this is how a simple custom node class will looks like:

class LoadSVG:
    @classmethod
    def INPUT_TYPES(cls):
        return {
            "required": {
                "svg_file": ("STRING", {"default": "file.svg"}),
            }
        }
    
    RETURN_TYPES = ("IMAGE", "STRING")
    RETURN_NAMES = ("image", "svg_text")
    FUNCTION = "load_svg"
    CATEGORY = "Svg2Raster"
    
    def load_svg(self, svg_file):
        # Your actual logic here
        return (image_tensor, svg_text)

This is a minimal ComfyUI node class explanation that I believe is good enough to start writing your own nodes. If you want more details, check the official ComfyUI custom node documentation.

Step 3: Implementing the LoadSVGImage Node

First, I set up the file structure in ComfyUI’s custom_nodes folder for my nodes package:

cd ComfyUI/custom_nodes
mkdir svg2raster
cd svg2raster

Then I create two essential files:

__init__.py: it allows ComfyUI to import your custom nodes.

from .svg2raster_node import *

__all__ = [ "NODE_CLASS_MAPPINGS",
            "NODE_DISPLAY_NAME_MAPPINGS"]

svg2raster_node.py: this is where the actual nodes code is written.

You can find the complete code for these nodes here: svg2raster_node.py. Here’s the boilerplate structure of LoadSVGImage node:

class LoadSVGImage:
    @classmethod
    def INPUT_TYPES(cls):
        """Define what inputs this node accepts"""
        # Use `folder_paths` to access ComfyUI's input directory
        input_dir = folder_paths.get_input_directory()
        files = [f for f in os.listdir(input_dir) if os.path.isfile(os.path.join(input_dir, f)) and f.lower().endswith('.svg')]
        return {
            "required": {
                "svg": (sorted(files), {"image_upload": True}),
            }
        }
    
    # Output configuration
    RETURN_TYPES = ("STRING", "IMAGE")
    RETURN_NAMES = ("svg_text", "preview_image")
    FUNCTION = "load_svg"  # Method name to execute
    CATEGORY = "FromSVG/Tools"  # Menu location
    
    def load_svg(self, svg, background="#FFFFFF"):
        """Main execution method - does the actual work"""
        # Your logic here
        return (svg_text, image_tensor)
    
    @classmethod
    def IS_CHANGED(cls, svg):
        # Returns file hash or modification time
        pass
    
    @classmethod
    def VALIDATE_INPUTS(cls, svg):
        # Check if file exists, return error string if invalid
        return True

Here, the helper methods serve crucial purposes: - IS_CHANGED: Tells ComfyUI when to re-execute the node - VALIDATE_INPUTS: Prevents crashes by validating inputs before execution

ComfyUI expects images as tensors in BHWC format (batch, height, width, channels) with values normalized to 0-1. Thus, you need to have a pil to tensor function.

def _pil_to_tensor(pil_img: Image.Image):
    """Convert PIL image to ComfyUI IMAGE tensor: (B, H, W, C) in [0,1]"""
    # conversion logic

Finally, you need the mappings for ComfyUI to discover your nodes:

NODE_CLASS_MAPPINGS = {
    "LoadSVGImage": LoadSVGImage,
}
NODE_DISPLAY_NAME_MAPPINGS = {
    "LoadSVGImage": "Load SVG Image",
}

Without these mappings, ComfyUI won’t find your nodes even if the code is correct.

Similarly, I also wrote a RasterizeSVG class for manipulating the loaded SVG. It takes the SVG text from LoadSVGImage and lets you adjust scale, dimensions, borders, and more. You will see the pattern is identical: define inputs, process with CairoSVG/PIL, convert to tensor, return.

That’s how you implement a node. Write a standalone functionality script, wrap it in the ComfyUI class structure, handle the tensor conversions, add the helper methods, and register it with the mappings.

Step 4: Testing Your Custom Node in ComfyUI

Once you are done implementing, testing is straightforward.

Ensure all your dependencies are installed, including CairoSVG:
Restart ComfyUI for it to detect the new node.
Find your nodes under the defined category

Note: I have also added the installation steps here.

Step 5: Sharing and Publishing Your Node

Once everything worked, I created a GitHub repo for the nodes. You can see how I structured mine in the Svg2Raster repo.

Some Essential files in the repo are:

File	Description
README.md	Clear installation instructions and usage examples
requirements.txt	Python dependencies (cairosvg in my case)
pyproject.toml	Required if you plan to publish to ComfyUI Registry
examples/	Optional Sample SVG files and workflow JSON files

Note: Having a good Readme and examples makes a huge difference for users trying to understand and use your nodes.

Once your repo is ready, you can even publish it to the ComfyUI Registry. There’s an excellent guide on publishing to ComfyUI Registry - just follow those steps.

I also set up a GitHub Actions workflow that automatically publishes updates to the ComfyUI Registry whenever I push changes to my repo. This ensures the registry always has the latest version. You can check out my workflow file to see how I did it.

Building GPT from Scratch: Following Karpathy’s Tutorial

Mon, 08 Sep 2025 00:00:00 GMT

The Transformer architecture has become the workhorse behind modern LLMs. GPT-2/3/4/5, Llama, Claude, Gemini: they all are built on top of the same core architecture or its variants from the 2017 “Attention Is All You Need” paper. I wanted to understand this architecture properly, so I followed Andrej Karpathy’s “Let’s Build GPT from Scratch” video. It’s a 2-hour walkthrough where you start from an empty file and end up with a working Transformer.

I followed Karpathy’s video and captured each architectural addition as a separate commit. This let me see exactly how each component pulled down the validation loss. In this walkthrough, the training data is ~1M characters of Shakespeare and the goal is to generate Shakespeare-like text.

	Component	Val Loss	Commit
Baseline	Bigram Model	~2.49	`e0b5864`
Update 1	Single Head Self-Attention	~2.4	`7b0e03a`
Update 2	Multi-Head Attention	~2.28	`9d2a7b5`
Update 3	Feed-Forward Network	~2.27	`c4c46ff`
Update 4	Residual Connections	~2.09	`0239c07`
Update 5	Layer Normalization	~2.076	`63ef5f8`
Update 6	Pre-LayerNorm (modern)	~2.076	`4f5bef8`
Update 7	Scaling Up + Dropout	~1.48	`d4141d7`

Loss Curves

You can find all the code and notebooks in the repo: building-from-scratch/basic-gpt

Baseline: Bigram Model (`e0b5864`)

Karpathy starts with the simplest possible language model: a bigram model. It predicts the next character based only on the current character. No context at all. The tokens aren’t talking to each other.

This still works somewhat because some characters naturally follow others (the letter ‘q’ is almost always followed by ‘u’). But the output is complete gibberish because the model has no way to look at what came before.

Result: ~2.49 validation loss.

Update 1: Self-Attention (`7b0e03a`)

We want tokens to communicate with each other and predictions to consider context from previous tokens, not just the current one. A token at position 5 should be able to look at tokens 1-4 and gather information from them. But at the same time, it can’t look at tokens 6, 7, 8 because those are the future we’re trying to predict.

Self-attention solves this. Every token is represented by 3 vectors: - Query: “What am I looking for?” - Key: “What do I contain?”
- Value: “If you find me interesting, here’s what I’ll tell you.”

The query dot-products with all the keys. High dot product means high affinity: “I find you interesting.” The values of interesting tokens get aggregated via weighted sum.

Note: Attention is really a communication mechanism. You can think of it as nodes in a directed graph where every node aggregates information from nodes that point to it. In our case, token 5 can receive information from tokens 1-4 (and itself), but not from tokens 6-8. The triangular mask creates this directed structure and is what makes this a “decoder” block.

One subtle but important point: attention has no notion of space. The tokens don’t inherently know where they are in the sequence. That’s why we add positional embeddings. Each position gets its own learned embedding that’s added to the token embedding, giving the model spatial information.

Result: ~2.4 validation loss. Tokens can now see context.

Update 2: Multi-Head Attention (`9d2a7b5`)

Tokens have a lot to talk about. One head might look for consonants, another for vowels, another for word boundaries, another for patterns at specific positions. Having multiple independent communication channels lets the model gather diverse types of data in parallel.

Note: This is similar to grouped convolutions. Instead of one large convolution, you do it in groups. With 4 heads of 8 dimensions each, we get the same total dimensionality (32) but with 4 separate communication channels. Each head can specialize in different patterns.

Multi-Head Attention

Result: ~2.28 validation loss.

Update 3: Feed-Forward Network (`c4c46ff`)

The FFN layer addresses a key problem. Until now, “the tokens looked at each other but didn’t have enough time to think about what they found.”

Self-attention is the communication phase. Tokens gather data from each other. But then they need to compute on that data individually. That’s what the feed-forward network does. It operates on a per-token level. All the tokens process their gathered information independently.

Feed-Forward Network

So the Transformer block becomes: communicate (attention) → compute (feed-forward). This pattern repeats for every layer.

Result: ~2.27 validation loss. The architecture now has both communication and computation.

Update 4: Residual Connections (`0239c07`)

This is one of two optimizations that make deep networks actually trainable. Without it, stacking many layers leads to vanishing gradients and optimization difficulties.

Karpathy visualizes it nicely: imagine a residual pathway running from top to bottom. You can “fork off” from this pathway, do some computation, and project back via addition. The path from inputs to outputs is just a series of additions.

Note: Why does this help? During backpropagation, addition distributes gradients equally to both branches. The gradients “hop” through every addition node directly to the input. This creates a “gradient superhighway” from supervision to input, unimpeded. The residual blocks are initialized to contribute very little at first, then “come online” over time during optimization.

Result: ~2.09 validation loss. Now we can stack layers without vanishing gradients.

Update 5 & 6: Layer Normalization (`63ef5f8`, `4f5bef8`)

Batch normalization normalizes columns (across examples in a batch). Layer normalization normalizes rows (across features for each example). The implementation is almost identical, you just change which dimension you normalize over.

Layer norm has advantages for Transformers: - No dependency on batch size (works even with batch size 1) - No running buffers to maintain - No distinction between training and test time

The original Transformer paper used post-layer norm (normalize after attention/FFN). Modern implementations use pre-layer norm (normalize before). Pre-layer norm creates a cleaner residual pathway since the transformation happens on normalized inputs, leading to more stable training.

Result: ~2.076 validation loss.

Update 7: Scaling Up (`d4141d7`)

With all the architectural pieces in place, Karpathy scales up the architecture:

Parameter	Before	After
Block size (context)	8	256
Embedding dim	32	384
Heads	4	6
Layers	3	6
Dropout	0	0.2

Dropout is added for regularization. It randomly shuts off neurons during training, effectively training an ensemble of sub-networks. At test time, everything is enabled and the sub-networks merge.

Result: ~1.48 validation loss. The generated text now looks like Shakespeare (structure, dialogue formatting, character names) even though it’s nonsensical when you actually read it.

How This Compares to GPT-3

	My Model	GPT-3
Parameters	~10M	175B
Dataset	~300K tokens	300B tokens
Architecture	Nearly identical	Nearly identical

The architecture we built is essentially the same as GPT-3. The difference is pure scale: 17,500x more parameters trained on 1 million times more data. By today’s standards, even GPT-3’s 300B tokens is considered modest. Current models train on 1T+ tokens.

This is what makes the Transformer architecture so remarkable. The same fundamental design (attention for communication, feed-forward for computation, residual connections, layer norm) scales from a 10M parameter Shakespeare generator to a 175B parameter model!

Resources

Code: building-from-scratch/basic-gpt
Video: Let’s Build GPT from Scratch by Andrej Karpathy

Key Takeaways from Lecture 1: LLM Evaluation Lifecycle

Tue, 02 Sep 2025 00:00:00 GMT

A couple of months back, I enrolled in AI Evals for Engineers and PMs, a course by Hamel and Shreya. The live cohort for ot ran from July to mid-August, but due to work commitments I couldn’t follow along in real time.

I have now started following it as a self-paced course and plans to write a blog for each lesson as I progress. This will be my way to capture what I learn and to reflect on the material. In this first blog 🤞, I’ll walk through my key takeaways from introductory Lecture 1.

Key Takeaways

1. Evaluation isn’t Optional but Fundamental

Anyone who has built or worked with LLM pipelines knows that their outputs are open-ended, subjective, and unstructured (unless you enforce it). If you rely on ad-hoc checks which I have been guilty of, it often leads to knee-jerk fixes. Moreover, it completely miss the long-term need of continuous tracking which is essential for improving your pipeline reliability and usefulness. This is why Evaluation—the systematic measurement of an LLM pipeline quality—is critical!

2. The Three Gulfs

The below image beautifully captures and categorizes the challenges associated with any LLM application:

Gulf of Comprehension: This is a result of limited understanding of the input data (user queries) and the pipeline’s outputs (behavior). Bridging it requires examining examples to identify common failure modes. This brings it own challenge: “How to manually review every input or output to identify failure modes?”
Gulf of Specification: It refers to the difficulty of translating a user’s high-level intent into unambiguous precise instructions for the LLM. Bridging it requires writing detailed prompts that captures “true intent” which in itself is challenging due to ambiguous nature of natural language.
Gulf of Generalizaton: This is due to LLMs unexpected and inconsistent behavior on new or unusual (out of distribution) inputs. Bridging it requires a good understanding of your LLM model capabilities. This leads to the question: “How to improve LLM model?”

3. Analyze → Measure → Improve Lifecycle

Hamel and Shreya introduced a structured way to bridge the above gulfs: Analyze → Measure → Improve lifecycle.

Analyze → Measure → Improve Lifecycle

However, the most important takeaways for me was not what each phase means but the pitfalls that often derail them:

Phase	Pitfalls	Notes
Analyze	Outsourcing annotation; looking at too few examples and forming shaky hypotheses	This is where you learn the most. Spend ~75–80% of your time here—good analysis sets up everything else.
Measure	Misaligned or poorly designed LLM judges; “overfitting” by testing judges on the same examples used in the judge prompt	In this phase, you need the rigor of data science. NEVER leak test data into judge prompts.
Improve	Prematurely jumping to fixes; defaulting to the most complex solution first (fine-tuning, bigger models)	Start simple. Prompt tweaks and improvements often go a long way before heavier changes are needed.

4. LLMs are Imperfect—Prompt Iteratively

When we write prompts it’s easy to ignore that LLMs are non-deterministic, prompt-sensitive and can confidently hallucinate. Thus, always remember: “LLMs are powerful but imperfect components. Leverage strengths, anticipate weaknesses.”

LLM Strengths vs. Weaknesses

Effective prompting starts with you. You should not delegate the prompting to an LLM or you will miss important failure modes. Instead, write your own draft prompt and if needed, use an LLM only to polish clarity.

From there on, treat prompting as an iterative process where the first draft is a starting point which you refine based on observed outputs.

5. Reference-based vs Reference-free Metrics

The evaluation metrics broadly fall into two categories: reference-free and reference-based. Both of them are useful but in different contexts.

	Reference-Free	Reference-Based
What it means	Evaluates properties of the output itself (no golden answer required)	Compares output against a golden reference or ground truth
When to use	Creative or open-ended tasks, formatting/structure checks, validity tests	Tasks with clearly defined correct answers (e.g., factual QA, deterministic outputs)
Examples	- Does the output follow the JSON format? - Does generated code/SQL run without errors?	- Exact match against a gold SQL query - ROUGE/BLEU score for text generation

Part III: Fine-tuning Llama-3-8B for Structured Functional Representation Extraction

Mon, 15 Jul 2024 00:00:00 GMT

Last week, I published the second blog in my LLM fine-tuning series, comparing various models performance in functional representation extraction.

In this third part of the series, I discuss the first steps toward fine-tuning an (open-source)-LLM for functional representation extraction. My aim is to give you all a sneak peek at the kind of performance you can expect from fine-tuning an LLM for a custom task. To streamline this step (and to satisfy my own curiosity 😊), I will use Predibase. It is a fast, cheap, and efficient open-source LLM fine-tuning and deployment platform.

FYI: I have some free Predibase credits through Dan’s and Hamel’s LLM course. Therefore, it is a perfect opportunity to put those credits to good use! 😬

Note: Whenever I mention “finetuning LLM,” I am specifically referring to LoRA (Low-Rank Adaptation) finetuning of a Large Language Model. For overview of LoRA, please read Sebastian Raschka’s blogs (LORA Blog 1, LORA Blog 2).

Task and Dataset

Similar to my previous blogs, the custom task is to predict the structured functional representation from the given text video game opinions of the ViGGO validation dataset.

To make this exercise interesting and challenging for future experiments, I will use a maximum of 1000 examples for fine-tuning any LLM model, instead of the full ~5K train dataset.

Below is an example from the randomly selected 1K train dataset:

Text                      : I remember you saying that you loved The Room. Do you tend to enjoy PC games from 2012?
functional_representation : verify_attribute(name[The Room], release_year[2012], rating[excellent], platforms[PC])

As shown in the graph below, the selected 1K dataset is a fairly representative sample of the full ViGGO train dataset.

Understanding Data Distribution

Upload the dataset to Predibase

Predibase requires you to upload the instruction fine-tuning dataset in particular format. This is from Predibase docs:

For instruction fine-tuning, your dataset must contain two columns named prompt and completion: - prompt: Your input prompt. It serves as the starting point or the guiding information for the model. - completion: The expected response that corresponds to the input provided in the “prompt” column. - split (optional): Should be either train or evaluation. To learn more, check out this section.

Make sure to add the prompt template to the examples and convert them to the correct format. For this exercise, I use the following prompt template:

prompt_template = """Given a target sentence convert it structured functional representation.

### Target sentence: {text}

### Output Functional representation:
"""

You can connect your dataset to Predibase via the UI or Python SDK. Here, I will upload the dataset using SDK.

# Initialize Predibase client
pb = Predibase(api_token=os.environ["PREDIBASE_API_TOKEN"])

# Upload the dataset
dataset = pb.datasets.from_file("viggo_train_val_dataset_1K.csv", 
                                name="viggo_train_val_dataset_1K")

Once uploade, you can check the uploaded dataset on the Predibase UI.

For detailed steps on uploading the dataset to Predibase, please refer to the companion blog notebook.

Setup and Finetune

Once you have uploaded the dataset, running the fine-tuning process is refreshingly simple. For this example, I fine-tune the base llama-3-8b model with the following parameters: epochs=3, rank=16, and learning_rate=2e-4.

# Create an adapter repository
repo = pb.repos.create(name="viggo-finetune-1K", 
                description="Llama-3-8b adapter repository for viggo 1K examples"
                )

# Create and run the fine-tuning job
adapter = pb.adapters.create(
   config=FinetuningConfig(
       base_model="llama-3-8b",
       epochs=3,
       rank=16,
       learning_rate=0.0002,
   ),
   dataset=dataset,
   repo=repo,
   description="baseline-llama-3-8b",
)

That’s all you need to do to submit a job! Once completed, it will be available on the Predibase platform.

You can always tweak multiple hyperparameters (see Finetuning Config) and run the fine-tune job again. All your fine-tune jobs will be available on the Predibase platform.

Evaluate the Fine-tuned Model

Predibase provides both popular Serverless endpoints and Dedicated deployments options for opens-source LLMs and their fine-tuned LORA checkpoints. I will create serverless endpoint for this case.

Note, atleast for now, serverless deployments are available for free.

Generate the responses for validation dataset

Similar to my previous blogs, I will evaluate the finetuned model on ViGGO validation dataset and calculate custom performance metrics metrics for a better understanding of finetuned model performance.

First, I generate the responses for the validation dataset:

# Initialize the Predibase deployment client
lorax_client = pb.deployments.client("llama-3-8b")

# Load the validation dataset
viggo_dataset = load_dataset("GEM/viggo")
val_dataset = viggo_dataset['validation']

# finetuned adapter id
adapter_id = "viggo-finetune-1K/2" 

responses_dict = {}
for idx in range(len(val_dataset)):
    if idx % 50 == 0: print(f"Processing {idx}/{len(val_dataset)}")
    output = lorax_client.generate(prompt_template.format(text=val_dataset["target"][idx]), adapter_id="viggo-finetune-1K/2", max_new_tokens=150).generated_text
    ground_truth = val_dataset["meaning_representation"][idx]
    text = val_dataset["target"][idx]
    responses_dict[idx] = {"output": output, "ground_truth": ground_truth, "text": text}

Note: Remember to replace “viggo-finetune-1K/2” with the correct adapter ID. You can find the adapter ID in the Predibase dashboard.

Now, I can generate the evaluation scores using custom evaluation metrics and compare them with previously calculated GPT-4 and Claude 3.5 Sonnet scores:

plot-finetuned-model

The initial finetuning of LLaMA-3-8B using 1,000 random examples from the ViGGO dataset, while not surpassing GPT-4 and Claude 3.5 Sonnet, shows promising results and outperforms several models from our previous blog. Notably, the exact_match score is even better than that of the two best-performing models.

Improved Performance with Updated Prompt Template

A simple yet effective way to enhance the model’s performance is by refining the prompt template. By providing clearer instructions that convey the structure of the functional representation, we can guide the model to produce more accurate outputs.

I updated the prompt template as follows:

prompt_template = """Given a target sentence construct the underlying meaningful functional representation of the input sentence as a single function with attributes and attribute values.

### Target sentence: {text}

### Output Functional representation:
"""

After uploading this new dataset, finetuning the model, and evaluating it with the new adapter viggo-finetune-1K/3, there is significantly improved evaluation metrics. Notably, the model now surpasses GPT-4o’s scores for exact_match and function_name_match.

plot-finetuned-model

This improvement highlights the importance of clear and specific instructions in prompt engineering, even when working with finetuned models.

Conclusions..

First of all, My overall experience with Predibase has been positive, particularly in terms of rapid finetuning of models. While there are some limitations such as restricted hyperparameter tuning, standardized dataset format, and inability to download adapters in the developer tier, it offers a user-friendly platform for fine-tuning (LORA) large language models. I was able to quickly upload, setut, finetune and infer the llm models.
I achieve out-of-the-box performance using only random 1K examples. Although the fine-tuned llama-3-8b model doesn’t match the performance of GPT-4 and Sonnet 3.5 on all metrics. This demonstrates the potential of fine-tuning with limited data, highlighting the efficiency of the approach for task-specific model adaptation.

Next steps…

My next goal is to further enhance the model’s performance on evaluation metrics while maintaining a limit of 1,000 training examples. LIMA: Less Is More for Alignment paper has demonstrated in the past that even 1,000 well-curated examples can lead to strong finetuning performance.
Careful curated selection of examples and hyerparameters will definitely improve the performance benchmarks on evaluation metrics.
In addition to it, I will deep dive into one of my favorite LLM fine-tuning tools, Axolotl.

References

Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.

Part II: Comparison of Model Performances on Structured Functional Representation Extraction

Tue, 09 Jul 2024 00:00:00 GMT

Introduction

In the previous blog post, I established a performance baseline using GPT-4o for generating structured data, particularly functional representations, from text using the ViGGO Dataset.

Building on that foundation, I expand the experiment to include a broader range of models, both open-source and proprietary. This comparison aims to provide insights on how well these models perform out of the box in structured data extraction tasks, which is quite crucial for RAG applications, knowledge base construction, and reasoning systems.

I evaluate and compare the performance of these six LLM models:

GPT-4o: OpenAI’s latest iteration of the GPT-4 model, known for its faster generation, advanced natural language understanding and generation capabilities. Currently one of the most popular and capable models.
Claude Sonnet-3.5: Anthropic’s refined language model with enhanced reasoning abilities. It aims to provide more context-aware outputs compared to earlier versions and has recently outperformed GPT-4o on many benchmarks.
Gemini-1.5-Flash: Google DeepMind’s streamlined version of the Gemini model, optimized for faster inference and reduced computational requirements.
llama-3-70b-instruct: Meta’s large-scale instruction-tuned language model, part of the latest LLaMA 3 family, with 70 billion parameters. It’s designed to follow complex instructions and generate high-quality text across diverse domains.
mixtral-8x7b-instruct-v0.1: Mistral AI’s instruction-tuned variant of the Mixtral 8x7B model, known for its mixture-of-experts architecture.
llama-3-8b-instruct: A more compact version of Meta’s LLaMA 3 family, with 8 billion parameters, optimized for instruction following.

Note: I’ve included the smaller Llama-3-8B model as I plan to finetune it’s base 8B model in coming days. It would help me compare the general instruction finetuned 8B model performance.

Dataset and Prompt Template

For consistency and fair comparison, I used the same ViGGO validation dataset and the prompt template as in the previous blog post. The prompt template and the few-shot examples are designed to guide the models in generating structured functional representations from the given text input:

PROMPT_TEMPLATE = """
Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. 
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].

The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']. The order your list the attributes within the function must follow the order listed above. For example the 'name' attribute must always come before the 'exp_release_date' attribute, and so forth.

For each attribute, fill in the corresponding value of the attribute within brackets. A couple of examples are below. Note: you are to output the string after "Output: ". Do not include "Output: " in your answer.

Example 1)
Sentence: Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.
Output: inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])

Example 2) 
Sentence: Were there even any terrible games in 2014?
Output: request(release_year[2014], specifier[terrible])

Example 3)
Sentence: Adventure games that combine platforming and puzzles  can be frustrating to play, but the side view perspective is perfect for them. That's why I enjoyed playing Little Nightmares.
Output: give_opinion(name[Little Nightmares], rating[good], genres[adventure, platformer, puzzle], player_perspective[side view])

Example 4)
Sentence: Since we're on the subject of games developed by Telltale Games, I'm wondering, have you played The Wolf Among Us?
Output: recommend(name[The Wolf Among Us], developer[Telltale Games])

Example 5) 
Sentence: Layers of Fear, the indie first person point-and-click adventure game?
Output: confirm(name[Layers of Fear], genres[adventure, indie, point-and-click], player_perspective[first person])  

Example 6) 
Sentence: I bet you like it when you can play games on Steam, like Worms: Reloaded, right?  
Output: suggest(name[Worms: Reloaded], available_on_steam[yes])

Example 7)
Sentence: I recall you saying that you really enjoyed The Legend of Zelda: Ocarina of Time. Are you typically a big fan of games on Nintendo rated E (for Everyone)?    
Output: verify_attribute(name[The Legend of Zelda: Ocarina of Time], esrb[E (for Everyone)], rating[excellent], platforms[Nintendo])

Example 8)
Sentence: So what is it about the games that were released in 2005 that you find so excellent?  
Output: request_explanation(release_year[2005], rating[excellent])

Example 9)
Sentence: Do you think Mac is a better gaming platform than others?
Output: request_attribute(has_mac_release[])

Give the output for the following sentence:
{input}
"""

Generating the responses

I used the respective official API calls for the closed models (GPT-4o, Gemini-1.5-flash, Claude-1.5-Sonnet) and the Replicate API client for open-source models (Llama-3-70B, Llama-3-8B, Mistral-8x7B).

For detailed information on API endpoints and the process of generating responses for all models, please refer to the Generate_responses_all_llms.ipynb notebook.

Generating responses for this experiment had the following associated costs:

Model/API	Cost
GPT-4o API	~ $2.5
Claude-1.5-Sonnet API	~ $3.5
Replicate API	~ $3
Gemini-1.5-flash API	Free*
Total	~$9

_*for limited usage_

Evaluation Strategy

To assess the models’ performance, I used the same evaluation criteria as in the previous post:

Function Name Match: The function name must match the ground truth function name.
Function and Attributes Match: The generated function name and attributes must match the ground truth function attributes. However, the order of the attributes does not matter.
Function, Attributes, and Values Match: The generated function name, attributes, and values must match the ground truth function attributes and values. The order of the attributes and values does not matter.
Exact Match: The generated function must exactly match the ground truth function.

Note: I implemented custom Python functions using regex and string manipulation to calculate these metrics, rather than relying on another LLM for evaluation. This approach helps avoid potential biases that might be introduced by using an LLM in the evaluation process.

For the complete evaluation code and functions used, please refer to the Compare_models_performances.ipynb notebook.

Comparing the models performance

Compare Models Performance

Based on the evaluation metric plot:

Claude Sonnet-3.5 and GPT-4o consistently outperform other models across all metrics.
Claude Sonnet-3.5 performs better than GPT-4o, aligning with all the twitter chats about its superior performance.
Gemini-1.5-Flash, despite its optimization for speed, maintains competitive performance for less stringent metrics but struggles significantly with exact matches.
The performance gap between the top-performing models and others widens sharply for more stringent metrics.
As expected, the smaller Llama-3-8B model shows the lowest performance, highlighting the evident size advantage of larger models.
Mistral-8x7B's performance is lower than anticipated, suggesting lower instruction-following capabilities.

Some key observations

Based on the quickly eyeballing the generated responses:

Almost all models consistently captures straightforward attributes such as player perspective and multiplayer status.
For queries involving multiple attributes or conditions, the model sometimes misses or misinterprets parts of the input.
The models struggles with capturing subtle distinctions in opinions and inferring information that is implied but not explicitly stated in the text. For example, models struggled to differentiate between inform, give_opinion and suggest.

Conclusions

This comparison provides valuable insights into the capabilities of various LLMs in functional representation extraction. As expected, the proprietary large models like Claude Sonnet-3.5 and GPT-4o perform best out of the box, with Claude Sonnet-3.5 being the best.

One can argue that these results may not fully represent the models’ maximum capabilities. Different model-specific prompt engineering approaches or dynamic few-shot examples could potentially improve performance further. However, my aim is just to assess how well a model perform without fancy RAG/function calling/complicated prompt engineering approaches.

Next steps….

I plan to fine-tune the base Llama-3-8B and other smaller models (like Phi and Gemma) on the ViGGO dataset to assess whether a fine-tuned smaller model can compete with or surpass the performance of Claude Sonnet-3.5 and GPT-4o.
At the same time, investigate the trade offs like inference speed, accuracy, and latency associated with fine-tuned smaller models and propreitory models api calls.

References

Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.

Part I: Baseline Evaluation of GPT-4o for Functional Representation Extraction

Wed, 03 Jul 2024 00:00:00 GMT

Introduction

Extracting structured data from unstructured texts allow us to condense the information present in the text. This representation then can be used for efficient indexing and other downstream RAG applications.

I want to evaluate GPT-4o’s performance in extracting structural data, specifically, functional representation, from the unstructured domain-specific text. I will use ViGGO dataset to evaluate it on custom evaluation criteria and will set it as baseline performance for that can be used for comparison in future work with other models such as Claude, Gemini, and custom fine-tuned open-source models.

ViGGO Dataset

This is a dataset for generating text opinions in the video game domain. Strictly speaking, it is intended to generate coherent conversational responses based on input functional representations (set of attributes and values).

However, I use the reverse task, where I generate structured functional representations from the given text input. A typical ViGGO dataset example has the output structured functional representation consisting of a single function with attributes and attribute values.

Text:
You said that you liked Crysis. Do you often play first person games from Crytek Frankfurt?

Functional Representation:
verify_attribute(name[Crysis], developer[Crytek Frankfurt], rating[good], player_perspective[first person])

The function and attributes must be one of the following, respectively:

Function:
['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 
'suggest', 'request_explanation', 'recommend', 'request_attribute']

Attributes:
['name', 'release_year', 'esrb', 'genres', 'platforms', 'available_on_steam',
'has_linux_release', 'has_mac_release', 'specifier', 'rating', 
'player_perspective', 'has_multiplayer', 'developer', 'exp_release_date']

Note: Since I am not training/fine-tuning any model, I will only consider the ViGGO validation datasetfor this exercise.

Prompt for generating the functional representation

I use modified version of prompt template used in Anyscale’s blog for the ViGGO dataset. The prompt template is a few-shot prompt with examples from each function category to assist the model in understanding the intended output response representation.

PROMPT_TEMPLATE = """
Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. 
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].

The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']. The order your list the attributes within the function must follow the order listed above. For example the 'name' attribute must always come before the 'exp_release_date' attribute, and so forth.

For each attribute, fill in the corresponding value of the attribute within brackets. A couple of examples are below. Note: you are to output the string after "Output: ". Do not include "Output: " in your answer.

Example 1)
Sentence: Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.
Output: inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])

Example 2) 
Sentence: Were there even any terrible games in 2014?
Output: request(release_year[2014], specifier[terrible])

Example 3)
Sentence: Adventure games that combine platforming and puzzles  can be frustrating to play, but the side view perspective is perfect for them. That's why I enjoyed playing Little Nightmares.
Output: give_opinion(name[Little Nightmares], rating[good], genres[adventure, platformer, puzzle], player_perspective[side view])

Example 4)
Sentence: Since we're on the subject of games developed by Telltale Games, I'm wondering, have you played The Wolf Among Us?
Output: recommend(name[The Wolf Among Us], developer[Telltale Games])

Example 5) 
Sentence: Layers of Fear, the indie first person point-and-click adventure game?
Output: confirm(name[Layers of Fear], genres[adventure, indie, point-and-click], player_perspective[first person])  

Example 6) 
Sentence: I bet you like it when you can play games on Steam, like Worms: Reloaded, right?  
Output: suggest(name[Worms: Reloaded], available_on_steam[yes])

Example 7)
Sentence: I recall you saying that you really enjoyed The Legend of Zelda: Ocarina of Time. Are you typically a big fan of games on Nintendo rated E (for Everyone)?    
Output: verify_attribute(name[The Legend of Zelda: Ocarina of Time], esrb[E (for Everyone)], rating[excellent], platforms[Nintendo])

Example 8)
Sentence: So what is it about the games that were released in 2005 that you find so excellent?  
Output: request_explanation(release_year[2005], rating[excellent])

Example 9)
Sentence: Do you think Mac is a better gaming platform than others?
Output: request_attribute(has_mac_release[])

Give the output for the following sentence:
{input}
"""

The typical responses for the above template are not perfect but it can be used for generating structured output for the full dataset.

Ground Truth: inform(name[FIFA 12], release_year[2011], esrb[E (for Everyone)], rating[average], genres[simulation, sport])
GPT Response: inform(name[FIFA 12], release_year[2011], esrb[E (for Everyone)], rating[average], genres[sports, simulation])


Ground Truth: request(player_perspective[side view], specifier[easy])
GPT Response: Output: request(genres[side view], rating[top], specifier[easy])


Ground Truth: recommend(name[Need for Speed: The Run], platforms[Xbox])
GPT Response: confirm(name[Need for Speed: The Run], platforms[Xbox])

Evaluation criteria

Often, you require custom evaluation criteria for custom tasks. For structured functional representation extraction, I define the following binary criteria:

Function Name Match: The function name must match the ground truth function name.
Function and Attributes Match: The generated function name and attributes must match the ground truth function attributes. However, the order of the attributes does not matter.
Function, Attributes, and Values Match: The generated function name, attributes, and values must match the ground truth function attributes and values. The order of the attributes and values does not matter.
Exact Match: The generated function must exactly match the ground truth function.

The above criteria are in order of increasing strictness. The first criterion is the least strict, and the last criterion is the most strict. These criteria will help me evaluate the model’s performance.

Evaluation Strategy

Although not ideal, I ask the model to evaluate its own performance on the given task. This approach has limitations, as it could potentially introduce bias in the evaluation process. However, it serves as a starting point for our analysis. I ask the model to compare the generated function with the ground truth function and provide a boolean score based on the above evaluation criteria.

Since, I need the evaluation scores in a more structured format to analyze the model’s performance effectively. I use pydantic and instructor packages to obtain the evaluation scores in a structured format. I use the following prompt and pydantic model to evaluate the GPT-4o performance:

class EvaluateFunctionRepresentation(BaseModel):
    function_match: bool = Field(description="The function name is the same but the attributes and values can be different.")
    function_attribute_match: bool = Field(description="The function and the attributes are the same but the values can be different.")
    function_attributes_values_match: bool = Field(description="The generated representation has same function and attributes and corresponding values without the same attributes and values order.")
    exact_match: bool = Field(description="The generated representation is exactly the same as the ground truth representation.")

PROMPT_TEMPLATE_SCORE_EVAL = """I will provide you with two functional representations strings. One will be the ground truth representation (ground_truth) and the other will be a generated representation (generated). You need to compare the generated representation with the ground truth representation and provide the following similarity match in true or false:
1) function_match
2) function_attributes_match
3) function_attributes_values_match
4) exact_match

A typical functional representation is of this form: function(attribute1[values], attribute2[values], attribute3[values], ...). 

Given the following two functional representation, provide the similarity scores for the following:

ground_truth: {ground_truth}

generated: {generated}

Let's think step by step.
"""

Please go through the GPT-4o baseline evaluation notebook for more details on prompt template and evaluation process and responses.

Evaluating the performance

Using the above evaluation strategy, I generate the average evaluation scores for the full dataset.

GPT-4o Evaluation Metrics

The task of generating functional representations from natural language sentences is challenging as seen in the above scores. The best the model can do is to provide the exact match only for 30% of the examples. Even the correct function name evaluation score is only about 80%. These results indicate that while GPT-4o shows some capability in extracting functional representations, there’s significant room for improvement.

Conclusion

The above evaluation exercise provides a good baseline and useful evaluation criteria/metrics that can be used to assess the performance of other models on the same task in the future. Generating functional representations is a complex task and is not easily accomplished using prompt-engineering techniques like few-shot learning as seen in the results.

Next steps…

Evaluate other large models like Claude, Gemini, and Llama-3–70B on the same task to compare their performance. This will provide a broader perspective on how different LLMs handle structured data extraction.
Explore alternative evaluation methods that don’t rely on the model evaluating itself. Most importantly, to eliminate potential biases in the evaluation process.
Fine-tune smaller models like Llama-3–8B or Mistral-7B for this domain-specific structured representation task. Fine-tuning will not only improve the model’s performance but also enhance latency and reduce the number of input tokens required to generate the output.
Investigate the impact of different prompt engineering techniques on the model’s performance.

References

Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.

Step-by-Step Guide to Setup Your Personal GPU Server

Sun, 30 Jun 2024 00:00:00 GMT

Github Gist Link

I’ve been using a GPU workstation with an RTX 4090 for almost a year now, and it’s been one of the best decisions I’ve made. With a personal GPU server, you no longer need to rely on cloud-based GPU instances from services like RunPod or Vast.ai every time you want to run a job or try new models. The best part? No stress about recurring GPU instance costs! :-)

However, I rarely work directly on my workstation. Instead, I prefer the flexibility of accessing the GPU remotely using my MacBook, whether I’m working from different locations within my home, from a co-working space, or a cozy cafe in another part of town.

In this blog, I will walk you through the steps to configure a personal GPU Ubuntu server.

For this guide, I assume you already have a workstation running Ubuntu with a GPU and it is connected to your local network

Setting Up Local Remote Access

Let’s start by setting up local access, which will allow you to ssh into your GPU server when you’re on the same home Wi-Fi network. This is ideal for a work-from-home (WFH) setup where your workstation is running in a corner of your living space.

Install the SSH server

First, we need to install an SSH (Secure Shell) server. This will allow you to securely access your GPU machine remotely. Open a terminal on your Ubuntu machine and run the following commands: bash sudo apt update && sudo apt install openssh-server This command updates your package lists and installs the OpenSSH server.
Start and Enable SSH Service

Next, enable the SSH service using this command: bash sudo systemctl enable --now ssh You can verify if the service is enabled by running: bash sudo systemctl status ssh

Look for a line starting with Active: active (running) for ssh.service. This indicates that the SSH service is up and running.

Note: The OpenSSH server starts running on boot by default.
Configure the firewall

To allow SSH connections through the system firewall, you need to open the appropriate port. Ubuntu’s default firewall, UFW (Uncomplicated Firewall), makes this process straightforward: bash sudo ufw allow ssh This command adds an exception to your firewall rules, permitting incoming SSH connections. You can check the SSH status with: bash sudo ufw status You should see the output similar to: bash To Action From -- ------ ---- 22/tcp ALLOW Anywhere 22/tcp(v6) ALLOW Anywhere (v6)
Connect to the local server

Now that your GPU server is set up, it’s time to test the connection. From your laptop (which should be on the same local network as your GPU machine), open a terminal and use the following command: bash ssh user@local-ip-address Replace user with your Ubuntu user and local-ip-address with the IP address of your GPU machine on the local network.
- To find your username on the workstation, you can use the whoami command.
- To find your local IP address, use one of these methods on your workstation:
  - Run hostname -I and use the first address listed.
  - Use ip addr show | grep -w inet for more detailed network information.
  - How to find my IP address on Ubuntu Linux is a great blog on it. It explains multiple commands like ip addr show | grep -w inet or networkctl status to get the local IP address.
Your local IP address typically starts with 192.168.

Note: If your router dynamically changes the local IP address of your workstation, it’s best to log into your router and assign a fixed local IP address to ensure consistent access.

If everything is configured correctly, you’ll be prompted to enter your password, after which you’ll gain remote access to your GPU server.
Set Up SSH Keys for Passwordless Login

It is recommended to set up key-based authentication for better security and convenience purposes. This allows you to connect to your remote server without entering a password each time.
- It is quite common to setup ssh key-based authentication.
- For detailed instructions on setting up SSH keys, refer to the DigitalOcean guide on Setting up SSH keys on Ubuntu 20.04.

Setting Up External Remote Access

While local access is great for working within your home network, sometimes you need to access your GPU workstation from outside your local network, such as from co-working spaces or a cozy cafe.

One simple and secure way to achieve this is by using ngrok.

ngrok helps creates secure tunnels from public endpoints to locally running services. It allows you to expose your personal server to the internet, enabling remote access from anywhere without complex network configurations.

Here’s how to set it up:

Install ngrok

First, you need to install ngrok on your GPU workstation. Open a terminal and run this command: bash snap install ngrok
- For more installation options, see https://dashboard.ngrok.com/get-started/setup/linux.
Create and connect to ngrok Account

Visit ngrok’s website and sign up for a free account if you haven’t already. After signing up, you’ll receive an auth token. On your GPU workstation, run: bash ngrok config add-authtoken YOUR_AUTH_TOKEN You can get the config file path and edit using ngrok config check and vim , respectively.
Start the ngrok Tunnel

Now, you can create a secure tunnel to your SSH service: bash ngrok tcp 22 This command will display a URL that looks like tcp://X.tcp.ngrok.io:PORT. Note down this URL.
Connect to Your Workstation

From any external laptop, you can now SSH into your GPU workstation using: bash ssh -p YYYY user@X.tcp.ngrok.io Replace PORT with the port number and X with the subdomain from the ngrok URL. Replace user with your Ubuntu username.

The above steps ensure that you can remotely access the workstation from external network. However, no one is going to manually start the ngrok every time before heading out.
Make ngrok start automatically on boot

To ensure ngrok starts automatically when your workstation boots:
- Create a new service file:
```
sudo vim /etc/systemd/system/ngrok.service
```
- Add the following content:
```
[Unit]
Description=start ngrok tunnel on startup
After=network.target

[Service]
ExecStart=/snap/bin/ngrok tcp 22
Restart=on-failure
User=<your_username>

[Install]
WantedBy=multi-user.target
```
Replace with your Ubuntu username. Save the file and exit the editor.
- Enable and start the service:
```
sudo systemctl enable ngrok.service
sudo systemctl start ngrok.service
```
Now ngrok will automatically start and create a tunnel when your workstation boots.

Note: With a free account, ngrok assigns a new port (YYYY) each time your workstation boots. You can get the new port from the ngrok dashboard.
Paid ngrok account for dedicated port

For a dedicated TCP endpoint port that doesn’t change on reboot, you need a paid ngrok personal account ($10/month).
1. Reserve a tcp endpoint
Once you have a paid account, reserve a TCP endpoint at https://dashboard.ngrok.com/cloud-edge/tcp-addresses.
1. Update the ngrok service file
Add the following content:
```
[Unit]
Description=start ngrok tunnel on startup
After=network.target

[Service]
ExecStart=/snap/bin/ngrok tcp --region=<region> --remote-addr=<remote-address> 22
Restart=on-failure
User=<your_username>

[Install]
WantedBy=multi-user.target
```
Replace , , and with the appropriate values from your reserved TCP endpoint config.

With this setup, your SSH remote endpoint will remain the same even if the system reboots.

Reference Links:

Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.

Managing multiple CUDA versions using environment modules in Ubuntu

Sun, 19 May 2024 00:00:00 GMT

Latest Update: May 19th, 2024

Github Gist Link

This blog contains all the steps required to: - Install multiple CUDA versions (e.g., CUDA 11.8 andCUDA 12.1 - Manage multiple CUDA environments on Ubuntu using the utility called environment modules. - Use this approach to avoid CUDA environment conflicts.

Environment Modules is a package that provides for the dynamic modification of a user’s environment via modulefiles. You can find more on it at https://modules.readthedocs.io/en/latest/

Install the Compatible NVIDIA Drivers (if required)

Add PPA GPU Drivers Repository to the System

sudo add-apt-repository ppa:graphics-drivers/ppa

Check GPU and available drives

ubuntu-drivers devices
# install it using: sudo ubuntu-drivers

Install the compatible driver
```
# best to allow Ubuntu to autodetect and install the compatible nvidia-driver
sudo ubuntu-drivers install
```
For example, I tried to install nvidia-driver-545 using sudo ubuntu-drivers install nvidia:545 command. However, I was unable to install it. There was always some or the other issue.

Note: Please restart your system after installing the nvidia driver. Ideally, you should be able to get GPU state and stats using nvidia-smi
Check the installed NVIDIA driver
```
nvidia-detector 
```
Additionally, you can also install NVIDIA drivers using the Software & Updates Ubuntu app. Just go to the Additional Drivers tab, choose a driver, and click Apply Changes.

Install `CUDA 11.8` and `CUDA 12.1`

Go to the https://developer.nvidia.com/cuda-toolkit-archive and select CUDA Toolkit 11.8 from the available options.
Choose your OS, architecture, distribution, version, and installer type. For example, in my case:

Option value

OS Linux

Architecture x86_64

Distribution Linux

Version 22.04

Installer type deb(local)

Option	value
OS	Linux
Architecture	x86_64
Distribution	Linux
Version	22.04
Installer type	deb(local)

Follow the provided installation instructions by copying and pasting the commands into your terminal. This will install CUDA 11.8. Use the following commands:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600  
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-11-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

Similarly, install CUDA 12.1 using the following commands:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-ubuntu2204-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

Make sure to copy and execute the commands above in your terminal to install CUDA 11.8 and CUDA 12.1 on your system.

Install `cuDNN` library

Go to https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/ and download the cuDNN tar for CUDA 11.x. Note that you might need to create a developer’s account first.

Untar the downloaded file using the following command:

tar -xvf cudnn-linux-x86_64-9.1.0.70_cuda11-archive.tar.xz # CUDA 11.x
tar -xvf cudnn-linux-x86_64-9.1.0.70_cuda12-archive.tar.xz # CUDA 12.x

Copy the cuDNN files to the CUDA toolkit files:

# for CUDA 11.8
cd cudnn-linux-x86_64-9.1.0.70_cuda11-archive/
sudo cp include/cudnn*.h /usr/local/cuda-11.8/include
sudo cp lib64/libcudnn* /usr/local/cuda-11.8/lib64

# for CUDA 12.1
cd cudnn-linux-x86_64-9.1.0.70_cuda12-archive/
sudo cp include/cudnn*.h /usr/local/cuda-12.1/include
sudo cp lib64/libcudnn* /usr/local/cuda-12.1/lib64

Make the files executable:

sudo chmod a+r /usr/local/cuda-11.8/include/cudnn*.h /usr/local/cuda-11.8/lib64/libcudnn*
sudo chmod a+r /usr/local/cuda-12.1/include/cudnn*.h /usr/local/cuda-12.1/lib64/libcudnn*

Note: Strictly speaking, you are done with the CUDA setup. You can use it by adding the CUDA bin and library path to the PATH and LD_LIBRARY_PATH environment variables. For example, you can set up CUDA 11.8 by adding the following lines in the ~/.bashrc:
```
PATH=/usr/local/cuda-11.8/bin:$PATH
LD_LIBRARY_PATH=/usr/local/cuda-11.8/extras/CUPTI/lib64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH
```

Similarly, you can set up CUDA 12.1. However, manually changing the paths every time can be cumbersome!

Note: In case, you only want to install either of the one, CUDNN 11.x or CUDNN 12.x. The simpler way is to go to https://developer.nvidia.com/cudnn-downloads and install the CUDNN 11.x or CUDNN 12.x similar to CUDA installation.

Manage multiple CUDA versions using `environment modules`

Install the environment modules utility

Run the following commands:
```
    sudo apt-get update
    sudo apt-get install environment-modules
```
Check the installation:
```
# Check the installation by running
module list
```
You should see a list of default installed modules like git and maybe their versions displayed when you run the command module list. This confirms that the environment modules utility has been successfully installed on your system.

Create modulefiles for CUDA distributions

Note: You might need root permissions to create directories and files. Use sudo in that case.

Create a directory /usr/share/modules/modulefiles/cuda to hold modulefiles for cuda distributions

sudo mkdir -p /usr/share/modules/modulefiles/cuda

Create a modulefile /usr/share/modules/modulefiles/cuda/11.8 for CUDA 11.8 and add the following lines:

#%Module1.0
##
## cuda 11.8 modulefile
##

proc ModulesHelp { } {
    global version

    puts stderr "\tSets up environment for CUDA $version\n"
}

module-whatis "sets up environment for CUDA 11.8"

if { [ is-loaded cuda/12.1 ] } {
module unload cuda/12.1
}

set version 11.8
set root /usr/local/cuda-11.8
setenv CUDA_HOME    $root

prepend-path PATH $root/bin
prepend-path LD_LIBRARY_PATH $root/extras/CUPTI/lib64
prepend-path LD_LIBRARY_PATH $root/lib64
conflict cuda

Similarly, create a modulefile /usr/share/modules/modulefiles/cuda/12.1 for CUDA 12.1 and add the following lines:

#%Module1.0
##
## cuda 12.1 modulefile
##

proc ModulesHelp { } {
    global version

    puts stderr "\tSets up environment for CUDA $version\n"
}

module-whatis "sets up environment for CUDA 12.1"

if { [ is-loaded cuda/11.8 ] } {
module unload cuda/11.8
}

set version 12.1
set root /usr/local/cuda-12.1
setenv CUDA_HOME    $root

prepend-path PATH $root/bin
prepend-path LD_LIBRARY_PATH $root/extras/CUPTI/lib64
prepend-path LD_LIBRARY_PATH $root/lib64
conflict cuda

Make CUDA 11.8 the default cuda version

Create a file /usr/share/modules/modulefiles/cuda.version to make CUDA 11.8 the default cuda module:
```
#%Module
set ModulesVersion 11.8
```
Note: make sure to reload your terminal.

Changing and Viewing the CUDA Module

To change and view the loaded CUDA module, you can use the following commands:

# Check the currently loaded module
module list
# Check the available modules
module avail

# Load a specific cuda version
module load cuda/12.1
# Unload the currently loaded CUDA module
module unload cuda
# Load CUDA 11.8
module load cuda/11.8

# verify the paths of the loaded CUDA
nvcc --version # should give the loaded CUDA version
echo $CUDA_HOME
echo $PATH
echo $LD_LIBRARY_PATH

Note: You can add additional CUDA versions or other packages by creating corresponding modulefiles and following the steps outlined in this gist.

Some Useful Tips

What to do if nvidia-smi does not works

Sometime, after Ubuntu update or some other weird issue. The system might not be able to detect drivers. For example, you get erros such as nvidia-smi has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. The best solution is to remove the current drivers and reinstall the compatible nvidia-driver.
```
# removes all the nvidia drivers
sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"
# reinstall the compatible driver and restart
sudo ubuntu-drivers install
```

How to purge CUDA from your computer

> DO IT AT YOUR OWN RISK

# removes all the nvidia drivers
sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"
# remove all cuda versions
sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*"  "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
# remove all cuda folders
sudo rm -rf /usr/loca/cuda*

Resources and helpful links

Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.

Aayush Garg

A Brief Introduction to Claude Agent Skills

What Are Agent Skills?

How Skills are Dynamically Loaded

What Makes Skills Valuable

How Skills are Organized

Building Your Own Skill

Walkthrough: Creating the Basic Image Editing Skill

The Iterative Philosophy

Conclusion

References

Key Insights from DeepSeekMath paper

Key Insights

Iterative Data Curation Pipeline

GRPO: PPO Without the Critic

Code Training Helps Math

ArXiv Papers are Surprisingly Ineffective

Online RL Training is Superior to Offline

RL Sharpens Distribution, Does not Expand Model Capability

The Iterative Data Curation Pipeline

1) Train a fastText Classifier

2) Recall Math-Related Web Pages from Common Crawl

3) Find New Math-Related Domains

4) Expand the Seed Corpus and Repeat

Conclusion

Understanding GRPO: PPO without the Critic

I: The PPO Objective and the Critic Problem

The Value Function Problem in PPO

II: Replacing the Critic with Group Sampling

Monte Carlo Baseline

Group-Relative Advantage

III: The GRPO Objective

The Full GRPO Objective

Single Gradient Step Simplification

IV: KL Divergence in GRPO

V: Outcome vs. Process Supervision

VI: Connection to REINFORCE Leave-One-Out (RLOO)

Conclusion

References

Deriving the DPO Loss from First Principles

I: The RLHF Objective

II: The Bradley-Terry Model for Preference Learning

The Bradley-Terry Probability Model

Reward Model Loss

III: Optimal Policy in Closed Form

IV: The Reparameterization Trick

V: Deriving the DPO Loss

VI: Building Intuition for DPO

Implicit Reward Model

Analyzing the Gradient Update

VII: Computing Log Probabilities in Practice

Conclusion

References

Deriving the PPO Loss from First Principles

I: Reinforcement Learning: Core Definitions

II: Reward Model in RLHF for LLMs

How is the Reward Model Trained?

Reward Model Loss Function

III: Trajectories and Returns

Trajectory

Return

IV: Policy Gradient Optimization and REINFORCE Algorithm

V: Reducing Variance and the Advantage Function

Replacing Full-Trajectory Return with Reward-to-Go (using causality)

Subtracting a Baseline

Value Functions: and

Advantage Function

Advantage-Weighted Policy Gradient

VI: Importance Sampling and Off-Policy Policy Gradients

Importance Sampling

Applying Importance Sampling to Policy Gradients

Off-Policy Learning: Reusing Trajectories

The Instability Problem

VII: Trust Region Policy Optimization (TRPO)

VIII: Proximal Policy Optimization (PPO)

Clipped Surrogate Objective

PPO Objective

IX: Complete PPO Objective with KL Penalty

Finally done…

References

Baseline: Bigram Model (`e0b5864`)

Update 1: Self-Attention (`7b0e03a`)

Update 2: Multi-Head Attention (`9d2a7b5`)

Update 3: Feed-Forward Network (`c4c46ff`)

Update 4: Residual Connections (`0239c07`)

Update 5 & 6: Layer Normalization (`63ef5f8`, `4f5bef8`)

Update 7: Scaling Up (`d4141d7`)