· 7 min read ·

Choosing the Right Training Method: What HuggingFace Skills Reveals About SFT, DPO, and GRPO

Source: huggingface

Looking back at HuggingFace’s December 2025 Skills integration with Codex, most of the coverage focused on the agent orchestration story: Codex drives a pipeline, submits GPU jobs, generates reports. That framing is correct but it tends to glide over something more practically useful, which is the training method selection logic embedded in the workflow. The system automatically picks between Supervised Fine-Tuning, Direct Preference Optimization, and Group Relative Policy Optimization based on dataset structure. That decision tree is a compressed form of the post-training knowledge most practitioners build up slowly through trial and error.

Three Methods, Three Different Problems

SFT, DPO, and GRPO are not variations on a theme. They address different problems with different data requirements and different assumptions about what “better” means for a model.

Supervised Fine-Tuning is the baseline. You have input-output pairs: a question and the answer you want the model to produce, an instruction and the response it should learn to generate. TRL’s SFTTrainer handles this case by running standard next-token prediction on the target outputs. The model learns to replicate the patterns in your demonstrations. This is appropriate when you have high-quality examples and want the model to produce outputs that look like those examples. The limitation is that SFT has no mechanism for expressing preference: you can only show the model what to do, not what to avoid.

Direct Preference Optimization was introduced in a 2023 paper by Rafailov et al. as a more tractable alternative to reinforcement learning from human feedback. Classic RLHF trains a separate reward model on preference pairs, then uses that reward model to optimize the language model via PPO. DPO collapses the reward model into the fine-tuning objective directly. The math shows that optimizing the language model directly on preference pairs is equivalent to the RLHF objective under certain assumptions, without needing the separate reward model training step.

The data requirement is specific: you need chosen and rejected pairs, two responses to the same prompt where one is preferred over the other. The HuggingFace Skills validator checks for these columns before allocating any GPU time:

DPO: ✗ INCOMPATIBLE
  Missing 'chosen' and 'rejected' columns
  Required: prompt, chosen, rejected

DPO is appropriate when you have comparative judgment data, typically generated from human labelers rating outputs, or from another model serving as a judge. It gives you a mechanism to steer model behavior away from undesirable responses while reinforcing preferred ones.

Group Relative Policy Optimization is the newest of the three and has a different origin. GRPO was introduced by the DeepSeek team and became widely discussed after DeepSeek-R1 demonstrated its effectiveness for training reasoning models. The key insight is that for domains where correctness is verifiable, you do not need human preference labels. You generate multiple candidate responses for each prompt, score them programmatically against a known-correct answer, and use the relative scores within each group to define the reward signal.

For competitive programming problems, you run the generated code against test cases. For math problems, you check whether the final answer matches. The reward signal is clean, scalable, and cheap to compute compared to human annotation. This is what makes GRPO tractable for reasoning domains at a scale where DPO would require an enormous annotation budget.

How the Skills Workflow Handles Selection

The HuggingFace Skills workflow inspects dataset schema during the validation step and routes to the appropriate trainer. The logic is roughly:

  1. If the dataset has prompt/response or messages columns in a conversation format, use SFT.
  2. If the dataset has chosen and rejected columns alongside prompts, DPO is available.
  3. If the dataset has verifiable outputs, math answers, code problems with test cases, structured facts with a checking mechanism, GRPO is viable.

The agent expresses this in natural language too. If you ask to “train a math reasoning model,” it will look for a dataset with verifiable answers. If you ask to “improve instruction following,” it expects demonstration data. The skills repository encodes these rules in the skill documentation files that the agent reads as context.

This selection matters more than it might appear. Applying DPO to a dataset that only has demonstrations, or applying GRPO to a domain without a verifiable reward signal, does not fail loudly. It trains a model that underperforms compared to what the correct method would have produced, and the failure is only visible in downstream evaluation.

The LoRA Boundary and What It Changes

The Skills workflow introduces LoRA automatically when model size exceeds roughly 3B parameters on the available hardware tiers. This is a practical constraint: full fine-tuning of a 7B model does not fit in the GPU memory available on a single T4 or A10G at the price points the system targets.

Low-Rank Adaptation works by freezing the pretrained weights and injecting small trainable rank-decomposition matrices into the attention layers. The number of trainable parameters drops by several orders of magnitude. A 7B model with LoRA rank 16 has roughly 50-100 million trainable parameters rather than 7 billion, which fits on a single A100 with memory to spare.

The HuggingFace Skills configuration for a LoRA run on a 3-7B model targets a10g-large or a100-large hardware and generates training code using the peft library:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)

The r parameter controls rank and therefore the parameter budget. Higher rank captures more adaptation capacity at higher memory and compute cost. Rank 16 is a reasonable default for domain adaptation tasks; fine-grained tasks like code style or specific output formats may benefit from higher rank values, while lightweight instruction following can often work at rank 8.

What LoRA does not change is the training method itself. You can run SFT with LoRA, DPO with LoRA, or GRPO with LoRA. The adaptation layer is orthogonal to the optimization objective.

What the December 2025 Benchmark Numbers Mean

The article reports that Qwen3-0.6B fine-tuned on open-r1/codeforces-cots via SFT reached a HumanEval pass@1 score of 0.342, compared to 0.306 for the base model. That improvement came from roughly $1-2 of GPU time on a T4.

A few things are worth contextualizing here. HumanEval measures Python code generation on 164 hand-written programming problems. It is a narrow benchmark, and results on it do not necessarily transfer to other code tasks, other programming languages, or real-world software engineering work. The codeforces-cots dataset contains competitive programming solutions with chain-of-thought annotations, which is specifically the kind of data HumanEval rewards.

The numbers are plausible but should not be taken as a general claim about code model improvement. They represent performance on a benchmark that is closely aligned with the training distribution. A more rigorous evaluation would include problems from outside the training domain, held-out competitive programming problems from a different contest platform, and ideally evaluation on realistic coding tasks rather than isolated function completion.

This matters when choosing a training method. SFT on HumanEval-adjacent data will improve HumanEval scores. GRPO on the same domain, with test-case verification as the reward signal, would likely produce different results because the optimization target is different: GRPO pushes the model to produce code that passes tests, which is not identical to producing code that looks like the demonstration data.

When to Use Each Method in Practice

The practical guidance here is data-driven rather than model-driven. The training method you should use is determined by what data you have and can collect, not by abstract preference.

SFT makes sense when you have high-quality demonstrations and a clear behavioral target: customer support responses that should sound a certain way, technical documentation that should follow a style guide, structured output formatting for a specific schema. It requires the least data infrastructure but gives you the least control over what the model avoids.

DPO makes sense when you have or can generate comparative labels. Automatically generating chosen/rejected pairs using a larger model as a judge is now a common pattern: generate multiple responses, have a stronger model score them, use the high-low pairs as preference data. This avoids human annotation costs at the expense of being bounded by the judge model’s quality.

GRPO makes sense specifically when your domain has programmatic evaluation. Code with test cases, math with verifiable answers, structured data extraction where you can check field values against a ground truth, logical reasoning with known-correct derivations. Outside those domains, GRPO is not applicable without building custom reward functions, which requires additional engineering work the Skills workflow does not cover automatically.

The HuggingFace Skills system handles all three cases within the same interface. The decision about which method applies is forced by data structure, which is the right place for that decision to live: if your dataset has chosen/rejected columns, you have preference data and DPO is available. If it does not, you fall back to SFT or invest in building verifiable rewards for GRPO. The workflow makes the constraint explicit rather than letting practitioners apply the wrong method by accident.

The open source skills repository is the right starting point if you want to examine exactly how these selection rules are encoded, or if you want to extend them for custom training methods or evaluation frameworks. The skill definitions are plain markdown, which means the logic is readable and modifiable without touching any code.

Was this interesting?