SKILL.md as Agent Brain: What HuggingFace's Fine-Tuning Pipeline Gets Right

Looking back at HuggingFace’s December 2025 announcement with a few months of perspective, the headline framing, that Claude fine-tuned an open source LLM, does a disservice to the more interesting architectural question underneath it. Claude didn’t fine-tune anything. Claude read a document, wrote a training script, and submitted a job. The training itself ran on HuggingFace’s GPU infrastructure using TRL. What’s worth examining is the pattern that made this delegation work, and what it tells us about how ML expertise should be packaged and consumed.

The SKILL.md Pattern

The mechanism is straightforward but the implications run deep. HuggingFace’s skills repository contains SKILL.md files, structured markdown documents that encode domain knowledge: which training method to use for which task, how to map model size to hardware tier, what dataset columns each trainer requires, how to handle mismatched formats. Claude Code receives these files via MCP, from HuggingFace’s server at https://huggingface.co/mcp?bouquet=skills. Before doing anything else, the agent reads the skill document. Everything that follows, hardware selection, script generation, job submission, is derived from that document plus your plain-language instruction.

The setup looks like this:

claude mcp add --transport http hf-skills \
  https://huggingface.co/mcp?bouquet=skills \
  --header "Authorization: Bearer $HF_TOKEN"

Then you say something like:

Fine-tune Qwen3-0.6B on open-r1/codeforces-cots for 3 epochs, SFT, checkpoint every 500 steps.

The agent plans and executes the full pipeline. The training report ends up at training_reports/qwen3-0.6b-codeforces-cots-sft.md.

This is a different approach than giving agents a Python API to call or a CLI tool to invoke. The skill is a document, not an interface. Any agent that can read files and reason about structured text can use it. The knowledge lives in markdown, not in compiled artifacts. That means it’s readable, diffable, forkable, and modifiable without touching any code.

The closest prior art is something like CLAUDE.md files for project context, but where those encode project-specific conventions, SKILL.md encodes domain expertise. The distinction matters: the skill file is designed to be reused across contexts, not tied to any particular codebase.

What the Agent Is Actually Deciding

The pipeline has five sequential steps, and understanding what the agent controls at each step clarifies where the value is.

Step 1: Dataset validation. Before any GPU is allocated, the agent checks whether the dataset format matches what the chosen training method expects. For SFT, you need a messages or text column. For DPO, you need chosen and rejected columns, with optional prompt. For GRPO, you need a verifiable reward function alongside prompts. The agent will attempt column mapping if names are close but not exact. This step runs on CPU, costs nothing, and prevents the most common source of wasted compute: malformed data that only surfaces as an error 40 minutes into a training run.

Step 2: Training method selection. If you don’t specify a method, the agent infers it from the dataset structure. chosen/rejected columns suggest DPO. A dataset with ground-truth verifiable answers suggests GRPO. Everything else defaults to SFT. The inference isn’t magic; it’s a lookup against the decision table in the skill document.

Step 3: Hardware selection. The agent consults a hardware guide embedded in the skill, mapping model size to GPU tier:

Model size	Hardware	Approx. cost/hr
Sub-1B	T4 small	~$0.75
1–3B	T4 medium / A10G small	moderate
3–7B	A10G large / A100	$15–40
7B+	Not supported	—

For test runs, you can specify a sample size explicitly: “Do a quick test with 100 examples” routes to T4 small at around $0.30 for 20 minutes. This makes iteration cheap enough that running multiple hypotheses in parallel is reasonable.

Step 4: Script generation and job submission. The agent writes a TRL-based Python script, then submits it via HuggingFace Jobs. Jobs is a managed GPU compute service, billed per second, accessed via the huggingface_hub SDK. The generated script handles training loop, Trackio monitoring hooks, checkpoint intervals, and Hub push on completion. For models above 3B parameters, the agent applies PEFT LoRA automatically.

Step 5: Monitoring and reporting. The agent polls job status and writes results to the training report markdown file, linking to Trackio dashboards for live loss curves and to checkpoints on the Hub.

The TRL Trainer Selection Logic

Each training method uses a different TRL trainer, and the differences matter for choosing the right approach for your task.

SFTTrainer is the baseline. You have demonstration data, pairs of inputs and desired outputs, and you want the model to replicate that behavior. SFT works for instruction following, format compliance, domain vocabulary, and task specialization where you have clean examples. It’s the lowest-friction starting point and often sufficient for narrow domains.

DPOTrainer handles preference alignment without a separate reward model. Direct Preference Optimization, introduced in 2023, reformulates the RLHF objective as a classification problem over preferred and rejected responses. You need chosen and rejected response pairs for each prompt. The practical workflow is SFT first, then DPO: establish baseline behavior, then refine it toward preferences. DPO is appropriate when you have human or AI feedback about which of two responses is better, and the model already knows how to produce outputs in the right general format.

GRPOTrainer is the most interesting of the three and reflects the influence of DeepSeek-R1’s training approach. Group Relative Policy Optimization works on tasks where you can verify correctness programmatically: math problems with known answers, code that passes or fails tests, logic puzzles with deterministic solutions. The trainer samples multiple responses per prompt, scores them against the reward function, and updates the model relative to the group’s average reward. This is how you get reasoning improvement on structured tasks without needing human annotation.

For something like math reasoning on openai/gsm8k, GRPO is the right call. The reward function checks whether the final numeric answer is correct. The model learns to produce reliable solutions by observing which of its sampled approaches actually worked.

LoRA in Practice

The automatic LoRA application for models above 3B deserves more than a footnote. LoRA, implemented via PEFT, inserts low-rank decomposition matrices into the attention weight layers. For a weight matrix W of shape (d, k), LoRA adds W + AB where A is (d, r) and B is (r, k), with rank r far smaller than both d and k. Only A and B are trained; the base weights stay frozen.

from peft import get_peft_model, LoraConfig, TaskType

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,             # rank of the decomposition matrices
    lora_alpha=32,   # scaling: effective learning rate scale = lora_alpha / r
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]
)
model = get_peft_model(model, peft_config)

For a 3B model, a full checkpoint is roughly 6GB. The LoRA adapter is typically 20–200MB. Multiple task-specific adapters can coexist on the same base model and be swapped without reloading the base weights. Catastrophic forgetting, the risk that fine-tuning one capability degrades another, is reduced because the base weights are never modified.

The skill’s automatic LoRA application above 3B is the right default. Full fine-tuning on a single GPU at that scale is impractical, and LoRA’s quality tradeoffs are small for most domain adaptation tasks. For smaller models, the agent leaves the choice open based on available VRAM.

AutoTrain Is Dead and That’s Worth Noting

HuggingFace’s previous no-code training platform, AutoTrain, is now deprecated and no longer maintained. Their recommended path for new projects is Axolotl, TRL directly, or the skills-based agent workflow.

This is a meaningful shift in HuggingFace’s platform strategy. AutoTrain was a UI product: you uploaded data, filled out a form, clicked train. It abstracted away all the decisions by eliminating them, giving you a fixed configuration surface. The agent-based approach does something different: it preserves the full decision space and uses the agent to navigate it on your behalf. Hardware selection, training method, hyperparameters, LoRA configuration, monitoring setup, all of these remain visible and overridable, but the agent handles them by default.

The difference matters for people who actually know what they want. A fixed UI can’t adapt to “I want GRPO with a custom reward function that checks Python syntax validity”. An agent with access to the full TRL API surface and a skill document that explains when to use each approach can at least attempt it, and will write a script you can inspect and modify.

The Division of Labor

What Claude is doing here is orchestration, not training. The agent reads domain knowledge, makes decisions, generates code, and submits infrastructure requests. The actual gradient descent happens on HuggingFace’s hardware, inside a TRL training loop, with no ongoing involvement from the agent once the job is submitted.

This division is worth being precise about because it clarifies what the agent needs to be good at. The agent needs to parse your intent, consult the skill document, map intent to the right trainer configuration, generate syntactically valid TRL code, and submit the job correctly. It doesn’t need to understand the mathematics of DPO or the convergence properties of GRPO. The skill document encodes those as decision rules. The agent applies the rules.

The same division applies to the infrastructure side. The agent doesn’t manage GPU provisioning, it calls HuggingFace Jobs. The agent doesn’t implement monitoring, it adds Trackio hooks to the generated script. The agent’s job is to assemble these pieces correctly based on your instruction and the knowledge in the skill document.

This is a reasonable allocation. Coding agents in 2025–2026 are reliable at structured code generation and API composition. They’re less reliable at novel research decisions or debugging subtle training dynamics. The skill system plays to the former and leaves the latter to you.

What’s Missing

The published article includes no benchmark comparisons between fine-tuned and base models. It’s a workflow demonstration, not a research result. You won’t find numbers showing that Qwen3-0.6B fine-tuned on codeforces-cots actually solves more competitive programming problems than the base model. That evaluation step is outside the scope of what the skill automates, at least for now.

Support also tops out at 7B parameters. Anything larger requires multi-GPU or multi-node training configurations that the current single-job architecture doesn’t cover. For frontier-scale fine-tuning, you’re still writing your own infrastructure.

The skills repository is open source and the SKILL.md format is extensible. Adding support for evaluation benchmarks, larger model configurations, or custom reward functions is a matter of writing markdown and Python, not modifying a platform you don’t control. That extensibility is the most durable thing about this design. The skill is a document you can read, modify, and share, which is more than you can say for most no-code ML platforms.