Fine-Tuning as a Tool Call: How MCP Turns Claude Into an ML Workflow Orchestrator
Source: huggingface
The original HuggingFace blog post, published in December 2025, describes an experiment that sounds simpler than it is: they gave Claude a set of “skills” via MCP and told it to fine-tune an open source LLM. Looking back at it now, the result is worth picking apart, not because of the fine-tuned model itself, but because of what the architecture reveals about how agent capabilities should be packaged.
What “Skills” Means Here
In most ML contexts, “skills” refers to what a model has internalized through training. A model that can write SQL has a SQL skill baked into its weights. HuggingFace uses the word differently. Their skills are operational knowledge delivered at runtime via MCP, Anthropic’s open standard for connecting tools to language model agents.
The hf-llm-trainer skill bundles GPU selection logic, Hub authentication configuration, training method selection rules, dataset validation logic, and job submission workflows into a single installable unit. None of this is fine-tuned into Claude. Claude receives it fresh on each session, through the MCP transport layer.
Setup looks like this:
claude mcp add --transport http hf-skills \
https://huggingface.co/mcp?bouquet=skills \
--header "Authorization: Bearer $HF_TOKEN"
Then, within Claude Code:
/plugin install hf-llm-trainer@huggingface-skills
After that, you can issue natural language instructions like “Fine-tune Qwen3-0.6B on open-r1/codeforces-cots for instruction following” and Claude handles the rest: proposing a hardware config and cost estimate, validating the dataset schema, submitting the job to HuggingFace’s Jobs API, monitoring training via Trackio, and converting the output to GGUF if requested.
The Training Method Layer
The more technically interesting part is how the skill handles training method selection. The TRL library provides the underlying trainers, and the skill effectively encodes when to reach for each one.
SFT (Supervised Fine-Tuning via SFTTrainer) is the default for most cases: you have demonstration data in input-output or conversational format, and you want the model to learn from those examples. The loss is token-level cross-entropy over the target tokens. For conversational datasets, you can mask loss on user and system turns with assistant_only_loss=True, so the model only learns to predict assistant responses.
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=load_dataset("trl-lib/Capybara", split="train"),
args=SFTConfig(assistant_only_loss=True)
)
trainer.train()
DPO (DPOTrainer) is for preference alignment. Rather than demonstration data, you provide pairs of chosen and rejected completions for the same prompt. The DPO paper by Rafailov et al. reformulated RLHF’s reward modeling step into a direct loss on the policy, eliminating the separate reward model and significantly simplifying the alignment pipeline. TRL implements a wide range of loss type variants including IPO, robust DPO, NCA, and others; the default is the original Bradley-Terry sigmoid loss.
GRPO (GRPOTrainer) is for verifiable tasks: math reasoning, coding problems, anything where you can write a deterministic reward function. The model generates multiple completions per prompt, scores them against the reward function, and the group-relative advantage determines which completions to reinforce. This is the same training regime behind DeepSeek’s R1 and similar reasoning models. No preference data required, no reward model, just a scoring function you write yourself.
from trl import GRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=accuracy_reward,
train_dataset=load_dataset("trl-lib/DeepMath-103K", split="train"),
)
trainer.train()
The skill wraps all of this selection logic. Ask for “preference alignment” and it configures DPO. Ask for “math reasoning” and it reaches for GRPO. The user doesn’t need to know the distinction.
The Economics
HuggingFace’s Jobs API prices by hardware tier. For a sub-1B model like Qwen3-0.6B, a full fine-tuning run costs around $0.30 on a T4 instance. For 3-7B models with LoRA, costs land in the $15-40 range on an A10G. Models above 3B trigger automatic LoRA configuration; the skill selects peft_config=LoraConfig() rather than full fine-tuning to keep jobs feasible on single GPUs.
This cost structure matters. Running an SFT job on a 1B model for under a dollar removes most of the friction from experimentation. You can iterate on dataset format, learning rate, and packing strategy across multiple runs without meaningful financial exposure.
What This Reveals About Agent Tool Design
The standard approach to giving an agent new capabilities is to add a tool that exposes an API endpoint. The agent calls the endpoint, gets a result, continues. HuggingFace’s skill design is more opinionated than that.
A skill bundles the decision logic alongside the API surface. The hf-llm-trainer skill doesn’t just expose a “submit training job” endpoint; it encodes when to use SFT versus DPO versus GRPO, what hardware to request for a given model size, when to apply LoRA, how to validate dataset schemas before submission, and how to interpret monitoring metrics. The agent isn’t making those decisions from first principles on each run; it’s executing against packaged expertise.
This is a meaningful design choice. Language models are capable of reasoning about training methods from their general knowledge, but they’ll make inconsistent choices without domain-specific grounding. The skill provides that grounding without requiring any fine-tuning of the base model. It’s operational knowledge delivered as context, not capability delivered as weights.
The tradeoff is that the skill author now bears the responsibility of encoding correct decision logic. If the GPU selection heuristics are wrong, or the DPO loss type defaults are suboptimal for a particular use case, every user of the skill inherits those decisions. The centralized packaging cuts both ways.
Practical Limits
Requiring a HuggingFace Pro or Team account for the Jobs API is a real gate. Free tier users can’t run jobs through the skill, which limits its reach among hobbyists and students who are often the most interested in experimenting with fine-tuning. The Pro tier costs $9/month, which is reasonable for anyone doing this regularly, but it adds friction for casual experimentation.
Multi-GPU training isn’t surfaced through the current skill interface. For anything above 7B parameters, you’re either accepting LoRA constraints or stepping outside the skill’s workflow entirely. This will presumably improve as the Jobs API adds multi-GPU support, but for now the skill targets small-to-mid models.
The monitoring integration via Trackio is worth noting. Real-time loss curves and validation metrics available during training, through the same interface where you submitted the job, close a loop that usually requires separate tooling. You don’t have to context-switch to TensorBoard or Weights and Biases mid-session.
The Broader Pattern
Looking back at this from early 2026, the HuggingFace Skills experiment sits in a clear lineage. The MCP ecosystem has expanded substantially since December, and the pattern of packaging domain expertise as agent-ready skill bundles rather than fine-tuned models is one of the more useful design patterns to emerge from that expansion. The alternative, fine-tuning Claude itself to understand HuggingFace’s training infrastructure, would have been slower to iterate on, harder to update as APIs change, and inaccessible to users of other agent frameworks.
The fact that the same skill can run on Claude Code, OpenAI Codex, and Gemini CLI is a direct consequence of building on the MCP standard rather than a proprietary plugin system. A skill authored once works across agents. That’s the kind of leverage that makes the investment in careful packaging worth it.
For anyone working with TRL directly, the TRL documentation covers the full trainer taxonomy including GRPOTrainer with vLLM acceleration, online DPO, and the various reward modeling trainers. The skill abstracts most of this, but understanding what’s underneath makes it easier to know when to step outside the abstraction.