Delegating the Fine-Tuning Loop: What HuggingFace Skills Gets Right About Agent-Driven ML
Source: huggingface
Looking back at this December 2025 announcement from HuggingFace with a few months of perspective, the framing around “Codex training models” undersells what’s actually interesting here. The real thing worth examining is the pattern: a structured AGENTS.md convention that turns ML training pipelines into agent-executable tasks, and what it means when an AI coding agent can take “fine-tune Qwen3-0.6B on codeforces-cots” and produce a running job on GPU infrastructure without you writing a single training script.
The AGENTS.md Convention
The mechanism behind HuggingFace Skills is AGENTS.md, a convention that’s been gaining traction as the “skills manifest” for AI coding agents. You put a structured markdown file in your repository that describes capabilities, tools, and workflows. Agents that support the convention, including OpenAI’s Codex, Claude Code, and Gemini CLI, read these files as context when you give them tasks.
HuggingFace’s skills repository implements this for ML workflows. Clone it, point your agent at it, and you get a set of skills the agent can invoke: training jobs, evaluation runs, model conversion, Hub publishing. The SKILL.md files describe the available operations in enough detail that the agent can select hardware, configure training methods, generate scripts, and submit jobs without asking for clarification on things it can reasonably infer.
This is a different approach than building a dedicated UI or API wrapper. It works by extending the agent’s context rather than constructing a new tool surface. Whether that’s the right long-term architecture is an open question, but it’s pragmatic: you get to use whatever agent you’re already working with, and the skills are just files you can read, modify, and version.
What the Workflow Actually Looks Like
The concrete capability here is end-to-end fine-tuning delegation. You give the agent a natural language instruction:
Fine-tune Qwen3-0.6B on the dataset open-r1/codeforces-cots
The agent steps through a defined workflow. First, it validates the dataset format against what the selected training method expects. Then it selects hardware based on model size, for example t4-small for sub-1B models at roughly $0.75/hour. It generates a training script using TRL, HuggingFace’s transformer reinforcement learning library, with Trackio monitoring hooks wired in. Then it submits the job through HuggingFace Jobs, their pay-as-you-go compute platform.
The supported training methods cover the current standard toolkit:
- SFT (Supervised Fine-Tuning): the baseline approach for instruction following and behavior shaping
- DPO (Direct Preference Optimization): preference-based training without requiring a separate reward model
- GRPO: reinforcement learning with verifiable rewards, relevant for domains like code where you can programmatically check correctness
Model sizes range from 0.5B up to 7B. For larger models, you’re on your own for now.
What the agent also maintains is a markdown experiment report at training_reports/<model>-<dataset>-<method>.md. This document tracks training parameters, links to Trackio dashboards for live loss curves, evaluation results against benchmarks like HumanEval, and links to checkpoints published on the Hub. It’s a persistent artifact that documents what ran and what the results were, written automatically as the experiment progresses.
The Infrastructure Layer
HuggingFace Jobs is worth understanding separately from the skills layer. It’s designed as a UV and Docker-like compute interface: you describe a workload and a hardware tier, and it runs on HuggingFace’s infrastructure. The CLI deliberately mirrors UV’s interface:
# Standard UV
uv run my_script.py
# HF Jobs equivalent
hf jobs uv run my_script.py --flavor t4-small
Hardware options span from CPU-only up through A100s and TPUs, with per-second billing on actual usage. The economics at the small model range are genuinely accessible: a 0.6B model run on a T4 costs around $0.30; a 3-7B model with LoRA on an A10G or A100 runs $15-40 per experiment. Multiple iterations of a fine-tuning hypothesis won’t break the budget, which changes how you approach experimentation.
Jobs also supports cron scheduling and webhook triggers, which opens more automated pipelines beyond what the skills layer exposes: trigger a fine-tuning run whenever a dataset is updated on the Hub, for instance, or run nightly evaluations on a checkpoint series.
GGUF Export as the Last Mile
One detail worth noting: after training completes, the agent can convert and quantize the resulting model to GGUF format for local deployment:
llama-server -hf unsloth/Qwen3-1.7B-GGUF:Q4_K_M
Q4_K_M quantization cuts memory requirements substantially while preserving most of the quality of the full-precision model. The fact that this step is included in the skills workflow rather than being left as an exercise matters. The pipeline runs all the way from dataset to a model you can serve locally or embed in an application. That completeness is what separates a useful tool from a demo.
The Abstraction Question
What this setup is really doing is raising the abstraction level for ML experimentation. Historically, running a fine-tuning experiment meant knowing the TRL API in detail, writing a training script from scratch, provisioning hardware, wiring up monitoring, and handling checkpoint evaluation yourself. The skills layer delegates most of that to the agent.
The tradeoff is transparency and debuggability. When the agent selects hardware or configures training hyperparameters, it does so based on heuristics encoded in the skill documentation. If those heuristics are wrong for your specific case, you need to inspect what the agent generated and override it manually. The generated training reports help here, since they capture parameters and results in a durable, reviewable format.
There’s also a question of where agents are actually reliable in this workflow. The mechanical parts, selecting hardware tiers based on model size, formatting dataset paths, submitting job commands, generating boilerplate TRL scripts, are exactly the kind of structured, low-ambiguity tasks where current agents perform consistently. The parts that require genuine judgment, like deciding whether a dataset’s structure actually suits a given training objective, or interpreting evaluation results that land in ambiguous territory, still benefit from a human reviewing what the agent produced before accepting it.
Multi-Agent Compatibility as a Design Choice
The announcement was framed around Codex because Codex had just launched and was generating attention in December 2025, but the skills repository isn’t Codex-specific. Because it’s built on the AGENTS.md convention rather than a proprietary integration, the same skills work with Claude Code and Gemini CLI. You use whatever agent you’re already reaching for.
This matters for adoption. ML engineers and researchers aren’t going to switch agents to access a training workflow. If the skills work with the tool already in their environment, the friction to trying it drops substantially.
What I’d Actually Use This For
This workflow makes the most sense for rapid iteration on small-to-medium models where you want to test a hypothesis without spending an afternoon on infrastructure. If you’re building domain-specific tooling where a 1-3B fine-tuned model might outperform a general-purpose large model on your specific task, the entry cost here is low enough to justify running several experiments.
The repository is open source and the AGENTS.md pattern is extensible. If the default training workflow doesn’t match what you need, you can add or modify skill files directly. The skills are just markdown with structured context, not compiled artifacts. That’s probably where this gets most interesting over time: teams treating ML training pipelines as versioned, shareable skill definitions rather than one-off scripts that live in someone’s home directory.