· 6 min read ·

From Prompt to Published Model: Codex as an End-to-End ML Engineer

Source: huggingface

HuggingFace’s Skills training integration with OpenAI’s Codex, published in December 2025, was framed as “Codex fine-tuning open source models.” That framing undersells what was being shown: a coding agent managing a complete ML training pipeline from a single natural language instruction, through dataset validation, GPU job submission, checkpoint evaluation, quantization, and model publication. Looking back at it now, it is worth unpacking how the architecture fits together and thinking through what it actually delivers.

Which Codex, and How It Connects

This is not the original OpenAI Codex from 2021, the code-completion model that powered early GitHub Copilot. This is OpenAI’s agentic coding assistant, the one that executes shell commands, iterates on failures, and navigates documentation autonomously. The connection to HuggingFace runs through the Model Context Protocol, a tool-calling standard that lets agents discover and invoke external services through a standardized interface. HuggingFace built an MCP server exposing their Hub API, Jobs API, and Trackio metrics as callable tools.

Configuration is minimal:

# ~/.codex/config.toml
[mcp_servers.huggingface]
command = "npx"
args = ["-y", "mcp-remote", "https://huggingface.co/mcp?login"]

After connecting, Codex can see the full set of HuggingFace tools. The HF Skills repository is open source, so the tool definitions are inspectable and extensible. The integration is not locked to Codex; any agent that speaks MCP can connect to the same server and run the same workflows.

The Pipeline, Step by Step

The workflow the article demonstrates starts with a single prompt: “Fine-tune Qwen3-0.6B on the dataset open-r1/codeforces-cots.” What Codex does with that is a multi-step orchestration, not a single API call.

First, it submits a cheap CPU job to validate the dataset schema. This runs for fractions of a cent and confirms field names, checks for malformed records, and verifies the format matches what the training script expects. Wasted GPU time is expensive; a few seconds of CPU validation is not.

Second, the agent selects hardware based on model size. The 0.6B Qwen model routes to a t4-small. A model in the 3-7B range would go to an a10g-large with LoRA applied. The HF Jobs infrastructure offers T4, A10G, A100, and TPU v5e options on a pay-as-you-go basis, and Codex surfaces the estimated cost before committing.

Third, it generates a training script using TRL. For supervised fine-tuning the script uses SFTTrainer; for reinforcement learning runs it reaches for GRPOTrainer. The configuration follows sensible defaults: bf16 precision, gradient checkpointing enabled, gradient accumulation to compensate for small per-device batch sizes on cheaper hardware, and Trackio instrumentation for live metrics. A typical SFT config looks like:

training_args = SFTConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    bf16=True,
    gradient_checkpointing=True,
    max_seq_length=2048,
    eval_strategy="steps",
    eval_steps=500,
)

Fourth, the script is submitted via run_uv_job(), which handles dependency installation and GPU provisioning on HuggingFace’s infrastructure. Codex saves the job ID, polls status, and reads streaming logs as the run progresses. If training diverges or a loss spike appears in the Trackio metrics, the agent can modify the script and resubmit without human intervention.

Fifth, after training completes, Codex evaluates checkpoints against a benchmark, selects the best one, and generates a markdown report. The concrete numbers from the article: Qwen3-0.6B fine-tuned on competitive programming problems reached a HumanEval pass@1 score of 0.342, compared to 0.306 for the base model. That improvement came from a run costing roughly $1-2 on a T4.

Finally, the model gets converted to GGUF with Q4_K_M quantization for local inference and pushed to the Hub. The full instruction is just:

Convert my fine-tuned model to GGUF with Q4_K_M quantization.
Push to username/my-model-gguf.

The result is available on the Hub and runnable locally via llama-server -hf username/my-model-gguf:Q4_K_M.

TRL Is Doing the Real Work

The training itself runs through TRL, HuggingFace’s post-training library, which has been the backbone of most open source fine-tuning work since 2023. TRL supports the full spectrum of post-training methods: SFT for supervised instruction following, DPO for preference learning without a reward model, and GRPO for reinforcement learning with verifiable rewards.

GRPO deserves specific attention because it became the dominant method for training reasoning models after DeepSeek-R1 demonstrated its effectiveness at scale. The Codex workflow supports all three approaches and selects between them based on dataset structure. A dataset with preference pairs gets DPO. A dataset with verifiable outputs, such as code problems with executable test cases or math problems with checkable answers, makes GRPO viable. This automatic selection is only possible because the agent inspects the dataset schema before committing to a training method.

TRL also recently added OpenEnv integration, Meta’s framework for defining reinforcement learning environments, which expands what GRPO can train on beyond standard scalar reward functions. The December 2025 Skills workflow predates that integration but the architecture accommodates it.

What Changes, What Stays the Same

The honest framing here is that this is primarily a friction reduction, not a capability expansion. ML engineers who already know TRL, HF Jobs, and Trackio could build this pipeline themselves in a few hours. The fundamentals are unchanged: you still need a good dataset, you still need to understand which benchmark matters for your use case, and you still need to interpret the results critically. A HumanEval pass@1 score is a narrow proxy; real-world performance on your specific domain may look very different.

What changes is the entry cost. The gap between “I have a dataset and a base model” and “I have a fine-tuned model on the Hub” previously required knowing Python well enough to write training scripts, understanding the TRL API, having a paid HF account, knowing which hardware tier to pick, writing evaluation harnesses, and managing checkpoints by hand. The Codex integration collapses most of that into a prompt and some patience while the job runs.

For developers who work adjacent to ML — building tools that call models, writing bots that embed inference, constructing data pipelines that process model outputs — this matters. Running a small domain-specific fine-tune to improve performance on your actual task, rather than relying entirely on a general-purpose base model, is now accessible without becoming a TRL specialist first.

The limitations are worth stating clearly. HF Jobs requires a Pro plan ($9/month) or Team/Enterprise access. Models above 7B are not well-supported in the automated flow yet. GRPO requires datasets with verifiable reward signals, which most people do not have prepared. HumanEval is narrow as a benchmark. And the workflow as described routes through OpenAI’s Codex, which requires ChatGPT Plus or above.

The MCP Angle

The piece of this integration that seems most durable is the MCP layer itself. The protocol has been adopted widely enough that the pattern of “AI agent calls external service through standardized tool interface” is becoming standard infrastructure. HuggingFace exposing their Jobs and Hub APIs through MCP means any agent that supports the protocol can orchestrate ML training workflows.

This composability is genuinely different from closed integrations. The skills server is open source and extensible. You could add tools for custom evaluation frameworks, for private compute clusters, or for experiment tracking systems beyond Trackio. The Codex integration is one instantiation of the pattern; the pattern itself transfers to any agent that speaks MCP.

For the open source ML community, the more significant shift here may be that training infrastructure is becoming a set of tool-callable APIs rather than scripts requiring local setup, CUDA configuration, and environment management. Whether that shift lands well depends on how the underlying compute costs and reliability evolve, but the direction is clear: the ML training pipeline is becoming something an agent can drive, not just a human with a terminal.

The December 2025 announcement was a demonstration of that end-to-end connection working. Small experiments are often where the durable tools start, and a $2 fine-tune that runs without manual intervention is small in the best sense.

Was this interesting?