HuggingFace's upskill Solves a Model Selection Problem, Not Just a Knowledge Transfer Problem

The headline story from HuggingFace’s January 2026 upskill experiment was knowledge transfer: Claude Opus builds H100 CUDA kernels, the agent trace gets compressed into roughly 520 tokens, and smaller models improve significantly when that document is injected as context. The 45 percentage point lift on a locally-run quantized model is real and meaningful. But there is a more immediately useful output from the same experiment: the evaluation table, which tells you which model to deploy for this class of task at each cost tier.

This matters because CUDA kernel writing is not an abstract benchmark. Teams building AI-heavy workflows on top of GPU infrastructure regularly need to automate parts of kernel development: targeting new architectures, adapting existing kernels to different compute capabilities, generating build configurations for new hardware. Every one of those tasks has a model selection question attached. Do you pay for Opus on every call, run Sonnet with a compact skill document in context, or route to a local model for near-zero marginal cost? upskill’s evaluation framework answers that question with actual pass rate data rather than intuition.

What the Evaluation Produces

When you run upskill generate, the tool does not just output a SKILL.md file. It evaluates the skill automatically before saving it, running the target task against the generating model with and without the skill in context, and reporting both pass rates and average token consumption per interaction.

upskill generate "build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder"

The output:

baseline   ████████████                60%
with skill ███████████████████    95%  (+35%)

Saved to ./skills/kernel-builder-cuda-kernels
SKILL.md              ~520 tokens

From there, you can evaluate any additional model targets:

upskill eval ./skills/kernel-builder-cuda-kernels/ \
  --model haiku --model sonnet --runs 5

# For local models
llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q4_K_M
upskill eval ./skills/kernel-builder-cuda-kernels/ \
  --model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \
  --base-url http://localhost:8080/v1

The output table includes pass rate, average assertions satisfied, and average tokens per interaction. For the H100 kernel-builder skill, the results sketch out a clear tier structure.

Reading the Tier Structure

Claude Opus 4.5 sits at the top of the capability ladder. On the kernel-building test suite, it achieves a high pass rate without any skill in context. Feeding it the skill document produces no accuracy gain and increases token consumption. The tool surfaces this explicitly: the 520-token skill adds to every request without improving output. For Opus, the document is noise rather than information.

Claude Sonnet, without the skill, passes around 60% of test cases. With the skill in context, that rises to 95%. The pass rate is nearly as high as Opus, but the cost per call is substantially lower, and the skill itself was generated once by Opus rather than on every Sonnet invocation.

Claude Haiku reaches around 80% with the skill. Local GLM-4.7-Flash-GGUF Q4_0 starts at 40% and reaches 85%, a 45-point lift. At zero marginal API cost, that is a capable enough tier for high-volume automated kernel generation tasks that do not require production-grade reliability on every call.

The practical read is this: Sonnet plus the skill is the default deployment for high-reliability kernel generation. The local GLM option covers high-volume, lower-stakes automation. Opus is reserved for generating new skills or handling tasks outside the skill’s scope, not for routine execution of tasks it has already documented how to do.

Why Token Efficiency Is the Second Axis

The evaluation tracks tokens alongside accuracy, and the Kimi-K2-Thinking result is instructive here. That model improved on both dimensions when the skill was in context: higher pass rate and fewer tokens consumed per successful interaction. This pattern makes sense for reasoning-oriented models. When a model lacks domain-specific knowledge, it explores more paths before converging on a correct answer, generating more tokens as it reconsiders and self-corrects. A skill document that provides the right framework upfront eliminates much of that exploration. The model knows to set compute capability to 9.0 for H100 targets, knows the 128-byte shared memory alignment requirement for the async copy engine, knows the __CUDA_ARCH__ >= 900 guard for cp.async calls, and produces a direct answer rather than working toward those constraints iteratively.

The test cases the tool generates to measure this are deliberately minimal:

{
  "cases": [
    {
      "input": "Create a build.toml for a CUDA kernel targeting H100",
      "expected": {"contains": "9.0"}
    },
    {
      "input": "Write a CUDA kernel with async memory copy for H100",
      "expected": {"contains": "__CUDA_ARCH__ >= 900"}
    }
  ]
}

They check for architecture-specific markers that models without H100-specific training data reliably get wrong. KernelBench from Cognition AI and Stanford puts the broader challenge in context: even the strongest models reach only around 54% on Level 1 kernel tasks as of early 2025, without any skill document. Level 3 tasks involving full architecture blocks remain largely unsolved by single-pass generation. The upskill evaluation is narrower than KernelBench, but it tests the right failure modes for H100-targeted workflows: architecture-version confusion, missing capability guards, incorrect project structure.

The Update Problem That Fine-Tuning Cannot Solve

One reason the context injection approach holds up particularly well for CUDA is the pace of GPU architecture change. H100 (Hopper, compute capability 9.0) added wgmma warpgroup matrix multiply instructions, the Tensor Memory Accelerator for asynchronous bulk copies, and larger shared memory relative to A100. The next architecture shifts the picture again. Any fine-tuned weight that bakes in H100-specific knowledge starts degrading in relevance once the deployment target changes.

A skill file is plaintext. Updating it means re-running an expert agent trace against the new target, which takes minutes. The skill published at hf-skills/h100-diffusers-kernel-builder on the HuggingFace Hub represents a snapshot of H100 conventions; when the production target changes, you generate a new skill rather than scheduling a fine-tuning run. The result works across any tool that implements the Agent Skills specification: Claude Code, Cursor, GitHub Copilot, OpenAI Codex, Gemini CLI, OpenHands, and others. One generation step, portable across deployment environments.

The Evaluation-First Design

What makes upskill worth paying attention to is that it builds evaluation into the artifact generation step, not as an optional check but as the output that justifies the artifact. You do not publish a skill until you know its lift per model. You do not deploy a skill to Opus because the evaluation tells you the injection is counterproductive. For a field that mostly treats context injection as additively positive by default, a tool that actively discourages context when measurement shows it does not help represents a different design philosophy. The deployment decision, which model, what cost, what accuracy guarantee, follows from the evaluation table. Producing that table is what the tool is actually for.