· 6 min read ·

Context Over Weights: What HuggingFace's Upskill Teaches About CUDA Knowledge Transfer

Source: huggingface

Back in January, HuggingFace published a post about their upskill tool that got filed under “interesting but niche” by most people who noticed it. Looking back at it now, I think the framing undersold what they actually built. The headline was “Claude builds CUDA kernels and teaches open models.” The more interesting story is the architectural choice they made to skip fine-tuning entirely.

What the Tool Does

The workflow is three steps. First, you run Claude Opus 4.5 through an H100 CUDA kernel-building task, recording the full agent trace. Second, you run upskill generate against that trace, which produces a SKILL.md file of roughly 520 tokens plus a skill_meta.json with auto-generated test cases. Third, you inject that skill document as context when running cheaper or open models on the same task class, and measure the lift.

# Generate a skill from an existing trace
upskill generate "write nvidia kernels" --from ./trace.md

# Evaluate smaller models with and without the skill
upskill eval ./skills/kernel-builder-cuda-kernels/ \
  --model haiku --model sonnet \
  --eval-model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \
  --eval-base-url http://localhost:8080/v1

The benchmark results were meaningful. GLM-4.7-Flash-GGUF (Q4_0, running locally) went from 40% to 85% pass rate on the kernel-writing test suite, a 45-point gain. Claude Sonnet went from around 60% to 95%. Claude Haiku reached 80%. The teacher model, Opus 4.5, saw no meaningful gain from being fed its own distilled trace back, which is the expected result.

The kernels in question were real: a fused LayerNorm + GELU kernel targeting H100 (CUDA compute capability 9.0), with a project layout involving build.toml for architecture specification, .cu source files, and PyTorch C++ bindings. Not trivial template exercises.

Why CUDA Is a Hard Target for LLMs

The gap between generating syntactically valid CUDA and generating performant CUDA is enormous, and it explains why this domain is a useful testbed for knowledge transfer experiments.

Public code datasets contain very little CUDA. The Stack v2 covers over 600 languages and hundreds of billions of tokens; CUDA represents a fraction of a percent of that. Most production kernel code lives in closed NVIDIA libraries, internal ML infrastructure repositories, and codebases that were never publicly committed. Models trained on broad code corpora are working from an unusually thin slice of ground truth.

Compounding this is the fact that CUDA expertise is hardware-generation-specific in ways that most programming domains are not. The skills that matter for an H100 (Hopper, compute capability 9.0) do not all transfer to an A100 (Ampere, compute capability 8.0) or to next-generation hardware. Specifically:

  • cp.async for asynchronous bulk copies is only available at compute capability 9.0 or higher
  • Warpgroup matrix multiply-accumulate (wgmma) instructions are Hopper-only
  • Shared memory alignment for async copies on H100 must be 128 bytes; other targets have different requirements
  • The Tensor Memory Accelerator (TMA) for bulk asynchronous data movement is a Hopper-and-later feature

A model generating kernels for an H100 that writes __CUDA_ARCH__ >= 800 guards where it needs >= 900 will produce code that compiles, runs, and gives wrong results on some paths. A model that does not know about 128-byte alignment for async copies will produce code that is correct but measurably slower.

The KernelBench benchmark, from Cognition AI and Stanford (arxiv.org/abs/2502.10517), quantifies how far current models are from solving this domain. Even the strongest models reach roughly 54% pass rate on Level 1 kernels, which are single-operation tasks like softmax or LayerNorm in isolation. Level 3, which requires fused multi-operation kernels for full architecture blocks, is essentially unsolved by single-pass generation. The ceiling on current models is not syntax; it is hardware-specific optimization knowledge.

The Architectural Choice: Context Instead of Weights

The standard response to domain knowledge gaps is to fine-tune. Collect synthetic training data generated by a frontier model, filter for correctness, fine-tune a smaller model on the filtered set. This is the pipeline behind WizardCoder, Magicoder, and the DeepSeek distilled reasoning models, among others. It works, and the results are well-documented.

Upskill takes a different route. No weights are updated. The SKILL.md is injected as context at inference time. This is closer to structured few-shot prompting than to knowledge distillation in the ML sense, though the framing of “skills” adds structure that generic few-shot prompting lacks: the skill captures not just examples but the reasoning process, the architectural decisions, the tooling invocations, and the test validation loop.

For general-purpose coding tasks, the fine-tuning approach probably has a higher ceiling. A model that has internalized a domain through thousands of training examples can draw on that knowledge in any context, not just when a skill document is explicitly provided. Inference-time injection puts a hard cap on how much the model can adapt.

For CUDA specifically, though, the argument flips. GPU architectures change on a roughly two-year cadence. A fine-tuned model with H100 knowledge baked into its weights starts decaying the moment Blackwell ships with different async copy semantics, different warpgroup instructions, and a different memory hierarchy. A skill document generated from an expert trace on B200 hardware can be produced in minutes and immediately distributed to anyone using the skill. No retraining, no cluster time, no versioned checkpoint management.

The SKILL.md format follows the Agent Skills specification, which means the resulting artifacts are compatible with Claude Code, Cursor, and any other tool that supports the spec. That portability matters for something as tooling-specific as kernel development, where the workflow involves compilers, build systems, and profiling tooling that vary across environments.

The Benchmark Limitation Worth Taking Seriously

The pass rates reported by upskill measure structural correctness, not runtime GPU performance. A kernel passes the test suite if it generates the right project structure, includes the right configuration values, and produces code containing the expected strings. Whether the resulting kernel achieves good memory bandwidth utilization, avoids warp divergence, or actually saturates H100 tensor cores is a separate measurement that requires running on hardware with a profiler.

To give a sense of the gap: on an H100 SXM, peak FP16 tensor core throughput is approximately 1,979 TFLOPS and peak memory bandwidth is roughly 3.35 TB/s. A naive matrix multiplication kernel written without knowledge of tiling or shared memory staging typically achieves 5 to 10 percent of peak. A well-tuned implementation with proper warp-level primitives reaches 70 to 80 percent. Those are not small differences. A kernel that “passes” a structural test but achieves 8% of peak has not succeeded at the actual engineering goal.

Upskill does not claim otherwise; the post acknowledges the limitation. But it is worth being explicit about when reading the headline numbers. The benchmark is a reasonable proxy for “can the model navigate the domain at all,” not a measurement of whether the generated kernels belong in a production ML pipeline.

What the Experiment Is Actually Measuring

The more defensible claim from the upskill results is this: a 520-token document, constructed by compressing an expert agent trace, transfers enough domain knowledge to take a model from “frequently confused by this domain” to “usually navigates it correctly.” That is not the same as producing optimal code, but it is a real and useful threshold for a different class of use case, specifically for teams who need to write maintainable, correct CUDA extensions for PyTorch without employing a GPU performance engineering specialist full-time.

The tool is installable now via pip install upskill or uvx upskill, and the kernel-builder skill is open source. Testing it against your own model setup is straightforward. Whether the inference-time approach or the training-time approach turns out to be the right abstraction for CUDA expertise at scale is a question that the next GPU generation will help answer, since we will see then whether skill documents stay current faster than fine-tuned weights do.

Was this interesting?