What 520 Tokens Can Teach a Small Model About CUDA

CUDA kernel writing sits at an uncomfortable intersection of GPU architecture knowledge, systems programming discipline, and hardware-specific optimization. Most developers avoid it entirely, relying on cuBLAS, cuDNN, or Triton abstractions. The abstractions hold for common cases, but they leak when you need something the library authors did not anticipate: a fused operation, a non-standard memory layout, a kernel tuned for a specific hardware generation.

When you do write a kernel directly, the failure modes are numerous. Global memory reads must be coalesced, meaning adjacent threads in a warp must access adjacent memory addresses so the hardware can satisfy the request in a single 128-byte transaction. Shared memory is divided into 32 banks; two threads accessing the same bank in the same cycle create a conflict that serializes execution. Warp divergence, where threads within a 32-wide SIMT unit take different branches, cuts throughput in half per divergence level. And all of this shifts substantially per GPU generation. The H100 introduced Tensor Memory Accelerator (TMA) for asynchronous bulk copies and warpgroup-level matrix multiply instructions (wgmma) that simply do not exist on the A100.

The cognitive overhead is the point. A junior engineer can write a syntactically valid kernel in an afternoon. That kernel will probably be correct. It will almost certainly be slower than PyTorch’s default operators, which call into hand-tuned library routines. KernelBench, a benchmark from Cognition AI and Stanford that measures LLM ability to write faster-than-PyTorch CUDA kernels, makes this gap concrete: even the strongest models achieve only around 54% on Level 1 tasks (single operations like softmax and LayerNorm) as of early 2025. Level 3 tasks, involving fused multi-operation kernels for full architecture blocks, remain essentially unsolved by single-pass generation.

This is the context for a January 2026 post from HuggingFace describing a tool called upskill and the outcome of using Claude Opus to build a real H100 CUDA kernel, then packaging that expert trace into a reusable artifact that boosts smaller models on the same task class.

What upskill Actually Does

The mechanism is simpler than it sounds. You start with an expert agent, in this case Claude Opus 4.5, working through a CUDA kernel-building task interactively while a trace is captured. The trace records the decisions made: which shared memory alignment to use (128 bytes on H100), when to use cp.async for async memory copies (only available at compute capability >= 9.0), how to structure PyTorch C++ bindings, what the build.toml configuration requires. The expert does not just produce the final kernel; it navigates the same obstacles a human expert would encounter, and that navigation is preserved.

From that trace, a skill document is generated. The HuggingFace kernel-builder skill (hf-skills/h100-diffusers-kernel-builder) compresses the essential domain knowledge into roughly 520 tokens. That document is then injected as context for smaller or cheaper models when they attempt the same class of task. The project structure it encodes looks like this:

project/
├── kernel_src/     # .cu CUDA source files
├── torch-ext/      # PyTorch C++ bindings
└── build.toml      # build configuration

The results are substantial. Claude Sonnet’s pass rate on a three-assertion test suite moved from 60% to 95% with the skill in context. A locally-running GLM-4.7-Flash-GGUF Q4_0 model moved from 40% to 85%. Claude Opus 4.5 itself showed no meaningful gain from the skill, which makes sense: the skill was derived from Opus’s own expert trace, so injecting it back is redundant rather than informative.

The skill does not perform arbitrary magic. It acts as a dense reference that prevents the model from re-deriving architecture-specific facts from scratch, or worse, confidently applying conventions from an older GPU generation. The H100’s compute capability 9.0 feature set is a common source of errors for models trained primarily on pre-Hopper CUDA examples.

The tool is installable via pip install upskill or uvx upskill, supports Anthropic, OpenAI, and OpenAI-compatible local endpoints, and follows the Agent Skills specification format with a SKILL.md and skill_meta.json structure compatible with Claude Code, Cursor, and similar environments.

Inference-Time Distillation vs. the Training-Time Alternatives

The upskill approach occupies an interesting position relative to the broader literature on transferring frontier model capability to smaller models. The dominant paradigm since 2023 has been training-time distillation: generate high-quality synthetic data from a frontier model, fine-tune a smaller model on it, and ship a model with baked-in capability.

The Phi series from Microsoft demonstrated how much synthetic data quality matters. Phi-1 (1.3B parameters) was trained on around 7 billion tokens of GPT-4-generated Python textbooks and matched Codex on HumanEval despite being far smaller. Phi-3-mini (3.8B) matched Mixtral 8x7B on many benchmarks using the same principle of quality over quantity.

Orca went further by distilling not just answers but reasoning traces. A 13B Llama model fine-tuned on 5 million GPT-4 chain-of-thought examples matched GPT-4-0314 on BigBench Hard at 42.3% versus 42.9%. The lesson was that teaching a model to reason through a problem transfers capability more efficiently than teaching it to produce correct outputs directly. DeepSeek’s January 2025 distilled R1 models followed the same logic at scale: a 7B model distilled from R1 reasoning traces reached 55.5% on AIME 2024, approaching o1-mini’s 63.0%, with no reinforcement learning applied to the student model at all.

The limitation of all these approaches is that they require a training run. You need the data, compute, and infrastructure to fine-tune, and the resulting capability is frozen into the weights. When GPU architecture advances and H100-specific code patterns need to be revised, you need to generate new training data and re-train.

Skill documents as context sidestep this. The knowledge lives outside the model weights, can be updated by re-running an expert trace, and can be swapped between model providers without any training infrastructure. The tradeoff is that the knowledge is only available when the skill is in context; the model has not internalized it. For niche, rapidly evolving technical domains like GPU kernel development, that tradeoff looks favorable. GPU generations change, and any fine-tuned weight that bakes in H100-specific knowledge will start decaying the moment the next architecture ships with a different feature set.

The CUDA Data Scarcity Problem

There is a practical reason why skill documents make sense for CUDA specifically, beyond the updatability argument. Training data scarcity is a real constraint.

Public code datasets like The Stack v2, which underlies StarCoder2 and covers roughly 675 billion tokens across 600+ programming languages, contain minimal CUDA content. CUDA is well under 1% of the corpus. Most production kernel code lives in closed NVIDIA libraries, proprietary ML infrastructure repositories, and internal codebases that have never been publicly committed. The models that exist are working from a thin slice of available GPU programming examples.

This is part of why KernelBench results look the way they do. Models can write syntactically valid CUDA; the language itself is not particularly complex. The difficulty is knowing which optimization decisions matter for a given hardware target, and that knowledge is sparse in the training distribution. Injecting a dense skill document derived from an expert trace is a direct fix for a distribution gap that fine-tuning on more synthetic data would only partially address, since the synthetic data would face the same ground-truth scarcity problem during generation.

Triton, OpenAI’s Python-embedded DSL that compiles to PTX via MLIR, is a partial workaround. Models achieve around 60-70% correctness on simple Triton operations, higher than raw CUDA, because the abstraction level is closer to what appears in Python training data. FlashAttention-2’s reference implementation is written in Triton, so at least that pattern is well-represented. But Triton does not expose every H100 feature, and for kernels that need TMA or wgmma directly, you are back to raw CUDA.

The Broader Pattern

What the upskill workflow actually demonstrates is a decomposition of the expertise problem. Frontier models are good at navigating novel, complex tasks; smaller models are good at executing known patterns efficiently. The skill document is the handoff artifact. An expert run produces it once; lighter models use it repeatedly.

This pattern generalizes beyond CUDA. Any domain where the critical knowledge fits in a few hundred tokens, changes faster than training cycles, and requires hardware or environment-specific facts underrepresented in training data is a candidate. Embedded systems programming with vendor-specific HALs, newer API surfaces that postdate a model’s training cutoff, compiler intrinsics for specific architectures, custom DSLs in niche industries.

The WizardCoder results already showed that small models can exceed much larger ones on coding tasks when given the right training signal. The upskill approach suggests that you do not always need to train; sometimes you just need to document the expert knowledge in a form the model can use at inference time. For a domain as hardware-coupled and as fast-moving as GPU kernel development, that is a more maintainable path than chasing the training data pipeline. The expert knowledge is perishable. The skill document can be regenerated in minutes. A fine-tuned weight cannot.