· 6 min read ·

Prompt-Level Distillation: How Hugging Face Compressed CUDA Expertise Into 520 Tokens

Source: huggingface

Looking back at the Hugging Face Upskill project, published in January 2026, the most interesting thing about it is not that Claude wrote CUDA kernels. It’s that the team found a way to compress the process of writing those kernels into 520 tokens and make that knowledge portable across entirely different models.

The distinction matters. Knowledge distillation in the traditional ML sense means training a smaller student model to mimic a larger teacher model through gradient updates. What Upskill does is different: it captures an expert agent’s reasoning trace, structures it into a dense instruction file, and injects that file as context at inference time. No weight updates. No dataset. No compute budget beyond the original run. Just a markdown file that tells a less capable model how to think about a specialized domain.

What the CUDA Task Actually Entails

Before getting into the mechanism, it helps to understand why CUDA kernel writing is a meaningful benchmark for this kind of capability transfer.

CUDA programming requires holding several overlapping mental models simultaneously: GPU memory hierarchy (registers, L1/L2 cache, shared memory, global memory), warp-level execution semantics, compute capability targeting, and the specific API requirements of whatever framework you’re binding to. Getting any of these wrong doesn’t produce a slower kernel. It produces a kernel that either fails to compile or produces wrong results silently.

The Upskill team focused on kernels for the H100, which targets CUDA compute capability 9.0. At this capability level, you gain access to features like asynchronous memory copies via cp.async and Hopper-specific wgmma (warpgroup matrix multiply-accumulate) instructions, but those features require explicit __CUDA_ARCH__ >= 900 guards to compile correctly on earlier hardware. The 128-byte shared memory alignment requirement for H100’s async copy engine is another non-obvious constraint that most models without specific training on Hopper documentation will get wrong.

The kernels they built included a fused LayerNorm + GELU kernel and attention kernels, wired up through PyTorch C++ bindings. The project structure they generated looks like this:

project/
├── build.toml
├── kernel_src/
│   ├── attention.cu
│   ├── layernorm.cu
│   └── geglu.cu
└── torch-ext/
    └── torch_binding.cpp

The build.toml specifies backend and capability targeting:

[general]
name = "diffuser_kernels"
backends = ["cuda"]

[general.cuda]
capabilities = ["9.0"]

This is not trivial GPU code. Fusing LayerNorm and GELU into a single kernel requires careful handling of reduction operations across thread blocks, and doing it correctly for H100 means taking advantage of shared memory bandwidth that the hardware provides but that naive implementations won’t use. Claude Opus 4.5 handled this through an interactive agentic session where the trace was recorded in full.

The Skill Distillation Mechanism

The Upskill workflow has three stages. First, you run Claude Opus 4.5 through the task and export the full agent trace as trace.md. Second, you run upskill generate to convert that trace into a structured skill file:

upskill generate "write nvidia kernels" --from ./trace.md

This produces a SKILL.md file of approximately 520 tokens and a skill_meta.json with test cases. Third, you run upskill eval to benchmark how well other models perform the skill with that file injected as context:

upskill eval ./skills/my-skill/ --model haiku --model sonnet

The test cases in skill_meta.json encode specific behavioral expectations:

{
  "cases": [
    {
      "input": "Create a build.toml for a CUDA kernel targeting H100",
      "expected": {"contains": "9.0"}
    },
    {
      "input": "Write a basic CUDA kernel template with proper includes",
      "expected": {"contains": "cuda_runtime.h"}
    }
  ]
}

The skill file is designed to be portable across agentic coding environments: Claude Code, Cursor, Codex. The Agent Skills specification it follows treats the skill file as a drop-in context injection, not a tool or plugin, so it works wherever you can inject a system prompt or context block.

What the Benchmarks Reveal

The accuracy improvements are where the methodology shows its character most clearly.

The local open-source model, unsloth/GLM-4.7-Flash-GGUF running quantized at Q4_0, gained 45% accuracy with the skill applied. Claude Sonnet gained 35%. Haiku went from some baseline to 80% pass rate. Sonnet reached 100% pass rate. Claude Opus 4.5, the teacher model, showed no improvement and actually increased token usage.

That last result is the most informative. Opus 4.5 generated the trace that became the skill file. It already encodes the knowledge the file represents. Giving Opus its own distilled expertise back as context doesn’t teach it anything; it just adds noise. The skill is not information that improves the teacher, it is information from the teacher compressed for consumption by models that lack that background.

The pattern across the other models suggests the skill compensates most where base capability is lowest. GLM-4.7-Flash benefits more than Sonnet because Sonnet already knows more CUDA. This is roughly what you’d expect from context injection doing real work: it supplies missing domain knowledge rather than reinforcing existing competence.

Why This Is Not Fine-Tuning (And Why That Matters)

Traditional knowledge distillation, going back to Hinton et al.’s 2015 paper, works by having a student model learn from soft probability distributions produced by the teacher. More recent approaches like LLM distillation via chain-of-thought imitation extend this by having the student learn to reproduce the teacher’s reasoning traces directly. All of these require compute for training.

What Upskill does is closer to few-shot prompting with carefully structured domain context than to distillation in the ML sense. The SKILL.md file functions as a highly compressed domain brief. The model reading it is not learning in any weight-update sense; it is reasoning from additional context it would not otherwise have.

This has specific tradeoffs. The gains are real but bounded. A model whose weights do not encode GPU architecture knowledge at all will not become an expert from 520 tokens of context, regardless of how well those tokens are structured. The GLM-4.7-Flash +45% result is impressive but that model is presumably not outperforming Sonnet on raw CUDA capability after the skill injection, it is just substantially better than its unaided baseline.

The advantage over fine-tuning is the absence of infrastructure cost. You can capture a skill from one Opus session, package it, and deploy it to any compatible model or tool in minutes. You can update the skill by running a new trace. You can share it via the kernel-builder repository without shipping model weights. For teams that need specialized capability on a budget, this is a real lever.

Situating This in the Broader GPU Programming Ecosystem

The choice of CUDA as the domain for this demo is not arbitrary. GPU programming remains one of the steeper specialization cliffs in software. Triton, OpenAI’s compiler for GPU kernels, has lowered the floor somewhat by letting you write kernel logic in Python with automatic memory management and tiling. But Triton’s abstractions don’t cover everything H100 can do, and for maximum performance on specific hardware, hand-written CUDA is still necessary.

There have been other attempts to use LLMs for kernel generation. NVIDIA’s own CUDA-Q platform focuses on quantum computing simulation rather than classical GPU kernels, but the company has invested in LLM tooling for GPU programming through its developer ecosystem. KernelBench from Scaling Intelligence provides a benchmark suite for evaluating exactly this capability, measuring whether LLM-generated kernels are both correct and faster than PyTorch baselines.

The Upskill project takes a different angle than benchmarking raw model capability. It asks: given that a capable model can solve a hard GPU programming problem, how do you make that solution reusable and transferable to cheaper inference?

The Practical Takeaway

For developers building on top of language models, Upskill represents a pattern worth understanding. The tool is available on PyPI, the methodology is documented, and the kernel-builder skill is a concrete example of what a well-structured skill file looks like.

The limitations are real. Pass/fail accuracy on test cases is not the same as runtime GPU performance. The benchmarks tell you whether a model can generate syntactically and structurally correct kernel code; they don’t tell you whether that kernel achieves good memory bandwidth utilization or avoids warp divergence. Evaluating actual kernel performance requires running the code on hardware and measuring it against a baseline like cuBLAS, which the project doesn’t do.

But as a mechanism for capability transfer, it is both simple and effective. If you have an expert model trace on any specialized domain, Upskill gives you a straightforward path to making that expertise available to smaller models without the infrastructure overhead of fine-tuning. For CUDA specifically, that’s useful. GPU programming expertise is scarce and the demand from teams training and serving models keeps growing.

The compression ratio is what sticks with me. A full agent session reasoning through H100 kernel design, capturing the build system configuration, the compute capability guards, the shared memory alignment requirements, all of it down to 520 tokens. That’s a dense document.

Was this interesting?