500 Tokens of GPU Expertise: What Upskilling Open Models Actually Teaches Us

Back in January 2026, the HuggingFace team published an experiment that framed itself as a story about CUDA kernels, but was really a story about knowledge compression. The headline angle, using Claude to build optimized GPU kernels, is interesting on its own. The more durable finding is what happened after: a 520-token text file capturing domain expertise improved a local GLM-4.7-Flash model’s performance on kernel-writing tasks by 45 percentage points, with zero fine-tuning.

That number deserves more attention than it got.

What the Experiment Actually Did

The workflow has three steps. First, Claude Opus 4.5 works through the task interactively, building CUDA kernels using HuggingFace’s kernel-builder toolchain. The agent trace from that session is captured. Second, the trace is distilled into a SKILL.md file, a structured markdown document that encodes the domain-specific knowledge the agent needed: GPU architecture conventions, project layout, build configuration, PyTorch binding patterns, and H100-specific requirements like compute capability 9.0 and async memory copy semantics. Third, that skill file is injected as context for smaller, cheaper models, and their performance is benchmarked against the baseline.

The upskill CLI tool handles all three stages:

# Generate a skill from a task description (Opus 4.5 does the work)
upskill generate "build optimized CUDA kernels for PyTorch using HuggingFace kernel-builder"

# Evaluate how well the skill transfers to cheaper models
upskill eval ./skills/kernel-builder-cuda-kernels/ --model haiku --model sonnet

# Works with local models too
upskill eval ./skills/kernel-builder-cuda-kernels/ \
  --model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \
  --base-url http://localhost:8080/v1

The skill itself sits in a directory following the Agent Skills specification:

./skills/kernel-builder-cuda-kernels/
├── SKILL.md           # ~520 tokens of domain expertise
└── skill_meta.json    # test cases for evaluation

The skill_meta.json contains structured test cases with assertions, which is what makes the evaluation pipeline work. You can actually measure whether a model benefits from a given skill or not.

Why CUDA Kernels Are a Good Test Case

Writing correct, performant CUDA kernels is one of those tasks where having the right context matters enormously. It is not primarily about general reasoning ability. You need to know specific things: which compute capability corresponds to which hardware generation, how shared memory alignment affects throughput, what the async memory copy (cp.async) instruction requires on H100, how PyTorch’s C++ extension API expects bindings to be structured.

None of this is derivable from first principles in a single session. A model without this context will often produce kernel code that compiles but runs slowly, targets the wrong architecture, or uses PyTorch binding conventions from an older API version. These are exactly the failure modes where injecting 500 tokens of precise, structured context should help, and the results confirm that intuition.

For reference, the kernel-builder project structure they worked with looks like this:

project/
├── build.toml
├── kernel_src/
│   ├── attention.cu
│   ├── layernorm.cu
│   └── geglu.cu
└── torch-ext/
    └── torch_binding.cpp

The build configuration alone has sharp specificity:

[general]
name = "diffuser_kernels"
backends = ["cuda"]

[general.cuda]
capabilities = ["9.0"]  # H100 only

Getting this wrong means your kernels silently fall back to suboptimal code paths or fail to compile entirely. The skill file encodes exactly this kind of brittle, architecture-specific knowledge.

The Numbers and What They Mean

Claude Opus 4.5 went from 60% to 95% accuracy with the skill applied. GLM-4.7-Flash, running locally, went from 40% to 85%. Those are large gains for what amounts to a structured prompt prepended to context.

But the more interesting finding is the non-uniformity. Moonshotai’s Kimi-K2-Thinking model showed improved accuracy and reduced token consumption when given the skill, meaning it stopped exploring dead ends and got to the right answer faster. Claude Opus 4.5, by contrast, showed increased token usage without proportional accuracy gain, suggesting the skill’s information was largely redundant for a model already carrying relevant priors.

This is not a flaw in the method. It is a useful signal. The evaluation framework exists precisely to surface these cases, so you can decide where deploying a skill actually saves cost and where it just adds overhead. The underlying question, whether a given model needs this particular context, turns out to have a non-obvious answer that varies by model.

Context Engineering vs. Fine-Tuning

The framing of “upskilling” a model without training it is deliberate. Fine-tuning would be the obvious alternative: gather kernel-writing examples, train a LoRA adapter, deploy it. That approach has real advantages for models that will handle the same task at high volume. But it has costs: data curation, training runs, adapter management, and the risk of forgetting or degrading on other tasks.

What this experiment demonstrates is that for sufficiently constrained tasks, structured in-context knowledge can capture most of the benefit at a fraction of the overhead. The 520-token skill file is the distillation artifact: it takes the implicit knowledge embedded in Opus 4.5’s agent trace and makes it explicit, portable, and model-agnostic.

This connects to a pattern that has been building quietly across the tooling ecosystem. Claude Code uses CLAUDE.md. Cursor uses .cursorrules. Codex supports custom instruction files. Every serious AI coding tool has converged on some version of a structured knowledge document that shapes model behavior without modifying model weights. The Agent Skills spec is an attempt to standardize this across tools, which would mean a skill file written for Claude Code could be evaluated and deployed in Cursor or any other compliant host without modification.

The Economics Are the Point

The cost structure here matters. Opus 4.5 is expensive per token. You run it once to generate and validate the skill. After that, every subsequent invocation uses Haiku, Sonnet, or a local model that costs a fraction as much. If you are running kernel-writing workflows repeatedly, the one-time investment in generating a high-quality skill file pays for itself quickly.

This is a different economic model than the standard “use a bigger model for better results” approach. It treats expensive frontier model capacity as a knowledge-extraction mechanism rather than a per-query reasoning engine. The Opus call is an investment, not an operating cost.

For teams running open models locally, the calculus is even more direct. A local GLM-4.7-Flash instance at near-zero marginal cost, combined with a well-crafted skill, approaches the accuracy of a cloud frontier model on this specific task. The 85% pass rate on kernel-building, up from 40% without the skill, is the kind of result that makes it plausible to keep GPU-intensive workloads off remote APIs entirely.

What Generalizes

CUDA kernel development is a narrow domain, but the pattern is broadly applicable. Any task with a substantial gap between what a model knows by default and what it needs to know for your specific environment is a candidate. Build system conventions, internal API patterns, organization-specific coding standards, deployment pipeline specifics, these are all cases where 500 tokens of precise context could matter more than model size.

The evaluation framework is the part that tends to get underbuilt in practice. Most context engineering happens by intuition: someone writes a system prompt, it seems to help, the team uses it. The upskill approach adds automated pass/fail test cases to that process, which lets you measure transfer across models and catch regressions when you update the skill. That is a small but meaningful shift toward treating context engineering with the same rigor applied to software.

The broader takeaway is less about CUDA specifically and more about the gap between what a model knows in the abstract and what it can do in a constrained, well-defined environment when given the right structured context. That gap is often larger than expected, and 520 tokens can close a surprising amount of it.