Teaching Open Models CUDA Kernel Writing by Capturing What Claude Actually Does

Back in late January 2026, Hugging Face published an experiment that sits at an interesting intersection of agent tooling and model capability transfer. The article documents using Claude Opus 4.5 to build CUDA kernels targeting H100 GPUs for the diffusers library, then packaging the agent’s accumulated knowledge into a portable skill file, and finally measuring how much that skill improves smaller and open-source models on the same task. The numbers are meaningful, but the more interesting part is the mechanism: you are not fine-tuning anything. You are capturing procedural knowledge from an agent trace and making it reusable across any model that understands a simple Markdown-based format.

What the Agent Skills Format Actually Is

Before getting into the CUDA specifics, it is worth understanding what Agent Skills is, because it underpins the whole transfer approach. The format originated at Anthropic, was released as an open standard, and has since been adopted by a wide range of coding agents: Cursor, GitHub Copilot, Claude Code, OpenAI Codex, Gemini CLI, OpenHands, Goose, and others. A skill is a directory with a SKILL.md file at its root. That file has YAML frontmatter with a name and description field, followed by freeform Markdown instructions. The frontmatter description is what agents load at startup for every available skill, roughly 100 tokens per skill, so they can decide whether any given skill is relevant to the current task. When they decide to activate a skill, they load the full body, which the spec recommends keeping under 5000 tokens.

skill-name/
├── SKILL.md           # Required: YAML frontmatter + Markdown instructions
├── scripts/           # Optional: executable code the agent can run
├── references/        # Optional: detailed docs loaded on demand
└── assets/            # Optional: templates, configuration files

The progressive disclosure design is deliberate. You do not want every skill fully loaded into every context window. Agents skim descriptions, activate what is relevant, and only pull in deeper reference files when the task requires it. The format validation is strict on the name field (lowercase, hyphens, 64 character max, must match the directory name) but completely open on the Markdown body. You write whatever instructions help the agent do the task reliably.

This format being adopted across many tools is significant for what Hugging Face is doing here. A skill written once to capture CUDA kernel conventions can, in theory, run inside any compatible agent, not just Claude.

The CUDA Kernel Problem

Writing CUDA kernels for modern Nvidia hardware is not a task where generic model knowledge is sufficient. For H100 GPUs, which are compute capability 9.0, there are specific requirements that differ materially from older architectures. Shared memory needs to be aligned to 128-byte boundaries. Asynchronous memory copies using cp.async require __CUDA_ARCH__ >= 900. The diffusers kernel builder expects a specific project layout:

project/
├── build.toml              # Build configuration
├── kernel_src/             # CUDA kernel implementations
│   ├── attention.cu
│   ├── layernorm.cu
│   └── geglu.cu
└── torch-ext/              # PyTorch C++ bindings
    └── torch_binding.cpp

The build.toml configuration needs to explicitly declare the target compute capability:

[general]
name = "diffuser_kernels"
backends = ["cuda"]

[general.cuda]
capabilities = ["9.0"]

This kind of specificity, architecture-aware memory alignment, correct async copy guards, the build system’s expectations around PyTorch C++ bindings, is exactly what a general-purpose model may not get right consistently. The configuration format is tool-specific; the hardware constraints are non-obvious if you have not read the H100 architecture guide. A model that has seen a lot of general CUDA code will still make mistakes on H100-specific idioms.

What the upskill experiment does is have Claude Opus 4.5 work through this problem interactively, build the kernels correctly, and then extract the knowledge the agent accumulated into a skill file. The resulting skill encodes about 520 tokens of instructions covering GPU architecture targeting, shared memory alignment requirements, async copy guards, project structure conventions, and the PyTorch binding setup.

The Knowledge Transfer Pipeline

The upskill tool, installable via pip install upskill or uvx upskill, provides a three-stage workflow:

# Generate a skill from an agent trace
upskill generate "write nvidia kernels" --from ./trace.md

# Evaluate multiple models against the skill
upskill eval ./skills/kernel-builder-cuda-kernels/ \
    --model haiku --model kimi --runs 5

# Evaluate a local model
upskill generate "parse YAML" \
    --model opus \
    --eval-model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \
    --eval-base-url http://localhost:8080/v1

The evaluation uses test cases defined in skill_meta.json, input/output pairs where the expected output is checked using assertion conditions like {"contains": "9.0"}. The teacher model (Opus) generates both the skill and the test cases. Each student model is then evaluated with and without the skill loaded, and the difference is the skill lift.

This is structurally different from knowledge distillation in the traditional sense. Classic distillation, in the vein of Hinton’s original 2015 paper, involves training a smaller student model on the soft output distributions of a larger teacher. What upskill does is closer to what you might call trace-to-instruction extraction: the teacher’s successful agent trajectory gets summarized into a set of procedural instructions that any model can follow at inference time, no weight updates required. The cost is that it only works for tasks where the knowledge can be expressed as instructions. You cannot encode statistical pattern recognition this way. But for tasks with explicit conventions and non-obvious configuration requirements like CUDA kernel development, instructions are exactly what is missing.

What the Benchmarks Show

The results from the Hugging Face experiment are worth looking at carefully:

Claude Opus 4.5 improved from 60% to 95% accuracy on CUDA kernel tasks with the skill loaded (+35 percentage points)
unsloth/GLM-4.7-Flash-GGUF (Q4_0 quantization, local inference) improved from 40% to 85% (+45 points)
Claude Haiku achieved 4/5 pass rate (80%) at roughly 1,250 tokens per evaluation
Claude Sonnet reached 5/5 (100%) with the skill loaded

The GLM-4.7-Flash result is the interesting one. A heavily quantized local model improving by 45 percentage points on a specialized GPU kernel writing task is not a marginal gain. It brings a much cheaper and fully local model within range of a general-purpose frontier model on a specific domain. If you are running a coding agent workflow and CUDA kernel generation is part of it, you may not need Opus-level inference for that step.

The benchmarks also surface something the article is honest about: skills are not universally beneficial. For moonshotai/Kimi-K2-Thinking, the skill improved both accuracy and token efficiency. For Claude Opus 4.5 itself, the skill increased token usage without a clear accuracy gain, which the tool correctly flags as a sign the skill is not worth using for that particular model on that task. A capable model that already has the requisite knowledge in its weights does not need a skill file explaining it; the instructions just add noise and cost.

Why This Pattern Matters Beyond This Experiment

The broader pattern here is that Agent Skills is becoming a meaningful unit of knowledge distribution for agentic workflows. The adoption across Cursor, GitHub Copilot, OpenHands, Databricks, and others means a skill written for one tool can, with minimal modification, benefit users of many different products. The format is simple enough that writing a skill is not much harder than writing a good README. The upskill tool lowers that further by automating the extraction from agent traces.

For teams running their own LLM deployments, this is a legitimate alternative to fine-tuning for narrow domain adaptation. Fine-tuning is expensive, requires curated training data, and produces a model that you have to maintain and redeploy whenever the base model updates. A skill file is a text document. It version-controls cleanly, requires no infrastructure beyond loading it into a context window, and can be updated whenever conventions change. The tradeoff is that it only works for tasks where explicit instructions help; you cannot skill your way past fundamental model capability gaps on reasoning-heavy work.

The CUDA kernel case is a good illustration of where skills are strong: the task involves specific conventions, configuration formats, and architectural constraints that are well-documented but rarely seen together in training data at sufficient density. That is exactly the kind of knowledge that transfers well through instructions rather than through weight updates.

For those interested in the kernel builder infrastructure itself, Hugging Face’s kernel-builder project is a broader effort to make custom CUDA kernel development more accessible for the diffusers ecosystem. The upskill experiment is one slice of that: making the tool itself teachable to agents, so the workflow can be automated end-to-end.

The skill produced by the experiment is publicly available on the Hugging Face Hub, which is another piece of the picture. Skills as shareable artifacts, hosted the same way models and datasets are, fits cleanly into a world where model selection and capability extension are increasingly decoupled. You pick your model based on cost and latency constraints, then extend it with skills for the specific domains you care about. The separation makes sense, and the upskill experiment with CUDA kernels is a clean demonstration of what that looks like in practice.