The Capability Diffusion Problem: When Small Models Can Find What Mythos Found

The security community spent much of 2024 processing what large language models could do when pointed at real codebases and given enough scaffolding. The results were uncomfortable. Models like GPT-4 demonstrated they could exploit one-day vulnerabilities autonomously given only a CVE description and access to a target system, as researchers at UIUC showed with an 87% success rate across a set of real CVEs. Google DeepMind’s Project Big Sleep found a previously unknown stack buffer underflow in SQLite. These felt like frontier-model achievements, the kind of thing that required frontier-model scale.

The argument from the Aisle blog post on AI cybersecurity after Mythos is that this assumption was wrong. Small models are finding the same vulnerabilities. The jagged frontier, it turns out, cuts differently than expected.

What the Jagged Frontier Actually Means

Ethan Mollick popularized the jagged frontier framing to describe how AI capability is uneven in ways that don’t map to intuitive difficulty. A model might write a competent legal brief and fail at basic arithmetic in the same session. The frontier isn’t a smooth line where hard things are beyond the boundary and easy things are within it. It’s jagged, and the jaggedness is hard to predict without empirical testing.

Cybersecurity research has been one of the better stress tests of this idea. The assumption going in was that vulnerability discovery would sit firmly outside what smaller models could do, because it requires combining deep domain knowledge, multi-step reasoning across large codebases, and the ability to hold complex state about program semantics while tracking potential attack paths. These feel like hard things.

But “hard” in human terms doesn’t translate cleanly to “requires a 70B+ parameter model” in practice. Pattern recognition across code, matching known vulnerability classes to structural signatures, following data flow through a bounded call graph, these tasks are learnable from training data in ways that don’t necessarily scale only with model size.

The Mythos Benchmark and What Came After

Mythos, as described in the source article, demonstrated that AI systems could identify real, exploitable vulnerabilities at a level of reliability that changed how practitioners were thinking about AI-assisted offensive security. The specific findings positioned this as a frontier-scale achievement, something that required heavy compute and large models to pull off.

The uncomfortable follow-on result is that smaller models, run at a fraction of the cost and at greater scale, are replicating those same findings. This isn’t surprising to researchers who have been watching the general pattern of capability diffusion in AI, where a result that initially requires a frontier model tends to be reproducible with smaller, fine-tuned models within 6 to 18 months, but it’s alarming in the security domain specifically because the asymmetry between attack and defense matters so much.

Why Model Size Matters Less Than People Thought for This Task

There are a few structural reasons why vulnerability discovery might be more amenable to small models than other complex reasoning tasks.

First, the vocabulary of vulnerability classes is constrained. Buffer overflows, use-after-free, SQL injection, type confusion, integer overflows, path traversal, these are a bounded set of patterns. A model fine-tuned on CVE writeups, proof-of-concept exploits, and vulnerable code examples can develop strong pattern-matching for these classes without needing the general reasoning depth of a frontier model. The task is closer to specialized classification than it is to open-ended reasoning.

Second, the tooling has improved dramatically. Models don’t operate in isolation; they’re embedded in pipelines with static analysis tools, symbolic execution engines, fuzzing harnesses, and structured code representations like tree-sitter ASTs or LLVM IR. A smaller model guiding a well-designed pipeline can outperform a larger model working with less scaffolding. The system capability often matters more than the model capability.

Third, the evaluation signal for fine-tuning is rich and available. Public CVE databases, bug bounty disclosures, open-source vulnerability patches, and CTF writeups provide structured training signal that doesn’t require the model to generalize to truly novel domains. You’re teaching it to recognize known-unknown patterns, not invent new ones.

This is the architecture of tools like Semgrep made AI-native: a smaller model with deep specialization and good tooling, not a frontier model doing freeform reasoning.

The Threat Model Shift

The concerning part isn’t that small models can find vulnerabilities. It’s what becomes possible when that capability is cheap, fast, and scalable.

Large frontier models are expensive to run at scale. API costs, rate limits, and inference latency impose natural constraints on how much volume an attacker can process. A security researcher using GPT-4 to audit a codebase is using a tool with real cost structure. But a fine-tuned 7B or 13B model running locally on commodity hardware has a very different cost structure. You can run it against thousands of open-source repositories simultaneously. You can build automated pipelines that triage findings, prioritize by exploitability, and generate proof-of-concept code without human review at each step.

This is capability diffusion in the form that matters for threat modeling: not whether the capability exists somewhere, but whether the cost structure allows it to be deployed at volume by actors without significant resources.

Defense has always been harder than offense because defenders have to protect everything while attackers only need to find one path. AI-assisted vulnerability discovery at low cost and high volume tilts that asymmetry further.

What the Security Community Should Take From This

The instinct to track what frontier models can do is correct but incomplete. The Mythos result and its replication by smaller models together tell a more useful story: the relevant question is not “what can the biggest model do” but “how quickly does that capability become commodity.”

For practitioners, this means a few things. Dependency scanning and SAST tooling need to be treated as a baseline, not a differentiator, because attackers have equivalent or better automated analysis. Vulnerability disclosure timelines need to account for the possibility that AI-assisted discovery is shrinking the window between patch availability and exploit development. And the security research community needs better empirical benchmarks for tracking capability diffusion across model sizes, not just frontier capability.

The DARPA AI Cyber Challenge is one serious attempt to create rigorous evaluation infrastructure for this space. The CTF community has been another ad hoc testing ground. But systematic tracking of which vulnerability classes are reproducible at which model scales is largely undone work.

The jagged frontier framing is useful precisely because it resists the temptation to draw a clean line. Security capabilities in AI don’t live at a fixed point on a capability curve that only frontier models cross. They’re distributed unevenly across task types and vulnerability classes, and the distribution is shifting as fine-tuning data and tooling improve. Assuming that what required GPT-4 in January will still require GPT-4 in September is the kind of assumption that gets infrastructure compromised.

The Mythos findings were significant. The follow-on finding that smaller models replicate them is more significant, because it changes who can run the attack.