· 6 min read ·

The Model Size Assumption in AI Security Is Starting to Break

Source: hackernews

There is a version of the AI cybersecurity story that goes like this: the bigger the model, the more dangerous. Frontier models find novel vulnerabilities; smaller models cannot. Security researchers need GPT-4 class systems to do anything meaningful. That version has been quietly falling apart for about a year, and the findings discussed in the Aisle blog’s analysis of what came after Mythos are a concrete data point in why.

The short version of the finding: small models found the same vulnerabilities that Mythos found. If you have been watching this space, that is worth pausing on.

What Mythos Represented

Mythos sits in the lineage of work that started getting serious attention in 2024, when researchers at the University of Illinois Urbana-Champaign published results showing that GPT-4 could autonomously exploit one-day vulnerabilities given only a CVE description and access to a target system. The exploit success rate was around 87% across their test set of real CVEs, and no other model they tested came close. GPT-4 could reason through multi-step exploitation chains; GPT-3.5 mostly could not.

That paper, and the broader class of results it belongs to, established that frontier models had crossed some threshold. They were not just pattern-matching on known vulnerability signatures. They were doing something closer to actual security reasoning: reading documentation, forming hypotheses, running commands, revising based on feedback.

Mythos extended this. The system demonstrated capability at scale, across a broader and more realistic set of targets than previous research had managed. It made the case that AI-assisted vulnerability discovery was not a lab curiosity but something approaching a usable capability.

The uncomfortable follow-on question was always: what happens when models one-tenth the size can do the same thing?

The Jagged Frontier, Applied to Security

The “jagged frontier” concept comes from research by Fabrizio Dell’Acqua, Ethan Mollick, and colleagues studying how consultants used GPT-4. The core observation is that AI performance is not a smooth gradient. Models are dramatically better than humans at some tasks and surprisingly worse at others that seem simpler. The performance surface is jagged, not a ramp.

This applies to security tasks in ways that are not obvious from headlines. Consider what “finding a vulnerability” actually involves:

  • Pattern recognition over known vulnerability classes: heap overflows, format string bugs, integer overflow conditions, injection points. This is largely syntactic and is something models with modest capability do reasonably well.
  • Dataflow analysis across a codebase: understanding that user input from function A reaches a dangerous sink in function C, possibly through three intermediate transformations. This requires maintaining a coherent mental model of the program across context.
  • Novel vulnerability discovery: identifying a class of bug that does not fit existing patterns, requires reasoning about timing or concurrent state, or depends on subtle protocol semantics.

The jagged frontier prediction would be: small models do surprisingly well at the first category, fail more often at the second, and mostly cannot do the third. That maps well onto what the research is showing.

When Mythos’s vulnerabilities turn out to be findable by small models, it is worth asking which category those vulnerabilities fell into. The answer probably matters more than the headline.

What Small Models Are Actually Doing

Models like Phi-3 Mini, Mistral 7B, and CodeLlama 13B have been benchmarked on security tasks with results that would have seemed implausible two years ago. They can:

  • Identify SQL injection and command injection patterns in code snippets with high accuracy
  • Spot obvious memory safety issues (unchecked malloc return values, off-by-one conditions in buffer copies) when the vulnerable code is in context
  • Generate working proof-of-concept exploits for well-documented vulnerability classes when given a description and a code sample

What they struggle with is anything requiring multi-hop reasoning across large contexts, novel exploitation of logic errors, or understanding stateful systems where the vulnerability requires a specific sequence of operations to trigger.

The tooling around these models matters as much as the models themselves. A 7B parameter model augmented with a static analyzer, a symbolic executor, or a fuzzer is a fundamentally different thing than the same model in isolation. Tools like Semgrep and CodeQL have been doing pattern-based vulnerability detection for years; coupling them with an LLM that can triage findings, generate reproduction cases, and explain root causes creates a system whose combined capability exceeds what either component does alone.

This is part of why the small-model finding is not as surprising in retrospect as it sounds. If Mythos’s architecture used a large model to drive a tool-augmented pipeline, and if a significant fraction of its findings came from the tool layer rather than pure LLM reasoning, then substituting a smaller model in the driver seat loses less than you might expect.

The Democratization Implication

This is where the finding gets uncomfortable in a practical sense.

Running a frontier model against a target system at scale is expensive. API costs alone make it impractical for most individual actors. A well-resourced security team or a well-funded threat actor can afford it; most cannot.

Small models change this arithmetic. A fine-tuned Mistral 7B running locally on consumer hardware has near-zero marginal cost per query. It can run continuously, against large codebases, with no rate limits and no API bill. If it can find a meaningful fraction of what a frontier model finds, the economic barrier to AI-assisted vulnerability research drops dramatically.

Defenders benefit from this too. A security team that could not afford to run GPT-4 against every pull request can potentially run a small model on every commit. The latency is lower, the cost is lower, and the integration story is simpler. Projects like Ollama and LM Studio have made local model deployment a Saturday afternoon project rather than an infrastructure initiative.

But the threat model shifts in the same direction. Script kiddies with a GPU and a fine-tuned model are a different problem than script kiddies without one.

Where the Parity Does Not Hold

The danger of the “small models can do it too” framing is that it flattens a distinction that matters.

The UIUC research that established GPT-4’s capability on CVEs was explicit about this: smaller models did not just perform worse, they failed at a qualitatively different set of tasks. They could not execute multi-step exploitation chains. They lost track of state across long tool-use sequences. They hallucinated about system behavior in ways that broke their exploit attempts.

For the hardest security problems, the capability gap is real and it is not closing as fast as the headline benchmarks suggest. Finding a use-after-free in a 500-line C file is a different task than finding a logic vulnerability in a distributed system’s consensus protocol. Frontier models are better at both; the gap is larger for the second.

This is the jagged frontier in practice: the easy-to-medium vulnerabilities are increasingly accessible to small models, and the hard ones remain out of reach for everyone except the most capable systems and the most skilled humans.

What Researchers Should Do With This

A few things follow from this pattern.

First, the “we need frontier models for security work” assumption should be challenged on a task-by-task basis rather than accepted as a blanket rule. For many practical security engineering tasks, a small local model is sufficient and has better operational properties.

Second, the research community needs better benchmarks that distinguish between vulnerability classes by difficulty and reasoning depth required. Current benchmarks often aggregate these, which makes it hard to see where the capability cliff actually is.

Third, the tooling integration story is probably more important than model selection for most practitioners. A small model with a well-designed tool harness will outperform a frontier model with no tool access on most real security workflows.

The jagged frontier framing is useful here because it counsels against both excessive optimism and excessive dismissal. AI systems are genuinely capable at a meaningful subset of security tasks, the subset is larger than most people assumed a year ago, and small models cover more of it than expected. The hard problems at the frontier of offensive security research are still hard, and probably will be for a while.

What has changed is the floor.

Was this interesting?