· 5 min read ·

Restricted Access for Security-Capable AI Is the Right Default, Not a Compromise

Source: simonwillison

Simon Willison wrote briefly about Anthropic’s Project Glasswing, which restricts Claude Mythos, a security-focused model variant, to vetted security researchers. His reaction, that this sounds necessary, is the right one. But the interesting question isn’t whether to restrict this kind of model. It’s how to run a vetting program that actually means something.

The Dual-Use Problem Is Not New

Security tooling has navigated this tension for decades. Metasploit was open-sourced in 2004 and remains the canonical example of a tool that serves both penetration testers and attackers with equal efficiency. Cobalt Strike took a different approach: commercial licensing, vetting of buyers, and terms of service that technically prohibit malicious use. In practice, cracked copies of Cobalt Strike have shown up in ransomware operations, nation-state intrusions, and commodity malware campaigns for years. The vetting didn’t hold.

But that failure doesn’t mean vetting is pointless. It means vetting is one layer of a defense-in-depth strategy, not a complete solution. Cobalt Strike still matters to legitimate red teams. The fact that attackers also got access doesn’t retroactively erase the value it provides to defenders. The same logic applies to Claude Mythos.

What makes AI different from a tool like Cobalt Strike is the nature of the uplift. A framework like Cobalt Strike requires the attacker to already understand what they’re doing. The tool operationalizes existing knowledge; it doesn’t generate new capability from nothing. A sufficiently capable language model can do something closer to the latter. Someone with a partial understanding of a vulnerability class can potentially bridge that gap with a well-prompted model in a way they couldn’t with a traditional toolkit. That’s the specific risk Anthropic is trying to address.

Where Anthropic’s Safety Policy Framework Fits

Anthroptic’s Responsible Scaling Policy defines a set of AI Safety Levels, modeled loosely on biosafety levels, that describe thresholds of capability and the deployment constraints that come with them. ASL-2 covers current production Claude models, which already have mitigations against providing serious uplift in cyberattack planning. ASL-3 would apply to models that could provide “meaningful uplift” to attacks on critical infrastructure or could assist a sophisticated actor in ways that substantially increase harm.

A model specifically trained or fine-tuned for security research, one that understands exploit development, vulnerability analysis, and offensive tooling at a level useful to defenders, almost by definition approaches ASL-3 territory in the cyber domain. The responsible response to that isn’t to not build it. Defenders need these tools more than anyone, and falling behind on AI-assisted security research while attackers experiment freely with whatever models they can access would be a worse outcome. The responsible response is to build it with access controls that match the risk profile.

Project Glasswing sounds like Anthropic’s operationalization of exactly that logic.

What Makes a Vetting Process Worth Anything

The weakness in every gated-access security program is the vetting layer. CVD programs, commercial security tool licenses, government contractor clearances: all of these involve some process of deciding who counts as a legitimate actor. All of them can be gamed, and all of them leak.

For Project Glasswing to be meaningful rather than performative, the vetting process needs a few properties that are harder to achieve than they sound.

First, the criteria need to be specific enough to actually screen for risk. “Security researcher” is a broad category. A vulnerability researcher at a major enterprise security firm, an independent bug bounty hunter, a graduate student studying offensive security, and someone who calls themselves a security researcher on a LinkedIn profile are not equivalent. The access controls need to be calibrated to the actual risk model, not just to a job title.

Second, the terms of access need to be auditable after the fact. Issuing access and then having no visibility into how the model is being used means the vetting was a one-time gate with nothing behind it. API access logging, usage pattern monitoring, and clear terms that allow Anthropic to revoke access based on observed behavior are the difference between a program with teeth and one without.

Third, the vetting process needs to be fast enough that it doesn’t push researchers toward workarounds. If getting access to Claude Mythos requires a six-week approval process, researchers will build their workflows around whatever they can actually get, which may be a less safe alternative from a different provider with no vetting at all. The program competes with uncontrolled alternatives, so the friction cost matters.

The Counterfactual That Matters

The strongest objection to restricted-access programs is that they don’t work. Models get leaked, jailbroken, or replicated. Competitors build equivalent capabilities without the access controls. The restrictions inconvenience legitimate researchers more than they impede bad actors, who are more motivated to circumvent them.

There’s real force to this argument, but it proves too much. By that logic, no access control on any dual-use tool is worth implementing, and that’s clearly not right. The point is not to achieve perfect prevention. It’s to raise the cost and complexity of misuse while keeping the cost to legitimate users low. A program that successfully screens out casual misuse and adds friction for serious misuse is valuable even if it doesn’t stop determined state-level actors.

Anthroptic also has a specific advantage that pure software tool vendors don’t: the model runs through their API. They don’t have to prevent someone from running a copy of Claude Mythos on their own hardware, because there is no local copy to run. The vetting layer and the usage monitoring layer are the same infrastructure. That’s meaningfully different from trying to prevent a cracked copy of Cobalt Strike from being distributed on forums.

What I’d Want to See

From the perspective of someone who builds software that interacts with AI APIs and cares about both security and safety, what would make Project Glasswing credible isn’t the existence of the vetting process. It’s transparency about how that process works.

Publishing the criteria for access, the obligations researchers accept, and the general shape of what kinds of use are and aren’t permitted would let the security community evaluate whether the program is serious. It would also create a public record that Anthropic has to live up to, which is a meaningful accountability mechanism.

The security research community has hard-won experience with institutional gatekeeping that protects institutions more than it protects researchers or the public. If Anthropic wants Project Glasswing to be a genuine contribution to defensive security rather than a liability management exercise, the proof will be in how the program handles edge cases: the independent researcher without institutional affiliation, the international applicant, the person whose work lives in a grey area between research and tooling.

Getting those cases right is the real test. The easy part was deciding that restriction was necessary.

Was this interesting?