· 7 min read ·

Gating Security AI: Why Project Glasswing Is the Right Kind of Restriction

Source: simonwillison

The security research community has always operated in a gray zone. Understanding how attacks work is a prerequisite for building defenses, but that same knowledge is exactly what an attacker needs. This tension predates AI by decades. It explains why the community developed responsible disclosure norms, why tools like Metasploit exist as dual-use frameworks distributed under open-source licenses, and why academic security conferences run review processes for particularly sensitive findings. Now that same tension has arrived squarely at AI’s doorstep, and Anthropic’s Project Glasswing, which restricts a model called Claude Mythos to vetted security researchers, is one of the first serious structural responses to it.

Simon Willison noted that this kind of restriction sounds necessary to him. I agree, and I think it’s worth spelling out exactly why, because the reasoning matters as much as the conclusion.

What a Security-Research Model Actually Needs

A general-purpose AI assistant is trained and prompted to decline helping with things like writing shellcode, explaining specific exploitation techniques for unpatched vulnerabilities, or analyzing malware samples for functional components. Those refusals make sense for a model deployed broadly. The marginal attacker who gets a little help from Claude is a real risk worth constraining.

But a security researcher working on a penetration test, a malware analyst at a threat intelligence firm, a red team operator doing authorized adversarial simulation, or a vulnerability researcher preparing a CVE writeup needs a model that engages with exactly those topics. The refusal that protects against misuse in one context actively obstructs legitimate work in another.

This is not a hypothetical gap. Researchers at firms like Mandiant, academic groups studying ransomware ecosystems, and government-affiliated security teams have real workflows where AI assistance would be valuable: summarizing disassembled binaries, reasoning about exploitation primitive chains, drafting technical advisories, or helping understand obfuscated malware. A model trained to refuse those tasks is not a safety win in their hands; it is just a useless tool that gets circumvented.

Claude Mythos, from what Anthropic has described around Project Glasswing, appears to be a model configured or fine-tuned to engage with those security-specific topics more fully. Restricting it to verified researchers is the mechanism that makes that permissiveness defensible.

The Uplift Problem

Anthropics Responsible Scaling Policy, first published in 2023 and revised since, introduced the concept of AI Safety Levels (ASL). One of the core questions the framework asks is whether a model provides meaningful “uplift” to actors seeking to cause mass harm. For cybersecurity, the question is whether a capable AI assistant meaningfully lowers the barrier for a moderately skilled attacker to conduct damaging intrusions or develop novel offensive capabilities.

The answer is not obvious. A lot of attack knowledge is already freely available. Exploit code circulates on GitHub. Vulnerability databases are public. Forums like Exploit-DB have existed for years. The marginal contribution of an AI that can fluently synthesize and apply that knowledge is real, but it is not the same as handing someone a bioweapons synthesis route where the limiting factor is genuinely information rather than materials and expertise.

Still, the marginal uplift concern is legitimate at the capability frontier. A model that can reason coherently about novel vulnerability classes, chain exploitation primitives across a complex environment, or help automate parts of an intrusion that previously required senior operator skill is meaningfully different from a searchable database of known CVEs. Keeping that class of capability out of general availability is a reasonable precaution even if the knowledge components themselves are not secret.

Project Glasswing’s approach, gating Claude Mythos behind researcher verification, is a bet that the benefit to the defender community outweighs the risk of access being abused or leaked. That is a judgment call, but it is the right framing for making it.

How Security Knowledge Has Always Been Gated

The restriction model Anthropic is implementing has deep precedent in how the security community itself has managed sensitive information.

Full-knowledge vulnerability disclosure did not spring up as the default. The community spent years debating responsible disclosure norms before converging on a rough consensus: notify the vendor, give them a fixed window (commonly 90 days, as Google Project Zero established as standard practice), then publish regardless. The goal was to pressure vendors to patch while giving defenders time to deploy fixes before full exploitation details became public.

Conference programs at DEF CON and Black Hat have review processes that sometimes ask presenters to delay certain details or coordinate with affected vendors before publishing. The Pwn2Own competition requires participants to hand over full exploit details to the organizing body before vulnerabilities are disclosed publicly.

This is not censorship. It is a recognition that the same information has different risk profiles depending on who has it and when. Claude Mythos is essentially the AI equivalent of a researcher-access program: the capability exists and Anthropic is not pretending otherwise, but access is conditioned on accountability.

The Verification Problem

The hardest part of any restricted-access program is defining and enforcing who qualifies. “Security researcher” is not a licensed profession. There is no equivalent of a medical board or bar exam. The term covers everyone from staff engineers at major security firms to independent bug bounty hunters to graduate students with a GitHub repository and a Hack The Box account.

Anthropics verification process for Project Glasswing matters a lot here, and the details will determine whether the restriction is meaningful or mostly theater. A few patterns from analogous programs are worth noting:

  • Institutional affiliation is the coarsest filter. Employees of known security firms or academic researchers at accredited institutions are easy to verify but exclude the large population of independent researchers who often do significant work.
  • Track record is a stronger signal. Prior CVE disclosures, published research, bug bounty payouts, or conference talks demonstrate genuine engagement with the field. This is closer to how programs like Apple’s Security Research Device Program operate.
  • Agreement to terms creates legal accountability. Most restricted security programs require participants to agree that they will not use access for offensive operations against unauthorized targets. This does not prevent misuse, but it creates consequences.

A combination of those filters, with a preference for track record over affiliation alone, would be the right approach. The goal is not to keep the tool away from all but a credentialed elite; it is to ensure that access comes with accountability and that the population using Claude Mythos has demonstrated they understand and respect the norms of the field.

The Precedent This Sets

Project Glasswing is notable not just for what it does but for what it normalizes. The implicit assumption in most AI deployment has been binary: a model either has certain capabilities available broadly or does not have them at all. Tiered access, where a model can engage with sensitive topics for some users and not others based on verified context, is a meaningfully different architecture.

This has implications for other sensitive domains. Medical professionals who need a model that can engage in detail with drug interactions, off-label prescribing, or clinical edge cases have the same basic problem as security researchers. Legal professionals, forensic investigators, and researchers studying extremist content all have legitimate needs for AI assistance that would be inappropriate to offer broadly.

The infrastructure that makes Project Glasswing work, the verification process, the access control layer, the terms governing use, is reusable. If Anthropic has built it well, it becomes a template for other restricted-access programs across other domains. That would be a more significant contribution than any individual model capability.

Where the Risk Lives

None of this means the program is risk-free. The realistic failure modes are worth naming.

Verified researchers can have their credentials compromised. An access token tied to a legitimate researcher account can be exfiltrated and used by someone else. This is the same threat model as any credentialed API, and the mitigations are the same: rate limiting, anomaly detection on usage patterns, and revocation infrastructure.

Verified researchers can also act in bad faith. Someone can legitimately qualify for access and then use it outside the intended scope. The terms of service create legal exposure, but legal exposure is a deterrent, not a prevention.

The model itself can be leaked. If Claude Mythos gets sufficient deployment, the weights or a fine-tuned version could eventually appear outside Anthropic’s control. This is a longer-term risk for any deployed model, and it is one reason why the restricted-access approach buys time and accountability rather than permanent containment.

These are not arguments against Project Glasswing. They are arguments for building it carefully, auditing it regularly, and treating it as one layer of a defense-in-depth approach rather than a complete solution.

The security community built its current norms over decades of debate, incidents, and iteration. Anthropic is making a reasonable structural bet here, and the instinct to gate rather than refuse outright is the right one. The alternative, a model that refuses to engage with security topics at all, does not make the world safer. It just means defenders use less capable tools while attackers find other paths.

Was this interesting?