Why Anthropic's Decision to Gate Claude Mythos Makes Sense, and What the Hard Part Is
Source: simonwillison
The dual-use problem in security tooling is not new, but AI systems make it considerably more acute. When Anthropic launched what Simon Willison describes as Project Glasswing, restricting a version of Claude called Mythos to vetted security researchers, it put a modern frame on a question the security community has been answering imperfectly for decades: how do you give defenders access to powerful tools without handing identical capabilities to attackers?
The Dual-Use Problem Has Always Been Here
Security research requires capabilities that look, from the outside, nearly identical to attack preparation. Writing a fuzzer for a production service requires understanding exactly how the service can be made to fail. Reverse engineering malware requires knowing how malware behaves at a binary level. Analyzing a novel exploit requires reproducing the exploit conditions in a controlled environment. These are the jobs of penetration testers, malware analysts, and vulnerability researchers, and they have always needed tooling that cannot be easily sanitized.
The industry developed several responses to this. Metasploit Pro gates certain automation, post-exploitation chaining, and reporting features behind a license that implies legitimate enterprise use. Shodan restricts its more aggressive scanning APIs to paid accounts with some accountability attached. Bug bounty platforms like HackerOne and Bugcrowd build vetting into their model by making researchers accountable to specific programs under specific scoped terms.
None of these are perfect. A Metasploit Pro license does not prevent misuse; it creates friction and accountability. Shodan’s paid tier limits throughput, not capability. The bug bounty model works because scope is explicitly defined, liability is contractually allocated, and the programs themselves have organizational reputation at stake.
AI language models do not fit neatly into any of these frameworks.
What Makes a Model Like Mythos Different
A tool like Metasploit is a finite artifact. You can enumerate its modules, read its source code, and understand its capabilities precisely. An AI model is different: its behavior emerges from interaction, and its capabilities extend across nearly any domain. Ask Claude to explain heap spray techniques, analyze a use-after-free vulnerability class, or reason about a stripped binary, and the same model that answers a security researcher’s legitimate question will answer a malicious actor’s question in essentially the same way.
The standard response to this has been either blanket restriction, avoiding offensive technique discussion entirely, or broad permissiveness backed by content policies. Blanket restriction cripples legitimate security research and pushes it toward less safe alternatives. Broad permissiveness with content policies relies on pattern-matching that determined users can circumvent.
Restricting Claude Mythos to security researchers is an attempt at a third path: selective availability based on identity verification. The capability exists, but access is gated at the identity layer rather than excised from the model entirely.
The Verification Problem
Restricting access to security researchers requires being able to identify security researchers. That is harder than it sounds.
Employment at a security firm is a starting point, but self-taught independent researchers have found some of the most significant vulnerabilities in recent history. A record of published CVEs through a numbering authority like MITRE would be a reasonable signal, but new researchers have to start somewhere. Certifications carry weight selectively: the Offensive Security Certified Professional (OSCP) exam requires demonstrating live exploitation rather than just theoretical knowledge, which gives it more signal value than most multiple-choice certifications, but many legitimate researchers hold no formal certification at all.
Established programs have developed heuristics. HackerOne has a reputation system that tracks disclosed vulnerability impact over time. CERT/CC and institutional vulnerability researchers typically have long-standing relationships with vendors and the broader coordination community. Academic researchers have institutional affiliation as a signal, with its own set of limitations.
Anthropicwill presumably build some version of this verification stack, either independently or by leveraging existing frameworks. The difficulty is that verification is a starting state, not an ongoing guarantee. A researcher who passes vetting at enrollment can have their circumstances change. And the vetting process itself becomes a target: if access to Mythos is valuable, bad actors have incentive to fraudulently obtain that access or social-engineer it from legitimate holders.
Why Blanket Restriction Was Never the Answer
The alternative, simply not providing a more capable security-oriented model, carries costs that are easy to understate.
Defenders are structurally outgunned. Modern malware campaigns use sophisticated evasion techniques, polymorphic shellcode, and increasingly AI-assisted variant generation. Vulnerability researchers working against these threats need tools that can reason across large codebases and synthesize information from multiple sources simultaneously. If Anthropic’s competitors provide equivalent capabilities without restrictions, or if the open-source model community closes the capability gap as it is doing across many domains, the practical effect of restricting Mythos is to redirect legitimate researchers elsewhere while doing nothing to limit bad actors who will find whatever tool works.
The security tooling community understood this logic well before AI models entered the picture. The Metasploit Framework is open source for this reason. Ghidra, the NSA-developed reverse engineering tool released publicly in 2019, is freely available. The argument then and now: defenders need these tools, making them freely available levels the playing field, and the offense was already using equivalent tooling anyway.
Claude Mythos presumably offers something qualitatively different from what those open tools provide: natural language reasoning over large contexts, synthesis across documentation and code simultaneously, and the ability to generate proof-of-concept code while explaining vulnerability mechanics in plain language. That capability profile is harder to replicate from first principles and is plausibly worth protecting with access controls.
What Capabilities Are Being Gated
This is where more specificity from Anthropic would be useful. “Restricted to security researchers” describes access policy. It does not describe what the model does differently.
There are several architecturally distinct possibilities. Mythos could be a fine-tuned model trained on security research corpora, with adjusted outputs that make it more fluent in vulnerability analysis and exploit development. It could be a standard Claude model served with a permissive system prompt that removes guardrails around specific vulnerability classes. It could have integrated tool access to sandboxed execution environments, disassemblers, or the National Vulnerability Database.
The distinction matters because the failure modes differ substantially. A fine-tuned model with genuinely different capability requires protecting the weights; if those weights leak, the restriction collapses. A permissive system prompt arrangement requires protecting the API credentials and the prompt itself, which is harder to guarantee at scale. System prompt leakage through prompt injection is a well-documented attack class, and shared API keys have a poor security track record in enterprise environments. If the gating is entirely at the access control layer rather than the capability layer, the security guarantee is only as strong as the access control implementation.
There is also a question of scope creep over time. Access programs that begin with tight vetting tend to broaden as the administrative cost of maintaining narrow definitions becomes apparent. The history of vulnerability disclosure programs shows this pattern clearly: what starts as coordinated disclosure with careful vetting often evolves into broader researcher communities with lighter-touch verification as programs scale.
Where This Lands
Willison’s read that this approach sounds necessary is the right one. Security research needs these capabilities. Blanket restriction pushes legitimate work toward less accountable alternatives. A tiered access model, even with its verification challenges and failure modes, is a better starting point than either extreme.
What will determine whether Project Glasswing holds up is the quality of the implementation details that are not yet public. Specifically: how rigorous is the verification process, and does it accommodate independent researchers rather than only credentialed institutional ones? How clearly is the scope of Mythos’s additional capabilities defined, so that users and Anthropic alike can be held to what was actually promised? And is there a misuse reporting path that the security community can engage with meaningfully, rather than a black box that generates access revocations with no transparency?
The security community has built trust infrastructure around dangerous information sharing before. Full disclosure debates in the 1990s and early 2000s produced coordinated disclosure norms that most of the industry now operates under. Bug bounty economics went from controversial to standard practice in under a decade. If Anthropic is genuinely engaging the security community on how these trust systems work, rather than building an access form and treating compliance as done, Project Glasswing could contribute usefully to how the industry handles AI capability restrictions more broadly. The approach is sound. The execution is what matters.