· 6 min read ·

Tiered Access for Dangerous AI Capabilities Is the Right Call

Source: simonwillison

Anthropic quietly announced Project Glasswing, a program that gates access to Claude Mythos, a version of Claude with capabilities specifically oriented toward security research, to a restricted group of vetted practitioners. Simon Willison covered it with the understated verdict that it “sounds necessary,” and he is right. But the reasoning behind why it is necessary is worth unpacking, because this is a pattern the security industry has wrestled with for decades, and AI is just the latest surface where the same tensions appear.

The Dual-Use Problem Has Always Been Hard

Security tooling has never been purely defensive. Metasploit is the canonical example: an open-source framework that makes writing and chaining exploits tractable, used constantly by both penetration testers and attackers. The Metasploit Project made a principled choice to publish it openly anyway, on the theory that defenders benefit more from shared knowledge than attackers do from exclusive access. That reasoning has held up reasonably well for exploit frameworks, where the barrier to exploitation without the tool is still high.

Cobalt Strike took a different path. Fortra (formerly HelpSystems) licenses it only to vetted organizations and charges accordingly. The goal was to prevent commodity use by less sophisticated threat actors. That goal has failed spectacularly: cracked versions of Cobalt Strike are ubiquitous in ransomware campaigns, and the licensing model mainly succeeded at making defenders pay for the same tool attackers get for free. The lesson from Cobalt Strike is not that access control is bad, but that access control without a technical enforcement mechanism is essentially theater.

AI models introduce a different dynamic. A language model is not a static binary you can crack and redistribute. The capability lives in the weights and in the inference infrastructure. Access control can be meaningfully enforced at the API level in a way that was never possible with a downloadable executable. This changes the calculus considerably.

What Claude Mythos Presumably Does Differently

Standard Claude, like all production AI assistants, is trained to decline certain requests. Ask it to write a working exploit for a specific CVE, explain how to bypass a specific EDR product’s behavioral detection, or generate convincing spearphishing content, and it will refuse or produce something so hedged as to be useless. This is deliberate. For a general-purpose assistant deployed to millions of users, those guardrails are appropriate.

But legitimate security researchers have genuine professional need for exactly these capabilities. A red team operator who needs to simulate a realistic phishing campaign for a client engagement is not helped by a model that produces warnings about social engineering. A vulnerability researcher auditing firmware needs a model that will engage with low-level memory corruption patterns without constant interruption. The safety training that protects general users actively impedes professional security work.

The traditional response to this friction has been jailbreaking: users finding prompt formulations that get around refusals. This is worse than controlled access in every respect. It turns safety mechanisms into an obstacle course rather than a genuine control, it produces inconsistent results, and it gives Anthropic no visibility into how the capabilities are being used. A formal research program with a vetted cohort is a meaningfully better outcome for everyone.

What Vetting Actually Requires

This is where the details matter enormously, and where similar programs have struggled. Google’s Project Zero maintains a team of vetted researchers with access to pre-patch vulnerability details under strict disclosure timelines. The vetting there is essentially employment. MITRE’s CVE Program distributes CNA (CVE Numbering Authority) status to organizations that demonstrate responsible disclosure practices over time. Neither of these is easily portable to an AI access program.

For something like Project Glasswing, the meaningful questions are: who qualifies, how is that determination made, what ongoing monitoring exists, and what happens when someone misuses access. Security researchers are not a monolithic professional class. There is a spectrum from full-time employees at large security vendors with established reputations, to independent bug bounty hunters with no institutional affiliation, to researchers in jurisdictions where the legal status of offensive security work is ambiguous.

A program that restricts access to employees of named security firms is relatively simple to administer but excludes a significant portion of the independent research community that produces much of the important work. A program that tries to vet individuals based on demonstrated publication history and professional reputation is more inclusive but harder to operate at scale and more susceptible to credential misrepresentation.

The HackerOne and Bugcrowd platforms have developed reputation systems for this exact problem over the past decade. Researchers build track records through disclosed vulnerabilities, and those records are portable across programs. Anthropic would be reinventing a solved problem if it built its own vetting system from scratch rather than leveraging existing trust frameworks.

The Risk of Access Creep

Even well-designed restricted access programs leak over time. Credential sharing among trusted colleagues is routine in research communities where the formal vetting process is slow and researchers collaborate closely. A researcher who loses their job at a security firm may retain access credentials tied to their former employer. A vetted researcher may take on consulting work that crosses into territory the program did not anticipate.

The technical mitigation here is logging and anomaly detection. If Claude Mythos access is tied to API keys with per-request logging, Anthropic can detect usage patterns that diverge from expected research behavior: bulk generation of phishing content targeting specific organizations, systematic probing of exploit chains for a single target, requests that follow the pattern of active operational use rather than research and tool development. This is the kind of monitoring that makes restricted access meaningful rather than nominal.

The legal and policy layer matters too. A clear terms of service with specific prohibited uses, combined with the ability to revoke access and pursue legal remedies, creates a deterrent that pure technical controls cannot provide. Cobalt Strike’s failure was partly that the technical enforcement mechanism (licensing) was breakable. Claude Mythos, as a service, does not have that weakness.

Why Not Just Refuse the Capability Entirely

The argument for not building a security-oriented model variant at all is that it creates a target: a capability that threat actors will attempt to gain access to through social engineering, credential theft, or compromise of vetted researchers’ systems. This risk is real but not decisive.

Security research already happens. Researchers already build and use specialized tools. If Claude Mythos did not exist, researchers would use less capable general models with jailbreaks, use locally-run open-weight models with no safety training, or simply not use AI for these tasks and do the work more slowly. None of these alternatives are better from a safety perspective. The capability exists in the ecosystem regardless; the question is whether it exists in a form that Anthropic can observe, audit, and control.

There is also a defensive argument. Understanding how AI can be used offensively is prerequisite to building defenses against AI-assisted attacks. Anthropic employing or partnering with researchers who stress-test these capabilities in controlled settings is how the industry gets ahead of threat actors who are already experimenting with the same techniques without any oversight.

The Broader Signal

Project Glasswing is a concrete expression of a position that has been mostly theoretical in AI safety discussions: that different user populations warrant different capability profiles, and that meaningful access control on those profiles is technically feasible. This is not a novel idea. It is how every other dual-use technology domain operates. The novelty is that an AI company is doing it with intentionality and structure rather than either blanket restriction or blanket permission.

Anthropic’s Responsible Scaling Policy established the conceptual framework for capability thresholds and access tiers. Project Glasswing is what that framework looks like when instantiated against a specific use case with real operational constraints. The execution details will determine whether it works. The principle is sound.

Was this interesting?