Tiered Access for Dangerous AI Capabilities Is the Right Call, and the Hard Part Is Making It Stick
Source: simonwillison
Anthropic has been quietly building something called Project Glasswing: a program that gates access to Claude Mythos, a variant of Claude tuned for security research, behind a vetting process for qualified practitioners. Simon Willison called it necessary, and I think that framing undersells what is interesting about the approach.
The policy logic is not novel. Tiered access to dangerous capabilities has been a recurring idea across security tooling, biosafety, and controlled substances for decades. What is worth examining is whether the underlying enforcement mechanism is structurally different this time, and whether that difference is durable.
What Claude Mythos Is and What It Unlocks
Standard Claude refuses to engage with a predictable cluster of security topics: working exploit code, EDR bypass techniques, shellcode generation, phishing payload construction, and detailed vulnerability analysis. For most users, those refusals are correct behavior. For a practicing penetration tester, a malware analyst, or someone writing a CVE disclosure, they are a daily obstacle that means defaulting to less capable, less safe tooling.
Claude Mythos, as described under Project Glasswing, is configured to serve that professional population. The capability delta between Mythos and standard Claude is not some extraordinary leap; it is the removal of blanket refusals in favor of context-aware responses appropriate for the field. A security researcher asking why a particular ROP chain works should get a technically complete answer. A standard user asking the same question without demonstrated professional context should not.
This is not a new problem in the security tools ecosystem. The industry has been arguing about it since at least 2004, and the historical track record should inform how seriously we take any given access-control scheme.
The Cobalt Strike Lesson
Cobalt Strike was built as a legitimate red team platform. The licensing model required purchase and implicitly screened for some level of professional context. It failed completely as an access-control mechanism. Cracked versions spread through criminal networks so effectively that by the early 2020s, Cobalt Strike beacons were showing up in the majority of ransomware campaigns. The tool became essentially public domain for anyone motivated to find it.
Metasploit took the opposite approach: open source everything on the theory that defenders benefit more from shared knowledge than attackers gain from exclusive access. That argument has merit for knowledge, but it does not cleanly transfer to capability. A framework that automates exploitation still provides uplift regardless of whether the underlying technique is documented somewhere.
Both approaches failed to maintain the distinction between defender access and attacker access because neither had a technical mechanism to enforce it. Distribution happened through files on disk. Once a file exists, distribution is just copying.
Why API Delivery Changes the Enforcement Problem
Claude Mythos is not a downloadable binary. It is an API endpoint. Anthropic can enforce access control technically in a way that was never possible with tools shipped as software. If a Glasswing account is found to be misused, it can be revoked. If credential sharing is detected, it can be flagged. The vetting process gates initial access, and the API layer maintains ongoing visibility.
This is not a complete solution. Determined adversaries can still misuse legitimate credentials, share sessions, or reconstruct capability through other means. But there is a structural difference between a cracked Cobalt Strike binary circulating on forums and an API that requires live authentication on every request. The marginal cost of obtaining and misusing the tool remains higher, permanently, not just at the point of initial distribution.
The DURC framework in biosafety, dual-use research of concern, is the closest structural analogy. Institutional biosafety committees and select agent programs use a combination of vetting, ongoing oversight, and physical access control to manage research that would be catastrophic if misused. The enforcement there is physical containment plus accountability structures. What Glasswing is building is the information-security equivalent: authentication as containment, plus an institutional accountability layer through the vetting process.
Anthropic’s Responsible Scaling Policy and Where Mythos Fits
Anthropics’s Responsible Scaling Policy introduced AI Safety Levels, a tiered framework modeled structurally on biosafety levels. ASL-2 is roughly current general-purpose models. ASL-3 is the threshold where a model could provide “meaningful uplift” to attacks on critical infrastructure or to actors seeking weapons with potential for mass casualties.
Claude Mythos almost certainly sits near or at the ASL-3 boundary in the cybersecurity domain. A model that can reason fluently about exploit chains, generate functional shellcode, and analyze defensive tooling crosses from “could help someone who already knows what they are doing” into “could meaningfully help someone who does not.” The latter is the definition of uplift that triggers additional controls under the RSP framework.
The Glasswing vetting process is, in practice, Anthropic’s implementation of the ASL-3 deployment condition for this specific capability slice. You do not release a model with meaningful uplift to the general public; you gate it behind controls that establish accountability and raise the cost of misuse.
The Harder Problem: Vetting at Scale
Tiered access programs succeed or fail on the quality of the vetting process. Institutional biosafety review works because the population of researchers who need BSL-3 or BSL-4 access is small, physically locatable, embedded in institutions with their own accountability structures, and subject to significant career consequences for violations.
Security researchers are a much larger, less institutionally legible population. Plenty of legitimate practitioners work independently, operate pseudonymously, or are based in jurisdictions where professional certification is not standardized. A vetting process that requires, say, a CISM certification and a corporate email address will correctly exclude a large fraction of bad actors and also incorrectly exclude a nontrivial fraction of legitimate researchers.
The inverse error is also possible. A determined attacker with the patience to construct a plausible professional identity can probably pass a vetting process that relies primarily on credentials and stated purpose. The institutional biosafety model works partly because physical presence at a registered institution is hard to fake at scale; the security researcher identity is not.
Anthropics is not going to solve this on launch. What they can do is iterate on the vetting criteria based on observed misuse patterns, revoke access for violations, and improve the signal over time. The API delivery model makes iteration possible in a way that a software distribution model does not. You cannot recall a binary once it has spread; you can tighten an API access policy next Tuesday.
The Refusal Cost That Gets Underweighted
Most public discussion of AI safety and dual-use capability focuses on the misuse side of the ledger. The cost of over-restriction gets less attention, but it is real.
A security researcher who cannot get useful responses from a frontier model does not stop doing security research. They switch to a less capable model, use a jailbroken version, or use a locally-run open-weight model with no safety training and no oversight. The access-control value of standard Claude’s security topic refusals is close to zero, because the marginal cost of working around them is low for anyone sufficiently motivated.
Project Glasswing is attempting something more honest: acknowledge that security-oriented capability has legitimate use cases, gate it behind accountability structures that are actually meaningful, and stop pretending that blanket refusals constitute a security policy. The refusals were never keeping capable adversaries out. They were mostly just annoying defenders.
What Success Looks Like
A successful Glasswing program would look like this in a few years: a substantial population of security researchers using Claude Mythos as a regular part of their workflow, a documented process for handling credential misuse cases, observable absence of Mythos-specific capability in attacker toolchains, and some evidence that the vetting criteria are being refined based on actual misuse patterns rather than just initial assumptions.
Failure looks like the Cobalt Strike outcome: the vetting process gets gamed, capability leaks through shared credentials or API proxies, and the tiered access model provides accountability theater without actual access control.
The structural difference, API delivery with ongoing authentication, gives Anthropic tools that the Cobalt Strike distributor never had. Whether they use those tools well is an operational question that only time will answer. The underlying policy logic is sound, and the enforcement mechanism is meaningfully better than prior attempts. That is about as much as can be said at launch.