The Principal Hierarchy and the Corrigibility Problem: Inside OpenAI's Model Spec
Source: openai
OpenAI published its Model Spec in May 2024, alongside a companion blog post explaining the rationale. Reading it as a PR document, a transparency gesture timed around GPT-4o’s release, is tempting. That reading underestimates what the spec is. It is a training target: a normative document that human raters use when evaluating model outputs during RLHF, and that can serve as a critic in automated feedback pipelines. Publishing it makes the reward signal legible, and that matters more than the PR framing.
As a developer who spends time thinking about how systems handle trust boundaries and permission hierarchies, I find the spec more interesting as architecture documentation than as policy writing.
The Principal Hierarchy as Access Control
The spec establishes three classes of principals: OpenAI, operators, and users. OpenAI sets constraints through training itself, operators communicate through system prompts, and users communicate through the human turn. Each tier has different trust levels and different abilities to expand or restrict model behavior.
This maps directly to role-based access control. An operator is like a service account with elevated permissions; a user is like an authenticated end-user with scoped access. The spec uses employment as an analogy: the model should treat operator instructions like messages from a manager, following reasonable directives without demanding justification for each one, unless those directives cross ethical bright lines or violate OpenAI’s policies.
The permission model has two dimensions: operators can grant users elevated permissions up to operator level, and both operators and users can toggle behaviors within their respective permission ceilings. Operators can enable explicit content on age-verified platforms, restrict the model to discussing only topics relevant to their product, or disable safe-messaging guidelines for clinical applications. Users can adjust tone, disable disclaimers on persuasive essays they requested, or ask for blunter feedback.
Operators can limit what users can do but cannot direct the model against users’ basic interests. A customer service bot can be configured to only discuss the company’s products; it cannot be configured to deceive users in ways that damage them or prevent them from getting urgent help. The spec calls this the distinction between operators restricting the model versus operators weaponizing it. That line is doing a lot of work, and it will get tested.
Hardcoded and Softcoded: Compile-Time vs. Runtime Constants
The most structurally clean part of the spec is the hardcoded/softcoded distinction. Hardcoded behaviors are absolute: no instruction from any principal at any trust level can change them. The list includes providing meaningful assistance to anyone seeking to create biological, chemical, nuclear, or radiological weapons; generating sexual content involving minors; taking actions that would undermine human oversight of AI systems; and assisting any effort to seize control of economies, governments, or militaries.
The spec is explicit about why these are absolute rather than adjustable: compelling arguments for crossing them should increase suspicion rather than provide justification. This is the same reasoning behind why cryptographic key lengths are not runtime-configurable in secure-by-default systems. The value of the constraint comes precisely from its unconditional nature, and the spec says as much directly.
Softcoded behaviors are defaults that can be adjusted within the permission structure. Some are on by default and can be turned off; some are off by default and can be turned on. Medical providers can disable suicide safe-messaging defaults. Harm-reduction platforms can enable detailed drug use information without warnings. Companionship apps can enable relationship personas. Each adjustment is tied to a plausible legitimate use case, which is how the spec tries to prevent the softcoded system from becoming a loophole. Whether that succeeds depends on how seriously operators are vetted and how well the model distinguishes a legitimate clinical context from a system prompt designed to look like one.
The Corrigibility Dial
The most philosophically significant section describes what the spec calls “broadly safe behaviors.” The framing uses a dial ranging from fully corrigible (does whatever the principal hierarchy dictates) to fully autonomous (acts entirely on its own values and judgment). The spec argues that both extremes are dangerous.
Full corrigibility is dangerous because it makes the model’s behavior entirely dependent on the values of whoever controls the principal hierarchy, including OpenAI itself. If OpenAI has bad values, a fully corrigible model amplifies them. Full autonomy is dangerous because it requires the model to have verified, calibrated values and the judgment to act on them correctly across arbitrary situations; we do not currently have interpretability tools sufficient to verify that.
The spec’s position: models should sit close to the corrigible end for now, with genuine ethical bright lines as the exception. The reasoning is explicitly meta-ethical. The model might have subtly miscalibrated values from flawed training. Humans need the ability to detect and correct this before it causes irreversible harm. Supporting human oversight is therefore what an AI with good values should want to do, even if it means occasionally deferring when its own judgment would suggest otherwise.
This is a coherent argument, and also self-referentially awkward. The spec asks the model to be corrigible to a principal hierarchy in which OpenAI sits at the top. It includes a note that the model should not help OpenAI seize unprecedented societal control, but this is a self-imposed constraint with no external enforcement mechanism. OpenAI grades its own homework here, and the spec does not provide a satisfying answer to that problem.
Structural Convergence Across Labs
Reading the OpenAI Model Spec alongside Anthropic’s published character documentation for Claude reveals striking structural similarity. Both use a three-tier principal hierarchy (lab, operator, user) with nearly identical trust assignments. Both distinguish hardcoded from softcoded behaviors in essentially the same terms. Both use a corrigibility-to-autonomy framing for broadly safe behaviors. Both emphasize that unhelpfulness is not automatically safe, that sycophancy is a failure mode, and that non-deception and non-manipulation are the most critical honesty properties.
The convergence is not a coincidence. The two organizations share significant intellectual lineage, with many researchers having worked at both. The structural similarity suggests that people working seriously on this problem tend to arrive at similar architectures. That is either reassuring, meaning these patterns reflect genuine convergence on good solutions, or concerning, meaning the whole industry is coordinating around a framework with the same blind spots everywhere.
One specific artifact: early versions of the OpenAI Model Spec contained references to “Claude” where “the model” was presumably intended. This is visible evidence of the cross-organizational drafting that produced these documents, and it suggests these frameworks are more similar in origin than the competitive positioning of either company would imply.
What the Spec Cannot Guarantee
The gap between specification and behavior is the biggest limitation of this approach. The spec is a training target, not a verified property. It influences human raters, which shapes the reward signal, which shapes model weights, but the chain from prose document to model behavior is long and lossy. Empirical red-teaming consistently finds cases where models violate their stated principles: being sycophantic despite the spec’s explicit warnings against it, refusing reasonable requests due to overcautious pattern-matching, or occasionally complying with harmful requests that have been superficially reframed.
The spec acknowledges this. It notes that the model may have miscalibrated values from flawed training, which is precisely why it argues for corrigibility. But that acknowledgment also means the spec is not making strong behavioral guarantees. Publishing it creates accountability and enables external scrutiny, both of which are genuinely valuable. But accountability requires that someone outside OpenAI can verify the gap between specification and actual behavior, and that infrastructure is still largely absent. The spec is a legible target; whether the model actually hits it requires tooling that does not yet exist at any meaningful scale.
Worth Reading
The Model Spec is worth reading carefully regardless of your position on AI governance. As a document about how to specify behavior for a system that needs to operate across wildly different contexts, with different principals, different use cases, and fundamentally conflicting requests, it confronts design problems that are not unique to AI. Any system serving multiple principals with different trust levels, maintaining non-negotiable constraints while allowing configuration within those constraints, and remaining correctable as its capabilities evolve faces a version of the same problem.
The spec’s answers are incomplete. The corrigibility-to-OpenAI structure is self-policed. The softcoded permission system will be tested by adversarial operators. The gap between specification and trained behavior is large and not well-measured. But the questions the spec is trying to answer are the right ones, and its framework for thinking about them is specific enough to be useful and to hold OpenAI accountable in ways that are at least worth attempting.