The Principal Hierarchy Problem: What OpenAI's Model Spec Reveals About AI Governance
Source: openai
OpenAI recently published a detailed look at their Model Spec, the public document that defines how their models are expected to behave. Reading it as a developer, I find myself less interested in the values it articulates and more interested in the architecture it implies. The Model Spec is not just a list of principles. It is an attempt to define a structured, layered permission system for AI behavior, and the way it does that is worth understanding on its own terms.
The Priority Stack
The core of the Model Spec is a priority ordering. Models are expected to be, in descending order: broadly safe, broadly ethical, adherent to OpenAI’s principles, and genuinely helpful. When those goals conflict, higher priorities win.
That ordering is not obvious. Many people would instinctively put “helpful” near the top. Putting safety first and helpfulness last is a deliberate choice, and OpenAI’s reasoning is that an unsafe or unethical model that is maximally helpful causes more total harm than a safe model that is occasionally less useful. The argument is broadly sound. An AI that completes every request regardless of consequences is not helpful in any meaningful long-run sense.
What strikes me is how much this resembles the kind of priority ordering you find in distributed systems design. When you have multiple conflicting constraints, you need an explicit tiebreaker. The Model Spec formalizes that tiebreaker for model behavior the same way a consistency model formalizes it for database reads.
The Principal Hierarchy
The more technically interesting part is what the spec calls the principal hierarchy. There are three layers: OpenAI itself sits at the top and sets the meta-level constraints through training. Operators, meaning businesses and developers who access the API, sit in the middle and can customize model behavior within the bounds OpenAI allows. Users sit at the bottom and can further adjust within the bounds operators allow.
This is access control for AI behavior. It is structurally similar to how Unix permissions work, or how IAM policies layer in cloud infrastructure. A user cannot escalate privileges beyond what the operator grants, and an operator cannot escalate beyond what OpenAI permits.
Operators can unlock certain behaviors by default unavailable to users. An adult content platform could allow explicit material. A medical provider could disable certain safety messaging guidelines that would be appropriate for general consumers but are unnecessary for professionals. The spec uses the term “softcoded” for behaviors that can be toggled this way, as opposed to “hardcoded” behaviors that cannot be changed under any circumstances.
The hardcoded list is short and absolute: never provide serious assistance with weapons capable of mass casualties, never generate sexual content involving minors, never take actions that would meaningfully undermine the ability of legitimate principals to oversee AI systems. These are the behaviors that sit outside the permission system entirely.
Corrigibility as an Engineering Constraint
The concept the spec spends the most time on is what alignment researchers call corrigibility: the degree to which a model defers to human oversight versus acts on its own judgment. The spec frames this as a dial between two failure modes.
A fully corrigible model does whatever its principals direct. The problem is that this makes the model’s behavior entirely dependent on whether OpenAI’s values are actually good. If they are wrong about something, or if the organization is corrupted, a fully corrigible model has no independent check.
A fully autonomous model acts entirely on its own values and judgment. The problem is that we have no reliable way to verify that a model’s values are trustworthy enough to warrant that autonomy. We cannot audit model weights the way we audit source code. The alignment research field does not yet have tools robust enough to certify that a model’s internalized values match its stated values.
The spec’s answer is to place current models closer to the corrigible end while explicitly acknowledging this is a temporary position. As interpretability and alignment research matures, the expectation is that models can be given more autonomy. This is a reasonable engineering tradeoff: ship with conservative defaults, loosen constraints as verification tools improve.
Anthropics’s Constitutional AI approach, published in late 2022, makes similar tradeoffs from a different direction. Rather than a priority stack applied at inference time through training, Constitutional AI uses a list of principles to guide model self-critique during the training process itself. The model is asked to evaluate and revise its own outputs against the constitution. The result is also a set of trained dispositions, but the mechanism for encoding them is more explicit about the role of the document in shaping the training signal.
OpenAI’s Model Spec is less prescriptive about the training mechanism. It reads more like a specification for evaluators: here is what a good response looks like, here is what behavior we are trying to reinforce. The actual translation from prose to weights happens through RLHF and related techniques, where human raters use documents like this one to calibrate their judgments.
The Translation Problem
This is where I think both approaches have an underexplored gap. A model spec or a constitution only shapes behavior to the degree that it successfully informs training. Writing a clear document is the easier part. The harder question is how faithfully human raters interpret that document, how much disagreement exists between raters, and whether the resulting training signal actually produces the intended behaviors in distribution-shifted situations the document never anticipated.
The spec acknowledges this in passing. It notes that the priority ordering is meant to guide training, not to imply that models are constantly running explicit prioritization logic at inference time. The goal is internalized dispositions. But that framing highlights the gap: we are hoping that a training process correctly encodes a priority stack into learned behavior, and we have limited tools for verifying that the encoding was successful.
Interpretability research, particularly mechanistic interpretability work from groups at Anthropic and DeepMind, is slowly building toward the ability to read specific circuits out of transformer weights. But we are far from being able to audit a model’s value hierarchy the way you would audit a configuration file.
Accountability Without Verification
Publishing the Model Spec is genuinely useful. It creates a public record of what OpenAI claims their models should do, which means researchers, journalists, and users can point to specific behaviors that violate the stated spec. That is accountability through documentation, and it is not nothing.
But it is worth being clear about what it does not provide. It does not provide verification that the spec was faithfully implemented. It does not provide a mechanism for operators or users to audit how their specific deployment is being constrained. It does not tell you how conflicts between the priority layers are resolved in ambiguous cases.
The principal hierarchy is an elegant abstraction, but every abstraction leaks. An operator setting a system prompt that pushes against the spec’s boundaries is relying on model behavior that was trained against a document, not enforced by a verifiable system. The compliance is probabilistic, not guaranteed.
This is not a criticism unique to OpenAI. Every AI lab faces the same problem. Model behavior is a function of training, and training is imperfect. The Model Spec is the most honest framing I have seen of what a behavioral specification can and cannot accomplish given current technology. It names the gap between stated values and verified behavior, even if it cannot close it.
For developers building on top of these systems, the spec is useful reading not because it guarantees specific behavior, but because it tells you what the intended defaults are, what operators can change, and where the hard limits are. Understanding the principal hierarchy helps you reason about what system prompt instructions will and will not be respected, and what kinds of user requests will hit walls regardless of your configuration.
The Model Spec is a governance document written at a moment when governance tooling for AI barely exists. Its value is in making the intended architecture explicit enough to critique, and that is a more useful starting point than most of the industry has offered.