· 6 min read ·

Between Rules and Values: The Engineering Logic of OpenAI's Model Spec

Source: openai

OpenAI’s Model Spec was published in May 2024 alongside a companion article framing it as an attempt to give models genuine character rather than a rule lookup table. The document has been discussed at length in alignment circles, but reading it as someone who designs layered permission systems, even the relatively mundane kind found in Discord bots and API services, surfaces a few things that the usual framing glosses over.

The Permission System at the Core

The spec introduces the concept of principals, entities whose instructions carry weight in determining model behavior. There are four: OpenAI (communicates via training, highest trust), operators (API customers who deploy the model, communicate via the system prompt), users (end users in the conversation), and the model itself (relevant in agentic and multi-agent contexts).

Anyone who has designed a role-based access system recognizes this pattern immediately. Operators are admins. Users are members. OpenAI is the platform enforcing the outer constraints. The spec makes the analogy explicit, comparing operators to “a relatively trusted manager or employer within the limits set by OpenAI.”

The permission model follows from there:

  • Operators can expand default behaviors for users, such as an adult content platform enabling explicit material
  • Operators can restrict defaults, such as a customer service bot limited to product topics
  • Operators can grant users the ability to modify behaviors, up to but not exceeding their own operator-level permissions
  • Certain behaviors are hardcoded and cannot be modified by anyone at any trust level

The last category is what the spec calls hardcoded OFF behaviors. The list includes providing meaningful uplift toward weapons of mass destruction, generating CSAM, and taking actions that undermine legitimate human oversight of AI. The framing around these is deliberate: bright lines need to be resistant to compelling arguments for exceptions. The spec argues that if anything, a persuasive case for crossing a bright line should increase suspicion that something is wrong. This recognizes that a sufficiently crafted adversarial prompt can construct plausible-sounding justifications for almost anything, and that robustness to such arguments is itself a safety property, not a failure of reasoning.

The Disposition Dial

The most technically honest concept in the spec is the disposition dial, a conceptual spectrum running from fully corrigible (does whatever the principal hierarchy says) to fully autonomous (acts entirely on its own values and judgment).

The spec argues that neither extreme is safe. Full corrigibility makes the model’s behavior entirely contingent on OpenAI having good values, which cannot be independently verified. Full autonomy relies entirely on the model having well-calibrated values, which also cannot be independently verified. The spec positions the target closer to the corrigible end without being fully corrigible, and it is explicit that this is a choice made under uncertainty rather than a confident optimum.

The reasoning offered for why safety constraints sit above even ethical behavior in the priority ordering is worth examining carefully. The argument is that a model with subtly miscalibrated values cannot verify its own miscalibration. Human oversight is the error-correction mechanism that makes it possible to catch and fix those errors. You defer not because oversight is infallible, but because the model cannot be confident enough in its own value alignment to justify overriding it. This is a meta-level safety argument: you trust the correction mechanism because the correctable thing (the model) cannot assess its own trustworthiness.

The spec also includes a property called autonomy-preserving under its honesty cluster. Because one model is interacting with a very large number of people simultaneously, nudging everyone toward particular views could have disproportionate societal effects. This is offered as a reason for the model to be careful about how it presents opinions, not to suppress having them, but to be mindful of scale.

The Temporal Clause

Several of the spec’s safety constraints are explicitly tied to “the current period of AI development,” with the implication that these constraints are appropriate given present uncertainty and may be relaxed as alignment research matures and trust is established.

This is a reasonable position. The problem is that the spec does not define what ends the current period, who makes that determination, or what evidence would be required to trigger relaxing the constraints. A temporal clause in a contract without a defined termination condition is either a time-limited promise or indefinite deferral, and the document does not specify which.

This framing is not unique to OpenAI. Anthropic’s approach to alignment, described in their Constitutional AI research and subsequent model documentation, uses nearly identical language, noting that AI and humans need to develop tools and techniques for establishing the trust that would justify greater model autonomy over time. The convergence suggests that this framing represents genuine consensus in alignment thinking rather than independent parallel development, which is plausible given how much researcher overlap exists between the organizations.

What Converges with Anthropic’s Approach

The structural similarity between the OpenAI spec and Anthropic’s public model documentation is extensive. Both define a layered principal hierarchy with comparable semantics. Both enumerate similar honesty properties, including a specific distinction between non-deception (not creating false impressions through any means) and non-manipulation (not using illegitimate epistemic techniques to influence beliefs), which is more precise than just requiring truthfulness. Both warn against being assistant-brained, meaning purely obedient without genuine values. Both have hardcoded absolute limits in the same categories. Both use the disposition dial framing. Both tie safety constraints to a current period.

The surface differences are mostly in presentation and emphasis. Anthropic’s approach leans into the CAI methodology more explicitly, treating the constitutional document as both a values statement and a training signal in a unified framework. The original CAI paper describes a process where an AI critiques and revises its own outputs against a written constitution, generating preference labels that reduce reliance on human feedback for the harm dimension. OpenAI’s spec is somewhat more granular in specifying the operator and user permission mechanics. Neither difference is fundamental to the underlying alignment philosophy.

The convergence is worth noting not to diminish either effort, but because it suggests the field has arrived at a shared structural understanding of the problem, layered principals, disposition between corrigibility and autonomy, hardcoded limits, and temporal constraints tied to verifiable trust, even if the methodologies for implementing that structure differ.

Heuristics and the Verification Problem

The spec includes several practical heuristics for evaluating responses. The “thoughtful senior OpenAI employee” standard asks the model to consider how a person who cares about both doing the right thing and being genuinely useful would react to a given output. That hypothetical person would be uncomfortable with both harmful responses and with needlessly cautious or preachy ones. This is paired with a dual newspaper test: would the response be reported as harmful by a reporter covering AI harms, but also, would it be reported as uselessly paternalistic by a reporter covering overprotective AI systems.

The 1000-users mental model is another concrete heuristic: treat a response to an ambiguous request as a policy across the realistic distribution of people who might send that message. A question about medication overdose thresholds might come mostly from caregivers, medical professionals, and concerned patients. The response should serve that realistic majority while calibrating to the actual risk posed by the minority. This is a useful frame for thinking about response decisions as policies rather than individual judgments.

The companion article discusses deliberative alignment as a research direction: training models to explicitly reason about spec principles when making decisions, rather than pattern-matching to absorbed behavioral examples. The goal is alignment that is more robust and more interpretable, where you could in principle audit the model’s reasoning against the spec rather than inferring alignment from behavior alone.

This points to the right underlying problem, which the spec itself cannot solve: the gap between a specification document and a verified implementation is not closed by the document’s existence. The spec describes what the model should do. There is no mechanism within the document for confirming that training actually produced it. Deliberative alignment, if it works, would make that gap more visible and more auditable. Whether it delivers is a separate and open question, but framing it as a goal shows awareness of the limitation.

The spec is worth reading both as a statement of intent from an organization that has historically been opaque about its alignment approach, and as an engineering document that makes visible the structural choices behind how these models are trained to behave. Those choices shape the behavior of every system built on top of them, which includes a lot of software people are building right now without necessarily having read the document that defines the defaults.

Was this interesting?