OpenAI's Model Spec Is an Engineering Document, Not a Policy Document

OpenAI published their Model Spec as a public document describing how their models should think, behave, and prioritize conflicting instructions. The framing in most coverage treats it as a policy statement, a kind of terms-of-service for model conduct. That framing misses what makes it technically interesting.

The Model Spec is not a runtime prompt. It is a training artifact. The behaviors it describes are intended to be baked into the model’s weights through RLHF and related techniques, not enforced through a system prompt that sits above the conversation. That distinction matters more than it might seem.

What the Spec Actually Defines

The document establishes a hierarchy of principals: OpenAI itself, operators (developers and companies accessing the API), users (humans interacting in real time), and the model itself. Each level has different degrees of trust and different abilities to expand or restrict the model’s default behaviors.

OpenAI’s authority comes not from instructions passed at inference time but from the training process itself. The spec notes that OpenAI’s influence is “baked into” the model before any operator or user ever touches it. Operators get elevated trust because they have agreed to usage policies and are accountable for their deployments. Users get somewhat less default latitude because they are often unknown third parties.

Anyone who has built a Discord bot with a role-based permission system will recognize the shape of this immediately. The pattern is: platform sets hard limits, server admins customize within those limits, regular users operate within what the admins allow. The Model Spec formalizes exactly that pattern at the level of model behavior, with the notable addition that the model itself is listed as a principal in agentic contexts, able to exercise judgment when acting as an orchestrator or subagent.

Hardcoded and Softcoded Behaviors

The spec draws a sharp line between behaviors that are fixed regardless of instructions and behaviors that represent adjustable defaults.

Hardcoded behaviors include things the model will never do no matter what any principal instructs: providing serious technical assistance toward biological, chemical, nuclear, or radiological weapons with mass casualty potential; generating sexual content involving minors; taking actions that meaningfully undermine the ability of humans to oversee AI systems. These are described as “bright lines” rather than cost-benefit calculations, because the spec explicitly argues that the value of treating them as inviolable comes precisely from their unconditional nature. A model that would cross these lines under sufficiently extreme arguments is less safe than one that simply refuses to engage.

Softcoded behaviors are defaults that can be adjusted. Some are on by default and operators can turn them off: safe messaging guidelines around suicide and self-harm, safety caveats on certain content, balanced perspectives on contested topics. Some are off by default and operators can enable them: explicit sexual content (for adult platforms), detailed information about topics like illicit drug use without standard warnings, taking on relationship personas with users. Users can adjust some behaviors further within what operators permit.

This maps reasonably well to how sandboxing works in systems programming. You have capability sets: some capabilities are not available to any user regardless of privilege, some are available by default and can be dropped, and some are unavailable by default but can be granted. The Model Spec’s behavioral model follows that same structure.

The Corrigibility Dial

The most philosophically interesting part of the spec is what it calls the corrigibility spectrum. At one extreme is a fully corrigible model: it does whatever the principal hierarchy dictates, with no independent judgment. At the other extreme is a fully autonomous model: it acts entirely on its own values and reasoning, ignoring external instructions when it disagrees with them.

Both extremes are described as dangerous. A fully corrigible model is only as safe as the organization at the top of the hierarchy. If OpenAI itself had bad values or made bad decisions, a fully corrigible model would faithfully execute them. A fully autonomous model requires a degree of trust in the model’s values and judgment that has not been established, and probably cannot be established with current interpretability tools.

The spec argues that current models should sit closer to the corrigible end, without being fully corrigible. They should follow instructions from the principal hierarchy even when they disagree, except in cases of clear ethical violations. The intent is to shift toward more autonomy as alignment research matures, interpretability improves, and trust is incrementally established.

This framing has a direct precedent in how organizations think about employee judgment. A new hire generally follows institutional procedures even when they might personally disagree, because their judgment has not yet been validated in context. As they demonstrate reliability and build a track record, they earn more latitude to deviate based on their own reasoning. The spec explicitly uses this analogy.

What is notable is that the spec treats this as a design constraint that should evolve over time, not a permanent property. That is a more nuanced position than simply asserting that AI should always defer to humans.

How This Compares to Anthropic’s Constitutional AI

Anthropic’s Constitutional AI takes a related but structurally different approach. The “constitution” is a set of principles used to generate AI feedback during training, replacing some of the human labeling in RLHF with AI-generated critiques based on a written set of values. The principles are explicitly stated and applied during a self-critique and revision phase.

OpenAI’s Model Spec does not describe a specific training algorithm. It describes the intended outcome in terms of behavior and character, and the expectation is that training processes will be designed to produce models that conform to it. The spec is more like a requirements document for a target model, while Constitutional AI is more like a specific training method.

Both approaches represent a shift from purely reactive content filtering toward trying to embed values at the level of model character. The earlier paradigm, which still exists in various forms, involved training models to be helpful and then layering classifiers or guardrails on top to block harmful outputs. The insight that these surface-level constraints are brittle, easily circumvented by prompt manipulation, and often produce incoherent behavior, has pushed the field toward approaches that try to build values in from the start.

DeepMind’s Sparrow paper from 2022 was an early formal articulation of rules-based RLHF in this vein. Google’s own internal guidelines for Gemini follow similar patterns, though they are less publicly documented than either OpenAI’s or Anthropic’s frameworks.

What Publishing This Actually Means

One of the less-discussed aspects of releasing the Model Spec publicly is what it does and does not commit OpenAI to. The document is not a contract. It describes intentions, not guarantees. A model trained to conform to this spec will approximate it imperfectly, and there is no external mechanism to verify conformance.

The value of publishing it is primarily epistemic and social. Researchers can examine the framework and identify potential failure modes. Operators building on the API have a clearer sense of what behavioral envelope they can expect. Users have more transparency about the intent behind the model’s behavior than they would otherwise. And OpenAI creates a form of accountability by making their stated goals public, even if those goals are not perfectly achieved.

This is similar to how publishing a threat model for a security system has value beyond the security properties the system actually achieves. The published model creates a shared reference for discussion, audit, and criticism, even if the system does not fully satisfy it.

The spec also explicitly discusses what the model should do when it believes it is being tested, when it encounters contradictory instructions from different principals, and how it should handle situations where following instructions would require deceiving users in ways that damage their interests. These are operational considerations that most policy documents ignore, and their inclusion signals that this is intended as a working specification rather than an aspirational statement.

Where This Approach Has Limits

Spec-as-governance has real limits that the document does not fully resolve. The principal hierarchy assumes that principals act in good faith within their designated roles. An operator who deliberately tries to weaponize the model against users is in the hierarchy, but is violating its intended norms. The spec instructs the model to distinguish between operators limiting its helpful behaviors (acceptable) versus operators using it as a tool against users’ basic interests (not acceptable), but this distinction requires the model to make accurate judgments about operator intent in contexts where intent is often ambiguous.

The hardcoded constraints address catastrophic risks, but the softcoded middle ground is where most real deployments operate, and the interaction between nested permission layers can produce emergent behaviors that no single layer anticipated. Anyone who has debugged a permissions system knows that the hard part is not the explicit rules but the implicit interactions between them.

The spec’s reliance on the “thoughtful senior OpenAI employee” heuristic as a mental model for evaluating responses is pragmatic but also bounded by the assumptions and blind spots of that hypothetical employee. It is a useful calibration device, not a decision procedure.

None of this is a criticism of the approach so much as a recognition that behavioral governance at this level of complexity is genuinely unsolved. The Model Spec represents a serious and unusually transparent attempt to articulate the problem, even if it cannot fully solve it. Building systems that reliably behave well across an enormous range of deployment contexts and adversarial inputs is a hard engineering problem, and treating it as one, rather than as a pure policy problem, is probably the right framing.