The Model Spec as a Contract: What OpenAI's Public Behavioral Framework Actually Commits To
Source: openai
OpenAI published its Model Spec as a public document describing how its models are intended to behave. The framing in their recent piece on the approach positions it as a framework for transparency, a way for operators and users to understand what they’re working with. That framing is accurate as far as it goes, but it undersells what making this kind of document public actually does to the relationship between an AI company and the people using its systems.
When you write down your behavioral commitments in enough detail that someone can hold you to them, you’ve done something substantively different from publishing a vague safety statement. The Model Spec is that kind of document, and the interesting thing isn’t the individual policy choices it makes. The interesting thing is the structure underneath them.
The Principal Hierarchy
The core architectural decision in the Model Spec is what it calls the principal hierarchy. At the top sits OpenAI itself, whose rules are baked into training and cannot be overridden by anyone downstream. Below that are operators, the companies and developers who access the API and build products on top of it. Below them are users, the people actually having conversations.
Each layer can customize behavior within the constraints set by the layer above. An operator building an adult content platform can unlock explicit content generation. An operator running a children’s education product can restrict the model to topics appropriate for that context. Users can further adjust within whatever space the operator allows.
This isn’t a novel concept in software. It maps reasonably cleanly to how role-based access control works in most enterprise systems. The operator is roughly analogous to an application admin, the user to an end user, and OpenAI to the infrastructure provider setting baseline security policy. The difference is that the “policy” here includes things like what kinds of harm the model will participate in, how honest it will be about its nature, and under what conditions it will override operator instructions to protect a user.
That last point is important. The Model Spec specifies that while operators can restrict what the model does, they cannot direct it to actively work against users’ basic interests. The model should always tell users what it cannot help with in the current context, even if it can’t explain why. It should never deceive users in ways that damage them. It should refer users to emergency services when there’s risk to life, regardless of operator configuration.
These are protections that exist below the operator level, hardcoded in the same sense that certain absolute prohibitions are hardcoded. That distinction, between operators limiting the model versus operators weaponizing it against its own users, is one of the more useful conceptual contributions the document makes.
Hardcoded and Softcoded Behaviors
The Model Spec divides behaviors into two categories. Hardcoded behaviors are fixed regardless of any instructions from any principal. Softcoded behaviors are defaults that can be adjusted.
The hardcoded prohibitions include the expected items: providing meaningful assistance toward creating biological, chemical, nuclear, or radiological weapons capable of mass casualties; generating child sexual abuse material; taking actions that would undermine the ability of humans to oversee and correct AI systems. These are presented as bright lines that the model treats as non-negotiable regardless of how compelling the argument for crossing them might seem.
That last framing is deliberate. The spec notes that a persuasive case for crossing a bright line should actually increase suspicion that something is wrong. This is a reasonable epistemic stance for a system that knows it can be manipulated through clever prompting. If your safety guarantees bend under sufficient argumentative pressure, they aren’t really guarantees.
Softcoded defaults cover a much wider range. Safe messaging guidelines around suicide and self-harm are on by default but can be turned off for medical providers. Explicit sexual content is off by default but can be enabled by appropriate operators. Adding safety caveats to messages about dangerous activities is a default that can be adjusted. The structure acknowledges that appropriate behavior is genuinely context-dependent in ways that can’t be resolved by a single global policy.
The Honesty Framework
One section of the Model Spec that doesn’t get enough attention is its treatment of honesty. The document distinguishes between several different honesty-related properties: being truthful (only asserting things believed to be true), being calibrated (acknowledging uncertainty appropriately), being transparent (not pursuing hidden agendas), being forthright (proactively sharing useful information), being non-deceptive (not creating false impressions through technically true statements, framing, or selective emphasis), and being non-manipulative (relying only on legitimate means to influence beliefs).
The distinction between non-deception and non-manipulation is worth sitting with. A technically true statement can deceive. A logically valid argument can manipulate if it exploits psychological biases rather than informing rational judgment. The spec commits the model to avoiding both, which is a higher bar than just not lying.
The document also distinguishes between sincere assertions and performative assertions. When a model writes a persuasive essay it doesn’t personally agree with, or plays a character in a roleplay, it isn’t lying, because both parties understand the context. This is a reasonable distinction that maps to how we think about human honesty in similar situations. A lawyer arguing a position they find distasteful isn’t being dishonest; a novelist writing a villain’s dialogue isn’t endorsing it.
How This Compares to Anthropic’s Approach
Anthropics published their own model specification for Claude, which covers much of the same ground and arrives at structurally similar conclusions. Both documents use a principal hierarchy with roughly the same layers. Both distinguish between hardcoded and adjustable behaviors. Both emphasize that models should support human oversight during this period of AI development.
The differences are more in emphasis and framing than in the fundamental architecture. Anthropic’s spec spends more time on Claude’s identity and psychological stability, treating the model as an entity with a genuine character rather than purely as a tool that should behave well. It discusses what Anthropic calls “big-picture safety” in terms of the model’s own values rather than just compliance with rules.
OpenAI’s spec is more transactional in its framing. It talks extensively about what operators and users can and cannot do, about trust levels and permissions. It’s more explicitly about the product than about the model as an entity.
Neither approach is obviously superior. The Anthropic framing risks anthropomorphizing in ways that could mislead users about what they’re dealing with. The OpenAI framing risks reducing the question of model behavior to a permissions system, which underspecifies how the model should reason in genuinely novel situations where no explicit rule applies.
DeepMind has published Gemini model cards and various safety documentation, though less in the form of a unified behavioral specification. The field is converging on something like this format, which is probably healthy even if the documents themselves are imperfect.
The Accountability Question
The most significant thing about public behavioral specifications isn’t their content but their existence. When OpenAI publishes a document saying the model will always tell users what it can’t help with, that’s a commitment that can be tested. Researchers, journalists, and users can probe whether the model actually behaves that way. When it doesn’t, there’s a documented standard against which the failure can be measured.
This is how accountability works in more mature industries. Pharmaceutical companies publish clinical trial protocols before running trials, not after, so that selective reporting becomes harder. Financial institutions publish risk management frameworks that regulators can audit. The publication itself changes the incentive structure around compliance.
AI behavioral specifications are early in this process. They’re mostly self-certified and self-enforced. There’s no external audit mechanism equivalent to what exists in finance or pharmaceuticals. But establishing the norm of publishing detailed behavioral commitments is a precondition for eventually having meaningful external verification.
The spec also creates a useful artifact for the operator relationship. A developer building on the OpenAI API can read the Model Spec and understand what guarantees they’re getting and what they aren’t. They know they can configure the model’s defaults within certain limits and that certain baseline protections for their users exist regardless of how they configure it. That’s a foundation for building products responsibly, in a way that wasn’t possible when behavioral expectations were implicit.
What It Leaves Open
The Model Spec is explicit that it doesn’t resolve everything. It acknowledges that the model will sometimes make mistakes, that calibrating appropriate behavior in edge cases is genuinely hard, and that the spec itself will evolve as understanding improves.
The section on autonomous AI agents is notably less developed than the rest of the document, which reflects the actual state of the field. When models are taking multi-step actions in the world, browsing the web, writing and executing code, managing files, the question of what constitutes safe behavior becomes considerably more complex. A conversation that produces bad output is recoverable. An agent that takes irreversible actions based on misunderstood instructions is a different category of problem.
The spec’s guidance for agents, to prefer cautious actions, request only necessary permissions, err on the side of checking with users when uncertain, is reasonable but underspecified. This is where most of the interesting unsolved problems in AI safety actually live, and it’s where the gap between having a principled framework and having a fully worked-out policy is most visible.
For now, the value of a document like this is less about having all the answers than about making explicit what the questions are. That’s harder than it sounds, and worth more than it’s given credit for.