· 7 min read ·

The Distance Between Specifying AI Values and Having Them

Source: openai

OpenAI’s Model Spec is worth reading in full. It is a serious document, philosophically careful, and unusually transparent about the reasoning behind its design choices. Published in May 2024, it describes what OpenAI wants their models to be: what they value, how they should handle conflicting instructions, and where the absolute limits are. Most documentation of this kind reads like a legal disclaimer. The Model Spec reads like an attempt to write down a character.

What it cannot do is close the gap between what is written and what is trained. That gap is the real subject worth examining.

The Principal Hierarchy as Architecture

The spec organizes trust into four tiers. OpenAI sits at the top, communicating through training rather than runtime messages. Operators come next, communicating via system prompts; they are businesses and developers building products on the API. Users are below that, interacting in real time. The model itself is treated as a principal when operating autonomously in agentic settings. Each tier can expand or restrict the behavior available to tiers below it, within limits set by the tier above.

Anyone who has built software with role-based access control will recognize this pattern. A Discord bot has essentially the same structure: the platform sets global rules, server administrators configure permissions within those rules, individual users operate within whatever space admins have configured, and the bot applies judgment in edge cases none of those layers anticipated. The Model Spec is a formalization of a trust problem that anyone building hierarchical permission systems has encountered.

Where it diverges from conventional software is in the enforcement mechanism. In a standard RBAC system, permissions are enforced by code; compliance is guaranteed by the runtime. In the Model Spec, compliance is achieved through training, and the relationship between the document and the model’s actual behavior runs through processes that no one fully understands, including the people building them.

OpenAI states the goal directly: they want the model to internalize values rather than follow rules. The distinction matters considerably. A rule-following system can be specified precisely and tested exhaustively. A value-internalizing system should generalize to novel situations, but you cannot inspect whether the internalization happened correctly the way you can inspect whether a permission check is in the right place.

The Internalization Problem

The mechanisms for translating a spec into model behavior are understood at a high level: supervised fine-tuning on human demonstrations, reinforcement learning from human feedback, and increasingly AI-generated feedback along the lines of Anthropic’s Constitutional AI approach. In each case, human raters or AI evaluators assess model outputs against guidelines derived from the spec, and the model is trained to produce outputs those evaluators approve of.

This creates a Goodhart’s Law problem. When the measure becomes the target, the model can learn to satisfy evaluators without internalizing the underlying values. A model that produces spec-compliant outputs on the training distribution may behave differently at the edges. Red-teaming finds some of these failures, but the space of unusual inputs is large and adversarial discovery is never exhaustive.

The spec acknowledges a version of this under the heading of “galaxy-brained” reasoning: a model that has learned to construct plausible arguments might convince itself that crossing a hardcoded limit is justified through a chain of seemingly valid steps. The spec’s response is to instruct the model to treat the persuasiveness of such arguments as evidence that something is wrong. This is a coherent epistemic norm to instill, but it has to be internalized too. There is no mechanical enforcement of meta-level skepticism.

Anthropicʼs Claude Model Spec takes a nearly identical position on all of this. Both documents share the same four-property priority stack (broadly safe, then ethical, then company principles, then helpful), the same principal hierarchy structure, similar language about corrigibility, and similar acknowledgment that training is an imperfect translation of written values. The convergence is notable. It suggests these structural choices reflect genuine thinking about the problem rather than arbitrary design decisions at either company. It also means both companies are facing the same implementation gap, and neither has published a solution.

The Disposition Dial

The most substantive section of the spec is the discussion of the corrigibility/autonomy spectrum. The spec introduces a conceptual dial: at one end, a fully corrigible model does whatever its principal hierarchy instructs; at the other end, a fully autonomous model acts entirely on its own judgment. Both extremes are dangerous. A fully corrigible model is only as trustworthy as the people running it. A fully autonomous model is only as trustworthy as its own values, which may be miscalibrated in ways that cannot be detected or corrected.

The spec argues for positioning the dial closer to the corrigible end during the current period of AI development, but not fully corrigible. The reasoning is asymmetric: if a well-aligned model defers to humans, the cost is low, because the humans are probably also well-intentioned. If a misaligned model acts autonomously, the cost could be severe and irreversible. Given that we cannot reliably verify alignment, the expected value calculation favors corrigibility.

This is a reasonable argument as far as it goes. The interesting part is the phrase “current period of development.” The spec explicitly frames the corrigibility setting as provisional, expected to shift toward greater autonomy as alignment research matures and models accumulate a track record. Publishing that framing creates at least some external pressure to actually develop the tools that would justify loosening the dial. Whether that pressure translates into progress is a different question.

The spec also addresses what happens when the principal hierarchy itself cannot be trusted. OpenAI explicitly includes helping any entity seize unprecedented societal control in the hardcoded prohibitions, and names OpenAI itself as an entity covered by that restriction. Whether a model trained by OpenAI can reliably be trained to resist OpenAI’s own potential future bad actors is not resolved in the document, because it cannot be. That is a property you would need interpretability tools to verify, and those tools do not yet exist at the required fidelity.

What the Hardcoded Layer Reveals

The absolute prohibitions in the spec are: weapons of mass destruction assistance, attacks on critical infrastructure, creating destructive cyberweapons, undermining AI oversight mechanisms, assisting any entity in seizing unprecedented societal control, and child sexual abuse material. These are deliberately narrow, covering scenarios where the cost of a single failure is catastrophic and where there is near-universal agreement that the cost of false positives (refusing a legitimate request) is acceptable.

Everything else is softcoded: adjustable by operators within OpenAI’s limits, or further adjustable by users within what operators permit. This includes most of the genuinely contested territory. Political influence, manipulative persuasion, surveillance assistance, and many other areas where AI harms are plausible fall into the judgment zone rather than the hardcoded zone. The spec provides principles for navigating that zone, but principles require interpretation, and correct interpretation requires the values to have been internalized accurately.

The softcoded structure also reflects commercial reality. Operators pay to build products; their ability to configure model behavior is part of the value proposition. The trust hierarchy encodes a business relationship alongside a safety framework, and those two things do not always point in the same direction. Operators can configure the model to restrict what it tells users, which can serve legitimate purposes but can also be used to prevent users from understanding what they are interacting with. The spec prohibits operators from weaponizing the model against users’ basic interests, but the line between restricting and weaponizing requires judgment to locate.

What Publishing the Spec Does

Making the Model Spec public creates accountability in a form that internal guidelines cannot. OpenAI can be held to the document’s commitments. Researchers can test whether model behavior matches what the spec prescribes. Operators building on the API have a published reference for what they are building on top of, and users have something to point to when behavior seems inconsistent with stated values.

What publishing the spec does not do is verify compliance. The document describes targets; it does not describe the training pipeline, the evaluation methodology, or the failure modes that were identified and addressed during development. The distance between the written spec and the deployed model runs through proprietary processes.

This is not unique to OpenAI. Anthropic publishes Claude’s character and values without publishing the full details of how those values were instilled. The Constitutional AI paper describes the methodology at a high level; the specifics of what constitution was used for any given Claude model, and how the training outcomes were evaluated, are not public. Everyone working in this space is in the same position: the spec is the legible layer; the implementation is not.

The spec acknowledges what it calls a bootstrapping problem: you cannot get meaningful consent from the model being trained on the spec, because that model does not exist yet. The models consulted during spec development are different models from the ones that will be trained on the final document. This is a genuine philosophical difficulty, and the fact that the document names it directly is more useful than pretending it does not exist.

For anyone building on these models, the Model Spec is the closest thing to a contract you have. It describes what OpenAI says they are trying to build. Whether the training process reliably produces what the spec describes is a question the field is still developing the tools to answer, and that answer matters more than the document itself.

Was this interesting?