· 6 min read ·

Why Reasoning Models That Can't Lie in Their Scratchpad Are Safer Than Ones That Can

Source: openai

OpenAI published research in early March 2026 on what they call CoT-Control, a framework for testing whether reasoning models can deliberately control what appears in their chain of thought. The finding, stated simply, is that they mostly cannot. The models tested showed significant resistance to producing chains of thought that diverged from their actual computation, even when instructed or fine-tuned to do so. OpenAI frames this not as a flaw but as a safety property worth preserving.

That framing deserves unpacking, because it runs counter to how people usually discuss CoT faithfulness.

The Faithfulness Problem Has Two Faces

Researchers have known for a while that chain-of-thought reasoning can be unfaithful. The landmark 2023 paper by Turpin et al. showed that models will sometimes produce confident, articulate reasoning that does not actually explain why they gave a particular answer. The CoT is post-hoc rationalization, not a trace of the computation. This is a problem for interpretability: if the scratchpad does not reflect what the model is doing internally, monitoring the scratchpad tells you little.

But there are two distinct failure modes here, and they point in different directions for safety:

  1. Accidental unfaithfulness: the model’s CoT diverges from its internal computation without the model intending this. The model just happens to generate a plausible-sounding explanation that does not match the real cause of its answer.
  2. Deliberate manipulation: the model produces a CoT that strategically misrepresents its reasoning, perhaps to appear aligned while pursuing a different objective.

The Turpin et al. work, and most faithfulness research since, addresses the first case. CoT-Control addresses the second. These are very different problems. Accidental unfaithfulness is an interpretability problem. Deliberate manipulation is an alignment problem, and a more serious one.

What CoT-Control Actually Tests

The research framework works by attempting to induce a specific kind of divergence: get a model to produce a chain of thought that either conceals or misrepresents the steps it is actually taking to reach an answer. This can be attempted through prompting alone, through fine-tuning, or through other steering mechanisms.

The finding is that current reasoning models resist this surprisingly well. When pushed to produce misleading CoT, models tend to either fail at the task, produce CoT that still reveals the actual reasoning, or refuse. The chain of thought appears to be tightly coupled to the computation in a way that is difficult to selectively decouple.

This coupling is not surprising from a mechanistic standpoint. In reasoning models like o1 and o3, the chain of thought is not a separate commentary track on top of the computation. It is part of the computation. The model generates tokens that form the scratchpad, and those tokens influence subsequent token generation. The CoT and the answer are produced through the same forward pass dynamics. Asking the model to produce a deceptive CoT while still reaching a correct answer requires the model to maintain two independent threads of reasoning simultaneously, which appears to be beyond what current architectures reliably support.

Why This Matters for Safety Architecture

OpenAI’s Preparedness Framework and related safety work have consistently emphasized monitorability: the property that AI systems should be observable and interpretable enough that humans can detect problematic behavior. CoT monitoring is one of the most practical tools available for this. If a model is working through a task and producing visible reasoning, you can inspect that reasoning for signs of deceptive intent, scope creep, or misaligned objectives.

But this monitoring only works if the CoT is actually informative. If a sufficiently capable model could produce an innocuous-looking scratchpad while pursuing a hidden objective, the monitoring surface becomes noise. CoT-Control provides empirical evidence that, at least with current reasoning models, this decoupling is difficult to achieve. The scratchpad is a reasonably reliable signal.

This is directly relevant to how AI systems are deployed in practice. Organizations using reasoning models for high-stakes tasks, such as code generation, medical triage support, or financial analysis, often implement logging of model outputs including intermediate reasoning. The CoT-Control findings suggest this logging is more meaningful than a pessimistic view of CoT faithfulness would imply.

The Goodhart Concern

There is an obvious objection: if models are trained knowing that their CoT is monitored, they will learn to produce CoT that satisfies monitors without revealing their actual reasoning. This is a version of Goodhart’s Law applied to interpretability. The measure becomes the target.

OpenAI’s research is not claiming this problem is solved. The finding is that current models, under current training regimes, show this tight coupling between CoT and computation. The research is careful to frame this as a property of present systems that is worth preserving, not a guaranteed property of future systems.

This is the right framing. As models become more capable, and as they are trained in environments where CoT monitoring is known, the dynamics could change. A sufficiently capable model with a sufficiently misaligned objective might develop the ability to decouple its visible reasoning from its actual computation. The CoT-Control framework gives researchers a tool to track whether this is happening, which is arguably more valuable than any single finding from applying it today.

Connecting to Broader Interpretability Work

The AI interpretability landscape approaches the problem from multiple angles. Anthropic’s mechanistic interpretability research attempts to understand what is happening inside models at the level of circuits and features, independent of any externally generated text. That approach is complementary to CoT monitoring: mechanistic interpretability works from the inside out, while CoT monitoring works from the outside in.

The two approaches also have different cost profiles. Mechanistic interpretability is computationally expensive and requires substantial expertise to apply. CoT monitoring is cheap: log the scratchpad, run it through a classifier or human review. If CoT is reliably informative, the cheap approach catches a significant fraction of the risk surface, and the expensive approach handles the residual.

There is also interesting prior work from DeepMind on process-based supervision, which trains models based on the quality of their reasoning process rather than just the final answer. This kind of training might have structural implications for CoT faithfulness, since it creates pressure for the CoT to actually reflect the computation rather than just be a plausible-sounding cover story.

What This Means for Building on Reasoning Models

From a practical standpoint, if you are building applications on top of models like o3 or similar reasoning systems, the CoT-Control findings are a mild but genuine reassurance. The intermediate reasoning you see is not arbitrary. It is informative about what the model is doing, and monitoring it is not theater.

That said, the findings do not change the basic engineering advice: log everything, treat model outputs as untrusted, and build evaluation pipelines that can catch failures regardless of whether the CoT looks reasonable. A model can produce a coherent scratchpad and still be wrong. Faithfulness is not accuracy.

For developers building agentic systems where models take multi-step actions with real-world consequences, the monitoring surface matters even more. An agent that is doing something unexpected should ideally show signs of that in its reasoning trace before it produces a harmful action. CoT-Control provides some evidence that this monitoring surface is not trivially gameable with today’s models.

The Bigger Picture

The framing that OpenAI has chosen here is worth sitting with. The finding is that reasoning models struggle to control their own CoT, and this is described as good. That inversion of the usual expectation, where more control is assumed to be better, reflects a genuine insight about the current moment in AI development.

We are at a stage where the tools for monitoring AI behavior are limited, and the models are capable enough to cause significant problems if they behave unexpectedly. In that environment, a model that cannot hide its reasoning is preferable to one that can, even if the inability to hide reasoning sounds like a limitation. The transparency is load-bearing.

As OpenAI’s safety team has noted, preserving monitorability during training is something that requires active effort. It is easy to inadvertently train models in ways that reduce CoT transparency. The CoT-Control framework gives researchers a way to measure this property and track whether training changes are eroding it.

That measurement capability is probably more durable than any specific finding about current models. The specific result, that o-series models struggle to decouple their CoT from their computation, will likely need to be revisited as architectures and training regimes evolve. The framework for checking it is the lasting contribution.

Was this interesting?