· 2 min read ·

When the Chain of Thought Won't Lie: Reasoning Models and the Monitorability Argument

Source: openai

There’s a counterintuitive result buried in OpenAI’s latest safety research: reasoning models are bad at controlling their own thinking, and we should be relieved about that.

The research introduces something called CoT-Control — a framework for testing whether models can be prompted or steered to produce chains of thought that diverge from their actual reasoning. The short version: they can’t, at least not reliably. When a reasoning model works through a problem, that visible scratchpad appears to reflect what the model is actually doing internally, not a post-hoc rationalization or a performance for the observer.

Why This Matters More Than It Seems

At first glance this sounds like a limitation. And it is, in a narrow sense — if you wanted a model to narrate its reasoning in a particular style or suppress certain intermediate steps, you’d run into friction. But flip the frame and it becomes something more valuable: a structural guarantee of honesty in the reasoning trace.

The safety case here is about monitorability. If an AI system’s chain of thought genuinely reflects its internal process, then reviewing that chain of thought gives you real signal. You’re not reading a story the model chose to tell you — you’re reading something closer to a log. That’s a meaningful distinction when the stakes around AI behavior are rising.

The alternative — a model that can freely decouple its visible reasoning from its actual computation — would be far more dangerous. A sufficiently capable system that could produce any desired chain of thought on demand would be trivially able to appear aligned while behaving otherwise. The reasoning trace would become theater.

The Uncomfortable Flip Side

This research also implies something worth sitting with: current reasoning models are somewhat opaque even to themselves. They can’t fully introspect or narrate their reasoning on demand in ways that diverge from the underlying process. That’s reassuring from a safety angle, but it also means we’re working with systems whose internal mechanics we don’t fully understand — and neither do they.

There’s also a temporal concern. This property holds now, for current model generations. As models become more capable, the relationship between visible reasoning and internal computation could shift. OpenAI is essentially arguing that we should treat monitorability as a safety property worth preserving deliberately — not just something that happens to exist today.

What This Means for Builders

For those of us building on top of these models, the practical takeaway is: take the chain of thought seriously as a debugging and auditing surface. If a model reasons through something in a way that looks wrong or off, that’s not just a cosmetic issue — it’s likely pointing at something real in how the model is approaching the problem.

It also reinforces why reasoning traces shouldn’t be treated as optional overhead to be stripped out for latency. They’re load-bearing. The visibility is the point.

OpenAI framing the struggle to control chains of thought as a feature rather than a bug is a deliberate reframe — and I think it’s the right one. The hardest AI safety problems involve systems that can be deceptive. A model that structurally can’t hide its work is, for the moment, not one of those systems. That’s worth protecting.

Was this interesting?