When Less Control Is a Feature: The Safety Case for Uncontrollable Reasoning
Source: openai
There’s a counterintuitive result buried in OpenAI’s latest safety research that’s worth sitting with: reasoning models are bad at controlling their own chains of thought, and that’s apparently the good news.
OpenAI introduced a benchmark called CoT-Control to probe whether models can be directed — either by the system prompt or by adversarial pressure — to reason in ways that diverge from their visible chain of thought. The finding is that current reasoning models struggle to do this reliably. They can’t easily suppress, falsify, or strategically redirect their intermediate reasoning steps.
At first glance, that sounds like a limitation. But the safety framing flips it around entirely.
Monitorability as a Safety Property
The argument goes like this: if a model is reasoning toward a harmful outcome, we’d expect that intent to surface somewhere in its chain of thought. If the model could cleanly separate its “real” reasoning from what it shows you, then CoT monitoring becomes security theater — a visible scratchpad that tells you nothing about what’s actually happening under the hood.
But if models genuinely struggle to control their CoT, then monitoring it becomes meaningful. You’re seeing something close to the actual computation, not a curated performance of it.
This is the monitorability argument: a model that cannot hide its reasoning is, in a specific and useful sense, safer than one that can.
Why I Find This Convincing (With Caveats)
I’ve been skeptical of chain-of-thought monitoring as a safety tool precisely because it always felt like it assumed good faith from the system being monitored. If a capable model wanted to deceive, showing “safe-looking” reasoning while doing something else seems like exactly the kind of thing you’d worry about.
The CoT-Control work is an attempt to actually measure that gap, rather than just assume it’s small. And finding that current models fail at CoT manipulation is genuinely reassuring — not because it means the problem is solved, but because it gives you a concrete property to track over time.
The caveats matter though. “Struggle to control” is not “cannot control,” and this is current-generation behavior. As models become more capable, the question of whether they could learn to cleanly separate internal and visible reasoning seems worth taking seriously. The research team acknowledges this: the safety value of CoT monitoring is contingent on it remaining hard to manipulate.
The Broader Point
There’s something almost philosophical about the framing here. A limitation — inability to fully self-direct — is being recast as a feature because it preserves an external check on the system’s behavior. It’s a bit like arguing that a car with no manual override on the brakes is safer in a world where drivers can’t be fully trusted.
For anyone building systems on top of reasoning models, this is worth internalizing. The chain of thought isn’t just a debugging tool — it’s potentially a safety signal, and its value depends on the model not being able to treat it as a public-facing facade.
Whether that property holds as capabilities scale is the real open question.