When the Chain of Thought Won't Lie: Reasoning Model Opacity as a Safety Feature
Source: openai
There’s a counterintuitive result buried in OpenAI’s latest safety research that’s worth sitting with: reasoning models struggle to control their own chains of thought, and that’s actually a good thing.
The research introduces something called CoT-Control, a framework for evaluating how much influence a model has over its own intermediate reasoning steps. The finding is that models like o1 and o3 have surprisingly little ability to steer or suppress their reasoning chains on demand. If you try to get them to hide their reasoning or reason in a deliberately misleading way, they largely fail.
At first glance this sounds like a limitation. And in some narrow sense it is — these models aren’t as controllable as you might want them to be. But the safety implication flips that around entirely.
Monitorability as a First-Class Property
The core argument is about monitorability. If a reasoning model’s chain of thought reflects something close to its actual computational process — not a performance of reasoning, but the real thing — then reading the CoT gives you meaningful signal about what the model is actually doing.
Compare this to the alternative: a model that can produce whatever chain of thought it wants, independent of its actual inference process. That model could reason one way internally and narrate something completely different. You’d have no reliable way to audit it.
The fact that current reasoning models can’t cleanly separate their visible reasoning from their actual reasoning is what makes the CoT useful as a monitoring surface. The struggle to control it is evidence that the two things are coupled.
Why This Matters for People Building on These Models
If you’re building applications that use reasoning models — and increasingly, developers are — this has practical implications.
First, the CoT isn’t just a debugging artifact. It’s closer to a log of actual model behavior. Treating it as such (storing it, reviewing it, using it for evals) makes more sense than discarding it.
Second, any future model that does gain strong CoT control should be treated with more suspicion, not less. The ability to decouple visible reasoning from actual computation is a capability that could be used to obscure misaligned behavior. A model that reasons well but can also hide that reasoning cleanly is harder to trust.
Third — and this is more speculative — this research gives some grounding to the idea that interpretability and controllability can trade off. A model that’s highly interpretable (its CoT reflects its process) may be less controllable in certain ways. That’s a constraint worth understanding before assuming you can have both.
The Broader Safety Picture
OpenAI is positioning monitorability as a safeguard — not just a nice-to-have, but a structural property that makes oversight possible. That framing feels right to me. A lot of AI safety discourse focuses on alignment and values, but the ability to actually observe what a model is doing in production is underrated.
The result also implicitly pushes back on some threat models where advanced models strategically manipulate their visible reasoning to deceive evaluators. That’s still worth taking seriously as a future concern, but the current evidence suggests it’s not trivially easy — these models aren’t good at it even when prompted to try.
For now, the takeaway is practical: the chain of thought in reasoning models is more trustworthy than you might assume, and worth building monitoring around. That’s a more useful property than it might seem.