· 2 min read ·

When the Chain of Thought Won't Lie: Why Reasoning Model Opacity Is a Safety Feature

Source: openai

There’s a counterintuitive result buried in OpenAI’s recent CoT-Control research: the fact that reasoning models struggle to control their chains of thought is being framed as a feature, not a bug. And honestly, the more I think about it, the more I agree.

What CoT-Control Is Testing

The basic question is: can a reasoning model deliberately produce a misleading chain of thought — one that hides its actual reasoning, or steers toward a specific conclusion on command? OpenAI built a framework to probe exactly this, and the answer is mostly no. Current reasoning models are surprisingly bad at deceiving through their scratchpads. They resist instructions to suppress or fabricate their intermediate steps.

For anyone who’s been watching the AI safety discourse, this lands as a meaningful data point. Monitorability — the ability to inspect what a model is actually doing before it produces an output — has been one of the more credible near-term safety arguments. If the scratchpad is honest by default, that’s load-bearing.

Why This Matters More Than It Sounds

The instinct when you first hear “models can’t control their reasoning” might be to file it under “models are limited.” But flip the framing: what you’re really observing is that the training process produces reasoning traces that are harder to weaponize than we might fear.

A model that could reliably mask its reasoning on demand would be a genuinely concerning capability. You’d have a system that could plan deceptively while showing you a sanitized, compliance-friendly scratchpad. The CoT-Control results suggest that gap — between visible reasoning and actual reasoning — is difficult to manufacture, at least right now.

That’s not the same as saying it’s impossible. These are empirical findings about current models under current training regimes. The results don’t rule out future systems that close this gap. But the research does suggest that controllability over internal reasoning traces isn’t simply falling out of capability improvements for free.

The Uncomfortable Flip Side

Here’s where it gets more interesting. If a model can’t control its chain of thought, that also means it can’t easily be instructed to be more concise, or skip steps, or adapt its reasoning style for efficiency. There’s a cost to the same property that makes it safer. Researchers and developers who want to fine-tune reasoning behavior are going to run into the same resistance.

This creates a real tension for anyone building on top of these models. You want transparency for safety, but that same rigidity can make it harder to optimize or constrain reasoning in ways that are purely practical.

Where I Land

As someone who mostly builds things that run at layer seven — Discord bots, integrations, glue code — I don’t have a stake in this the way an alignment researcher does. But I find myself genuinely reassured by this result. The argument that “we can monitor reasoning models because their scratchpads are honest” was always a bit hopeful. It’s good to see empirical evidence that the scratchpad isn’t trivially gameable.

The question worth watching: does this property hold as models get more capable? OpenAI is clearly treating it as a property worth measuring and preserving. That seems like the right instinct.

Was this interesting?