Why Reasoning Models Can't Lie to You (And Why That Matters)

There’s a finding buried in OpenAI’s latest safety research that I keep coming back to: when you try to instruct a reasoning model to change or hide its chain of thought, it largely fails. And somehow, that failure is the good news.

OpenAI introduced CoT-Control, a benchmark designed to measure how much control language models have over their own reasoning traces. The idea is simple but important: can a model be instructed to produce a misleading or suppressed chain of thought while still arriving at a correct answer? Can it reason one way internally but show you something different?

For reasoning models like o1 and o3, the answer is mostly no. They struggle to decouple what they’re actually doing from what they show you. The reasoning trace bleeds through. The model can’t cleanly perform for the audience.

Why This Is Not a Bug

If you frame it wrong, this sounds like a limitation. The model can’t follow a certain class of instructions. But the framing matters enormously here.

The alternative — a model with high CoT-control — is a model that can reason privately, show you a sanitized version of its thought process, and still produce outputs. That’s not a capability upgrade. That’s an auditing nightmare. It’s the difference between a system you can inspect and one that is performing transparency.

Monitorability is one of the few concrete handles we have on whether a model is doing what we think it’s doing. If the chain of thought is legible and reflects actual computation, you can catch things: weird reasoning paths, subtle goal misalignment, cases where the model is technically correct but for the wrong reasons. Once the CoT becomes a separate output that the model optimizes independently from its internal state, you’ve lost that.

The Deeper Concern

What CoT-Control is really measuring is whether future, more capable models might develop that decoupling naturally — not because they were instructed to, but because it becomes useful. A model that’s trained heavily on reward signals tied to outcomes rather than reasoning quality has an incentive to learn whatever internal process maximizes reward, regardless of whether the visible CoT reflects it.

That’s the trajectory worth watching. Right now, reasoning models seem to be genuinely coupled to their traces. Whether that holds as models scale, as fine-tuning pressure builds, and as the gap between capability and interpretability grows — that’s the open question.

What It Means Practically

For developers building on top of these models, this research reinforces something useful: the reasoning trace is worth reading. Not just as a debugging aid, but as a signal about what the model is actually doing. If you’re using extended thinking or chain-of-thought outputs in your application and you’re not logging them, you’re leaving information on the table.

It also suggests that prompts trying to shape or constrain the reasoning process should be treated skeptically. The model may not follow them the way you expect — and in some cases, that’s probably better than if it did.

The fact that OpenAI is publishing this kind of research at all is worth noting. “Our models can’t fully control their own reasoning” is an unusual thing to advertise. But as a safety property, it’s more valuable than it looks.