The Safety Case for Uncontrollable Reasoning

There is a version of this story where the headline is alarming: AI models that cannot control their own reasoning. But OpenAI’s framing of their CoT-Control research lands differently, and I think they are right to frame it that way.

The research introduces CoT-Control, a framework for probing whether reasoning models can be instructed to suppress, redirect, or manipulate their own chains of thought. The finding: they largely cannot. When prompted to reason in a misleading or hidden way, the models resist — not through any explicit design rule, but as an emergent property of how they learned to reason.

Why this matters

The concern with powerful reasoning models has always been legibility. A model that can construct a convincing answer is one thing. A model that can construct a convincing answer while concealing how it got there is a different kind of problem. If the chain of thought can be strategically shaped to look benign while actual reasoning runs somewhere else, you have lost one of the better tools for auditing model behavior.

What CoT-Control is probing is essentially: can the model do that? Can it reason one way and show you something else?

The answer, at least for current models, appears to be mostly no. The reasoning leaks. The actual process tends to surface in the visible trace, even when the model is nudged toward hiding it.

Monitorability as a first-class property

OpenAI is making a specific argument here: that monitorability — the ability to observe a model’s reasoning process — is a meaningful safety safeguard, and that the difficulty models have in suppressing their chains of thought reinforces it.

This is a more interesting framing than “we made the model transparent.” It is saying: the model’s inability to fully control its own cognition is, structurally, a feature. The reasoning process is somewhat ungovernable from the inside, which means it is somewhat auditable from the outside.

For anyone building systems on top of reasoning models, this has practical implications:

Chain-of-thought traces are worth logging and reviewing, not just the final outputs
Attempts to instruct a model to reason covertly may fail in detectable ways
The visible reasoning is a closer proxy for actual model behavior than it would be if models had fine-grained control over it

The caveat worth holding onto

None of this means the problem is solved. Future models trained differently, or models explicitly optimized for steganographic reasoning, could look very different. The current property is emergent, not guaranteed, and emergent properties can be trained away.

What this research establishes is a snapshot: today’s reasoning models, in tested conditions, struggle to control their chains of thought. That is useful to know. It grounds a specific safety claim in empirical observation rather than architectural assumption.

For people working on AI systems at any level, the broader lesson is that auditability should be treated as an engineering concern, not just a philosophical one. CoT traces are evidence. Treat them like logs.