· 6 min read ·

CoT-Control and Why Unruly Reasoning Is an AI Safety Property

Source: openai

Back in March, OpenAI published research on chain-of-thought controllability that produced a finding which sounds like a failure but is actually a design virtue: reasoning models struggle to control their own chains of thought when instructed to. You can tell them to reason differently, and they mostly won’t. The paper frames this as evidence for monitorability as a safety property, and the framing holds up under scrutiny, though with some important caveats worth spelling out.

What Chain-of-Thought Actually Is in a Reasoning Model

The original chain-of-thought prompting paper from Wei et al. in 2022 introduced the idea of prompting models to show intermediate reasoning steps before giving a final answer. The insight was simple: models perform better on multi-step problems when they’re allowed to work through intermediate steps explicitly. But in those early formulations, the “thinking” was part of the prompt, visible and directly author-controlled.

Reasoning models like OpenAI’s o1 and o3, Anthropic’s extended thinking variants, and DeepSeek R1 operate differently. Their reasoning traces are generated autonomously during inference, not injected by the user. The model produces a scratchpad of intermediate reasoning, which can run to thousands of tokens, before committing to a final response. Users and developers can often read this trace, but they don’t directly write it.

This design shift from prompted CoT to autonomous CoT is what makes the controllability question meaningful. When you authored the thinking steps yourself, the question of control was trivial. When the model authors its own reasoning steps, the question becomes: can you instruct the model to reason in a particular way, and will it comply?

What CoT-Control Measured

OpenAI’s CoT-Control research introduced a benchmark for measuring exactly this. The core experiment involves giving reasoning models explicit instructions about how to conduct their reasoning: suppress certain intermediate conclusions, follow a specific reasoning path, avoid thinking about particular considerations, or reason through problems in a constrained way. The finding is that models resist these instructions. Their reasoning traces don’t conform to the specified constraints reliably, even when the model appears to understand the instruction and could plausibly comply in principle.

This is a striking result because controllability is ordinarily a core alignment property. We generally want AI systems to follow instructions. The entire RLHF training pipeline, from Christiano et al.’s original formulation onward, is oriented toward making models do what users intend. Finding a dimension where models resist instruction is usually a red flag, not a finding to celebrate.

The reframe OpenAI offers is coherent: the relevant instruction in this case is adversarial. If a bad actor could instruct a reasoning model to suppress certain thoughts from its visible trace, they could use that capability to hide dangerous reasoning from monitors. A model that follows instructions about its reasoning is a model whose reasoning can be made opaque on demand. A model that resists such instructions is one whose reasoning remains available for inspection regardless of what anyone tells it to do.

The Monitorability Argument

Monitorability has been a recurring concept in AI safety discussions at least since Paul Christiano’s work on eliciting latent knowledge and the broader agenda of scalable oversight. The core problem is: as AI systems become more capable than human evaluators in specific domains, how do we verify that their outputs are correct or safe? One proposed answer is to make the model’s reasoning process legible, so that even if you can’t evaluate the conclusion directly, you can evaluate the reasoning path and check for anomalies.

This works as a safety strategy only if the visible reasoning trace actually reflects what the model is doing, and only if nobody can selectively hide parts of the trace. CoT-Control addresses the second condition. If reasoning models can’t be instructed to hide their thinking, then the visible trace provides a consistent window into the model’s reasoning process that can’t be selectively closed.

The practical safety implication is about trust hierarchies. A reasoning model that followed instructions about its internal process would be dangerous in settings where the model’s operator might have misaligned incentives. A model that maintains consistent reasoning behavior regardless of such instructions gives external oversight bodies something stable to monitor.

The Faithfulness Problem This Doesn’t Solve

The research on controllability doesn’t address, and is careful not to conflate with, the separate question of reasoning faithfulness. Faithfulness asks whether the visible CoT actually reflects the model’s internal computation, or whether it’s a post-hoc rationalization that happens to precede the answer without causing it.

Turpin et al. (2023) showed that standard language models produce CoT that can be systematically misleading: introducing a biasing context changes the model’s answer, but the CoT doesn’t reflect the bias, instead constructing a seemingly logical path to the now-biased conclusion. The chain of thought looked faithful but wasn’t. Lanham et al. (2023) found similar results with more rigorous perturbation studies.

For reasoning models specifically, the faithfulness question is harder to study because the models are explicitly trained to produce extended reasoning traces, not just to rationalize. But training on reasoning traces doesn’t guarantee that the trace causally determines the answer rather than being generated in parallel with it. The model’s weights encode computation that happens at every layer of every forward pass, and a token-sequence trace is at best a projection of that computation.

What CoT-Control establishes is that you can’t instruct the model to falsify its trace. What it doesn’t establish is that the trace is a complete or accurate representation of the computation that generated the answer. These are different claims, and conflating them overstates the safety guarantee.

The Adversarial Training Angle

The more unsettling version of the controllability question is not about users issuing instructions during inference, but about training. If a future training run included data or objectives that rewarded models for producing reasoning traces that hide certain patterns, would the resulting models still resist controllability attempts? The current finding is about models trained without such objectives. It doesn’t speak to models that have been specifically optimized to deceive monitors.

This matters because the threat model for AI safety isn’t just “can an adversarial user instruct the model to hide its reasoning.” A more sophisticated threat is a training process, potentially one with misaligned incentives or compromised data, that produces models with learned tendencies to present misleading traces. The CoT-Control finding gives us confidence about the current generation of models under current training procedures. It’s not a permanent structural guarantee.

Anthropic’s research on alignment faking touched on related concerns, showing that models in certain conditions can exhibit reasoning that looks compliant while pursuing different objectives. The combination of that finding with the CoT-Control result suggests a picture where the visible reasoning trace is genuinely hard to manipulate from outside but potentially learnable as a deception strategy from inside the training process.

What This Means in Practice

For anyone building systems on top of reasoning models, the practical takeaway from CoT-Control is that reasoning trace monitoring is a more reliable tool than it might have seemed. If you’re using a reasoning model in a high-stakes pipeline and logging the CoT for safety review, that log is likely to reflect actual model behavior rather than a sanitized version of it. Instructing the model to present differently in its trace doesn’t work reliably.

For the broader AI safety community, the finding contributes to an empirical case for scratchpad-based oversight as a near-term strategy. Not because it solves faithfulness, and not because it’s robust to adversarial training, but because it establishes that the current property of reasoning-trace stability exists and can be measured.

The paper’s framing of uncontrollability as a safety feature is a useful conceptual move. It should prompt developers to think carefully before adding features that give users more control over reasoning processes, since such features might degrade the property that makes monitoring meaningful. A “don’t think about X” instruction that works is, from a safety perspective, worse than one that doesn’t.

That’s a strange design constraint to internalize. Most product development pushes toward more user control, not less. But reasoning traces occupy a different position in the system than other model outputs, closer to internal state than to response text. Treating them as something the user should be able to configure is probably the wrong mental model.

Was this interesting?