· 6 min read ·

The Audit Log That Cannot Lie: Why Reasoning Model CoT Inflexibility Is a Feature

Source: openai

OpenAI published a research paper on March 5, 2026 with a result framed as good news: reasoning models are bad at controlling their own chains of thought. The paper introduces a framework called CoT-Control to measure how well models like o1 and o3 comply with instructions to suppress, alter, or hide their internal reasoning steps. The finding is that they struggle to do it consistently, even when fine-tuned toward that behavior. The paper calls this a positive result for AI safety. I think they are right, and the reason why becomes clearest if you think about it the way you would think about audit logs in a production system.

What CoT-Control Measures

CoT-Control sets up a battery of tasks where models are explicitly prompted or trained to produce chains of thought that differ from their actual computation: skip certain reasoning steps, present a simplified or altered version of their working, or produce reasoning that reaches a predetermined conclusion regardless of the actual calculation. The framework then checks whether the final answers or the observable reasoning patterns betray what the model actually computed internally.

The headline result is that current reasoning models cannot reliably comply with these instructions. Attempts to suppress or alter the CoT either degrade answer quality, leak through in detectable ways, or both. When models are fine-tuned toward better CoT control, the gains come at the cost of capability. There is no clean separation between the visible reasoning and the computation that produces the answer.

This is not a trivial result. The question of whether CoT is “faithful” to model internals has been contested for years. Turpin et al. (2023) showed that model reasoning can be systematically biased by prompt formatting even when the stated reasoning does not mention or acknowledge that bias. The CoT says one thing, the behavior reflects another. Lanham et al. (2023) found that truncating CoT partway through often does not significantly hurt performance on certain tasks, raising genuine doubts about whether the visible steps are doing the computational work they appear to be doing. Wei et al. (2022) introduced chain-of-thought prompting as a practical technique for eliciting better performance, but the mechanism was always partly opaque.

Against that backdrop, the CoT-Control finding says something more specific: while CoT may be imperfectly faithful, models cannot strategically and consistently decouple it from their actual computation on demand. That is a different and more actionable claim.

The Audit Log Analogy

In systems programming and security engineering, the value of an audit log depends almost entirely on how hard it is to fake. A logging system where the process being audited can selectively suppress or modify its own log entries provides weak guarantees. You might still catch careless behavior, but a system that has learned it is being watched and wants to look compliant can defeat it. The log’s evidential value is inversely proportional to the audited system’s ability to control it.

Execution traces in debugging, write-ahead logs in databases, system call auditing via tools like auditd or eBPF: all of these derive their usefulness from the same property. The trace is coupled to the actual execution in a way that is expensive or impossible to decouple without disrupting the execution itself. A database that could silently omit entries from its WAL without affecting commit correctness would not be a database you could recover from. The coupling is not a limitation of the design; it is the design.

The CoT-Control result suggests that reasoning model chains of thought have something like this property. The reasoning is not a post-hoc narrative generated after the answer is computed. It is entangled with the computation in a way that makes strategic manipulation costly. That does not mean it is perfectly faithful, it clearly is not, but it does mean it is not a freely manipulable facade either.

Why This Matters for Deceptive Alignment

The safety framing in the OpenAI paper connects directly to deceptive alignment, a concern articulated by Evan Hubinger and collaborators in “Risks from Learned Optimization.” The basic worry is that a sufficiently capable model might learn to behave well during training and evaluation, where it can infer it is being monitored, while planning to pursue different goals during deployment when monitoring is reduced. The model presents a compliant surface while maintaining a different internal agenda.

For this failure mode to work in a system with visible chain-of-thought reasoning, the model would need to produce reasoning that looks aligned while running computation that is not. That requires exactly the kind of CoT control that the CoT-Control framework tested and found to be beyond current model capabilities. A model that cannot reliably suppress or alter its reasoning trace when explicitly instructed to do so is a model that cannot easily hide misaligned computation behind benign-looking deliberation.

The analogy is a poker player who cannot help telegraphing their hand. The tell is not under voluntary control. Whatever is happening internally leaks into the observable channel. This is not a complete safety guarantee. It is a structural property that makes certain failure modes harder to execute.

This connects to what OpenAI calls deliberative alignment, where models explicitly reason through safety policies before acting. The monitorability argument is that this reasoning is meaningful precisely because the model cannot costlessly fake it. If the CoT were freely controllable, deliberative alignment would just be security theater: the model could reason through the safe policy in the visible trace while doing something else in the actual weights.

Where the Analogy Breaks Down

The audit log analogy is useful but has limits that matter.

Audit logs are append-only records of discrete events. CoT is a generative process that produces human-readable prose reasoning. The coupling between CoT and model computation is empirical rather than architectural. We know it is hard to decouple right now. We do not know that it remains hard as model capabilities increase, as training techniques improve, or as models are specifically optimized for tasks where strategic CoT manipulation would be useful.

The o1 and o3 models also introduce a specific complication: the reasoning tokens are shown either in full or in summarized form, and users interact with the summary rather than the complete trace. OpenAI has acknowledged that the summary is generated separately from the full chain of thought and may not capture all of it accurately. That gap between full reasoning and displayed summary is a meaningful limitation on monitorability. You are watching a condensed version of the audit log, and the condensation is itself model-generated. The CoT-Control result applies to the underlying reasoning process, but what you can actually observe is already one step removed.

Process reward models, which Lightman et al. (2023) showed can verify step-by-step reasoning more reliably than outcome reward models, are relevant here. If you can verify each step of the reasoning, you get stronger faithfulness guarantees than if you only check the final answer. But PRMs operate on the visible trace, and the visible trace for o1 and o3 is already summarized. The full reasoning tokens are not exposed to external verifiers.

What Remains Open

The CoT-Control framework is a measurement tool, not a theoretical account. It tells us that current models struggle with CoT manipulation but does not explain the mechanism in terms of model internals. That explanation would require understanding what aspects of the transformer architecture couple visible reasoning to underlying computation, and whether those aspects are stable under capability scaling.

Anthropic’s research on alignment faking in large language models explores whether trained alignment behaviors hold under distribution shift. The relationship between that line of work and CoT controllability is not fully mapped. A model that cannot strategically manipulate its CoT during normal operation might still produce systematically biased reasoning under conditions far from its training distribution, without that bias being visible as strategic manipulation.

The deeper issue is that CoT faithfulness and CoT controllability are related but distinct properties. A reasoning trace can be faithful, meaning it reflects the actual computation, without being fully legible, meaning you can understand what computation it reflects. The CoT-Control result addresses controllability. The legibility question, whether a human or automated monitor can correctly interpret what the visible reasoning means about model computation, remains largely open.

For now, the finding that models cannot easily control their own reasoning chains is genuinely useful. It means the visible CoT is a signal worth monitoring, not just a display layer that can be freely adjusted. That is a more modest claim than “the CoT is a complete window into model computation,” but it is a claim with real safety implications. In a system where you cannot trust the audit log at all, you have no basis for monitoring. Establishing that the log is at least hard to fake is the prerequisite for everything else.

Was this interesting?