The Correctness Trap: What Coding Agents Do to Your Internal Code Quality

Back in January 2026, Erik Doernenburg published a careful, grounded study on martinfowler.com that most AI-and-coding discourse skips right past. He did not run a vibe check or count how fast he typed. He took a real production codebase, CCMenu, used a coding agent to add a feature, and then actually looked at what happened to the code structure afterward. The result was not a horror story and not a triumph; it was something more instructive.

Doernenburg is the author and maintainer of CCMenu, a macOS menu bar application that monitors CI/CD build pipelines by reading CruiseControl-style XML feeds from CI servers and showing build status as a traffic-light icon. It is written in Swift, it has a real test suite, and Doernenburg knows the codebase deeply enough to notice when something is off. That combination, production code, expert author, meaningful test coverage, makes it a far better test case than most of the “I asked GPT to write a todo app” experiments that flood the conversation.

The feature he added was support for a new pipeline provider. The agent produced working code. Tests passed. The feature shipped. From the outside, every measurable outcome was positive.

The inside was a different matter.

What Internal Quality Actually Means

The distinction between external and internal quality is foundational to how thoughtful engineers think about code. External quality is what users see: correctness, performance, reliability. Internal quality is what engineers see: the structure, the coupling, the naming, the degree of duplication, the ease of writing tests. Martin Fowler’s essay “Is High Quality Software Worth the Cost?” is worth reading if you have not already, because it makes a precise economic argument that internal quality is not a luxury. Poor internal quality accumulates as cruft that slows future development in a compounding way. His “design stamina hypothesis” holds that investing in internal quality pays back within weeks, not months, because it keeps velocity from degrading as the codebase grows.

This framing matters here because coding agents are optimized almost entirely for external quality. They are trained on code that works, evaluated against tests, and rewarded for producing correct output. Nothing in that feedback loop penalizes tightly coupled classes, duplicated logic, or a new file that ignores the dependency direction the rest of the codebase was built around.

What Doernenburg Found

The findings break down across four dimensions that are standard when assessing internal quality.

Coupling. The new code introduced dependencies between components that had previously been loosely coupled. The agent did not understand or respect the architectural boundaries Doernenburg had established. It reached across layers in ways that were convenient to the local task but wrong for the system as a whole.

Cohesion. Responsibilities that should have been separated ended up bundled together. The agent placed code in the most obvious or immediately accessible location rather than the architecturally appropriate one. The new structs and classes took on more concerns than they should have.

Duplication. Logic that already existed elsewhere in the codebase was reimplemented rather than reused. The agent either did not discover or did not use existing helper methods and shared utilities. This is one of the more quietly damaging outcomes, because duplicated logic means future changes have to be made in multiple places, and divergence is nearly guaranteed over time.

Testability. The new code was harder to test in isolation. The existing codebase had established patterns for dependency injection that made unit testing tractable. The agent did not follow those patterns, making the new code more resistant to the kind of isolated testing that the rest of the project supported.

None of these problems showed up as compiler errors or failing tests. The code was, by every automated measure, correct. The technical debt was introduced silently.

Why This Is Structurally Inevitable

The reason this happens is not a bug in any particular model or tool; it is a property of how coding agents operate.

A coding agent sees a context window. The context window contains some of the codebase, the task description, recent conversation, and whatever the agent or the tooling chose to retrieve. It does not contain the architecture as a mental model, the history of why certain decisions were made, the implicit conventions the team has agreed on, or the long-range dependency graph that took years of careful attention to maintain. Even with retrieval-augmented approaches that pull in relevant files, the agent is reasoning locally. It produces code that is coherent with what it can see, not necessarily with what it cannot.

This is meaningfully different from how an experienced developer works. When Doernenburg adds a feature to CCMenu, he carries a mental model of the entire system. He knows which layer a new piece of logic belongs in, which utilities already solve adjacent problems, which design decisions were deliberate trade-offs worth preserving. That knowledge does not live in any file. It lives in his head, accumulated over years of working on the project. The agent has none of it.

The training data problem compounds this. Agents are trained on enormous corpora of code from public repositories. Most of that code is not particularly well-structured; it is just code that was written and committed. The model learns patterns that produce working software, but “working” and “well-structured” are weakly correlated at the corpus level. There is no strong training signal that rewards maintaining the architectural coherence of a specific codebase.

The GitClear Data Corroborates This

Doernenburg’s case study is qualitative and deep; it is one project observed carefully by its author. But the pattern he describes is visible at scale too. GitClear’s 2024 analysis of millions of lines of AI-assisted code found increased rates of code churn, meaning code that is written and then reverted or rewritten within two weeks, and markedly higher duplication rates compared to pre-AI baselines. The duplication finding in particular maps directly onto what Doernenburg observed: agents reproduce logic rather than reuse abstractions.

The churn finding is worth sitting with. When code is written and then has to be rewritten shortly afterward, that is often a sign that the code passed tests but was structurally wrong in ways that only became apparent when someone tried to extend or integrate it. The agent produces output that looks like the destination but is not the journey. The next developer who touches that code has to figure out where it really belongs.

What Refactoring Looks Like After the Fact

One implication Doernenburg draws out is that the usual argument for using agents, that they save time, becomes more complicated when you account for the cleanup work. Adding a feature with an agent may be faster in the short term. Reviewing that feature carefully, identifying the structural problems, and deciding whether to refactor immediately or accept the debt, takes time. If you defer the refactoring, it compounds. The coupling introduced by one agent-assisted feature makes the next feature harder to implement cleanly, whether or not you use an agent for it.

This suggests a workflow question that does not get enough attention: when should you prompt the agent to refactor its own output, and when does the agent lack enough architectural context to do that usefully? Asking an agent to “clean up this code” or “reduce coupling” is more tractable than asking it to “align this with our architectural conventions” because the former is local and the latter requires understanding the system. Context-stuffing approaches, where you feed the agent architectural documents or examples of well-structured code from the same project, can help, but they are manual and imperfect. The agent may still miss the dependencies it cannot see.

A more reliable approach is treating agent output the way you would treat a pull request from a skilled contractor who is new to the codebase: the code may work, but it needs architectural review from someone who carries the mental model of the system. That review has to be intentional. If you let agent output go straight through CI because the tests pass, you are accumulating exactly the kind of silent debt Doernenburg describes.

The Measurement Problem

One thing Doernenburg’s study highlights that deserves more attention is that most teams do not measure internal quality at all. Coupling and cohesion metrics exist; tools like SonarQube, CodeClimate, and language-specific analyzers can surface them. But most CI pipelines run tests and linters and call it done. If your quality gates only check for correctness and style, you will not notice when structural quality declines, whether the cause is an agent or a tired human engineer.

This is not a new problem, but the speed at which agents produce code makes it more acute. A developer writing code manually accumulates technical debt at roughly the pace they can write. An agent can accumulate structural debt much faster, and because the output looks polished, it is less likely to trigger the usual instinct to slow down and review.

What This Changes About How to Use Agents Well

None of this argues against using coding agents; it argues for being clear-eyed about where they are strong and where they are not. Agents are good at producing working implementations of well-defined, locally-scoped tasks. They are weaker at maintaining architectural coherence across a codebase they do not fully see.

The practical adjustments follow from that. Giving the agent more context, including architectural decision records, examples of the patterns in use, and explicit guidance about which layer new code belongs in, narrows the gap. Treating the output as a draft that requires structural review, not just correctness review, is probably unavoidable on any codebase where internal quality has been a deliberate investment. And running code quality metrics as part of your review process, so that coupling and duplication increases are visible rather than invisible, gives you the data to make informed decisions rather than discovering the debt when it has already compounded.

Doernenburg’s study is a retrospective published a couple of months after the work was done, which gives it a useful calmness. He is not alarmed; he is precise. The coding agent did not break his codebase. It made it measurably worse in ways that matter for long-term maintainability, and it did so in a way that no automated gate caught. That is the honest shape of the current state of agent-assisted development, and working with it productively starts with seeing it clearly.