The Code That Passes Tests but Rots in Place: AI Agents and Internal Quality

Martin Fowler’s ongoing “Exploring Generative AI” series has been one of the more grounded empirical accounts of what actually happens when developers use AI coding tools on real production codebases. The installment on CCMenu and internal code quality, written by Erik Doernenburg in January 2026, is worth looking at carefully — not just for what it found, but for the question it asked.

Doernenburg is the maintainer of CCMenu, a macOS menu bar application that polls CI/CD systems and displays build status as an icon in the menu bar. The project is a good candidate for this kind of experiment: it has a real domain model, a non-trivial networking layer, and a SwiftUI interface, and its original author knows the codebase deeply enough to evaluate AI-generated changes with genuine critical judgement. The CCMenu2 rewrite, available at github.com/erikdoe/ccmenu2, is pure Swift targeting macOS 12+, with clear separation between model objects, feed parsers, and view components.

The experiment was straightforward: add a non-trivial feature using a coding agent, then assess what happened to internal quality. Not “did the tests pass” and not “does the feature work” — those are external quality concerns. The question was whether the structure of the code got worse.

External Versus Internal Quality

Fowler has written about the internal/external quality distinction for years, and it is central to understanding why this experiment matters. External quality is what users see: the application behaves correctly, error cases are handled, the interface works. Internal quality is what developers see: the code is structured coherently, abstractions are at the right level, components are loosely coupled and individually coherent, duplication is minimal.

The problem with coding agents is that they are optimized, implicitly, for external quality signals. An agent receives a prompt, generates code, sees whether the tests pass or whether the human approves the output, and iterates. What it does not see is whether the resulting structure degrades coupling metrics, whether it has introduced a second copy of logic that already exists, or whether a new method belongs in the class it was added to.

This is not a criticism of the models per se. It is a consequence of what the feedback signal is. Tests passing is measurable and immediate. Cohesion degrading is diffuse and slow.

What Agents Do to Structure

Several patterns appear consistently when coding agents work on established codebases.

The first is addition rather than reshaping. A human developer implementing a feature on a codebase they know well will often restructure existing code to accommodate the new behavior cleanly. An agent typically adds. It finds the closest existing class or function, appends the necessary logic, and stops. The result works. It is also a little worse than it was before, in ways that accumulate.

The second is boundary violation. In a layered architecture, the model layer should not know about the UI, the networking layer should not contain business logic, and so on. Agents regularly violate these boundaries because the shortest path to a working implementation often crosses them. Putting a formatting decision in a model object saves two steps. The agent takes the shortcut; the developer reviewing the output might not notice.

The third is duplication. Large language models generate text that resembles training data. When they need to parse a date, they write a date parser. They do not search the existing codebase for the date utility that is already there. Code review catches some of this, but not all of it, and it accumulates into a codebase that has multiple slightly-different implementations of similar logic scattered across files.

The Measurement Gap

The harder problem is that standard CI tooling is not designed to catch these issues. A test suite tells you if behavior changed. A linter catches style violations and obvious code smells. Static analysis tools like SonarQube flag long methods and high cyclomatic complexity. None of these reliably detect that a class now has two responsibilities where it previously had one, or that coupling between the networking layer and the view layer has increased.

The GitClear report from 2024 analyzed over 150 million lines of code changes and found that code churn — lines changed or reverted shortly after being written — increased substantially in the years corresponding with AI coding tool adoption. Copy-paste and moved code patterns increased. Refactoring as a deliberate activity, measured by commits that restructure without adding net functionality, decreased. These are proxy signals for internal quality, not direct measurements, but they point in a consistent direction.

The specific challenge with CCMenu, and with any mature codebase used for this kind of evaluation, is that internal quality degradation is most visible to someone who already has a mental model of the correct structure. Doernenburg, as the original author, is positioned to see when a new method is in the wrong place in a way that a reviewer unfamiliar with the codebase would not be.

The Refactoring Problem

There is a discipline in professional software development — associated with Kent Beck, Fowler himself, and the XP tradition — of treating refactoring as a continuous activity that runs alongside feature delivery, not a separate phase. You add behavior in small steps, and you reshape the code to accommodate each step cleanly before moving on. The code is always left a little better than you found it.

Coding agents do not do this. They do not have an internalized standard of what the codebase’s structure should look like, and they do not apply incremental improvement as a background activity. They produce working code. The developer reviewing and accepting that code is responsible for the reshaping, and under time pressure, or when the output looks clean enough, that reshaping often does not happen.

This is probably the most important practical implication of the CCMenu experiment. The question is not really “does the agent write bad code.” In isolation, agent-generated code is often reasonable. The question is whether the workflow that teams build around coding agents preserves the practices that keep internal quality stable over time. If the agent accelerates output but eliminates the space where refactoring used to happen, the codebase gets worse at an accelerated rate.

What This Means in Practice

For teams using coding agents seriously, a few things follow from this.

Code review needs to include structural assessment, not just behavioral verification. Reviewers should ask whether new code belongs where it was placed, whether it duplicates existing logic, and whether it degrades the layering of the system. This requires reviewers to actually understand the existing structure, which is a non-trivial investment on large codebases.

Metrics worth tracking include things agents tend to degrade: test-to-implementation coupling (tests that reach into implementation details rather than testing through public interfaces), class responsibility counts, and inter-layer dependencies. Some of this can be automated; most of it requires deliberate attention.

The most useful counterbalance is explicit refactoring time. If the agent accelerates feature implementation, some of that time can be reinvested in structural improvement. This requires a team culture that treats refactoring as legitimate work rather than a luxury, which is easier to say than to maintain under delivery pressure.

The CCMenu experiment is valuable precisely because it is specific and grounded. It does not claim that AI coding tools are net harmful. It asks a focused question — what happens to internal quality — and takes that question seriously. That is the right frame. The answer depends on what practices surround the tool, and Doernenburg’s careful assessment of his own codebase is an example of the kind of judgement that tooling alone cannot replace.