· 6 min read ·

The Internal Quality Problem That AI Coding Agents Don't Solve

Source: martinfowler

The question most practitioners ask first about AI coding tools is whether the code works. Does it pass tests? Does the feature behave as described? Does it ship? That framing captures external quality. What gets far less attention is internal quality: the structural health of the code itself, the properties that determine how expensive it will be to change six months from now.

Erik Doernenburg’s experiment with CCMenu, documented in the Exploring Gen AI series on Martin Fowler’s site, takes the less-traveled path. Published in January 2026 and worth a close read now that the tools and discourse around them have both matured, it documents what happened when he used a coding agent to add a feature to a real open source application he maintains, then examined what the agent did to the code’s internal quality. It is one of the more specific practitioner assessments of this question available.

What Internal Quality Actually Measures

Internal quality is the part of software quality that users never see but developers live with. Martin Fowler has a useful framing: internal quality refers to the structure and design of the code, the things that determine how easy it is to change. External quality is correctness from the user’s point of view.

The metrics that capture internal quality include cyclomatic complexity, which measures how many independent paths exist through a function and correlates with how difficult it is to test and reason about; coupling, which measures how much a component depends on the internals of other components and predicts how far changes will ripple; cohesion, which describes how closely related the responsibilities of a module are; and code duplication, which tracks how often similar logic appears in multiple places without abstraction. These properties are things a developer who cares about maintainability pays attention to almost reflexively when writing or reviewing code. The question with AI coding agents is whether they pay attention to them too.

What the Experiment Found

CCMenu is a macOS menu bar application, written in Swift, that monitors CI/CD pipeline statuses and displays build results in the Mac menu bar. It has been actively maintained by Doernenburg for years and carries deliberate architectural decisions built up over that time. It is not a greenfield toy project, and that matters: the baseline for comparison is a real codebase with real conventions, not a blank slate.

Doernenburg used an agent to implement a concrete feature, then assessed the result against those quality dimensions. The agent shipped working code. The feature functioned. But the agent’s approach to fitting the feature into the existing codebase was less attentive than a developer fluent in the codebase’s conventions would have been. It introduced duplication where an abstraction was available and appropriate. It was not fully coherent with the existing architectural approach. The code got longer than it needed to be, carrying logic that a refactoring pass would have consolidated.

This pattern is familiar to practitioners who use these tools regularly. AI agents are optimized to produce code that satisfies the immediate specification. They are not optimized to minimize the long-term cost of maintaining that code in context.

What the Data Shows

Doernenburg’s observation is not isolated. GitClear published an analysis in early 2024 examining AI-assisted commits across a large corpus and found significant increases in code duplication correlated with the adoption of AI coding tools. Copy-paste code nearly doubled year over year in their dataset. Code churn, defined as lines added and then removed or revised within two weeks, also rose measurably.

These are leading indicators for technical debt. Duplication compounds: every time you need to change duplicated logic, you pay the cost of finding and updating all copies, and the risk of missing one. Churn suggests code that was not well understood when written, consistent with agent-generated code that is correct in isolation but misfit to context.

The mechanism makes sense. When a large language model generates code in response to a prompt, its reference for what good code looks like is statistical: patterns that appear frequently in training data. It does not have access to the specific design intentions embedded in your codebase. It cannot distinguish between a duplication that reflects an intentional separation of concerns and one that reflects a missed abstraction opportunity.

The Real Risk: Slow Degradation

The worst-case scenario here is not a single bad AI-generated function. It is the gradual degradation of a codebase’s internal quality across many agent-assisted changes.

Internal quality tends to follow a broken-windows dynamic: a small amount of disorder attracts more disorder. When the first agent-generated change introduces some duplication and slightly weakens cohesion, and that code is accepted because it works, subsequent changes build on a slightly degraded foundation. The agent’s next contribution does not have a clean codebase to work with. And further violations become harder to notice because the existing code already contains some.

This is distinct from the sudden failure mode that developers fear from AI tools. It is a slow, quiet rot that does not surface in tests or sprint reviews. It surfaces six months later when a change that should take a day takes three, and the reasons are diffuse and hard to point to.

Systematic Weaknesses Worth Naming

Looking at AI coding agents specifically, a few recurring patterns affect internal quality:

Preference for addition over modification. When a new feature requires extending or refactoring existing code, agents typically add new code rather than adapt old code. This is safe from the agent’s perspective, but worse from a design perspective: it produces duplication and coherence violations.

Naming that fits the immediate context but not the system. Good naming communicates meaning across the codebase. Agents generate plausible-sounding names, but those names emerge from the immediate prompt context, not from the system’s broader vocabulary. The result is terminology drift that makes code harder to read over time.

Missed abstraction opportunities. A developer who notices the same three-line pattern appearing for the fourth time will usually extract it into a named function. Agents do not track pattern recurrence across a session unless explicitly prompted to, and so duplication accumulates.

Accidental complexity. A developer solving a simple problem typically reaches for a simple solution. Agents have been exposed to many patterns and may apply unnecessary ones. Over-engineered solutions that are technically correct but carry more machinery than the problem requires are a consistent output.

What Helps

None of this means AI coding agents do not belong in a software development workflow. The productivity gains are real. The question is how to capture those gains without accruing structural debt that you will be paying down for months.

Running static analysis on agent-generated code gives you an objective signal rather than a vague impression. Tools like SonarQube, CodeClimate, or language-specific linters catch measurable quality violations. Making this a routine part of reviewing agent output, not an occasional audit, prevents violations from accumulating.

Reviewing for architecture rather than just correctness changes the frame of the review. The natural instinct is to check whether the code does what was asked. The more important question is whether it fits the design: does it introduce a concept that should be named and shared? Does it duplicate something that already exists? Does it violate module boundaries?

Prompting explicitly for quality can shift agent output. Telling the agent not to introduce duplication, to look for existing utilities before writing new ones, or to prefer modifying existing code over adding parallel code does not guarantee good results, but it moves the distribution in the right direction. Agents respond to constraints in the prompt.

Treating internal quality as an acceptance criterion, rather than an afterthought, is the structural fix. If a feature is done when it works correctly, agents will routinely produce done code with poor internal structure. Adding explicit internal quality criteria to the definition of done makes the cost visible at the point where it is cheapest to address.

The Broader Point

Doernenburg’s experiment is the careful work of measuring something rather than asserting it. The conclusion is specific: AI coding agents introduce identifiable, systematic pressures on internal quality that require deliberate countermeasures. Whether that changes how you use these tools, or just how carefully you review their output, is a judgment call that depends on your codebase and your team.

When I use agents for features in the Discord bots I maintain, the agent ships the feature but the architecture remains my responsibility. Keeping that distinction clear matters because the agent will not maintain it. The code that is easy to understand, easy to change, and straightforward to delete in six months is not automatically the code that satisfies the prompt. That distance between working and well-structured is where developer judgment stays indispensable.

Was this interesting?