Published in January 2026, Erik Doernenburg’s experiment on CCMenu is one of the more careful pieces of empirical work we have on what AI coding agents actually do to a codebase over time. Not what they do to a single method or a single test, but to the internal structure of a real project with a real author who has ground truth on every design decision.
The answer, measured with concrete metrics rather than impressions, is that coding agents are good at the wrong level of quality.
The Two Kinds of Quality That Don’t Move Together
External quality is what tests measure: does the software behave as specified? Internal quality is structural: is the code organized so that future changes stay cheap? These are independent dimensions. A class can be correct and cohesionless. A module can pass every test and still have coupling that makes it a maintenance hazard.
The distinction matters because AI agents have very different capabilities against each dimension. A model trained on billions of lines of code has seen every common algorithm and can produce correct implementations. It has also seen every common structural problem, and it reproduces those too, because it was never trained to optimize against them.
What makes the structural problems dangerous is that standard review processes mostly cannot see them. A pull request reviewer checking whether a feature works will often approve code that is subtly degrading the architecture. The individual method might be clean. The class it was added to might now have six responsibilities where it had three.
CCMenu2 as a Test Subject
CCMenu is a macOS application that has been showing CI/CD build status in the menu bar since the early CruiseControl era. CCMenu2 is the current Swift/SwiftUI rewrite. It is a good test subject for exactly the reasons that make most AI benchmarks poor ones: it is small enough to analyze exhaustively, real enough to have genuine design pressures, and Doernenburg is the sole author, so he can identify when AI suggestions diverge from intended design without needing to reconstruct intent from history.
Doernenburg used a coding agent to add a feature, then applied Robert C. Martin’s package-level metrics to measure what changed structurally.
The Metrics That Make Structure Legible
Martin’s metrics, formalized in Agile Software Development: Principles, Patterns, and Practices, give quantitative handles on package-level design. The core ones:
- Afferent Coupling (Ca): classes outside a package that depend on classes inside it. Measures how much responsibility the package has to others.
- Efferent Coupling (Ce): classes inside a package that depend on classes outside. Measures how dependent the package is on the world.
- Instability (I):
I = Ce / (Ca + Ce). Ranges 0 to 1. High instability means easy to change but unreliable; low instability means stable but hard to modify. - Abstractness (A): ratio of abstract types to total types in the package.
- Distance from Main Sequence (D):
D = |A + I - 1|. Good packages sit near zero, on the diagonal between purely abstract/stable and purely concrete/unstable.
Packages that drift off this diagonal fall into two failure modes. The zone of uselessness is highly abstract and unstable: code that exists but does nothing. The zone of pain is the pattern AI agents induce: highly concrete and stable. The package is full of implementation details that other code depends on directly. It is almost impossible to change safely, because anything touching it risks breaking downstream consumers.
Why Agents Push Toward the Zone of Pain
The failure mode is not accidental. It emerges from a specific optimization the agent is performing.
When adding a feature, the agent looks for the most relevant existing class and extends it. This is locally rational: the class already has the imports, the domain context, and the test file. The agent minimizes its own reasoning cost by adding rather than decomposing. Over multiple sessions, a central class accumulates responsibilities.
Consider what this looks like in a Swift codebase over time:
// After initial design
class BuildMonitor {
func checkStatus(for pipeline: Pipeline) -> BuildStatus { ... }
func scheduleNextCheck(for pipeline: Pipeline) { ... }
}
// After two AI-assisted feature additions
class BuildMonitor {
func checkStatus(for pipeline: Pipeline) -> BuildStatus { ... }
func scheduleNextCheck(for pipeline: Pipeline) { ... }
func parseGitHubActionsResponse(_ data: Data) -> [Pipeline] { ... }
func formatNotificationTitle(for build: Build) -> String { ... }
func shouldShowBadge(for status: BuildStatus) -> Bool { ... }
func persistLastKnownStatus(_ status: BuildStatus, for pipeline: Pipeline) { ... }
}
None of these additions are wrong in isolation. Each one passes tests. The method-level cyclomatic complexity stays low, because agents do write small methods. But the class now owns parsing, formatting, notification logic, badge display, and persistence alongside its original monitoring responsibility. Its afferent coupling grows as other code starts calling the new methods. Its instability approaches zero. It has become a load-bearing wall.
This is what Doernenburg measured: individual method complexity staying acceptable while class-level and package-level metrics degraded.
The Structural Reasoning That LLMs Don’t Have
Architectural judgment requires reasoning about what a system should become, not just what it currently is. It requires understanding which dependencies are acceptable and which represent design debt. It requires the kind of intentionality that comes from having made the original design decisions, or having spent time understanding why they were made.
An LLM’s context window contains the existing code, the task description, and any retrieved examples. What it does not contain is the coupling budget for a module, the implicit rules about which responsibilities belong where, or the design decisions made six months ago that are not written down anywhere. CCMenu2 is a good test subject precisely because Doernenburg holds this context and the agent does not. When he reviews the diff, he can see that a new method is in the wrong class, not because it fails to compile, but because it belongs to a responsibility that should be separate.
The GitClear 2024 analysis of GitHub code changes found corroborating evidence at scale: after widespread AI coding assistant adoption, code churn increased and code moves (refactoring) decreased. Developers were producing more code and restructuring less. The structural implications of AI-generated additions were accumulating faster than teams were processing them.
Measuring What Standard Review Misses
The practical implication of the CCMenu experiment is not that you should stop using coding agents. It is that the standard review process was already inadequate for catching structural degradation, and AI agents make that inadequacy more consequential.
A few things change when you take this seriously:
Separate correctness review from structural review. These are different cognitive tasks. A reviewer checking functional behavior is not doing a structural review. Structural review requires explicitly thinking about coupling, cohesion, and dependency direction, and it requires someone who understands the intended design well enough to evaluate whether additions respect it.
Use metric tools that surface drift. SonarQube tracks cyclomatic complexity, coupling between objects, and weighted methods per class. Structure101 gives package-level coupling analysis and can visualize dependency tangles before they become severe. CodeClimate tracks maintainability and duplication over time. None of these tools replace architectural judgment, but they make degradation measurable rather than impressionistic.
Watch for class growth, not just method complexity. The AI-induced signature is low method-level complexity alongside growing class size and rising afferent coupling on central classes. If a class is gaining new methods across AI-assisted sessions while its Ca climbs, it is moving toward the zone of pain.
Build in structural checkpoints. A refactoring session after every few AI-assisted features is not overhead. It is the maintenance cost of using a tool that optimizes at the wrong level of abstraction. The speed gains from coding agents are real, but they come partly at the expense of architectural work the agent is not equipped to do.
The Underlying Tension
Fowler has argued for decades that internal quality is the primary driver of whether a codebase can be changed cheaply over time. Teams that treat it as a luxury trade short-term velocity for compounding long-term cost.
AI coding agents do not change this dynamic. They accelerate the production of code without improving, and often while degrading, the structural properties that determine how expensive future changes will be. The CCMenu experiment makes this concrete with specific metrics on a specific codebase, which is more useful than the usual vague warnings about technical debt.
The experiment also illustrates why experience matters more, not less, when using these tools. A developer who cannot evaluate structural quality is in a poor position to catch what the agent is silently getting wrong. The output looks fine. The methods are short. The tests pass. The architecture is drifting.