The Internal Quality Test That Coding Agent Benchmarks Don't Run
Source: martinfowler
Most evaluations of AI coding agents measure whether the code works, framing success as: does it compile, do the tests pass, does the feature behave as described. These are reasonable first-order questions, but they miss a category of quality that matters just as much for long-lived software: internal quality.
Martin Fowler’s writing on this distinction is worth revisiting. As he lays out in Is High Quality Software Worth the Cost?, external quality is visible to users and stakeholders; internal quality is visible only to developers reading and modifying the code. High internal quality means low cyclomatic complexity, clear separation of concerns, minimal coupling between modules, and the kind of structure that lets you confidently change one thing without breaking five others. It’s the difference between code that works today and code that can be changed next quarter without a rewrite.
Erik Doernenburg, the maintainer of CCMenu, set out to get a ground-level answer to how coding agents affect this dimension of quality. CCMenu is a Mac menu bar application that polls CI/CD services and shows build status at a glance. It has been around since the CruiseControl era, making it a legitimate long-lived production codebase rather than a scaffolded demo app. Doernenburg added a feature using a coding agent and then inspected what had changed in the code’s structural properties. The article, originally published in January 2026, is a retrospective worth reading in full, but the questions it raises deserve more unpacking than the format allows.
What Internal Quality Means in Measurable Terms
Internal quality is harder to measure than external quality, but it is not unmeasurable. A few concrete indicators:
Cyclomatic complexity counts the number of independent paths through a function. A function with no branches has a complexity of 1; every conditional or loop adds 1. The standard threshold for “too complex to safely modify” is around 10. Functions above this tend to resist refactoring and accumulate bugs at edge cases.
Coupling describes how many other components a module depends on. In an object-oriented codebase, high coupling between classes means that a change in one ripples unpredictably through others. Tools like SonarQube track metrics like coupling between objects and lack of cohesion in methods because they reliably predict defect rates over time.
Code duplication is straightforward to measure but easy to ignore. Duplicated logic is fine until the underlying behavior needs to change, at which point you discover you’ve updated it in two of three places.
These metrics are well-established. Measuring agent output against them reveals some consistent patterns.
How Coding Agents Approach Structure
Coding agents are optimized on outcomes: does the code compile, do tests pass, does the stated requirement get satisfied. They are not optimized on the internal structural properties that experienced developers cultivate through feedback that takes months or years to accumulate. An agent doesn’t carry a felt sense of “this method is getting too long” or “this class is doing too many things.”
The result is code that tends toward correctness without tending toward cleanliness. Agents will often add branches to existing functions rather than extract new abstractions. They inline logic that a human would factor out. They duplicate patterns across files when a shared utility would serve better, because the local context of the task doesn’t surface the distant parallel.
In Swift codebases like CCMenu, these tendencies express themselves in specific ways. View controllers accumulate logic that should live in model or service layers. Closures grow long rather than being extracted to named functions. Protocol conformances get satisfied with boilerplate that obscures intent. None of this is unique to agents; junior developers do the same things. The difference is that agents do it very fast, very consistently, and without the self-correction that comes from peer review and accumulated context.
Consider a simple example of the pattern. A human developer adding a network feature to a Swift app might write:
// Extract to a dedicated service
struct FeedParser {
func parse(_ data: Data) throws -> [BuildStatus] {
// single responsibility
}
}
An agent asked to “add support for parsing build feeds” often produces the parsing logic inline inside a view model or even a view controller, because that’s where the data flow starts in the context it was given. The code works. The coupling is invisible until something else needs to use that parser.
Why This Matters for Real-World Adoption
The argument for using coding agents is usually framed in terms of velocity: features ship faster, boilerplate disappears, and developers can focus on higher-order problems. This argument is partly correct, but it defers a cost rather than eliminating it. Code written without structural care accumulates complexity, and that complexity compounds. A codebase with a dozen small structural debts is not twelve times harder to work in than a clean one; it tends to be significantly harder, because the debts interact in ways that aren’t predictable from inspecting them individually.
Doernenburg’s approach with CCMenu is instructive precisely because it takes a codebase with real history and measures what one agent-assisted feature does to it. The finding matters more in a real project than it would in a synthetic benchmark, because real codebases already carry their own structural baggage. Adding agent-generated complexity on top of existing complexity has non-linear effects that you won’t see in a benchmark suite where each task starts from a clean slate.
What You Can Do About It
The practical response is not to avoid coding agents; it’s to be explicit about what they need guidance on. A few approaches that have emerged from teams working with agents regularly:
Treat agent output as a first draft. Review it the same way you’d review output from a developer unfamiliar with the codebase, because that’s functionally what you’re getting. The agent doesn’t know your abstractions, your conventions, or the patterns you’ve been careful to maintain.
Write prompts with structural constraints, not just functional ones. Instead of “add a feature that does X,” try “add a feature that does X, extracting the parsing logic into a new struct and keeping the existing view model under 150 lines.” Agents respond to constraints; they just don’t generate them on their own.
Run static analysis before and after agent sessions. Tools like SwiftLint with complexity rules, or lizard for cross-language complexity analysis, can surface structural regressions before they accumulate. A diff in metrics is easier to act on than a vague sense that the code feels harder to follow.
Pair agent sessions with refactoring passes. Let the agent produce working code, accept it, then immediately spend time on the structural properties. This separates the “does it work” problem from the “is it maintainable” problem, which helps clarify where the real effort lives and makes the cost of agent-generated debt visible rather than hidden.
The Broader Point
CCMenu is a good test case because Doernenburg has maintained it long enough to know its structure well and evaluate changes against a clear baseline. Most teams don’t have that baseline, which makes the problem harder to see and easier to dismiss. The code looks reasonable, the tests pass, the feature ships. It’s only later, when a different feature requires touching the same area, that the debt becomes visible, and by then it’s usually attributed to the complexity of the domain rather than to the decisions made months earlier.
The challenge for teams integrating coding agents is that the feedback loops for internal quality are long and the signals are diffuse. Velocity metrics improve immediately; maintainability costs arrive later and get attributed to other causes. Building structural review in as a first-class part of the agent-assisted workflow, rather than treating it as optional cleanup, is the practical way to keep the long-term trajectory of a codebase from slowly diverging from its short-term productivity gains. Erik’s experiment with CCMenu is a useful reminder that those two things can diverge, and that noticing the divergence requires intentional measurement rather than just shipping features and hoping for the best.