· 7 min read ·

The Part of Code Quality That AI Coding Benchmarks Keep Missing

Source: martinfowler

Most evaluations of AI coding assistants measure the same things: does the code compile, do the tests pass, does the feature behave correctly. These are worth measuring, but they only capture one dimension of quality. Erik Doernenburg’s experiment using a coding agent to add a feature to CCMenu, published in January 2026 on Martin Fowler’s site, takes a different angle: what does the agent do to the internal quality of the code, the part that doesn’t show up in test results?

This is the right question, and the reason more benchmark work hasn’t tackled it is that it’s substantially harder to measure.

Internal Quality Is Not a Soft Concern

Martin Fowler has written at length about the distinction between external and internal software quality. External quality is what users observe: does the feature work, is the application fast, does it handle errors gracefully. Internal quality is what engineers work with every day: is the code readable, are the abstractions coherent, are module boundaries respected, is the naming consistent with the domain.

Internal quality is what determines the long-term economics of a codebase. A codebase with good internal structure makes each successive change cheaper to implement; a codebase where internal quality has been ignored makes each change increasingly expensive as engineers work around accumulated inconsistencies and unclear ownership. The cumulative effect compounds in both directions.

The standard metrics for internal quality, cyclomatic complexity, afferent and efferent coupling, lack of cohesion in methods, code duplication ratios, capture gross violations but not the subtler patterns that experienced engineers recognize. A function can have acceptable complexity scores while still being named wrong, placed in the wrong module, and structured in a way that will confuse anyone who reads it six months later.

Why CCMenu Is a Good Test Subject

CCMenu is a macOS menu bar application that shows the status of CI/CD pipelines. It connects to build servers and pipeline APIs, including GitHub Actions, Jenkins, CircleCI, and others, and surfaces their build states directly in the menu bar without requiring a browser. Doernenburg has maintained it for years, and CCMenu2, the current Swift rewrite, has a deliberate architecture that reflects decisions accumulated over that maintenance period.

The choice of subject matters here. Most AI coding evaluations use either synthetic benchmarks designed to have clean answers, or greenfield projects where the agent is writing code without having to fit into an existing structure. Neither reflects the conditions under which most professional software development happens. Most real development is additive: you’re adding a feature to a codebase that already has opinions about how things should be structured.

Adding a feature to CCMenu is not just an exercise in generating correct Swift. It’s an exercise in generating Swift that fits: that uses the existing networking abstractions rather than reinventing them, that places new types in the modules that own the relevant domain concepts, that follows the naming conventions already established in the codebase. An agent that produces code which passes tests but violates these structural conventions has done half the job.

What Agents Optimize For

Current coding agents, including Claude, GitHub Copilot, Cursor, and Aider, are very good at producing code that works. Given a clear feature specification and relevant context, they’ll generate implementations that compile and satisfy the described behavior. What they aren’t directly trained to optimize for is structural coherence with an existing codebase.

This manifests in a few predictable patterns:

Duplication over abstraction. If a new feature requires logic similar to existing logic the agent has seen in context, the agent may duplicate rather than extract. The feature works; the tests pass; the codebase now has two places that need to be updated when the underlying logic changes.

Vocabulary drift. A mature codebase uses domain vocabulary deliberately. The type names, method names, and variable names encode assumptions about the domain that were worked out over time. An agent may produce code that uses slightly different vocabulary, locally consistent but globally misaligned with the conventions it didn’t see in context.

Misplaced ownership. An agent implementing a feature will place the code somewhere. That somewhere is often the module most immediately visible in the context window, not necessarily the module that conceptually owns the relevant abstraction. The feature works from the outside; the module boundaries have quietly become less meaningful.

Layer leakage. A codebase may have a strict separation between, say, its network layer, its domain model, and its view layer. An agent might produce an implementation that technically functions but introduces dependencies that violate these boundaries, because nothing in the test suite enforces architectural constraints.

None of these produce failing tests. All of them degrade the internal quality of the codebase over time, and the degradation accelerates because each new piece of agent-written code is working in a codebase that has already drifted from its intended structure.

The Context Window as an Architectural Horizon

There’s a structural reason agents tend to produce locally coherent but globally inconsistent code. Codebases are large; context windows are finite. When an agent adds a feature to CCMenu, it’s working with the files most immediately relevant to the task. It may not see the abstraction defined in a module it wasn’t pointed at, the naming convention established in a file it hasn’t read, or the architectural principle documented in a design note that nobody thought to include in the prompt.

Some agent scaffolding tries to address this through retrieval: before generating code, the agent searches the codebase for patterns relevant to the task. This helps, but it’s a partial solution. The patterns that most need to be respected are often the least obviously retrievable. They’re implicit in the structure of the codebase, in what types exist and how they relate, in what the module structure says about conceptual ownership. Keyword search finds explicit text; it doesn’t find implicit architectural intent.

The practical consequence is that using an agent to add a feature to a mature codebase requires more upfront specification than using an agent on a fresh project. You need to tell the agent which abstractions to use, which modules own which concepts, and which boundaries must be respected. This adds real overhead to the workflow. The overhead buys you a better chance at structural coherence, but it partially erodes the productivity gain the agent was supposed to provide.

The Maintainer’s Review as the Missing Tool

Doernenburg’s methodology, adding a feature with an agent and then assessing what happened to the code, is, in an important sense, the right methodology for this kind of quality assessment. The maintainer brings exactly what automated tools lack: contextual knowledge of what the code was trying to say before the agent touched it.

This is also why such experiments are difficult to generalize. The result depends on how idiosyncratic the codebase is, how successfully the agent inferred existing conventions from the context it was given, and how precisely the feature was specified. A maintainer with deep knowledge of their codebase will notice structural drift that a tool like SonarQube or even a diligent reviewer unfamiliar with the project would miss.

For a project like CCMenu, this points toward a workflow where agent-generated code is treated as a draft rather than a commit. The agent’s output is useful, sometimes very useful, but it goes through a review step where the maintainer specifically asks not just whether the feature works but whether the code fits. This review is cheaper than writing the feature from scratch, but it’s not free.

The Tooling Gap

What’s largely absent from current AI coding workflows is tooling designed specifically to assess structural fit. We have good tooling for external quality: test runners, type checkers, fuzzers. We have partial tooling for internal quality: linters, complexity analyzers, duplication detectors. We don’t have good automated tooling for what might be called architectural conformance: does this new code respect the conventions and boundaries of the existing codebase?

Fitness functions, from the evolutionary architecture literature, are one partial answer. You can write automated checks that enforce specific architectural properties, like preventing imports between layers that shouldn’t communicate, and run them in CI. But fitness functions require knowing in advance which properties matter, and they catch violations after the fact rather than guiding the agent during generation.

A more promising direction is probably tighter integration between agents and architecture-aware analysis tools. Tools like CodeScene, which analyze structural patterns and identify hotspots, could in principle feed into an agent’s context in a way that makes it more likely to produce conformant output. The agent could be given explicit information about module ownership, dependency conventions, and naming patterns derived from the existing codebase before it generates anything.

This is still largely unexplored territory in production tooling. Most agent integrations provide file access and search; they don’t provide architectural context in a structured, queryable form.

What the Experiment Reveals About the Longer Arc

There’s a compounding dynamic worth noting. Internal quality is what makes a codebase hospitable to future work, whether that future work is done by humans or agents. A codebase with clear abstractions, consistent naming, and well-enforced module boundaries is a codebase that agents can reason about more effectively when the next feature comes along. A codebase where internal quality has been allowed to drift is one where the agent’s context is noisier, the conventions are harder to infer, and the structural mistakes from previous sessions compound with new ones.

This means that how carefully internal quality is maintained under agent-assisted development has implications not just for human maintainability but for the effectiveness of future agent assistance. Doernenburg’s experiment is asking a narrow question about a specific feature in a specific codebase, but the question it’s really probing is whether AI-assisted development is a practice that sustains itself or one that degrades the conditions for its own future effectiveness.

That’s worth knowing, and it’s a question that benchmark suites, however comprehensive, aren’t set up to answer.

Was this interesting?