· 6 min read ·

What a Mature Codebase Reveals When an Agent Writes the Feature

Source: martinfowler

Erik Doernenburg published his assessment of what happens to internal code quality when you let a coding agent implement a feature in January 2026, using his own project as the test subject. CCMenu is a Mac application he maintains that displays CI/CD build status in the menu bar, reading from the cctray.xml format that originated with CruiseControl and is now supported by most CI systems. It is a relatively small but genuinely mature codebase, one with real history, real conventions, and real architectural decisions behind it. That makes it a much better test environment than a greenfield project or a toy app.

The central question Doernenburg was investigating is one that gets less attention than it deserves: not whether the coding agent could produce working code, but what that code looks like from the inside.

External versus Internal Quality

The distinction between external and internal code quality is one of those things that sounds obvious once stated but gets collapsed constantly in practice. External quality is what users experience: does the feature work, is it correct, does it handle edge cases. Internal quality is what developers experience: is the code readable, does it follow the conventions of the surrounding codebase, is it well-structured, is it testable, does it respect existing abstractions.

AI coding agents are, at this point, reasonably good at external quality. They can implement a feature that passes tests and handles the specified cases. The concern Doernenburg investigated is whether they are comparably good at internal quality, specifically in the context of a codebase they are navigating rather than one they generated from scratch.

This matters because internal quality is what makes a codebase maintainable over time. A feature that works but is coupled to the wrong abstractions, ignores existing conventions, or bypasses the patterns the rest of the codebase uses does not just create localized debt. It creates divergence. Each subsequent change now has two paths to consider: the path that follows the original conventions and the path that follows whatever the agent introduced. That compounds.

What Quality Looks Like in a Swift Codebase

CCMenu is written in Swift with SwiftUI, which gives you a specific set of internal quality concerns. Swift has its own idioms around value types versus reference types, protocol-oriented design, and the use of property wrappers and the @Observable macro in modern code. A codebase that uses these consistently has a certain shape. Code that ignores them does not look obviously broken, but it looks foreign.

Some of the internal quality markers worth watching in a Swift project:

  • Protocol conformance patterns: does new code define its own abstractions or use existing protocols?
  • Optionality handling: is optional chaining and guard let used consistently, or does new code introduce force unwraps where the codebase convention avoids them?
  • Test structure: does new code come with tests that follow the existing XCTest conventions, or are tests missing or structured differently?
  • Type design: does new code introduce new types where existing types should serve, or add properties to existing types where a separate concern would be cleaner?
  • Error propagation: does new code use throws and Result consistently with the surrounding code?

These are not issues that break anything immediately. They are issues that accumulate. A codebase where some code uses one pattern and other code uses another is harder to reason about than one that is internally consistent, even if neither pattern is inherently better.

The Pattern Agents Tend Toward

Coding agents navigate a codebase through retrieval. They read relevant files, identify patterns, and generate code that fits within what they saw. The quality of that fit depends heavily on what the agent saw and how it weighted it. In a well-structured codebase with clear conventions, a capable agent will often produce code that looks like it belongs. In a codebase with mixed patterns, the agent may pick up the wrong one. And there is a more subtle failure mode: the agent may produce code that is locally coherent but globally misfit, following patterns from the files it read without accounting for the broader architectural intent those patterns were serving.

One specific concern that comes up in real experiments is test quality. Agents will generate tests, but the tests they generate tend to test implementation details rather than behavior. They assert on internal state, mock at the wrong layer, or test things that are better covered at a different level. The tests pass, the coverage number goes up, but the tests are not actually doing what a thoughtful developer would want them to do. They do not document intent. They break on refactoring. They create the appearance of quality without the substance.

Another concern is coupling. When an agent implements a feature, it frequently takes the path of least resistance through the codebase, passing objects or reaching for globals where a cleaner solution would introduce an abstraction. The code works, but it increases coupling in ways that are not visible in the immediate diff. That coupling is debt, and it is the kind of debt that makes future changes harder without being traceable to any single bad decision.

Why a Mature Codebase Is the Right Place to Look

One reason Doernenburg’s experiment is more informative than most AI code quality assessments is that CCMenu is old enough to have a real identity. It has gone through multiple rewrites, has known conventions, and reflects a set of deliberate design decisions accumulated over years. When an agent adds something that violates those decisions, the violation is legible. You can see it.

Greenfield projects obscure this problem because there are no conventions yet. Whatever the agent produces becomes the convention, and the quality assessment reduces to whether the output is reasonable in isolation, which is a different and easier question. The interesting question is how well the agent reads and respects an existing codebase, because that is what most real development looks like.

The finding that emerges from experiments like this one is not that agents produce bad code. It is that agents produce code that is optimized for correctness and completion rather than for fit with an existing system. Those are different objectives, and the difference becomes visible in a mature codebase more quickly than in a new one.

What This Means in Practice

For developers using coding agents on maintained projects, Doernenburg’s experiment points toward a few practical conclusions.

First, code review with an AI-generated change cannot be just correctness review. It has to include the question of whether the code fits, whether it follows conventions, and whether it respects existing abstractions. That requires the reviewer to have enough context to see misfit, which means the review is doing substantive work that the agent cannot do.

Second, the investment in context documents pays off here in a specific way. The more clearly the codebase’s conventions are documented, whether in a CLAUDE.md, a CONTRIBUTING guide, or inline comments explaining why things are done a certain way, the better the agent’s output fits. Not perfectly, but meaningfully better. Conventions that live only in the heads of the maintainers are invisible to the agent.

Third, technical debt from AI-generated code is a different kind of debt than the debt most developers are used to. It is not obvious debt from shortcuts. It is subtle misfit debt that accumulates through individually reasonable decisions that collectively diverge from the codebase’s shape. Recognizing that pattern is the first step to managing it.

Doernenburg’s work on this in the context of CCMenu is a relatively rare thing: a maintainer with genuine context about a real codebase making the effort to assess quality rigorously rather than just asking whether the feature works. The answer to that narrower question is almost always yes. The more important question is what the feature cost, and that takes longer to see.

Was this interesting?