Architecture Fitness Functions Are the Missing Safeguard for AI-Assisted Codebases
Source: martinfowler
Erik Doernenburg’s experiment with CCMenu, documented in January 2026 on Martin Fowler’s site, adds a specific data point to a recurring debate. He used a coding agent to add a feature to CCMenu, a macOS menu bar application he maintains that monitors CI/CD build statuses, then assessed what happened to the code’s internal structure. The agent shipped working code that degraded internal quality: it introduced duplication where abstractions existed and produced code less coherent with the existing architecture than a developer fluent in the codebase would have written.
The finding is consistent with what practitioners who use these tools regularly report. What makes the experiment worth examining closely is that it sits in a longer history, and that history clarifies both what is familiar and what is genuinely new.
Code Generators Have Been Here Before
The tension between productivity tooling and internal code quality predates large language models by decades. The tools changed; the pattern did not.
In the mid-2000s, RAD tools and IDE wizards generated boilerplate that worked but violated structural principles. Visual Basic’s drag-and-drop database widgets generated code that mixed data access with presentation logic. The developer who understood the problem would hand-edit or rewrite; the developer who did not would accumulate what Brian Foote and Joseph Yoder named the Big Ball of Mud pattern: a large, tangled system that grew by accretion without design intent.
Rails generators addressed this by generating code that followed opinionated conventions. rails generate scaffold produces a coherent, conventional resource, but generated code is a starting point, not a finished design. ActiveRecord is the canonical example of the tradeoff: it accelerates data access code substantially, but its affordances pull business logic into model objects, compressing the separation of concerns that a disciplined architecture would maintain. Teams that used it without understanding its structural tendencies accumulated that compression silently, across many features, until the model objects became unmanageable.
The recurring pattern: tools that accelerate code production create systematic quality pressures in directions specific to their affordances. The developer who understands the tool’s biases can compensate. The developer who does not accumulates the consequences.
What Changes With LLMs
Template-based code generators produce predictable output. A developer reviewing Rails-scaffolded code knows what patterns to look for and can develop a checklist calibrated to the specific structural issues that scaffolding introduces. The quality violations are consistent in character because they derive from fixed templates.
Large language models do not generate from templates. They generate from statistical patterns over training data, which produces two differences that matter for code review.
The first is unpredictability. An agent might structure one class cleanly and then add a tightly coupled method in the next response. The quality violations are not systematic in the way a code generator’s violations are; they are context-dependent and variable, which makes calibrated checklists less reliable.
The second is fluency. Rails-generated code looks like Rails-generated code: particular naming conventions, particular boilerplate, a recognizable shape. AI-generated code looks like ordinary code written by a competent developer. This is a significant difference in practice. GitClear’s 2024 analysis of over 150 million lines of code changes found that code churn and duplication increased substantially alongside AI tool adoption, while refactoring activity declined. These are proxy signals for internal quality degradation, consistent with code that passes review because it looks reasonable even when its structure is not.
Doernenburg’s position as the original author of CCMenu is what makes the experiment credible. He has the design intent of the codebase in his head and can detect when a new method is in the wrong place, when a concept has been duplicated rather than shared, when the layering has been violated. A reviewer without that depth of context would likely approve the code without noticing the violations. Most team reviewers, on most AI-assisted changes, are in that position.
The Feedback Loop Agents Are Missing
The deeper problem is about what signals agents receive. When a developer writes code that violates architecture, they eventually receive feedback: a reviewer pushes back, a subsequent change becomes unexpectedly difficult, a colleague asks why the approach works this way. These signals are slow and diffuse, but over time they create an incentive to understand and follow the existing design.
Agents receive feedback on external quality: tests pass or fail, the human approves or rejects. They do not receive feedback on internal quality, because internal quality signals are slow, diffuse, and not automatable with a standard test suite. An agent will produce similar statistical patterns regardless of whether a previous response degraded the codebase’s structure.
The developer in the loop must compensate for this missing feedback. Under time pressure, with AI-generated code that looks reasonable on the surface, that compensation often does not happen fully. The result is what Doernenburg observed: working code that accumulates structural debt in ways invisible to standard tooling and visible only to someone with deep familiarity with the codebase.
Fitness Functions as a Structural Response
The historical solution to code-generator quality problems was not better reviewer attention but better automated enforcement. This is the principle behind architecture fitness functions, developed by Neal Ford, Rebecca Parsons, and Pat Kua and documented in Building Evolutionary Architectures.
A fitness function is an automated test for an architectural property. Instead of relying on reviewers to catch coupling violations, you write a test that fails when coupling metrics exceed a defined threshold. Instead of hoping someone notices a class has accumulated two responsibilities, you write a test that fails when method count or file complexity exceeds a defined limit. These checks run in CI alongside behavioral tests, creating objective signals rather than depending on reviewer attention.
For Swift codebases, the language’s module system provides a natural enforcement layer. Code in one Swift package cannot access another package’s internal types; architectural boundaries enforced by module boundaries become compile errors rather than code review findings. A CCMenu-style codebase that places the feed parser, model layer, and networking layer in separate Swift packages makes cross-layer coupling impossible by construction, not by convention. Conventions are advisory and require human enforcement; module boundaries are mandatory and require no human attention.
This approach scales with AI coding agents in a way that code review does not. Agents generate code quickly; reviewers review slowly. If quality enforcement depends on reviewer attention, quality degrades as generation speed increases. If quality enforcement is automated, the agent can iterate toward code that satisfies both behavioral and structural requirements.
Some teams have started encoding architectural constraints directly in agent prompts: “do not import UIKit in model layer files,” “prefer extending existing protocols over creating new classes,” “search for existing utilities before writing date or string parsing logic.” These soft fitness functions encode design intent that would otherwise live only in the reviewer’s head. They are imprecise and the agent will not always follow them, but they shift the output distribution in the right direction and cost almost nothing to add.
What the Methodology Demonstrates
Doernenburg’s experiment is, in one sense, a confirmation of a predictable outcome: an agent optimized for external quality, operating on a codebase it cannot fully understand, in the absence of automated structural enforcement, produces code that passes tests and violates structure.
What is more valuable than the specific finding is the methodology. He measured internal quality explicitly, using a codebase he understood deeply enough to evaluate. That combination is uncommon in practitioner assessments of AI coding tools, most of which focus on speed, test pass rates, or developer satisfaction.
For teams using agents seriously, the experiment offers a useful reframing. The question worth asking is not whether your agent writes good code in isolation, but whether your workflow makes internal quality a signal that agents and reviewers both receive. The teams that struggled with Rails model bloat in 2010 were not the ones who reviewed more carefully; they were the ones who never made the structural constraints explicit enough to enforce. The parallel for AI-assisted development is direct, and the solution is the same: automate the enforcement of the properties you care about, rather than relying on human attention to catch their degradation in review.