How to Systematically Assess Internal Code Quality After Using a Coding Agent

Erik Doernenburg’s assessment of coding agent output on CCMenu, part of Martin Fowler’s ongoing Exploring Gen AI series, is notable less for what it found than for what it did: it treated internal quality assessment as a methodology rather than an impression. Published in January 2026, it’s a retrospective worth examining now precisely because the tooling and discourse around AI-assisted development have matured enough to act on its findings.

Most post-agent code review amounts to confirming that tests still pass and the feature works as described. That’s external quality verification. Internal quality, the part that determines how expensive the codebase will be to maintain and extend, requires a different approach, and that approach is rarely discussed in concrete terms.

What Internal Quality Actually Consists Of

Before you can assess internal quality systematically, you need a specific vocabulary for what you’re measuring. Martin Fowler has a useful framing: internal quality is what gives you design stamina, the capacity to keep adding features at a sustainable rate. The specific attributes that compose it are measurable, not just impressions.

Cyclomatic complexity counts the number of independent control-flow paths through a function. A function with no branching has a complexity of 1; each conditional or loop adds 1. Values above 10 correlate strongly with difficulty in testing and reasoning about code. Above 20, functions become genuinely hard to maintain. Agents tend to produce functions at the high end of this range because they optimize for correctness in a single pass rather than decomposing into simpler sub-problems.

Coupling measures how much a module depends on the internals of other modules, as opposed to their public interfaces. High coupling means changes ripple unpredictably through the codebase. Agents frequently violate architectural boundaries to take the shortest path to a working implementation, placing view logic in model objects or adding networking knowledge to business components.

Cohesion describes whether a class or module has a single, well-defined responsibility. Low cohesion manifests as classes that grow to handle multiple concerns because the agent added behavior to the nearest available container rather than creating a new one.

Code duplication tracks how often similar logic appears without abstraction. This is where agents fail most consistently: they generate code that satisfies the prompt without scanning the existing codebase for patterns they could reuse. A 2024 GitClear analysis of over 150 million lines of code changes found that copy-paste patterns nearly doubled in repositories corresponding with AI tool adoption.

The Tooling Layer

For Swift projects like CCMenu2, the tool landscape for automated quality measurement is more complete than many developers realize.

SwiftLint enforces over 200 rules, including several that catch internal quality violations directly: function_body_length flags functions over configurable thresholds, cyclomatic_complexity computes path counts, file_length flags bloated files, and type_body_length catches classes that have grown beyond a coherent scope. Most projects have SwiftLint configured for style; fewer configure the complexity and coupling rules aggressively enough to catch AI-generated structural problems.

Periphery detects unused declarations across a Swift project. This is a relevant signal for agent output because agents sometimes introduce new types or protocols to solve a sub-problem and then the agent’s next step uses something different, leaving the earlier declaration dead. Tests still pass; the code is just carrying dead weight.

For cross-language structural analysis, SonarQube provides coupling metrics, cognitive complexity (a more nuanced version of cyclomatic complexity that accounts for nesting depth and recursion), and duplication detection. The free Community Edition handles Swift through community plugins; the commercial editions have native support. The key configuration for AI-assisted codebases is tightening the quality gates: the default thresholds were calibrated for human-written code, and agent output regularly slips through them at default settings.

CodeClimate provides maintainability scores and duplication detection as a service, with GitHub integration that makes it practical to review agent-generated pull requests against a baseline.

None of these tools replaces judgment. They surface signals; a developer who understands the codebase’s design intentions still has to interpret them.

The Manual Assessment Layer

Automated tools catch measurable violations. The subtler quality problems that Doernenburg’s experiment surfaced require reading the code with a specific frame of mind.

The first question in any manual internal quality review is whether the new code belongs where it was placed. In a layered architecture, the right question is not just whether the code works but which layer its logic belongs to. An agent implementing a feature in CCMenu’s networking layer might correctly parse a feed format while quietly including formatting logic that belongs in a presentation layer. The tests pass because formatting is correct; the architecture degrades because the boundary moved.

The second question is whether the new code introduces a concept that the codebase already has a name for. Doernenburg’s codebase has conventions built up over years. A new class named FeedChecker that duplicates the behavior of an existing PipelineMonitor represents a quality violation that no linter will catch: both names are syntactically valid, both implementations work, but now the codebase has two terms for the same concept and the cognitive load of understanding it has increased.

The third question is whether the agent’s solution is proportionate to the problem. AI-generated code frequently applies more machinery than the problem requires, not because the model is wrong about the technique, but because it has been trained on codebases where that technique appears in similar contexts and cannot evaluate whether simpler alternatives would serve better here. Over-engineered solutions are technically correct and structurally costly.

Building a Review Workflow Around This

The gap between how most teams review agent output and how Doernenburg reviewed CCMenu is a workflow gap, not a knowledge gap. The practices exist; they just need to be applied consistently.

Running SwiftLint or your language’s equivalent static analysis on every agent-assisted commit, with complexity and coupling rules enabled, moves automated quality checks from occasional audits to routine gates. This gives reviewers a baseline metric to compare against, rather than relying on impressions of whether the code “looks fine.”

For significant agent-assisted features, a separate structural review pass, distinct from the functional review, creates the space to ask the architectural questions without conflating them with correctness questions. Mixing them in a single review tends to resolve in favor of correctness, because working code is easier to defend than poorly structured code.

Prompting explicitly for quality constraints shifts the distribution of agent output before it reaches review. Instructions like “do not duplicate logic that exists elsewhere in the codebase,” “prefer modifying existing classes over creating new ones when the responsibility fits,” and “use the same terminology this codebase already uses for similar concepts” do not guarantee good structural output, but they reduce the frequency of the worst violations.

Sourcery for Swift, and equivalent metaprogramming tools in other languages, can automate structural checks that go beyond what linters typically handle, including verifying that protocol conformances follow expected patterns and that no types in a restricted layer have dependencies on types in another layer.

What Doernenburg’s Experiment Actually Shows

The CCMenu experiment is valuable because Doernenburg was positioned to assess his own codebase in a way that a reviewer unfamiliar with it could not be. He knew what the architecture was supposed to look like, which made violations visible to him that automated tools would miss and that a reviewer without that context would not recognize as violations.

This is an honest acknowledgment of what systematic internal quality assessment requires: not just tools, but a developer with a sufficiently detailed mental model of the intended design to evaluate whether new code fits or degrades it. That model can be built through reading and working with a codebase, through explicit documentation of architectural decisions, and through code reviews that focus on structure rather than correctness.

The tools extend that assessment to the measurable dimensions. The judgment fills in what tools cannot reach. Neither alone is sufficient, and the workflow that combines both is what the agent’s own output cannot provide for itself.