The Quality Violations That Pass Every Gate: What the CCMenu Experiment Actually Shows
Source: martinfowler
Erik Doernenburg published his assessment of coding agent impact on CCMenu in January 2026, and most of the discussion around it has focused on a familiar conclusion: AI agents produce code that works but isn’t well structured. That conclusion is true, but the more important finding is buried in how the violations were detected. They weren’t caught by tests. They weren’t caught by static analysis. They weren’t caught by standard code review. They were caught because Doernenburg built the project himself, over more than a decade, and recognized the violations by comparing agent output to a mental model no tool had access to.
That’s the part worth sitting with.
What Standard Gates Actually Test
When a team adds a coding agent to their workflow, they typically apply the same quality gates they’ve always used: a test suite, a linter, maybe a CI pipeline with coverage thresholds, and code review. These gates are calibrated for a specific failure mode: code that doesn’t work. They’re not calibrated for the failure mode agents actually produce.
The violations Doernenburg found in CCMenu were of a different kind. The feature worked. The tests passed. But the agent had:
- Introduced logic into architectural layers it didn’t belong in, crossing coupling boundaries that had been deliberately maintained
- Placed responsibilities in existing classes rather than creating new ones, degrading cohesion in ways that compound over time
- Reimplemented utility logic that already existed elsewhere in the codebase, creating duplication that would require coordinated changes in the future
- Used structural patterns that made the new code harder to test in isolation, despite the test suite passing on the feature as written
None of these are test failures. None are compiler errors. A linter running at default configuration would pass all of them. And a code reviewer who isn’t deeply familiar with the project’s intended architecture would have no frame of reference to identify them as violations rather than stylistic choices.
Why Linters Miss Most of the Real Violations
Static analysis tools for internal quality are genuinely useful, and for a Swift project like CCMenu, the tooling is more comprehensive than many developers realize. SwiftLint supports cyclomatic complexity thresholds, type body length limits, and file length gates. Periphery detects unused declarations. SonarQube provides cognitive complexity scores and duplication detection across the codebase.
The problem is that every one of these tools measures what is present in the code, not whether what is present belongs where it was placed. A function that crosses an architectural boundary can be short, simple, uncomplicated, and perfectly within every threshold. The boundary violation isn’t a property of the code itself; it’s a property of the relationship between the code and the system’s intended design.
Swift Package Manager gets closer than most ecosystems by enforcing module dependencies at the compilation level. If you structure a Swift project as distinct SPM targets, you can make certain boundary violations literally impossible to compile: a Model target that doesn’t declare a dependency on Networking cannot import from it, period. No agent can violate that boundary without also modifying the package manifest, which is the kind of visible, anomalous change that does get caught in review.
But even SPM target separation only enforces the boundaries you’ve modeled as package boundaries. Most codebases have subtler layering than that. The CCMenu violations Doernenburg found were within layers, not between separately compiled modules. Within a single package target, cohesion violations and inappropriate coupling are invisible to the compiler.
SwiftLint custom rules can go further by encoding project-specific conventions as regex patterns:
custom_rules:
model_no_uikit:
name: "Model layer should not import UIKit"
regex: "^import UIKit"
included: ".*/Model/.*\\.swift"
message: "Model types must not depend on UIKit; use protocol abstractions instead"
severity: error
view_no_network:
name: "View layer should not import Networking"
regex: "^import.*Networking"
included: ".*/Views/.*\\.swift"
message: "Views should receive data from view models, not networking directly"
severity: error
These rules can catch the specific violations the project’s architecture is designed to prevent. But writing them requires first knowing what violations to prevent, which means the architectural knowledge has to be extracted from the maintainer’s head and encoded as a rule before the agent shows up.
The Time-Sensitive Nature of Tacit Knowledge
The GitClear analysis of AI-assisted codebases found that code duplication nearly doubled year-over-year in repositories corresponding to periods of heavy AI tool adoption. That’s a signal that agents are generating code without discovering existing utilities to reuse, producing parallel implementations that diverge as the codebase evolves.
But duplication is actually one of the more detectable problems. You can measure it, track it over time, and catch it before it becomes structurally embedded. The harder problem is cohesion degradation: the slow accumulation of responsibilities in classes that started with a coherent purpose, as agents add behavior to the nearest available container rather than the right one.
Cohesion degradation doesn’t register as a metric spike. It shows up gradually, as the ratio of what a class does to what its name implies grows, as the class becomes harder to reason about in isolation, as the test file for it accumulates setup code for concerns the class shouldn’t have. By the time a code complexity alert fires, the structural problem has already been there for months and is already the foundation for subsequent work.
Doernenburg could identify these problems immediately because he had the reference architecture in his head. The architectural knowledge that makes violations visible is time-sensitive in a specific way: it’s strongest in the people who built the system, and it degrades as those people move on, get promoted, change focus, or simply forget the reasoning behind decisions made years ago. Every team using agents on an established codebase is in a race between knowledge extraction and knowledge loss, and most teams aren’t aware they’re running it.
What Has to Change Before the Agent Arrives
The implication isn’t that agents shouldn’t be used on established codebases. It’s that the codebase needs preparation that most projects haven’t done, and that preparation has to happen before the tacit knowledge becomes harder to recover.
For a project like CCMenu, this would mean:
Documenting architectural intent explicitly. Not what the code does, but why the layers are structured as they are and what violations would look like. Architecture Decision Records are a standard format for this; the relevant convention in a CLAUDE.md file or similar is enough for agent context injection.
Encoding boundary enforcement as automated tests. SwiftLint custom rules for project-specific conventions. SPM target separation where the layering matters enough to enforce at compilation. Fitness functions that run in CI and fail on structural violations. The goal is to make the same violations that Doernenburg recognized manually into failures that surface automatically.
Running agents with quality constraints explicit in the prompt. Instructions like “do not duplicate logic that exists elsewhere in this codebase” and “if you’re adding behavior to an existing class, verify that this responsibility fits its stated purpose” change the distribution of agent output without guaranteeing it. They’re not substitutes for automated gates, but they reduce the rate at which violations require correction.
Establishing a baseline before the first agent contribution. Running your complexity, coupling, and duplication measurements on the current codebase gives you something to compare against. When those metrics start drifting after agent-assisted development begins, you have signal. Without a baseline, you’re comparing against an impression of what the codebase used to be like.
The Scaling Problem Nobody Is Solving
The CCMenu experiment worked as an assessment precisely because Doernenburg is the maintainer. He has the tacit architectural knowledge that makes violations visible. Most teams cannot replicate this condition: the developers using agents most heavily are often not the original maintainers of the codebases they’re working in, and the original maintainers are often not the ones doing the quality review.
The tools partially compensate for this. Good automated gates, tight linting rules, and explicitly documented conventions give a reviewer without deep context something to anchor their judgment to. But tools enforce what you encode, not what you know. The boundaries that matter most are exactly the ones that feel too obvious to document until an agent violates them and you realize they weren’t obvious to anyone but you.
The real lesson from the CCMenu experiment isn’t that agents produce bad code. It’s that the badness is in a specific register, internal structural coherence, that standard quality infrastructure doesn’t measure, and that only a certain kind of knowledge can detect. Understanding which kind of knowledge that is, and finding ways to encode it before it’s needed, is the preparation work that determines whether agent-assisted development accelerates a codebase forward or slowly makes it harder to change.