Building the Harness: Why AI-Assisted Code Quality Is an Infrastructure Problem

The framing Birgitta Böckeler lays out in her harness engineering piece on martinfowler.com, published back in February, is worth sitting with for a while. The term comes from OpenAI’s own internal writing on AI-assisted development, and it names something the industry has been doing inconsistently and mostly by accident.

A harness, in this context, is everything around the AI that makes its output trustworthy and useful: the test infrastructure that validates generated code, the architectural patterns that let the model reason about complete units of logic, the curated context assembled before each request, and the ongoing cleanup that ensures the model is not imitating outdated or contradictory patterns. Three components drive most of the value here: context engineering, architectural constraints, and what Böckeler calls garbage collection of the codebase.

Context Engineering Is Not Prompt Engineering

The distinction matters. Prompt engineering is about the instruction you give the model: what task to perform, what constraints to honor, what format to return. Context engineering is about the surrounding information: which files are in scope, which conventions apply, which type definitions constrain the solution space. A well-crafted prompt with poor context produces plausible-looking wrong code. A sparse prompt with the right context often produces exactly the right code with no further guidance.

This is why tools that handle context management well tend to outperform tools that do not, regardless of which underlying model they use. Aider’s repository map is a concrete example: rather than dumping entire files into context, it generates a compressed, AST-derived summary of the codebase, including function signatures, class definitions, and import relationships, that gives the model a navigational view without consuming the full context budget on implementation details. The model can ask for specific files based on this map, enabling progressive expansion rather than upfront saturation.

Cursor indexes the codebase with embeddings and pulls in semantically relevant chunks at query time. Claude Code uses CLAUDE.md files as a persistent, project-specific instruction layer that gets injected into every session automatically. These are all implementations of the same basic insight: context should be curated, not accumulated.

The .cursorignore and .aiderignore patterns represent the negative side of context engineering. Build artifacts, generated files, vendor directories, test fixtures with thousands of lines of synthetic data: all of this degrades model output by consuming context budget and introducing noise. Maintaining these exclusion lists is infrastructure work, not housekeeping.

Architectural Constraints as LLM-Friendliness

There is a useful reframing here that Böckeler’s piece surfaces: the properties of a codebase that make it easy for humans to read and reason about are largely the same properties that make it effective for LLMs to reason about. Small, focused files with single responsibilities fit entirely within a context window; the model can reason about a complete unit without working from partial information. Typed interfaces give the model explicit contracts instead of requiring it to infer behavior from dynamic patterns. Consistent naming reduces the probability of the model generating multiple implementations of the same logical operation under different names.

This is not a new argument. The case for small modules and strong typing has been made on human-readability grounds for decades. What is new is the feedback loop: teams that invest in these properties now get compounding returns because their AI tools work substantially better. The architectural quality gap between teams shows up directly in the quality of their AI-assisted output.

One concrete area where this manifests is TypeScript’s strict mode. Enabling strict null checks, noImplicitAny, and strict function types gives the model precise contracts for every function boundary. When the model generates a function call, it has enough type information to produce a correct signature on the first attempt rather than generating something plausible that fails the type checker. The tests still catch problems; the type system narrows the search space before the test even runs.

The same principle applies to consistent error handling conventions, standardized module boundaries, and predictable file layout. The model learns by example from the context you provide. If your codebase has five different ways to handle async errors, the model will pick one at random.

Garbage Collection Is Not Optional

The garbage collection framing is the most actionable part of harness engineering, and also the most commonly neglected. Dead code is not neutral. Commented-out blocks, unused utility functions, deprecated API clients, feature flags from completed migrations: the model will reason about all of it. It will imitate deprecated patterns because those patterns appear in context. It will generate calls to deleted functions because those functions are still referenced in comments. It will follow outdated style conventions because the old code has more examples than the new.

This is the part that most teams underestimate because the cost is invisible. The dead code does not cause test failures. It does not break the build. It quietly degrades the model’s probabilistic inference about what your codebase intends.

Regular dead code elimination, using tools like ts-prune for TypeScript, vulture for Python, or Go’s built-in unused variable detection, is infrastructure work for AI-assisted teams, not optional cleanup. The same applies to dependency hygiene: an npm install with 40 unused packages or a requirements.txt with three versions of the same library tells the model confusing things about what tech stack you are actually using.

The recurring recommendation to keep CLAUDE.md or .cursorrules files current is a variation of the same principle. Stale instructions are worse than no instructions; the model reads them and tries to follow them.

This Is a Team Problem, Not an Individual One

The deeper shift that harness engineering represents is organizational. Most developer conversations about AI tools focus on individual productivity: how to write better prompts, which tool to subscribe to, how to configure your editor. The harness framing moves the conversation to team infrastructure.

The analogy I find useful here is the shift from “every developer writes their own tests” to CI/CD as shared infrastructure. Individual developers writing tests is good. A shared pipeline that enforces test coverage, runs tests on every commit, and blocks merges on failures is infrastructure. It is a team artifact that compounds across every contribution. The harness is the same category of investment.

This means someone on the team needs to own the CLAUDE.md file, the .cursorignore patterns, the dead code elimination runs, and the module boundary conventions. It means architectural decisions should now include consideration of how a change affects LLM-assisted work on that module. It means code review should catch patterns that degrade context quality, not just patterns that affect runtime behavior.

Teams that invest in this infrastructure get better AI output per session, with less correction overhead, and less variation in output quality across developers on the team. Teams that do not will continue to get inconsistent results and attribute the inconsistency to the model rather than to the environment.

Where to Start

If you are looking at your codebase and want to improve the harness without a large upfront commitment, a few things have high leverage:

Write a CLAUDE.md or .cursorrules file that documents your actual conventions: naming patterns, error handling approach, test structure, key architectural decisions. Keep it under 500 lines so it fits comfortably in context without crowding out task-specific information.
Add .cursorignore or .aiderignore to exclude build outputs, generated code, and large fixture files from the default context.
Run a dead code detection pass and delete the output. Start with functions that have zero call sites.
Enable TypeScript strict mode or add type annotations to the most frequently referenced modules in your codebase.
Split any file over 400 lines that carries more than one logical responsibility.

None of these require new tools or changes to your deployment pipeline. They are the same investments you would make for long-term human maintainability, just prioritized through the additional lens of what makes AI assistance reliable rather than inconsistent. The harness does not replace good engineering judgment; it gives AI tools enough structural information to work within the constraints that judgment has established.