The Harness Around the Model Is the Real Engineering Work

The concept of a test harness has been part of software development vocabulary for decades. You build the scaffolding, wire up the fixtures, and the tests run in a controlled environment. OpenAI’s framing of “harness engineering” for AI-enabled software development borrows that same intuition, and Birgitta Böckeler’s commentary on Martin Fowler’s blog makes the case that this framing is one of the more useful ones to emerge from the current wave of AI tooling discourse. Published in February 2026, it is worth revisiting now that teams have had time to actually test these ideas against real projects.

The prompt engineering framing has always been slightly off. It focuses attention on a single message exchange, as if the key to getting useful output from a coding agent is choosing the right words at the right moment. The harness framing shifts focus to something more structural: the environment you build around the model, which determines what it knows and what it can safely do.

That environment has three layers according to the OpenAI framing that Böckeler highlights: context engineering, architectural constraints, and codebase garbage collection.

Context engineering is not prompt engineering

Prompt engineering asks: what should I say to the model? Context engineering asks: what information should the model have access to when I say it?

The distinction matters because most of what makes a coding agent useful or frustrating has nothing to do with the quality of individual prompts. It has to do with whether the model understands your project’s conventions, knows which files are relevant, has access to your API docs, understands the testing patterns you use, and knows what you already tried.

Tools have formalized this in different ways. Claude uses CLAUDE.md files at the repository root. Cursor uses .cursor/rules. GitHub Copilot leans on repository-level context windows and indexed codebases. All of these are implementations of context engineering: structured ways of putting project knowledge in front of the model before it starts working. A well-maintained CLAUDE.md might look like:

# Project conventions
- All database queries go through `src/db/queries.ts`, never raw SQL in handlers
- Error handling uses the `AppError` class from `src/errors.ts`
- Tests use Vitest with MSW for HTTP mocking; no manual fetch mocking
- Commands follow the builder pattern in `src/commands/base.ts`

That file costs maybe thirty minutes to write and saves hours of correction across every session anyone on the team runs. The return on investment is unusually clear.

The deeper engineering challenge is that context is finite. A model’s context window has a limit, and the naive approach of dumping everything into it degrades quality as the context grows noisier. Real context engineering involves curation and retrieval: knowing what to include, what to summarize, and what to leave out. This is closer to library science than it is to copywriting. Teams building serious AI-assisted workflows end up implementing retrieval-augmented approaches, pulling in relevant file summaries or API references on demand rather than loading the whole repository.

For work like building Discord bots, the difference between a useful AI session and a frustrating one is almost entirely determined by what context is set up beforehand. When the model has a clear description of the command structure, the database schema, and the existing error-handling patterns, the code it writes fits. When you just start typing, it writes generic boilerplate that needs reshaping. The prompt matters far less than the harness.

Architectural constraints as a new design consideration

The second component is less obvious and more interesting from a software design standpoint. Architectural constraints in the harness sense means structuring your code in ways that make it legible to AI tools.

This is not identical to writing code for humans to understand, though there is overlap. Some specific constraints that improve AI tool performance: keeping files small and focused, using explicit naming that makes function purposes clear without context, minimizing global state (since state is hard to track across a context window), and preferring declarative patterns over imperative ones where the domain allows it.

These are all also good software engineering practices, but the motivation is different. The traditional justification for small files is cognitive load for developers. The harness engineering justification is that coding agents struggle when they have to hold more than a few hundred lines of context to understand what a file is doing. The same code structure, justified by a different constraint.

Where this gets more interesting is in areas where AI legibility and human cognitive preferences diverge. Long, well-commented procedural code can be very readable to humans but hard for a model to work with incrementally, since changing one part requires understanding the entire flow. Heavily abstracted, type-rich code can be opaque to junior developers but navigable to a language model that has seen millions of similar patterns in its training data.

There is no settled answer here, but the harness framing makes the question explicit. You are now designing code for two audiences: humans and AI tools. Their preferences are not always aligned, and pretending otherwise leads to architectures that serve neither well.

Codebase garbage collection

The third component is where the framing is most provocative. Böckeler discusses garbage collection of the code base as part of harness engineering, and this deserves careful unpacking.

The argument is roughly: AI tools are confused by dead code, by redundant abstractions, by modules that no longer serve a purpose, and by naming that has drifted from reality over years of changes. The technical debt cleanup that teams perpetually defer becomes a direct impediment to AI-assisted development, not just an abstract quality concern.

This gives technical debt remediation a new economic rationale that is more legible to stakeholders than the usual appeals to maintainability. The cost of confusion used to fall on human developers reading the code. Now it falls on every AI-assisted session, every code completion, every refactoring suggestion, every agent run. A function named processData that was renamed three years ago but never cleaned up used to cost one human five seconds of confusion. Now it costs that five seconds multiplied across every AI interaction that touches that module.

The garbage collection metaphor implies an ongoing process rather than a one-time cleanup, which is apt. Just as a garbage collector runs continuously to reclaim unused memory, codebase GC in this framing is a continuous practice: removing code that no longer serves its purpose, clarifying names that have drifted, and eliminating abstractions that outlived the problems they solved.

A practical form of this practice: before starting an AI-assisted feature sprint, do a focused cleanup pass on the modules the feature will touch. Remove dead branches, rename unclear variables, consolidate duplicated logic. The AI sessions that follow will be more accurate and require fewer corrections. The cleanup pays for itself within the sprint.

Why the framing shift matters

The older framing of prompt engineering positioned AI tool quality as a skill held by individual developers. Write better prompts, get better output. This led to a wave of prompt engineering as a job title, prompt template repositories on GitHub, and libraries of “magic phrases” that supposedly improved outputs.

The harness framing repositions the work at the team and codebase level. Context engineering is an ongoing team practice, not an individual skill. Architectural constraints affect how the whole team designs code. Codebase garbage collection is a shared maintenance discipline.

This is not a minor reframing. It changes where investment goes. Instead of training each developer to write better prompts, you invest in repository-level context files that every session benefits from. Instead of leaving each developer to figure out how to describe the codebase to the AI from scratch, you engineer that description once and keep it maintained.

The harness is also durable in a way that prompt craft is not. Models change. Interfaces change. But a well-structured codebase with clear naming, maintained architecture documentation, and minimal dead code will serve you regardless of which AI tool the team adopts next year.

Böckeler’s point, following OpenAI’s framing, is that harness engineering represents a serious engineering activity, not a configuration task or a soft skill. That framing is correct. The teams getting consistent value from AI coding tools are the ones who treat the environment around the model as something to be built and maintained, with the same rigor they apply to their test infrastructure, their CI pipelines, and their deployment systems. The harness is not a detail you get to once and forget. It is the ongoing engineering work that makes AI assistance reliable at scale.