· 6 min read ·

Context Is Infrastructure: What Harness Engineering Gets Right About AI Coding

Source: martinfowler

Back in February, Birgitta Böckeler published a piece on Martin Fowler’s blog responding to OpenAI’s internal framing of what they call “harness engineering.” It’s worth returning to now, a month on, because the concept keeps coming up in practice and the name finally gives teams something to point at.

The core claim is that there is a distinct category of engineering work involved in making AI coding tools perform well, and it is not prompt engineering. Prompt engineering is session-level work: you are tuning the text you send to the model right now. Harness engineering is sustained infrastructure work: you are maintaining the environment the model operates in across every session, for every developer, indefinitely.

The harness, as Böckeler frames it drawing on OpenAI’s framing, has three components: context engineering, architectural constraints, and garbage collection of the codebase. Each of these is worth unpacking separately because they behave differently and create different kinds of team obligations.

Context Engineering

Context engineering is the practice of deliberately shaping the persistent information the AI sees before it reads a single line of your code. The most visible artifact of this is whatever your tool calls its project-level instruction file. In Claude Code that is CLAUDE.md. In Cursor it is .cursorrules. GitHub Copilot has .github/copilot-instructions.md. The name varies; the function is the same.

A minimal but effective CLAUDE.md might look like this:

# Project Context

## Architecture
This is a Discord bot using discord.py. Commands live in `cogs/`, each cog
responsible for one domain. The bot uses a PostgreSQL database via asyncpg;
all DB access goes through `db/queries.py`, never inline.

## Conventions
- All command handlers are async
- Error handling uses the `@handle_errors` decorator from `utils/errors.py`
- User-facing strings live in `locales/en.json`, never hardcoded
- Tests use pytest-asyncio; mock the DB at the repository layer

## Things to avoid
- Do not use discord.py's `commands.Bot` subclass; we use `AutoShardedBot`
- Do not add synchronous DB calls anywhere in the cog layer

That is roughly 120 tokens. Without it, every AI session starts from zero knowledge of your module boundaries, your error handling approach, and your localization strategy. The model will guess, and it will guess inconsistently across sessions and across developers.

The non-obvious part of context engineering is that these files have to be maintained. The architectural decisions they describe drift from reality if nobody keeps them current. A context file that describes the old database layer after a migration teaches the model confidently wrong things. This creates a new category of maintenance work with no precedent in traditional software hygiene.

Architectural Constraints

The second component is about code structure. Some architectural shapes are more legible to AI models than others, and this is not mysterious: models predict well on patterns they have seen often, and the patterns they have seen often in training data are the same ones that human readers find clear.

Small, cohesive modules with explicit interfaces perform better than large files with implicit dependencies. Strong type annotations reduce ambiguity about what a function expects and returns. Consistent naming conventions let the model reason by analogy across similar functions. Avoiding clever tricks, whether creative use of metaclasses or elaborate DSLs, keeps the model from having to infer what the code does from unusual patterns.

This is not a new insight. Every piece of advice about writing maintainable code for human readers applies here too. What changes is the feedback loop. When a human reader encounters a tangled module, the feedback is slow: a PR review comment, a confused new hire, a bug from someone who misunderstood the structure. When an AI encounters a tangled module, the feedback is immediate: it produces a confusing or wrong edit, right now, in this session. The economic case for clean architecture gets faster, not different.

What this means practically is that architectural refactoring to improve AI output is often identical to refactoring you would have wanted to do anyway. The bottleneck is often just prioritization: cleaning up module boundaries never felt urgent because the cost was diffuse. That calculation shifts when every AI session is paying that cost in degraded output quality.

Codebase Garbage Collection

The third component is the one that surprised me when I first thought through its implications. Böckeler uses the phrase “garbage collection of the codebase,” and it captures something specific: dead code, unused dependencies, and stale comments are not just technical debt in the traditional sense. They are active noise in the model’s context.

A human reader encountering dead code can often infer it is dead from surrounding context, git blame, or asking a colleague. A model reading the same file has no reliable way to distinguish active code paths from inactive ones unless you tell it. Contradictory patterns, where one module does things one way and another module does the same thing differently because of historical accident, are similarly confusing. The model has no way to know which pattern is the current one.

The practical consequence is that codebase cleanup that would normally sit in a backlog becomes harness maintenance with a direct quality cost. Removing a dead code path is not just housekeeping; it is narrowing the search space the model operates in. Deleting an unused dependency removes a source of plausible-but-wrong suggestions. Fixing inconsistent naming conventions across modules reduces the chance the model generates something that looks right but uses the wrong convention for this part of the codebase.

None of this requires exotic tooling. It requires treating cleanup work as engineering infrastructure rather than optional polish.

The Team Dimension

What makes the harness engineering framing valuable is that it surfaces something individual prompt crafting habits obscure: the harness is a shared resource. If one developer keeps CLAUDE.md updated and another ignores it, the whole team’s AI quality degrades in proportion to the worst-maintained part of the codebase. Dead code in a module you do not own still confuses the model when it is editing code you do own.

This creates coordination requirements that did not exist before. Teams need shared conventions for what goes in context files and who is responsible for keeping them current. Code review needs to consider whether a change creates new inconsistencies that will mislead the model. Architectural decisions need to account for AI legibility as a first-class concern, not an afterthought.

Some teams are handling this by treating context file updates as part of the definition of done for any significant change. If you add a new module boundary, you update CLAUDE.md to describe it. If you change a database access pattern, you update the constraints section. This is not much overhead per change, but it requires that the convention exists and is enforced.

Why the Name Matters

Software teams have understood maintainability and code hygiene for decades. Nothing in harness engineering is strictly new. What is new is that the feedback is now fast and visible rather than slow and diffuse. A poorly maintained harness degrades AI output today, in a way that developers can observe directly. That immediacy changes what teams are willing to prioritize.

Naming the discipline helps because it creates a shared vocabulary for the work. “We need to do some harness engineering” is a more actionable statement than “we should clean up the codebase.” It also correctly frames the work as infrastructure maintenance, not feature development, which means it belongs in the same prioritization conversation as CI reliability or test coverage, not in the backlog of nice-to-haves.

Böckeler’s engagement with OpenAI’s framing is a good example of exactly the kind of conceptual work the field needs right now: taking practices that are emerging organically across teams and giving them enough structure to reason about and communicate clearly. Whether “harness engineering” is the name that sticks is less important than recognizing that the category of work is real, it is distinct from prompt engineering, and it belongs in how teams plan and prioritize their engineering work.

Was this interesting?