Harness Engineering: Why the Leverage Is in the Infrastructure, Not the Model

Birgitta Böckeler’s February 2026 piece on martinfowler.com revisits OpenAI’s framing of harness engineering as a core activity in AI-enabled software development. A few weeks on, it holds up better than most takes on AI tooling, and the reason is the name. “Harness engineering” does something that “prompt engineering” never quite managed: it positions the work as infrastructure, not incantation.

In software testing, a harness is the scaffolding around the system under test. It controls inputs, captures outputs, and defines what success looks like. The model being tested doesn’t know the harness exists; the harness is the reason the model can do useful work at all. Applying this metaphor to AI-assisted development, where the language model is the system being harnessed, reframes a set of practices that previously lived under vague labels like “AI workflow” or “LLM integration” into something with a clear boundary and clear responsibilities.

The harness, as described in OpenAI’s write-up that Böckeler discusses, has three components: context engineering, architectural constraints, and garbage collection of the codebase. Each of those names is a deliberate borrow from existing computer science vocabulary, and unpacking each one reveals why the framing is productive.

Context Engineering: It Was Never About the Prompt

Most practitioners who work with AI coding tools treat “prompt” and “context” as roughly synonymous, but they describe fundamentally different things. The prompt is what you type in the chat window. The context is everything the model sees, including the system prompt, the conversation history, injected files, tool call results, retrieved documents, and whatever the agent framework has assembled. The prompt is a small and often late-added component of a much larger constructed input.

Context engineering is the discipline of deliberately managing that full construction. It involves deciding which files to include and which to leave out, how to structure retrieved content so the model encounters it in a useful order, how to allocate the token budget across competing concerns, and when to summarize or discard prior context to make room for what is currently relevant.

Token budget management is worth dwelling on. Current frontier models offer context windows of 128k to 200k tokens, enough that teams often assume they can include everything. But context quality degrades as quantity increases. A model given an entire repository sees more noise than signal, and its outputs reflect that. The better approach is to model the context the same way you would model a database query: retrieve the minimum set of information that makes the task solvable, structure it with precision, and place the most relevant content where the model’s attention is strongest.

One concrete pattern that emerges from this is layered system prompts. Instead of a single monolithic system message, split the context into layers managed independently:

Layer 1 (stable):     Role definition, output format expectations
Layer 2 (project):    Codebase conventions, architectural patterns, key abstractions
Layer 3 (task):       The specific file or module in scope, relevant interfaces
Layer 4 (dynamic):    Tool call results, retrieved documentation, current state

Each layer has a different update frequency and a different token budget cap. When a layer fills, you summarize rather than truncate blindly. The model gets a coherent picture at each level of abstraction rather than a fragmented dump of everything that seemed potentially relevant.

This connects directly to work in information retrieval and retrieval-augmented generation, where the central problem has always been relevance: how do you identify the right documents, rank them correctly, and inject them without overwhelming the consumer? Context engineering for AI coding tools is the same problem applied to a different kind of consumer. RAG is one technique within context engineering, not a synonym for it.

Architectural Constraints: Designing for AI Legibility

There is a growing body of practice around writing code that human collaborators can read quickly. Clean architecture, good naming, separation of concerns: all of it fundamentally reduces the cognitive load of understanding a system. AI coding tools apply a different but overlapping set of reading heuristics, and a codebase that is legible to humans is not automatically legible to models.

Architectural constraints, in the harness engineering framing, are the deliberate design decisions that make the codebase tractable for a model working within a bounded context window. Several patterns stand out.

Small, focused modules with descriptive names allow the model to select relevant files without reading every file to determine relevance. A 2,000-line utils.ts file containing a hundred loosely related functions is not just bad for human maintainability; it is actively hostile to context engineering because any task that touches it requires injecting the entire file.

Strong static types give the model structural information without requiring it to trace execution paths. TypeScript codebases routinely outperform equivalent JavaScript codebases in AI coding tasks, not because the model prefers TypeScript syntax but because interfaces and type signatures compress architectural information that would otherwise require reading five files of implementation detail. The model can reason about a UserRepository interface without seeing the PostgreSQL implementation behind it.

Standard patterns reduce uncertainty. The repository pattern, the command pattern, conventional REST resource naming: models have seen enormous quantities of code that follows these conventions and can navigate them without needing the full implementation in context. Bespoke patterns require more context to explain, cost more tokens, and produce less reliable outputs.

This is where the connection to existing CS literature is clearest. Architectural fitness functions, as described in the evolutionary architecture literature, are automated checks that verify an architecture stays within defined constraints over time. Harness-aware fitness functions would check module size limits, interface completeness, naming conventions, and coupling metrics — not just for human code quality reasons but specifically because violations degrade AI tool performance. Measuring this is feasible now; very few teams are doing it.

Garbage Collection: The Codebase as a Managed Resource

The third component has the most unusual name for a software quality practice, and it is also the most insightful. “Garbage collection” in this context means actively removing the semantic noise that accumulates in a codebase over time: dead code, stale comments, outdated documentation, deprecated functions left for backward compatibility, and TODO comments that refer to decisions made years ago.

Human developers navigate this noise using git history and institutional knowledge. They read a comment like // temporary fix until we migrate off the old auth service (2022) and mentally discount it. Models lack that disambiguation mechanism. They treat all text in the context window as signal. Stale comments become competing hypotheses about what the code does. Deprecated functions suggest alternative approaches. Accumulated context noise degrades output quality in ways that are difficult to attribute to any single cause.

// Semantic garbage that degrades model output:

// OLD: returned UserProfile before the 2023 auth migration
// TODO: remove once all callers migrated (see JIRA-4892)
// @deprecated - prefer getUserById for new code
async function fetchUser(id: string): Promise<any> {
  return this.legacyClient.get(id);
}

The problem here is not the deprecated function itself. Deprecated code sometimes exists for valid reasons. The problem is the accumulated commentary that introduces ambiguity about what the function does, whether it should be used, and what its relationship is to the current system. The harness engineering response is to treat this as a resource management problem: semantic context, like memory, is finite and must be actively reclaimed.

The analogy to garbage collection in memory management is precise in a way that earns the name. In runtime memory management, garbage is memory that is no longer reachable from any live reference. In a codebase managed for AI context, garbage is text that will actively mislead the model rather than inform it. Reachability in the CS sense becomes something like semantic relevance to current system behavior. The discipline of identifying and removing it is not just cleanup; it is active management of the model’s ability to reason about the system.

The same principle applies to conversation history in agentic workflows. Long-running agents accumulate tool call results, intermediate reasoning traces, and failed attempts in their context window. Without deliberate pruning or summarization, the model spends increasing attention on past state that is no longer relevant to the current step. Compressing resolved sub-tasks, discarding superseded results, and summarizing completed reasoning chains is how agentic systems stay coherent across many steps.

Why Naming This Matters

Böckeler’s central observation, building on OpenAI’s framing, is that harness engineering is a distinct and learnable discipline. That observation does real work. It separates out practices that are currently scattered across roles, spanning DevOps, software architecture, prompt engineering, and developer tooling, and gives them a unified identity. It suggests that teams should have people who own the harness the way they own CI infrastructure or the test suite.

The three component names are well-chosen precisely because they are borrowed. Context engineering inherits years of thinking about information retrieval and search relevance. Architectural constraints inherits decades of software design literature. Garbage collection inherits an entire field of systems programming concerned with automatic memory management. Practitioners do not need to start from scratch; they can apply existing intuitions and adapt existing techniques.

The maturity trajectory here looks familiar. AI-assisted development is moving through the same arc as earlier computing paradigms: first, the raw capability exists; then, people develop ad-hoc practices around it; then, the practices get named and codified; then, they become teachable disciplines with tooling support. Harness engineering is the naming-and-codifying step. The tooling and the teachable curriculum will follow, and teams that treat the harness as infrastructure now will be ahead when they do.