The Engineering Choices That Define Coding Agent Behavior

Most explanations of coding agents collapse to: the model sees a task, calls tools, observes results, repeats. Simon Willison’s guide on agentic engineering patterns covers the structural overview well. The more interesting territory is the design decisions inside each component and how those decisions shape what the agent can and cannot accomplish.

The Core Loop

Every coding agent runs a variant of the ReAct pattern (Reason + Act), described by Yao et al. in 2022. The core finding was that interleaving reasoning traces with tool invocations outperforms doing all the reasoning upfront. Instead of planning the full solution and executing it, the model reasons, acts, observes, and reasons again. Each observation informs the next reasoning step.

In practice:

Model receives task, context, and tool definitions
Model outputs either a tool call or a final response
If a tool call, the runtime executes it and appends the result to the conversation
Model receives the updated conversation and repeats

One practical consequence of this structure is that the loop is serialized by default. Each step depends on the previous observation, so you cannot parallelize across turns. Some systems enable parallel tool calls within a single turn (Anthropic’s API supports requesting multiple tools simultaneously in one response), but the fundamental dependency chain across turns remains. A long-running agent task spends most of its wall time waiting for tool execution results, not generating tokens.

File Editing: The Design Space

File editing is where most of the engineering complexity lives in coding agents. There are three broad approaches, each with real trade-offs.

Full file replacement: The agent reads a file into context, modifies the content, and writes the entire file back. The implementation is trivial. The cost is tokens: a 2,000-line file consumes a significant fraction of your context budget to change one function signature. For agents with large codebases and tight context budgets, this becomes untenable quickly.

Search and replace: The agent specifies an exact old string and a new string. The runtime finds the first matching occurrence and replaces it. This is the approach Claude Code’s Edit tool uses. The agent only needs to output the affected lines, not the entire file. The failure mode is ambiguity: if the target string appears more than once, the replacement becomes unpredictable. Good implementations enforce uniqueness, requiring the model to provide enough surrounding context to guarantee a single match. This adds a constraint on the model but makes the operation safe to execute without human confirmation.

Unified diff application: The agent outputs a standard unified diff, which the runtime applies. This is the most compact representation but also the most fragile. LLMs generate syntactically incorrect diffs more often than expected: off-by-one errors in line numbers, confusion between context lines and deletion lines, whitespace mismatches. Without a validation layer that rejects malformed diffs before execution, this approach produces unreliable edits. Systems that use it successfully wrap it in retry logic or AST-aware validation.

In practice, search-and-replace dominates in production agents because it balances compactness against reliability. It requires the model to produce only what changes, enforces a testable correctness criterion, and fails loudly when that criterion is not met.

Shell Access: The Capability Trade-off

Shell access is the most consequential tool in any coding agent’s repertoire. With it, the agent can run tests, invoke compilers, install dependencies, query a running database, or observe the output of a web server. Without it, the agent can only edit files and guess whether the edits work.

SWE-agent from Princeton NLP, one of the early systems to achieve strong results on the SWE-bench benchmark, made shell access central to its design. The agent receives a failing GitHub issue, clones the repository, and works through a bash session: reading tracebacks, running failing tests, modifying source files, re-running tests. The tight feedback loop between edit and execution is what makes it effective at real bug fixing. A model that cannot run the code cannot verify its own changes.

The risk profile is significant. Shell access inside a well-sandboxed container with a cloned repo and no production credentials is relatively safe. Shell access in an environment with network access, cloud provider credentials, and write permissions to production systems is a different category of risk. Most serious deployments run agents in isolated containers or virtual machines, which constrains the blast radius of a bad tool call but also limits what the agent can accomplish: no live database inspection, no external API calls, no actual dependency installation.

Systems like OpenHands (the open-source agent framework from All Hands AI) let operators configure sandbox policies, but there is no clean answer to how much capability to grant. The more hermetic the sandbox, the safer but also the less useful.

Context Window Management

The context window fills up faster than most people expect. A typical coding session accumulates: the system prompt with tool definitions and instructions (often 10,000 to 30,000 tokens for a fully-equipped agent), the task description, every tool call and its result, all reasoning traces, and the growing conversation history. On a long debugging session against a large codebase, you can exhaust the context of a 200,000-token model without trying.

Different systems handle this differently:

Summarization compresses older conversation turns into a compact narrative when the context approaches its limit. Claude Code does this automatically. The risk is information loss: a summarization pass may omit a constraint the model stated four turns ago that still applies.

Sliding windows drop the oldest turns entirely. Simple to implement, but the agent loses memory of early decisions. An agent that has forgotten it already modified a particular file may modify it again, overwriting its own work.

Structured recall extracts key facts, which files were modified, what errors were encountered, what constraints were established, into a separate block that persists across compaction. This is more reliable than narrative summarization but requires careful thought about what counts as important enough to preserve.

The compaction problem matters for correctness, not just resource management. An agent operating over a long session without proper context management will exhibit inconsistent behavior: confident early, confused later, occasionally contradicting decisions it made an hour before.

Tool Definitions Shape Behavior

The tools you give an agent, and how you describe them, influence what the agent does at least as much as the underlying model capability.

An agent with only file read/write tools approaches problems as text manipulation. The same agent with shell access starts running test suites and reading compiler output. An agent with a “search codebase” tool will use exact string search or semantic search depending entirely on how the description communicates what the tool does. The descriptions are, in effect, part of the agent’s instructions for that session.

Model Context Protocol (MCP), Anthropic’s open standard for tool definitions, attempts to standardize this layer so that tool servers can be built once and consumed by multiple agent runtimes. The protocol specifies how tools declare their name, description, and input schema in a structured format that the model’s tool-selection logic can interpret. A well-designed MCP server for reading Kubernetes pod logs can in principle be dropped into any MCP-compatible agent without modification.

What Separates Useful Agents from Frustrating Ones

The loop architecture is table stakes at this point. What distinguishes useful agents from frustrating ones comes down to a few concrete properties.

Observation quality. An agent that receives truncated tool output cannot reason about what it cannot see. A test runner that cuts stderr at 500 characters prevents the model from reading the part of the traceback that contains the actual failure.

Error recovery. When a tool call fails or returns unexpected output, does the agent recognize that and try a different approach, or does it repeat the same call? The ReAct pattern enables recovery, but only if the model attends to its observations and updates its strategy accordingly.

Scope discipline. The tendency for agents to modify code they were not asked to touch is a reliability problem in practice. Unexpected changes to adjacent files create surprising diffs, complicate code review, and make the agent’s output harder to trust over time.

Most current research in this space, visible in work on SWE-bench variants and in systems like OpenHands, focuses on these behavioral properties rather than the core loop mechanics. The loop is understood. How reliably the model uses it is not.