· 7 min read ·

Inside the Loop: The Engineering Behind Coding Agents

Source: simonwillison

The basic idea behind a coding agent is not complicated. Give a language model access to tools, let it run them, read the output, and keep going until the task is done. Simon Willison’s guide on agentic engineering patterns lays out this core loop clearly. But the gap between understanding the loop and building a system that reliably writes, tests, and ships working code is where most of the interesting engineering lives.

The Loop Itself

The foundation is what the 2022 ReAct paper (Reason + Act) formalized: interleave reasoning traces with action execution. The model generates a thought about what to do, calls a tool, receives the result, thinks again, and continues. In practice, modern coding agents don’t use free-form text parsing to extract tool calls. They rely on native function calling APIs, where the model returns a structured object specifying which tool to call and with what arguments.

{
  "type": "tool_use",
  "name": "read_file",
  "input": {
    "path": "src/parser/lexer.py"
  }
}

The scaffolding code executes the tool, captures stdout, stderr, and exit codes, then appends the result back into the conversation as a tool result message. The model sees a growing transcript of actions and observations, and at each step decides whether to keep working or declare the task complete.

At the API level, this is almost embarrassingly simple. The complexity is everything else.

Tool Design Is Architecture

Which tools you expose to the model, and how you describe them, shapes agent behavior as much as the underlying model. SWE-agent, the research system from Princeton that helped define how agents approach GitHub issues, introduced the concept of an Agent-Computer Interface (ACI) as a deliberate design layer. The analogy to HCI is intentional: just as human-computer interfaces need affordances that match human cognition, agent-computer interfaces need affordances that match model cognition.

Concretely, this means:

  • File viewing tools that show line numbers, so the model can reference specific locations
  • Search tools that return surrounding context rather than just file paths
  • Edit tools that work on line ranges rather than requiring full file rewrites
  • Shell execution that captures output even when commands fail

SWE-agent found that seemingly minor differences in tool design produced large swings in performance on SWE-bench, the benchmark derived from 2294 real GitHub issues. Their ACI-optimized tools improved resolve rates by several percentage points over naive bash access.

Claude Code takes a different approach: it leans heavily on direct bash execution rather than specialized tools, trusting that a capable model in a real Unix environment can figure out how to navigate a codebase. Both philosophies work. The bash-heavy approach is simpler to build; the ACI approach gives more structured affordances. The tradeoff is between flexibility and predictability.

The Context Window Is the Real Bottleneck

A fresh checkout of a moderately-sized project might have 100,000 lines of code. Even the largest context windows available today cannot hold all of that plus conversation history plus tool outputs. Coding agents have to solve a retrieval problem before they can solve the coding problem.

The strategies break down into a few categories.

Selective reading. The agent reads directory listings, finds relevant files through search, and loads only what it needs. This works well for well-structured codebases with clear module boundaries. It breaks down on tangled code where understanding one file requires understanding five others.

Semantic search. Some systems embed the codebase and retrieve chunks by similarity to the current task. This helps when the relevant code is not in an obvious location. The downside is that embedding-based retrieval misses syntactic specifics: a search for “user authentication” might miss a function called validate_credentials.

Tree-sitter parsing. Using a real parser to extract function signatures, class definitions, and import graphs lets the agent build a structural map of the codebase without reading every line. The agent can then navigate to relevant definitions on demand. This is how Aider implements its repository mapping feature, using tree-sitter grammars to produce a compact representation of code structure that fits in context.

Conversation pruning. As the conversation grows, old tool outputs get summarized or dropped. The model loses access to things it read earlier, which can cause it to re-read files or forget decisions it already made. Managing this gracefully is one of the harder engineering problems in building reliable agents.

Error Recovery and Self-Correction

A coding agent that can only succeed when every step works is not useful. Real codebases have tests that fail for reasons unrelated to the change, linters that complain about style, and build systems that behave differently across machines.

The agent loop handles this naturally to a point: when a bash command returns a non-zero exit code, the output goes back into the context as an observation, and the model can try a different approach. What matters is whether the model recognizes a recoverable error versus an unrecoverable one, and whether it knows when to stop trying.

Unconstrained retry loops are a real failure mode. An agent that keeps attempting variations on the same wrong approach can burn through a lot of tokens and time without making progress. Systems like Claude Code address this by keeping humans in the loop for certain decisions. When the agent is uncertain, it asks. This sounds obvious but represents a genuine design choice: full autonomy versus interactive autonomy. The agents that work best in practice tend toward the interactive end, at least for now.

The research framing here is the distinction between oracle agents, which have access to the correct answer as a signal during evaluation, and deployed agents, which have to decide on their own when they are done. SWE-bench measures oracle performance. Real deployments have to handle the messier case.

What the Scaffolding Actually Does

The model is one component of a coding agent. The scaffolding around it handles:

  • Process management. Running bash commands, managing working directories, handling timeouts for commands that hang or loop.
  • State management. Tracking which files have been modified, maintaining a diff of changes made so far, potentially managing git operations like staging and committing.
  • Tool routing. Parsing tool call responses, dispatching to the right handler, formatting results back into the conversation in a way the model can use.
  • Interruption handling. Letting the user interrupt a long-running agent, inspect the current state, and decide whether to continue or redirect.
  • Cost tracking. Counting tokens, estimating costs, enforcing budgets.

Open source systems like SWE-agent and Aider make this scaffolding visible. Looking at their source code shows how much engineering sits between “the model has tools” and “the agent works reliably.” The model is maybe 20% of the code. The scaffolding is the rest.

Where the Real Limits Are

Model capability is obviously a ceiling. Agents built on weaker models fail more often, make more logical errors in multi-step reasoning, and produce lower-quality code. The SWE-bench Verified leaderboard makes this clear: the gap between frontier models and models from a year prior is substantial. As of early 2026, top systems resolve around 50 to 60 percent of benchmark issues, which sounds impressive until you consider that these are curated, well-specified issues from well-maintained Python repositories.

Real software projects are messier. The issues are underspecified. The codebases have unusual conventions. The tests are flaky. An agent that performs well on SWE-bench still needs significant supervision on production code.

The more fundamental limit is that coding agents operate by reading and writing text. They do not execute the code to understand it; they read it. When behavior is difficult to infer from static analysis, which is often, the agent has to use execution as a probe. Running a minimal test case to observe behavior before writing a fix is a strategy that the best systems employ deliberately. Less sophisticated systems skip this step and guess.

What This Means for Building With Agents

If you are building on top of coding agent infrastructure rather than building the agents themselves, the practical implications are straightforward.

The quality of the task description matters more than almost anything else. An agent given a vague task will make assumptions, and some of those assumptions will be wrong. “Fix the login bug” will produce something different from “The authenticate_user function in auth/session.py raises a KeyError when the session token is expired rather than returning None. Fix it to match the return type documented in the docstring.”

Smaller, well-scoped tasks succeed more reliably than large open-ended ones. “Add error handling to this function” works better than “improve the robustness of this module.”

The choice of which tools to expose affects which strategies the agent can pursue. A sandboxed environment where the agent cannot run arbitrary commands will be safer but less capable. That tradeoff is real and worth making deliberately rather than accidentally.

Simon Willison’s framing of coding agents as a tool-use loop is the right mental model to start with. The interesting questions are all about what fills in around that loop: which tools, how the context is managed, how errors are handled, and how much the human stays in the picture. Those choices determine whether an agent is a useful collaborator or an expensive way to generate plausible-looking wrong answers.

Was this interesting?