· 6 min read ·

The Tool Loop Is Deterministic, the Decision Layer Is Not

Source: simonwillison

The core of a coding agent is almost disappointingly simple once you see it. An LLM receives a context window containing instructions, conversation history, available tools, and observations from previous steps. It decides what to do. It either calls a tool or produces a final response. The result of any tool call gets appended to the context window, and the loop continues.

Simon Willison’s guide on how coding agents work lays this out clearly. What that framing surfaces, though perhaps doesn’t belabor, is that the architectural simplicity is both the strength and the difficulty of building reliable systems with these agents.

The Loop as an Architectural Primitive

The observe-decide-act loop isn’t novel. Control systems, game AI, and robotics have used similar patterns for decades. What’s different with LLM-based coding agents is that the “decide” step is probabilistic and operates over natural language, which changes the engineering constraints substantially.

In a traditional automated system, you test a loop by enumerating states. Transitions are deterministic. With an LLM agent, the decision step is shaped by the full contents of the context window, and two runs with identical inputs can produce different tool call sequences. The loop is the same; the decision quality varies.

This asymmetry between deterministic tool execution and non-deterministic decision-making is worth sitting with. Every tool result is exact. read_file returns the file contents. run_bash returns stdout and exit code. The LLM’s interpretation of those results, and its choice of next action, is not exact. Engineering a reliable agent means controlling what the model sees and structuring tools so that the gap between “model interprets tool output” and “model takes next correct action” is as small as possible.

Context Window as Working Memory

The context window is the agent’s only working memory during a run. There’s no heap, no persistent variables, no mutable state outside of what gets written to the file system or other external systems through tool calls. Everything the agent “knows” at any point is text sitting in a sequence of tokens.

This shapes failure modes in a specific way. As a run progresses and tool results accumulate, the context fills. At some point, the model is reasoning over thousands of tokens of prior observations. The practical effect is that early context matters less; models in long contexts attend preferentially to recent information and to whatever was anchored in the original system prompt. An agent that correctly read a file fifty tool calls ago may “forget” a key constraint from that file if subsequent steps have pushed it far enough back in the sequence.

Frameworks like Claude Code and Cursor handle this by periodically summarizing or truncating context. The tradeoff is lossy: a summary of twenty tool calls is smaller than the raw observations but necessarily drops details. Knowing which details to drop requires the model to predict what will matter later, which it cannot do with certainty.

The engineering response to this problem is usually to keep individual tool results compact. If list_directory returns a flat list of 400 file paths, those 400 entries consume context the model will attend to for the rest of the run. Better to return the 10 most relevant paths with a note that the search was filtered. The output format of a tool is a design decision, not just a convenience.

Tool Design as the Primary Engineering Surface

The system prompt configures intent. The tools configure capability. But tool descriptions do something more subtle: they shape how the model reasons about what it’s doing and when to use each tool.

A poorly described tool creates ambiguity. If edit_file and write_file both exist and their descriptions don’t clearly differentiate when each is appropriate, the model will sometimes choose wrong. The wrong choice isn’t random; it correlates with whatever patterns appear in training data. The model might prefer write_file because it resembles patterns it has seen before, even when edit_file would be safer and more targeted.

Well-designed tools have descriptions that:

  • State the purpose without ambiguity
  • Describe the expected input format with concrete examples
  • Specify what the output will look like under both success and failure conditions
  • Note failure modes explicitly and suggest corrective actions

The last point matters for error recovery. If a tool’s description mentions that it can return a “file not found” error and suggests what to do in that case, the model is substantially more likely to handle that error correctly than if it encounters the error cold. The model doesn’t have access to runtime error handling logic the way a program does; its error recovery is shaped by what it was told to expect.

This is one of the less obvious ways that tool design affects agent behavior. A tool that silently returns an empty list on failure will cause different downstream behavior than one that returns a structured error message. Both are valid designs for a library consumed by a human programmer; for an LLM, the structured error message is almost always superior because it gives the model something to reason about rather than an absence to notice.

Debugging Agent Runs

Debugging a coding agent run differs from debugging a program. There’s no debugger to attach. You can’t set a breakpoint after the third tool call and inspect live state, though you can log the full context at each step.

The useful artifact for debugging is the message trace: the full sequence of model decisions and tool results, ordered chronologically. Every failure mode appears in the trace as either a bad tool call (wrong tool, wrong arguments, wrong timing) or a bad interpretation of a correct tool result. Most failures fall into a small number of categories:

  • The model misread a tool output and drew the wrong conclusion
  • The context grew large enough that a key earlier observation was no longer influencing decisions
  • The model entered a retry loop because an error message wasn’t informative enough to suggest a different approach
  • The model completed a subtask correctly but lost track of the overall goal

The third category is particularly common with shell operations. If bash returns a non-zero exit code with a terse error like ENOENT, the model may retry the exact same command multiple times before giving up or trying something different. Making error messages directive, “file not found: check that the path is relative to the project root,” is not just user experience polish; it directly affects agent reliability by giving the model a path forward rather than a dead end.

What This Architecture Is Optimized For

Coding agents work well on tasks that decompose into discrete, verifiable steps where each step’s output can be evaluated before proceeding. Reading a file, running a test, checking a diff: these produce concrete outputs. The model can observe the output, determine whether it matches expectations, and adjust.

They work less well on tasks that require sustained global reasoning over large amounts of context, or tasks where intermediate steps are hard to verify independently. Refactoring a large codebase across fifty files is harder not just because it’s more work, but because the model has to maintain a coherent picture of the refactoring goal while navigating a context window filling with per-file observations. Each file edit is locally correct; the global consistency degrades as context pressure increases.

This suggests a practical design heuristic: tasks given to a coding agent should be structured so that the global goal can be restated compactly and the intermediate verifications are cheap. A well-framed task keeps the relevant context small, gives the model concrete intermediate checkpoints, and surfaces failures early in the run rather than at the end when the context window is full and the damage is done.

The architecture, as Willison’s framing makes clear, is fundamentally a loop. The quality of an agent run depends on how well each iteration of that loop moves toward the goal. The model decides; the tools execute; the context grows. What you control as an engineer is the quality of the tools, the structure of what gets put into context, and the information density of every tool result. That’s the actual surface worth optimizing, and it’s more tractable than it looks once you see the loop clearly.

Was this interesting?