· 6 min read ·

When Coding Agents Spawn More Coding Agents

Source: simonwillison

The tool loop at the center of every coding agent is well-documented at this point. The model emits a tool call, the scaffolding executes it, the result returns as context, the model decides what to do next. Simon Willison’s guide on how coding agents work covers this loop in useful detail. What the loop-centric view underrepresents is what happens when one agent running one loop is not sufficient: when context fills before the work finishes, when a task has independent subtasks that could run in parallel, or when isolation between sub-tasks is worth the coordination overhead.

These are multi-agent problems, and the design space is more nuanced than “add more agents.”

The Single-Agent Ceiling

A coding agent working in a single context window hits two concrete limits that define when the single-loop model breaks down.

Context exhaustion is the first. Claude 3.7 Sonnet’s 200k-token context window sounds generous until you run a moderately complex debugging session: read the failing test file, run the test suite, read the stack trace, read two files it references, look up the relevant function in a third file, read the test again, make an edit, run tests again. That sequence can consume 50,000 to 100,000 tokens before the problem is resolved. A task touching twenty files, with intermediate test runs and multiple rounds of error reading, can fill the window before completion. The agent either receives truncated context and loses track of earlier observations, or hits a hard error and stops.

Serial execution is the second. A single agent works sequentially: read, decide, call tool, receive result, repeat. For tasks that are structurally parallel, writing tests for ten independent modules or updating documentation across ten separate service directories, the single agent’s sequential loop is the bottleneck. Ten iterations where ten parallel workers could accomplish the same work in roughly the time of one.

Subagents as the Structural Answer

Claude Code’s Task tool is the clearest production implementation of the multi-agent response to these limits. The tool takes a task description and an optional prompt, spawns a new agent instance in a separate context window, and returns the subagent’s final output as a tool result in the parent’s context. The subagent’s entire intermediate work, its file reads, test runs, edits, and reasoning steps, happens in isolation. None of it consumes the parent’s context budget.

A parent agent managing a large refactor can distribute work like this:

Parent agent:
  - Read architecture overview to understand module boundaries
  - Spawn Task: "Refactor the auth module to use the new token format.
    Run auth tests to verify. Return the list of changed files."
  - Spawn Task: "Update the user service integration tests for the new
    token format. Run them to confirm they pass. Return test output."
  - Spawn Task: "Update the API gateway validation middleware.
    Run the gateway tests. Return the diff."
  - Collect results, run full integration test suite
  - Handle any cross-module conflicts that surface

Each subagent works within its own context, completes its task, and the parent receives a compact summary. The parent’s context window tracks the overall picture without containing the detail of each subtask’s execution.

This mirrors how a senior engineer delegates: assign a scoped task with clear success criteria, receive a result, verify it against those criteria. Claude Code currently runs subagents sequentially rather than concurrently by default, but the critical benefit is context isolation regardless of execution order. The parent stays oriented while each subagent does deep work.

The Principal-Agent Problem, Applied

Multi-agent coordination introduces a version of the principal-agent problem: the parent (principal) delegates to subagents (agents) with limited visibility into how they worked. The parent assumes the subagent fulfilled the task, but it cannot inspect intermediate steps without reading the full subagent output log.

This creates a specific failure mode: the subagent can report completion while having misunderstood the scope. A subagent asked to “update the auth module for the new token format” might update the authentication flow and miss the authorization checks that live in the same module. If the parent does not run tests covering both paths, the gap goes undetected.

The mitigation lies in how subagent tasks are specified. Effective subagent prompts include explicit success criteria, typically commands to run and outputs to verify; narrow scope definitions that specify which files or directories are in bounds; and explicit prohibitions that prevent the subagent from modifying shared infrastructure that other subagents depend on. This is the same discipline that makes good unit tests and good code review requests work. Ambiguity in the specification produces ambiguity in the execution, and with subagents there is no real-time course correction available once the task is running.

When Isolation Is the Point

Context isolation in subagents has a benefit that is easy to overlook. A subagent that goes off track, making incorrect edits or spinning through repeated failed attempts, does so in its own context window. The parent receives the final output and runs verification steps before incorporating it. If verification fails, the subagent’s wrong turns are discarded without affecting the parent’s working state.

This is the coding agent equivalent of running risky work in a child process: the blast radius of a failure is bounded. For tasks with programmatically verifiable outcomes, tests passing, a diff matching expected structure, a command returning specific output, this isolation is a genuine architectural advantage. The parent can be conservative about accepting subagent results: run the full test suite, inspect the diff for unexpected side effects. A subagent that produced a wrong result costs one failed verification cycle. Without isolation, that same wrong result embeds in the parent’s context and can influence subsequent decisions in ways that are difficult to undo.

The pattern is most reliable when the deliverable is concrete. “Write a test file for functions A, B, and C, and confirm all tests pass” is a well-specified subagent task because success is unambiguous. “Improve the error handling in this module” is harder to verify programmatically, which makes the trust problem more significant.

Where It Breaks Down

Multi-agent designs trade coordination overhead for isolation and potential parallelism. The coordination cost is real.

When subagent tasks are not truly independent, ordering becomes a dependency. A subagent updating a shared type definition must complete before subagents that use that type can start. Naive parallelization of dependent edits produces incompatible changes to the same code, the agent equivalent of a Git merge conflict, that the parent has to reconcile after the fact.

Token expenditure also changes shape. Even when total token consumption is comparable to a single-agent run, peak rate is higher when subagents run concurrently. This affects API rate limits and billing in ways that sequential single-agent runs do not. For teams with tight token budgets or rate-limited API access, the multi-agent pattern can hit infrastructure constraints that the single-agent pattern avoids.

The debugging story changes too. A single-agent run has a linear execution history. A multi-agent run has a tree of histories, one per subagent plus the parent, and tracing a failure requires reconstructing the relevant subagent’s context separately. This is manageable but it is more cognitive overhead than reading a flat conversation log. In practice, this means that when something goes subtly wrong in a multi-agent run, the failure investigation is harder than in a single-agent run, even if the task itself was completed faster.

The Right Granularity

The tasks that benefit most from multi-agent patterns share a profile: large enough to risk context exhaustion in a single agent, divisible into subtasks with clear boundaries and programmatic success criteria, and independent enough that ordering does not become a blocking dependency chain.

Large refactors across many modules, parallel test generation for independent components, documentation updates following a repeatable template across many services: these fit. Small iterative tasks where each observation informs the next decision, debugging a specific test failure, implementing a feature with a tight read-edit-verify loop, are better handled in a single continuous context where the model retains the full picture of what it has seen and done.

The single-agent tool loop is well-understood. The coordination layer above it is where the design space is still being developed. Getting task granularity right, keeping success criteria explicit, and defining scope narrowly enough to prevent overlap are the skills that determine whether multi-agent patterns extend capability or add complexity without proportional benefit. The loop is the easy part. Knowing when to multiply it is the harder question.

Was this interesting?