· 7 min read ·

The Concurrency Model Every Coding Agent Has to Get Right

Source: simonwillison

When a coding agent reads a file, searches for a pattern, and runs the test suite at the same time, it is doing something more interesting than saving a few seconds. Claude 3.5 Sonnet and later models, alongside OpenAI’s function calling interface, support multiple tool calls in a single response. Simon Willison’s guide on coding agent architecture touches on this, but the mechanics and implications are worth examining more carefully, especially for anyone building custom scaffolding.

What Parallel Tool Calls Look Like on the Wire

The Anthropic Messages API surfaces parallel tool calls as a list of tool_use content blocks in a single response. A model investigating an authentication bug might return:

{
  "stop_reason": "tool_use",
  "content": [
    {
      "type": "text",
      "text": "Let me look at the auth module and check what tests exist."
    },
    {
      "type": "tool_use",
      "id": "toolu_01abc",
      "name": "read_file",
      "input": {"path": "/src/auth/middleware.py"}
    },
    {
      "type": "tool_use",
      "id": "toolu_02def",
      "name": "glob",
      "input": {"pattern": "**/test_auth*"}
    },
    {
      "type": "tool_use",
      "id": "toolu_03ghi",
      "name": "bash",
      "input": {"command": "python -m pytest tests/test_auth.py -x 2>&1 | tail -30"}
    }
  ]
}

The host application receives all three in one response, executes them concurrently, and returns all three results before the model’s next turn. Each result carries a tool_result block matched to its originating call by tool_use_id:

{
  "role": "user",
  "content": [
    {
      "type": "tool_result",
      "tool_use_id": "toolu_01abc",
      "content": "class AuthMiddleware:\n    def __init__(self, ...)..."
    },
    {
      "type": "tool_result",
      "tool_use_id": "toolu_02def",
      "content": "tests/test_auth.py\ntests/integration/test_auth_flow.py"
    },
    {
      "type": "tool_result",
      "tool_use_id": "toolu_03ghi",
      "content": "FAILED tests/test_auth.py::test_token_expiry - AssertionError: expected 401..."
    }
  ]
}

The model receives all three simultaneously on its next turn and reasons about them jointly.

The Latency Case

Sequential execution of those three calls would take T_read + T_glob + T_bash, where T_bash likely dominates if the test suite takes a few seconds. Parallel execution takes max(T_read, T_glob, T_bash). For I/O-bound operations like file reads and network requests, this compounds: agents regularly make 10 to 30 tool calls per task, and sequential execution can mean tens of seconds of mostly-waiting. Parallel batching where independent cuts this wall-clock time substantially.

The Anthropic documentation notes that Claude “will sometimes request multiple tools to be called in parallel when it’s appropriate.” That “sometimes” is real: the model decides when operations are independent enough to parallelize, based on what it can infer from the task and from the tool descriptions. If the second call logically depends on the result of the first, the model sequences them. If they are independent, it batches them.

What Parallelism Reveals About Model Reasoning

The more interesting implication is that parallel tool calls reflect the model’s belief about independence. When it batches three reads together, it is asserting: knowing the content of file A does not change which information I need from file B or C. When it sequences calls, it is asserting: I need to see what the first call returns before I can decide what to do next.

This is observable in agent transcripts. An agent exploring an unfamiliar codebase tends to batch reads aggressively at the start of a task, then sequence more carefully as it converges on specific files to modify. A pattern of three parallel reads followed by an edit followed by a test run is a characteristic signature of a well-structured agent on a focused task.

Agents that do not use parallel calls well tend to explore sequentially: read one file, decide to read another based on what they found, then another. This is correct when each observation genuinely constrains the next, but often the reads are actually independent and the sequential execution is unnecessary latency with no reasoning benefit.

The Matching Problem and Partial Failure

The mechanical requirement is that each tool_result must include the tool_use_id of its corresponding tool_use call. Mismatching IDs is a silent failure: the model receives what it believes is the result of reading auth/middleware.py but is actually the output from the glob search. Subsequent reasoning based on that mislabeled result is corrupted in ways that may not surface until several steps later.

Correct implementation requires care. The host receives a list of tool_use blocks, dispatches each, and maps results back by ID:

import asyncio

async def execute_tools_parallel(tool_calls, tool_executor):
    tasks = [
        tool_executor(call.name, call.input)
        for call in tool_calls
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return [
        {
            "type": "tool_result",
            "tool_use_id": call.id,
            "content": str(result) if not isinstance(result, Exception)
                        else f"Error: {result}",
            "is_error": isinstance(result, Exception)
        }
        for call, result in zip(tool_calls, results)
    ]

The return_exceptions=True flag in asyncio.gather ensures that a failure in one tool call does not abort the others. The model receives the error as the tool_result for the failed call and has all the successful results available to reason about. Aborting the entire parallel batch on a single failure is worse: the model receives nothing and must restart the exploration from scratch.

The is_error flag is worth setting correctly. It changes how the model treats the result in subsequent reasoning, framing it as an observation about failure rather than an observation about the world. The distinction matters when the model is trying to decide whether to retry, escalate, or proceed with partial information.

Context Accumulation and Batching

Parallel tool calls have a second effect that is less obvious. Sequential reads accumulate context incrementally as the model sees each result and adjusts its next call accordingly. Parallel reads dump a batch of context into the history at once, without any intervening model reasoning. The model then has to reason about a larger set of information in one turn.

For three independent file reads totaling 15K tokens, sequential execution gives the model three chances to reason incrementally, each adding around 5K tokens. Parallel execution gives it one chance to reason over the full 15K. The parallel version is faster and uses fewer API calls, but the sequential version may produce better-informed exploration when each file genuinely informs which one to read next.

This is a real tradeoff, not an optimization to push uniformly. Exploration batching makes sense at the start of a task when the agent is building its initial model of an unfamiliar codebase. Targeted reads near the end of a task, when the agent is converging on a specific fix, may benefit from sequential execution where each observation informs the next.

Paul Gauthier, author of Aider, has documented in Aider’s benchmark methodology that test-driven agentic workflows, where the agent runs tests after each change and responds to the specific failure output, consistently outperform write-only workflows on SWE-bench. The test execution is the key sequential element: run the tests, observe the specific failure, decide what to read next based on that specific failure. Parallel speculation about what might be wrong before running the tests skips the grounded feedback that makes this approach reliable.

This pattern is also supported by failure analysis: research from the OpenEnv Calendar Gym benchmark, which tested agents on calendar management tasks, found that over half of agent failures came from malformed arguments or incorrect operation sequencing, not from selecting the wrong tool. The model knew which tool to call; it failed at forming valid inputs or ordering dependent calls correctly. Sequencing discipline, knowing when to batch and when to chain, is most of the difference.

Designing Around the Concurrency Model

For anyone building custom scaffolding on the Anthropic API, the concurrency model is an engineering input, not just a performance detail. Tool implementations that are safe to run concurrently, file reads, searches, directory listings, web fetches, can be dispatched in parallel without coordination. Tool implementations that modify shared state, file writes, database modifications, need the same care as any concurrent write operation.

The model uses tool descriptions to decide whether to batch calls. Being explicit about concurrency safety in the description field changes model behavior:

{
  "name": "write_file",
  "description": "Write content to a file, replacing all existing content. Not safe to call concurrently with other write_file calls targeting the same path."
}

Claude Code’s tool design reflects this: Read, Glob, and Grep are naturally parallel-safe and the model batches them freely. The Edit tool, which uses old-string/new-string replacement, depends on the original file content being stable between the read step and the edit step. If two edits target the same file, sequential execution is safer, and the tool description communicates this constraint.

A detail from Stanford and UC Berkeley research on long-context attention is relevant here too: information placed in the middle of long contexts receives less model attention than information near the beginning or end. When parallel tool results arrive as a batch, they all enter the context at the same position. If one result in a parallel batch of five is the critical piece of evidence, it has no positional advantage. In sequential execution, the critical result arrives at the end of the sequence, in a relatively high-attention position. For high-stakes exploration where one observation is likely to be the key finding, sequential positioning can matter.

The concurrency model is one place where scaffolding decisions have direct, measurable impact on agent performance. Getting the batching heuristics right, aggressive parallelism for independent exploration, careful sequencing for dependent operations, produces agents that complete tasks faster with fewer accumulated tokens. The loop is forty lines. The concurrency model is where the work is.

Was this interesting?