Subagent Invocation Is Distributed RPC, and Frameworks Are Pretending Otherwise
Source: simonwillison
When you spawn a subagent in a multi-agent system, you are making an RPC call. Not metaphorically. The orchestrating model emits a tool call, the scaffolding builds a new API request, sends it to the same or a different model endpoint, and waits for a response. That response is returned as a tool result in the orchestrator’s conversation context. This is structurally identical to a microservice calling a downstream API: issue a request, wait, handle the result.
Simon Willison’s guide on agentic engineering patterns covers subagents from the perspective of when to use them and how to scope their work. That framing is correct as far as it goes. What the agentic engineering literature generally does not address is what happens when the system around the subagent fails mid-execution, and what the orchestrator is supposed to do about it. These are distributed systems questions, and agent frameworks have not built the primitives to answer them.
What Spawning a Subagent Actually Does
In the Anthropic Python SDK, there is no special “spawn subagent” primitive. What you write is a tool handler that makes another client.messages.create() call with a fresh context. If the subagent needs tools, you run the agentic loop inside the handler, processing tool calls until the model returns a final answer.
import anthropic
client = anthropic.Anthropic()
def run_subagent(task: str, tools: list, system: str) -> str:
messages = [{"role": "user", "content": task}]
while True:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=8096,
system=system,
messages=messages,
tools=tools,
)
if response.stop_reason == "end_turn":
return response.content[-1].text
messages = handle_tool_calls(response, messages, tools)
The subagent’s entire working context, its file reads, tool calls, intermediate reasoning, is contained in that nested call stack. The orchestrator sees only the return value of run_subagent. Context isolation is the advertised benefit, and it is real. But notice what is also happening: the orchestrator is blocked waiting on an external call that takes an indeterminate amount of time, may fail at any point, and may have already performed side effects before it fails.
That is the definition of a non-idempotent distributed operation.
The Failure Modes
In distributed systems, the fundamental problem with RPC is not latency. It is partial failure. Your service calls downstream. The network drops. You do not know whether the downstream service received and executed the request, or whether it never arrived. This ambiguity drives most of the complexity in distributed system design.
Subagent orchestration has exactly this problem, and additionally has no established vocabulary for discussing it.
Consider a subagent tasked with implementing a user authentication module and opening a pull request when done. The subagent does its work: writes the files, runs the tests, calls a create_pull_request tool. The pull request is created. Then, before the subagent can return its final message, it hits a context limit or a tool error and cannot produce a clean exit. The orchestrator receives either an error result or a malformed response.
The orchestrator now faces the RPC dilemma: retry or not? If it retries, the subagent runs again, determines the authentication module needs implementing, writes similar or identical files, and tries to create another pull request. Depending on your tooling, it either creates a duplicate PR, gets an error about a conflicting branch, or silently overwrites the previous one. If it does not retry, the orchestrator cannot tell whether the task succeeded.
Neither answer is right, because the framework has not given the orchestrator the information it needs to make the decision.
Idempotency by Action Type
The severity of this problem varies significantly by what the subagent actually does. It helps to think about subagent actions the same way distributed systems engineers think about HTTP operations.
File writes are generally safe on retry. Writing the same content to the same path produces the same result. Overwriting with slightly different content because the model’s output is not deterministic is a real risk but usually a recoverable one.
Read operations are trivially safe. Calling external read-only APIs similarly so.
External write operations are the dangerous category. Creating a GitHub pull request, sending a Slack notification, inserting a database record, calling a payment API: these actions are not idempotent by default. Stripe’s solution for payments is idempotency keys, a client-provided token that makes retries safe. Most agent tool implementations do not have an equivalent.
Git commits are an interesting middle case. Running git commit twice produces two commits. The second commit might contain no changes if the first committed everything, in which case git fails cleanly. Or the subagent regenerated slightly different code, producing confusing history. Checking git state before committing prevents this, but most agent-scaffolded scripts do not bother.
What Distributed Systems Actually Do
The problem is not new. The distributed systems community has addressed it through several patterns, each with different trade-offs.
Idempotency keys are the simplest approach. Before calling a downstream operation, generate a unique key. Pass it with the request. The downstream service deduplicates based on the key, executing the action once regardless of how many times it receives the request. This requires cooperation from every downstream API the subagent touches, which in agentic contexts often means tools you wrote yourself.
The saga pattern addresses long-running distributed transactions through compensating actions. Each step in a process registers a rollback action. If the process fails partway through, the system executes rollbacks in reverse order. Porting this to subagent orchestration would mean: the subagent records a compensation function alongside each side-effectful action, and the orchestrator can call these compensations if the subagent’s final result is unacceptable. Frameworks like LangGraph include checkpointing mechanisms that partially support this, though the compensation design is left entirely to the developer.
Two-phase commit is too costly for most agent scenarios. The coordination overhead would dwarf the cost of the work itself.
The most practical pattern for agent orchestration today is a variation on write-ahead logging: the subagent records a progress checkpoint before each side-effectful action, and on retry the orchestrator checks this log to determine what has already been done. This is essentially what Anthropic’s guidance on long-running agents gestures at through checkpoint-based state management, without using that distributed systems vocabulary explicitly.
Designing Around the Problem Today
Until frameworks build explicit retry semantics, you can reduce the blast radius of partial failures through subagent design choices.
Return a structured completion manifest. Instead of having subagents return natural language summaries, require them to return a structured list of what they accomplished: files modified, external actions taken, verification results. An orchestrator can parse this to detect partial completion and determine whether a retry is safe.
# Better: subagent returns structured result
{
"files_modified": ["src/auth.py", "tests/test_auth.py"],
"external_actions": [{"type": "pull_request", "url": "https://...", "number": 42}],
"tests_passed": true,
"verification_output": "All 23 tests passed"
}
Separate pure from effectful tasks. A subagent that only reads code and produces a diff is safely retriable. A subagent that reads, writes, and pushes is not. Structuring work so that effectful actions happen in narrow, well-scoped subagent calls makes failures cheaper and recovery clearer.
Use programmatic verification as a proxy for receipt confirmation. If a subagent created a PR, the orchestrator can verify it exists before accepting the result as successful. This does not prevent duplicate actions, but it does prevent the orchestrator from treating partial execution as success and moving on.
Design tool implementations with idempotency in mind. A create_or_update_pr tool that searches for an existing PR on the same branch before creating a new one is trivially safe to retry. The tool layer is the right place to absorb this complexity, because it is the point of contact between model outputs and external systems.
The Vocabulary Gap
What stands out reading through the agentic engineering literature is how consistently it reaches for software engineering analogies, subagents as employees, tasks as sprints, rather than distributed systems analogies. The distributed systems analogies are more precise for the failure cases that actually matter. An employee who loses track of a task is a management problem. An RPC call that fails after mutating state is a correctness problem with a well-studied solution space.
Frameworks are starting to converge on some of these solutions. LangGraph’s checkpointing, Anthropic’s guidance on structured agent handoffs, the tooling in CrewAI for defining agent roles and expected outputs: these are pieces of a coherent failure model that no single framework has assembled completely yet.
The useful reframe is: every subagent invocation that involves side effects should be designed as if you are implementing a distributed transaction. You need to know which actions have occurred, which have not, and what you will do if the process stops in the middle. A highly capable model does not change this. Capability does not eliminate partial failure; it just makes partial failure look more convincing.