· 7 min read ·

The Context Economy of Subagent Calls

Source: simonwillison

The context economy of subagent calls is a real engineering concern that most multi-agent demos obscure. When Simon Willison’s guide on agentic engineering patterns describes subagents as a core building block, it covers the what and the why: decompose a large task, delegate pieces, run independent pieces in parallel. The how of context design at the agent boundary gets less attention, and it matters more than it looks.

This is the claim: the interface you design between an orchestrator and its subagents, specifically what context you include in each subagent call and what structure you require back, determines whether your multi-agent system is efficient or expensive, debuggable or opaque, robust or brittle. It is an API design problem, and it should be treated like one.

The Naive Approach and Its Costs

In early prototypes, the typical approach is to send the subagent everything the orchestrator knows. The full conversation history, the original user request, all prior tool results, the task at hand. This is easy to implement because you are just splicing the orchestrator’s messages into the subagent’s context. And it works: the subagent has all the information it could possibly need.

The problem is the word “possibly.” A subagent tasked with writing a summary of three documents does not need to know that the orchestrator earlier tried a different file path and failed. A subagent doing a web search does not need the 4,000-token output from a code execution step that ran ten minutes before it was spawned. But if you pass the full context, it pays for all of that.

The Anthropic API prices per input token, and those costs compound when you add subagent layers. A single orchestrator call with a 20,000-token context spawning three subagents that each receive that same context means 60,000 input tokens before any subagent has done anything. If those subagents themselves spawn further agents, the multiplication continues. The original context weight propagates through every level.

This is not a theoretical concern. Production multi-agent pipelines where the orchestrator naively passes its full context to subagents frequently run costs per task that are several times higher than necessary, because most of the context at every subagent boundary is irrelevant to the subagent’s specific job.

What a Subagent Actually Needs

Reducing that waste requires thinking clearly about what a subagent needs to do its job. In most cases, that is three things.

First, a task specification: a precise description of what the subagent is supposed to produce. Not the full history of what led to this task. Not the orchestrator’s reasoning. Just: here is your job, here is the form of a correct output.

Second, relevant input data: the specific documents, strings, search results, or structured records the subagent needs to operate on. Not all the data the orchestrator has accumulated. Only the data for this task.

Third, an output schema: an explicit definition of what the subagent should return, either as a description in natural language or, better, as a forced tool call that structures the output.

Everything else is noise. Orchestrator reasoning, prior failed attempts, context from other subagents running in parallel, metadata about the user’s original intent, none of that should be in the subagent’s context unless it directly affects how the subagent should do its work. The test is simple: if the subagent cannot make a better decision with a piece of context than without it, do not send it.

Structured Inputs Over Prose

One concrete improvement over passing prose context is sending structured inputs. Rather than writing a natural language briefing that the subagent has to parse, you pass a defined payload that makes the task and its parameters explicit.

import anthropic
import json

client = anthropic.Anthropic()

def run_summarization_subagent(
    documents: list[str],
    max_length: int,
    focus_topic: str,
) -> str:
    task_payload = {
        "task": "summarize",
        "documents": documents,
        "constraints": {
            "max_words": max_length,
            "focus": focus_topic,
        }
    }

    messages = [{
        "role": "user",
        "content": json.dumps(task_payload)
    }]

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        system="You summarize documents. Input is JSON with keys: task, documents, constraints. Return only the summary text.",
        messages=messages,
    )
    return response.content[0].text

This is a narrower interface than “here is everything, figure it out.” It makes the inputs explicit and machine-readable. It also makes the context smaller, since you are passing a tight JSON payload rather than a paragraph of prose that embeds the same information more verbosely.

The system prompt here is also deliberately minimal. It tells the subagent its role and the format of its input. It does not include general knowledge about the project, organizational context, or reasons the summarization task exists. That information is irrelevant to execution.

Forcing Structured Outputs

The return direction is equally important. If a subagent returns freeform text, the orchestrator has to parse it. Natural language parsing is unreliable; the orchestrator may misread what the subagent produced, especially under edge cases, long outputs, or outputs that deviate slightly from expected formatting.

The more robust pattern is to require subagents to return structured output via a tool call. In the Anthropic API, you can define a tool with no functional implementation and use tool_choice to force the model to call it with its answer. This gives you a typed, validated response that requires no parsing.

output_schema = {
    "name": "submit_summary",
    "description": "Submit the completed summary",
    "input_schema": {
        "type": "object",
        "properties": {
            "summary": {
                "type": "string",
                "description": "The completed summary text"
            },
            "word_count": {
                "type": "integer",
                "description": "Number of words in the summary"
            },
            "key_points": {
                "type": "array",
                "items": {"type": "string"},
                "description": "Three to five main points extracted"
            }
        },
        "required": ["summary", "word_count", "key_points"]
    }
}

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    system="Summarize the provided documents.",
    messages=messages,
    tools=[output_schema],
    tool_choice={"type": "tool", "name": "submit_summary"}
)

result = response.content[0].input
# result is a dict with guaranteed keys: summary, word_count, key_points

The orchestrator receives a dictionary, not a string to parse. The schema enforces structure at the API boundary. If the subagent cannot produce output that fits the schema, the API call fails clearly rather than returning ambiguously formatted text that the orchestrator misinterprets silently.

The Compression Decision

Sometimes a subagent genuinely needs substantial context. A code review subagent might need a large diff plus the surrounding file. A research subagent might need prior search results to avoid redundant queries. In these cases, the question is not whether to pass context but how to pass it efficiently.

Pre-compression before handoff is one option. Before spawning the subagent, the orchestrator makes a separate, cheap call to compress the relevant context into a shorter form. This adds latency but reduces subagent input size. The tradeoff is worth it when the subagent will be called repeatedly with similar context, or when the full context is substantially larger than its compressed version.

Another option is retrieval rather than inclusion. If the full context is large but only portions of it are relevant to a given subagent, giving the subagent a tool that lets it query the full context by topic or section is more efficient than including everything upfront. The subagent pays for what it actually reads, not what it might need.

The key question is whether the subagent needs the full context to start, or whether it can begin with a summary and retrieve specifics when it encounters them. Most tasks are the latter.

Parallelism and Context Independence

One reason to design tight subagent interfaces is that tight interfaces enable safe parallelism. If each subagent call receives only the data it needs for its specific task, then subagents handling independent tasks can run concurrently without any risk of their inputs or outputs interfering.

import asyncio

async def run_parallel_subagents(
    tasks: list[dict],
) -> list[dict]:
    async def invoke(task_payload: dict) -> dict:
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(
            None,
            lambda: run_subagent(task_payload)
        )

    return await asyncio.gather(*[invoke(t) for t in tasks])

If instead each subagent received the full orchestrator context, parallel execution would still work mechanically, but you would be paying for N copies of that full context simultaneously. With scoped contexts, the cost per subagent is proportional to its actual information need, not the orchestrator’s total accumulated state.

Making the Boundary Observable

Well-designed subagent interfaces have a practical observability benefit. When every subagent call has a defined input schema and a defined output schema, you can log at those boundaries and reconstruct exactly what each subagent was asked to do and what it produced. That is much easier than instrumenting the interior of an agent run, where tool calls and model responses interleave in a single long conversation context.

A simple logging wrapper at the subagent invocation point gives you a complete audit trail: input payload, output payload, token counts in and out, latency, model used. From those logs you can measure cost per subagent type, catch unexpected input sizes, and identify subagents that are receiving context they do not appear to use.

This observability is only achievable if the interface is clean. A subagent that receives a prose blob with no defined structure and returns freeform text produces logs that require interpretation, not inspection. The interface design choice upstream determines whether debugging is mechanical or laborious.

The Design Work That Compounds Over Scale

The instinct in early development is to define subagent interfaces loosely and clean them up later. The subagents work, the prototype runs, tightening the interface feels like polish rather than substance. That calculus reverses under scale.

A pipeline running a few tasks per day can tolerate wasteful context passing. A pipeline running thousands of tasks pays for every redundant token on every call, and the cost of those loose interfaces accumulates in the billing dashboard. More importantly, loose interfaces make expansion harder. Adding a new subagent type into a system where every subagent gets the full orchestrator context is straightforward until you need to debug something unexpected or optimize a cost spike.

The framing in Simon’s agentic engineering patterns guide is right: subagents are a structural tool for capability scoping and parallelization. The interface between orchestrator and subagent is where that structure becomes either solid or soft. The context budget is not a performance detail; it is part of the architecture.

Was this interesting?