Codex Subagents and the Architecture of Context Isolation

OpenAI’s Codex recently gained support for subagents and custom agents, built on top of the OpenAI Agents SDK. The feature lets an orchestrating Codex instance spawn specialized subordinate agents, each running in its own context window, each defined by the user with custom instructions, model choices, and tool access.

The mechanism is straightforward. The Agents SDK exposes an .as_tool() method that wraps any Agent object into a callable tool for another agent:

from agents import Agent, Runner

test_writer = Agent(
    name="test-writer",
    instructions="Write comprehensive pytest tests for the provided code.",
    model="gpt-4o"
)

orchestrator = Agent(
    name="orchestrator",
    instructions="Implement features and coordinate quality checks.",
    tools=[
        test_writer.as_tool(
            "write_tests",
            "Generate a pytest test file for a piece of code"
        )
    ]
)

result = Runner.run_sync(orchestrator, "Add a user registration endpoint...")

When the orchestrator calls write_tests, the SDK spins up a fresh test_writer invocation with its own context window. The parent sees only the tool’s return value; the subagent’s intermediate reasoning, file reads, and intermediate tool calls are invisible to it.

That invisibility is the design decision worth examining.

Context isolation is a contract, not a convenience

The clean boundary between parent and child context is not primarily about privacy or security, though it helps with both. It is about controlling what the orchestrator treats as ground truth.

In a monolithic agent, every file read, every intermediate hypothesis, every failed attempt accumulates in the context window. That accumulation is often useful, because the model’s next step benefits from seeing what it tried. But it creates a problem at scale: context windows are bounded, attention is not uniformly distributed across them, and a 40k-token history of incremental file exploration is worse input for decision-making than a 500-token structured summary. Research on lost-in-the-middle attention patterns has consistently shown that models perform worse when relevant information is buried in long contexts.

The subagent boundary forces that compression to happen explicitly. The orchestrator does not get the subagent’s working memory; it gets a result. If the result is insufficient, the orchestrator has to ask again with more explicit requirements. This is uncomfortable when you are used to monolithic agents, but it pushes task specification upstream to where it belongs.

The token economy problem

Naive subagent design creates a specific failure mode: context explosion at the handoff points.

If you pass the orchestrator’s full conversation history to each subagent as context, the costs multiply quickly. Three subagents receiving a 20k-token history means 60k input tokens before any subagent does work. This also tends to produce worse results, because the subagent now has to filter the orchestrator’s reasoning from the data it actually needs.

The correct pattern is to pass only three things: a precise task specification, the relevant input data, and an explicit output schema. Structured outputs via forced tool calls let you enforce the schema at the API level:

from agents import Agent
from pydantic import BaseModel

class TestFileOutput(BaseModel):
    file_path: str
    content: str
    test_count: int
    coverage_targets: list[str]

test_writer = Agent(
    name="test-writer",
    instructions="Write comprehensive pytest tests. Return structured output.",
    model="gpt-4o",
    output_type=TestFileOutput
)

This eliminates parsing errors at the orchestrator level and makes the interface between agents explicit rather than implied by prose instructions.

The description field is routing logic

Custom agents are registered with a name, a description, and an input schema. The orchestrator reads those descriptions at routing time and decides when to delegate. This is function calling, except the “function” is a multi-step reasoning process:

{
  "name": "migration-writer",
  "description": "Generates and validates SQL migration scripts. Use when database schema changes are required.",
  "capabilities": ["file-read", "file-write", "shell-exec"],
  "input_schema": {
    "task": "string",
    "schema_context": "string"
  }
}

The description field is load-bearing in a way that code comments are not. A vague description causes mis-routing. Overlapping descriptions between two agents produce inconsistent routing decisions. A description that says what the agent does but not when to use it creates an agent that gets called in the wrong situations and produces locally coherent but contextually wrong results.

This is similar to the problem with OpenAPI spec descriptions in tool-calling systems. The model uses the description as a signal in its routing decision, so the description functions more like an interface definition language than documentation. Your tool description is the API contract, as this pattern is essentially what any good agent tool design demands.

Failure modes unique to multi-agent systems

Single-agent errors are usually visible: the agent says something wrong, the context shows its reasoning, and you can trace the mistake. Multi-agent errors are subtler.

Wrong delegation: Two agents with overlapping capabilities cause the orchestrator to route inconsistently. Neither agent throws an error; both return plausible output. The error is in the routing decision, which does not surface in any log.

Incomplete context handoff: The orchestrator summarizes rather than passing raw data, and the subagent works on an abstraction. The response looks correct and internally consistent but is factually wrong about the actual code the subagent never read.

Compounding plausible errors: Each agent in a chain produces locally coherent output. A subtle misunderstanding in one agent propagates silently because the next agent treats prior output as authoritative. The final result can be significantly wrong despite no individual agent making an obvious mistake.

The InjecAgent benchmark demonstrated the propagation problem concretely for prompt injection: isolation boundaries limit propagation but do not eliminate it if the orchestrator acts on compromised subagent output without verification. Full context isolation is a meaningful defense; it is not a complete one.

Observability closes the gap

The Agents SDK includes a trace() context manager that captures full execution trees, with spans for each agent invocation, its input, its output, and its children:

from agents import Runner, trace

with trace(workflow_name="auth-update") as t:
    result = Runner.run_sync(orchestrator, "Add login rate limiting to auth.py")

for span in t.spans:
    print(span.agent_name, span.input, span.output)
    for child in span.children:
        print("  ", child.agent_name, child.input, child.output)

This is the distributed tracing pattern applied to agent workflows, analogous to what Google’s Dapper established for distributed systems. The analogy holds in both directions: like distributed tracing, it is opt-in, asynchronous, and only as useful as the spans you instrument.

LangSmith provides similar hierarchical trace visualization for LangChain and LangGraph pipelines. Anthropic’s Claude Code has its own internal tracing infrastructure. The difference with the Agents SDK is that the trace API is exposed directly to users, so you can build your own analysis on top of it rather than depending on a platform dashboard.

When subagents pay off

The overhead of context isolation, structured handoffs, and coordinated routing only makes sense for specific task shapes:

Tasks that would exhaust a single context window across many iterations: large refactors, multi-file analysis, extended test-fix cycles
Tasks that decompose into genuinely independent subtasks with stable interfaces between them
Tasks where different subtasks benefit from different models, for example a more capable model for security review and a faster one for reading files

For simple bug fixes, single-function changes, or any task where each observation tightly informs the next decision, a monolithic agent is faster and less error-prone. The subagent boundary adds coordination overhead that only pays off when the task is large enough to need it. The right move is to profile the single-agent version first and identify where it actually breaks down before introducing delegation.

The broader picture

Compared to Claude Code’s built-in agent types, Codex’s custom agent system gives users direct access to the agent definition layer. Claude Code’s specialized agents are defined by Anthropic and are not user-extensible at the same level of granularity. Codex’s approach means more flexibility but also means the quality of specialization depends entirely on how well users write agent instructions and routing descriptions.

If custom agents expose MCP-compatible interfaces, the same agent definition could theoretically be invoked from Codex, Claude Code, or any other MCP-aware orchestrator. That composability is the more interesting long-term implication: not that Codex has subagents now, but that the primitives for defining specialized agents are becoming standardized enough to be portable across toolchains. The near-term value is better task decomposition for large coding tasks. The longer-term value is an ecosystem where specialized agents are reusable components rather than internals of a single tool.