Codex Grows a Delegation Layer: What Subagents and Custom Agents Actually Change
Source: simonwillison
OpenAI quietly shipped subagent support for Codex CLI, and Simon Willison covered the announcement. The feature summary is short: Codex can now spawn subagents using the OpenAI Agents SDK, and you can register your own custom agents for it to delegate to. That’s the what. The more interesting part is the why behind the specific design choices, and what they tell you about the problems hierarchical agent systems keep running into.
The Mechanism
Subagent support in Codex is built on the .as_tool() method from the OpenAI Agents SDK. The pattern wraps any Agent object into a callable tool that another agent can invoke:
from agents import Agent, Runner
test_writer = Agent(
name="test-writer",
instructions="Write comprehensive pytest tests for the provided code.",
model="gpt-4o"
)
orchestrator = Agent(
name="orchestrator",
instructions="Implement features and coordinate quality checks.",
tools=[
test_writer.as_tool(
"write_tests",
"Generate a pytest test file for a piece of code"
)
]
)
result = Runner.run_sync(orchestrator, "Add a user registration endpoint...")
When the orchestrator calls write_tests, the SDK spins up a fresh test_writer invocation with its own context window. The parent sees only the tool’s return value. All intermediate reasoning, file reads, and tool calls inside the subagent are invisible to the parent. That invisibility is not a limitation; it is the point.
Custom agents extend this further. You register an agent with a name, description, capability set, and input schema:
{
"name": "migration-writer",
"description": "Generates and validates SQL migration scripts. Use when database schema changes are required.",
"capabilities": ["file-read", "file-write", "shell-exec"],
"input_schema": {
"task": "string",
"schema_context": "string"
}
}
The description field is the routing mechanism. The orchestrating model reads it at decision time and uses it to decide whether to delegate. This means description quality directly affects routing accuracy. A vague description produces mis-routing; overlapping descriptions between agents produce inconsistent behavior. It functions more like a function signature in an OpenAPI spec than like a code comment.
The Agents SDK also supports structured output types via Pydantic, which lets you enforce a schema on what a subagent returns:
from pydantic import BaseModel
class TestFileOutput(BaseModel):
file_path: str
content: str
test_count: int
coverage_targets: list[str]
test_writer = Agent(
name="test-writer",
instructions="Write pytest tests. Return structured output.",
model="gpt-4o",
output_type=TestFileOutput
)
This matters more than it might seem. Unstructured string returns from subagents require the parent to parse natural language to determine what happened, which reintroduces the ambiguity you were trying to eliminate by delegating in the first place.
Why Context Isolation Is the Core Design Decision
Every subagent invocation starts with a clean context window. The parent’s full conversation history does not automatically flow into the subagent. You pass what you want to pass, explicitly.
This is the right default, and the research supports it. The “Lost in the Middle” paper demonstrated that language model performance degrades when relevant information is buried in the middle of long contexts. Models attend well to the start and end of their context window, and much worse to content in between. A subagent receiving a 40,000-token orchestrator history to answer a specific 200-token question is structurally disadvantaged before it starts.
Context isolation also has a token economy dimension. Naive multi-agent design passes the full orchestrator context to every subagent. Three subagents each receiving a 20,000-token history means 60,000 input tokens before any work begins, on top of each subagent’s own processing. At current API pricing this accumulates fast, and it accumulates fastest on the tasks you run most often. The correct pattern is to pass only three things: the precise task specification, the relevant input data, and the explicit output schema.
There is a third reason for isolation that is harder to see until something goes wrong: security. The InjecAgent benchmark found roughly 24% attack success rates for prompt injection against GPT-4-turbo in single-agent settings. In multi-agent systems, injected instructions can propagate upward through the call chain. A subagent that reads an attacker-controlled file and then returns its contents as part of a tool result can poison the orchestrator’s context. Context isolation limits this blast radius: a compromised subagent can only affect its own output, not the orchestrator’s full working memory.
The minimal footprint principle follows from this: give each subagent only the tools it actually needs. A subagent that writes test files does not need shell-exec. A subagent that reads documentation does not need file-write. The constraint is both a security property and an architectural signal. If you cannot enumerate a subagent’s required tools, the task is probably not scoped tightly enough to delegate cleanly.
Observability: The Trace API
Subagent calls are difficult to debug in the conventional sense. You cannot set a breakpoint inside a subagent invocation the way you would in a function call, because the invocation is a separate model call, not a stack frame. The Agents SDK exposes a trace() API for this:
from agents import Runner, trace
with trace(workflow_name="auth-update") as t:
result = Runner.run_sync(orchestrator, "Add login rate limiting to auth.py")
for span in t.spans:
print(span.agent_name, span.input, span.output)
for child in span.children:
print(" ", child.agent_name, child.input, child.output)
This is the distributed tracing pattern, applied to agent workflows. The conceptual lineage runs back to Google’s Dapper paper and forward through Zipkin, Jaeger, and OpenTelemetry. The problem is structurally the same: you have a request that fans out across multiple execution units, and you need causality preserved across those boundaries to diagnose failures. The difference is that LLM agent spans include natural language inputs and outputs rather than structured RPC payloads, which makes automated analysis harder but human inspection more informative.
Tooling like LangSmith offers similar capabilities for LangChain-based systems. What the Agents SDK trace API adds is a first-party, framework-native surface that does not require a separate observability backend to get started.
The Distributed Systems Frame
Every subagent invocation is structurally an RPC call with potential for non-idempotent side effects. This is the framing that most agent framework documentation avoids, because it immediately raises uncomfortable questions.
Consider a subagent that creates a GitHub pull request and then fails to return cleanly. The orchestrator faces a retry dilemma: retry risks creating a duplicate PR; skipping the retry leaves the workflow in an unknown state. This is not a new problem. Stripe has published extensively on idempotency key design for exactly this class of issue in payment APIs. The saga pattern from distributed systems offers compensating transactions as a recovery mechanism. These patterns exist because distributed state mutations are hard, and subagent calls that touch external state are distributed state mutations.
Parallel subagent execution adds another constraint. Two conditions must hold for safe concurrency: no write-access overlap (two agents cannot write the same file simultaneously) and no input-output dependency (Agent B cannot consume Agent A’s output if they run in the same batch). Most real-world refactors end up two-phase in practice: update shared abstractions first, then fan out parallel downstream changes. Frameworks like LangGraph make these dependencies explicit by modeling workflows as DAGs with checkpointing, which allows automatic scheduling of independent nodes. Codex’s approach relies more on the orchestrating model’s judgment, which is faster to set up and less reliable under complex dependency structures.
How This Compares to Other Frameworks
The multi-agent space has converged on a small set of patterns, implemented with varying levels of ceremony:
AutoGen uses GroupChat with nested agent hierarchies, with more permissive tool access defaults and class-defined agent objects. CrewAI structures delegation around roles in a “crew,” with a manager agent and structured expected outputs for each worker. LangGraph is the most explicit about dependencies, requiring graph structure to be defined upfront.
Claude Code’s Task tool runs subagents sequentially by default and provides a fixed set of built-in agent types (general-purpose, Plan, Explore). These are defined by Anthropic and not user-extensible at the same granularity as Codex’s custom agents. The tradeoff is predictability: built-in agent types have well-documented capabilities, which makes reasoning about their behavior easier.
Codex’s approach sits closer to the low-ceremony end of the spectrum. The AGENTS.md convention, which Codex has used since its initial release for project-level instructions, extends to custom agents: you can scope an agent’s configuration to a specific directory by placing its AGENTS.md there. This makes agent definitions human-readable and auditable in a way that code-only configuration is not.
One longer-term implication worth noting: if custom Codex agents expose MCP-compatible interfaces, the same agent definition could be invoked from Codex, Claude Code, or any MCP-aware orchestrator. The agent becomes a portable unit of capability rather than a framework-specific artifact. That portability is not guaranteed by the current design, but the direction of travel in the ecosystem points toward it.
What Changes in Practice
The most immediate change is scope management. Single-agent Codex runs accumulate context through a long task. Subagents let you offload bounded, well-defined subtasks to fresh contexts, which keeps the orchestrator’s working memory focused on coordination rather than the accumulated details of each subtask.
The second change is reuse. A well-scoped custom agent for writing migration scripts, generating API documentation, or running security checks can be defined once and reused across projects via the AGENTS.md convention. This is the team-level abstraction that single-agent setups make awkward: everyone reimplements the same task-specific logic in their own prompts, inconsistently.
The third change is auditing. Explicit task boundaries, structured inputs and outputs, and the trace API together make it possible to inspect what each subagent was asked, what it returned, and where failures occurred. That audit trail does not exist when a single agent handles everything in one unbroken conversation thread.
None of this eliminates the fundamental challenges: prompt injection across trust boundaries, retry semantics for stateful side effects, context budgeting for nested calls. What it provides is a cleaner set of primitives for addressing those challenges systematically, rather than working around them in each individual agent prompt.