Codex Gets Subagents: The Architecture of Delegating Code Work

Simon Willison noted the announcement briefly, but the technical substance beneath it is worth unpacking. OpenAI has added subagent support and user-definable custom agents to Codex CLI, built on top of the OpenAI Agents SDK. At the surface it looks like a convenience feature. Underneath, it makes a specific set of architectural choices that separate this approach from what frameworks like LangGraph or AutoGen provide.

The `.as_tool()` pattern

The load-bearing API is a single method: .as_tool() on any Agent object. It wraps an agent as a callable tool for a parent orchestrator:

from agents import Agent, Runner

test_writer = Agent(
    name="test-writer",
    instructions="Write comprehensive pytest tests for the provided code.",
    model="gpt-4o"
)

orchestrator = Agent(
    name="orchestrator",
    instructions="Implement features and coordinate quality checks.",
    tools=[
        test_writer.as_tool(
            "write_tests",
            "Generate a pytest test file for a piece of code"
        )
    ]
)

result = Runner.run_sync(orchestrator, "Add a user registration endpoint with rate limiting")

When the orchestrator calls write_tests, the SDK spins up a fresh test_writer invocation with its own isolated context window. The parent sees only the tool’s return value. All intermediate reasoning, file reads, and internal tool calls inside the subagent are invisible to the orchestrator. That invisibility is not an accident; it is the central design decision.

Why context isolation matters

The “Lost in the Middle” paper established what practitioners had already observed: language models attend poorly to content buried in long contexts. A subagent receiving a 40,000-token orchestrator conversation history to answer a 200-token question is structurally disadvantaged before it processes a single relevant token.

Context isolation sidesteps this by forcing explicit information handoff. You pass only what the subagent needs: the precise task specification, the relevant input data, and an explicit output schema. Nothing else bleeds through. This also limits the blast radius of prompt injection attacks. A malicious payload injected into a web-fetching subagent cannot propagate into the orchestrator’s full conversation history if the orchestrator sees only the subagent’s final output.

The tradeoff is real: implicit context that accumulates naturally in single-agent conversations, the kind of “by the way, this API endpoint is deprecated” knowledge that builds up over turns, must now be made explicit at every delegation boundary. If you forget to pass it, the subagent works with incomplete information and confidently returns a plausible but wrong result.

The description field is the API contract

Custom agents are registered with a JSON definition that includes a description field:

{
  "name": "migration-writer",
  "description": "Generates and validates SQL migration scripts. Use when database schema changes are required.",
  "capabilities": ["file-read", "file-write", "shell-exec"],
  "input_schema": {
    "task": "string",
    "schema_context": "string"
  }
}

The orchestrating model reads that description at decision time to determine whether to invoke this agent. This makes the description function as something closer to an OpenAPI spec function signature than a comment. A vague description produces mis-routing. Two agents with overlapping descriptions produce inconsistent behavior that generates no error in your logs; the orchestrator just picks one, unpredictably.

This is a qualitatively different failure mode from traditional software. In a typed system, calling the wrong function produces a type error at compile time or a runtime exception. Here it produces a subtly wrong result that looks plausible. The discipline required for writing agent descriptions is more like writing good SQL index hints or cache invalidation logic than writing function signatures: correctness is not enforced, it is maintained.

Per-agent model selection as a cost lever

Each agent in the hierarchy accepts its own model parameter. A documentation writer does not need the same model as a security reviewer:

doc_writer = Agent(
    name="doc-writer",
    instructions="Generate inline documentation for the provided function.",
    model="gpt-4.1-mini",
    tools=[read_file]
)

security_reviewer = Agent(
    name="security-reviewer",
    instructions="Review code for injection vulnerabilities and privilege escalation.",
    model="o4",
    tools=[read_file]
)

orchestrator = Agent(
    name="orchestrator",
    model="gpt-4.1",
    tools=[
        doc_writer.as_tool("write_docs", "Generate inline documentation"),
        security_reviewer.as_tool(
            "review_security",
            "Review code for security vulnerabilities. Use when code handles user input, authentication, or file access."
        )
    ]
)

Pattern-completion tasks go to cheaper, faster models. Security review or complex reasoning tasks go to a more capable reasoning model. The orchestrator itself sits somewhere in between, since its job is coordination rather than deep domain analysis.

Temperature is also per-agent. A brainstorming agent generating architectural alternatives can run at higher temperature; a security reviewer should run at zero. This kind of per-task tuning was always possible in principle but required custom infrastructure to implement. Codex surfaces it as a first-class configuration option.

Structured output prevents silent failures

The Agents SDK supports forced output schemas via Pydantic:

from pydantic import BaseModel

class TestFileOutput(BaseModel):
    file_path: str
    content: str
    test_count: int
    coverage_targets: list[str]

test_writer = Agent(
    name="test-writer",
    instructions="Write pytest tests. Return structured output.",
    model="gpt-4o",
    output_type=TestFileOutput
)

When the orchestrator receives a TestFileOutput object rather than raw text, it can programmatically check whether the subagent produced a meaningful result before treating that output as authoritative for the next step. This connects to OpenAI’s function calling infrastructure at the API level, where schema enforcement happens in the model call itself rather than as post-processing on the client.

How this compares to LangGraph and AutoGen

LangGraph takes the opposite philosophical stance. It requires you to define an explicit directed acyclic graph of agent nodes, with checkpointing and state defined upfront. This gives you a complete picture of the workflow’s dependency structure before any execution begins. The cost is ceremony: you write more code to define more explicit structure.

AutoGen and CrewAI sit somewhere in between, with class-based agent definitions and role framing, but still more explicit about topology than Codex’s description-routing approach.

Codex’s approach is lower ceremony. The routing logic lives inside the orchestrator model’s interpretation of your descriptions rather than in an explicit graph definition. For small agent hierarchies with clear task boundaries, this works well and requires less code. For complex workflows where two tasks have a strict dependency ordering, the description-routing approach provides weaker guarantees: you can write the description to hint at ordering, but you cannot enforce it structurally.

The AGENTS.md integration preserves auditability. Custom agent configurations live as human-readable markdown files scoped to project directories, analogous to how Claude Code uses CLAUDE.md. This makes the agent library discoverable by reading the repository rather than tracing through Python configuration code.

Observability via `trace()`

The Agents SDK exposes a trace() context manager for inspecting the full agent execution tree:

from agents import Runner, trace

with trace(workflow_name="auth-update") as t:
    result = Runner.run_sync(orchestrator, "Add login rate limiting to auth.py")

for span in t.spans:
    print(span.agent_name, span.input, span.output)
    for child in span.children:
        print("  ", child.agent_name, child.input, child.output)

This is the distributed tracing pattern, drawing conceptual lineage from Google’s Dapper paper and modern OpenTelemetry tooling, applied to agent call trees. Unlike LangSmith, which requires a separate observability platform, the trace API here is exposed directly. You can write your own analysis against the span tree without routing data through an external dashboard. That is the right default for teams that care about data handling or want to integrate agent traces into existing monitoring infrastructure.

What this means for how you structure agent work

The principal-agent problem from economics, where a principal delegates to an agent who may have different information or incentives, maps onto this architecture in a literal way. The orchestrator is the principal; each subagent is an agent. The orchestrator cannot observe the subagent’s intermediate work, only its final output. This makes output schema design and description quality the two places where most workflow failures will originate.

For developers building on Codex, the practical implication is that the engineering work shifts from writing prompts to designing interfaces: what exactly does each agent receive, what exactly does it return, and how does the orchestrator route to it. That is less like prompt engineering and more like writing a service contract.

The Model Context Protocol direction suggests a natural extension: if custom Codex agents expose MCP-compatible interfaces, the same agent definition becomes portable across Codex, Claude Code, or any MCP-aware orchestrator. An agent library built for one tool becomes infrastructure usable by others. Whether that portability materializes depends on how much the frameworks converge on MCP as a common substrate, but the architectural groundwork is being laid for it.