· 8 min read ·

Codex Custom Agents and the Trade-offs in Description-Driven Routing

Source: simonwillison

OpenAI shipped subagent support and user-definable custom agents for Codex CLI, as noted by Simon Willison. The feature lets an orchestrating Codex instance spawn specialized subordinate agents, each with its own context window, model selection, and tool access restrictions. On the surface this looks like a nice ergonomic improvement. One layer down, it surfaces a genuine architectural choice that every multi-agent framework has to make, and the decision Codex made has consequences worth thinking through.

The Core Mechanism

The load-bearing API is .as_tool(), a method on any Agent object from the OpenAI Agents SDK. It wraps an agent as a callable tool that an orchestrator can invoke:

from agents import Agent, Runner

test_writer = Agent(
    name="test-writer",
    instructions="Write comprehensive pytest tests for the provided code.",
    model="gpt-4o"
)

orchestrator = Agent(
    name="orchestrator",
    instructions="Implement features and coordinate quality checks.",
    tools=[
        test_writer.as_tool(
            "write_tests",
            "Generate a pytest test file for a piece of code"
        )
    ]
)

result = Runner.run_sync(orchestrator, "Add a user registration endpoint")

When the orchestrator calls write_tests, the SDK spins up a fresh test_writer invocation with its own isolated context window. The parent sees only the tool’s return value. All intermediate reasoning, file reads, and internal tool calls inside the subagent are invisible to the orchestrator.

Custom agents in Codex are registered via AGENTS.md files scoped to project directories, following the same convention that CLAUDE.md established for per-project instructions. The format is a JSON block with a name, description, capability list, and input schema:

{
  "name": "migration-writer",
  "description": "Generates and validates SQL migration scripts. Use when database schema changes are required.",
  "capabilities": ["file-read", "file-write", "shell-exec"],
  "input_schema": {
    "task": "string",
    "schema_context": "string"
  }
}

This makes agent libraries human-readable and auditable in version control. A new contributor can open AGENTS.md and understand what domain-specific agents exist before running anything.

Context Isolation Is the Design Decision That Matters

Every subagent invocation starts with a clean context window. The parent’s full conversation history does not flow into the subagent automatically. This is presented as a feature, and it genuinely is, but it is worth being explicit about why.

The “Lost in the Middle” paper documented that language models perform worse on information buried in the middle of long contexts compared to information at the beginning or end. If three subagents each naively receive a 20,000-token conversation history, you have 60,000 input tokens before any actual work happens, and the content that actually matters for each agent’s subtask is scattered through a large context that the model will partially ignore.

There is also a security dimension. The InjecAgent benchmark found roughly a 24% prompt injection success rate against GPT-4-turbo in single-agent settings. An isolated context limits the blast radius: a prompt injected through one subagent’s tool results cannot contaminate the orchestrator’s reasoning directly, only the value returned through the tool call boundary.

The correct calling pattern follows from this. Each subagent call should receive a precise task specification, the specific input data it needs, and an explicit output schema. Nothing more. A subagent that requires 15 pieces of dispersed context to function is doing too much and should be decomposed further.

The Agents SDK supports enforcing output schemas via Pydantic, which eliminates a whole class of parsing errors:

from pydantic import BaseModel

class TestFileOutput(BaseModel):
    file_path: str
    content: str
    test_count: int
    coverage_targets: list[str]

test_writer = Agent(
    name="test-writer",
    instructions="Write pytest tests. Return structured output.",
    model="gpt-4o",
    output_type=TestFileOutput
)

Schema enforcement happens at the model call level via OpenAI’s function calling infrastructure, not as post-processing. A partial result is still a valid typed object, which makes failure recovery tractable.

The Routing Trade-off

This is where the Codex design makes a choice that differs from the main alternatives, and the choice has real consequences.

Codex uses description-driven routing. The orchestrating model reads each tool’s description at runtime and decides which agent to delegate to based on semantic matching. The description field in an agent definition is not documentation; it is the routing mechanism. Vague descriptions produce mis-routing. Overlapping descriptions between agents produce inconsistent behavior with no log errors.

The explicit alternative is graph-defined routing, which LangGraph implements. You define edges between agent nodes in code, and the framework enforces them:

workflow = StateGraph(AgentState)
workflow.add_conditional_edges(
    "orchestrator",
    route_to_reviewer,
    {"security": "security_reviewer", "continue": END}
)

With LangGraph, routing failures produce deterministic code errors, not silent misbehavior. Dependencies between agents are encoded as DAG edges and can be scheduled automatically. The cost is ceremony: you define the full graph structure upfront, which becomes maintenance burden as the agent topology evolves.

The comparison across major frameworks looks roughly like this:

ToolRouting ModelUser-Extensible AgentsExplicit Dependency Graph
CodexDescription-drivenYes, via AGENTS.mdNo
LangGraphCode-defined edgesYes, via graph nodesYes
AutoGenGroupChat managerYes, via class definitionsLimited
Claude CodePredefined typesNo (fixed set)Sequential by default

Claude Code ships with a fixed set of built-in agent types (Explore, Plan, general-purpose) defined by Anthropic. You cannot extend this set at the granularity Codex now allows. Models are determined by Anthropic based on task profile rather than being configurable per agent. The tradeoff is predictability over flexibility.

Codex’s description-driven approach has prior art in a different domain. JADE, the Java Agent Development Framework popular in the early 2000s, used FIPA’s Agent Communication Language with formal ontology-based capability registration. That approach collapsed in open domains because every new capability required updating a shared ontology. Description-driven routing solves this by offloading semantic matching to the language model itself, removing the formal ontology requirement at the cost of making routing correctness probabilistic.

A practical defense against mis-routing: write negative exclusions in descriptions. “Do not use for security vulnerabilities or performance analysis” actively changes routing behavior in a way that positive-only descriptions do not.

Per-Agent Model Selection as a Cost Lever

Each agent accepts its own model and temperature parameters:

doc_writer = Agent(name="doc-writer", model="gpt-4.1-mini", tools=[read_file])
security_reviewer = Agent(name="security-reviewer", model="o4", tools=[read_file])
orchestrator = Agent(name="orchestrator", model="gpt-4.1", tools=[...])

Pattern-completion tasks, documentation, boilerplate generation, get cheaper and faster models. Security review and complex reasoning go to reasoning models. Temperature is also per-agent, which matters: a security reviewer should run at 0, a brainstorming agent at higher temperature.

This is the primary cost optimization lever in the system. A realistic agent topology might spend 80% of its token budget on orchestration and reasoning, with the bulk of file reads and documentation generation happening in cheaper models.

Failure Semantics Are Under-Discussed

Subagent invocations are structurally equivalent to RPC calls with non-idempotent side effects. This creates failure modes that are familiar from distributed systems but rarely discussed in the agent tooling context.

A subagent that creates a GitHub pull request and then fails before returning cleanly presents a retry dilemma: rerunning the subagent may create a duplicate PR. File writes are generally retry-safe. External writes, GitHub PRs, Slack messages, database inserts, payment APIs, are not.

Applicable patterns from distributed systems literature: idempotency keys for external writes, the saga pattern with compensating transactions for multi-step operations. Structured completion manifests from subagents make safe retry logic possible:

{
  "files_modified": ["src/auth.py", "tests/test_auth.py"],
  "external_actions": [{"type": "pull_request", "url": "...", "number": 42}],
  "tests_passed": true,
  "verification_output": "All 23 tests passed"
}

With a manifest like this, an orchestrator can determine exactly what work was completed before a failure and avoid re-executing the parts that succeeded.

The Agents SDK provides a trace() API for observability:

from agents import Runner, trace

with trace(workflow_name="auth-update") as t:
    result = Runner.run_sync(orchestrator, "Add login rate limiting to auth.py")

for span in t.spans:
    print(span.agent_name, span.input, span.output)
    for child in span.children:
        print(" ", child.agent_name, child.input, child.output)

This is the distributed tracing pattern, conceptual lineage from Google’s Dapper paper through OpenTelemetry, applied to agent call trees. Unlike LangSmith, it is first-party and does not require a separate observability platform.

The Atom Model vs. Thread Model

The OpenAI Assistants API uses a thread model: a persistent conversation accumulating state turn-by-turn. This works well for interactive conversations and poorly for orchestration, because reproducing or debugging a specific decision requires replaying the full conversation thread.

Subagents in Codex follow what could be called the atom model: each invocation receives everything it needs at call time, produces a result, and terminates. No state persists between calls. Every call is independently reproducible from its logged inputs. Debugging reduces to reading the log entry for the failing call and resubmitting it with the same inputs.

The atom model also exposes poor task decomposition. A subagent that requires 15 pieces of dispersed context to function correctly is a signal that the task boundary is wrong. The architecture creates pressure toward better-scoped agents.

Parallel Execution Has Prerequisites

Running subagents in parallel requires two conditions to hold simultaneously: no write-access overlap between agents, and no input-output dependency between them. Two agents cannot safely write the same file concurrently. An agent whose input is another agent’s output must run sequentially.

Most large refactors are two-phase in practice: update shared abstractions first, then fan out parallel downstream changes. Building an access matrix mapping each agent to its reads and writes before scheduling is not optional if you want safe parallelism.

LangGraph encodes dependencies as DAG edges and can automatically schedule independent nodes. Codex relies on the orchestrator model’s judgment, which returns us to the description-driven routing trade-off: flexible and low-ceremony, but probabilistic rather than deterministic.

Where This Is Going

Codex CLI exposes its agent loop via a bidirectional JSON-RPC 2.0 socket called the App Server, and a codex-mcp crate wraps that as a Model Context Protocol server. If custom Codex agents expose MCP-compatible interfaces, the same agent definition becomes portable across Codex, Claude Code, or any MCP-aware orchestrator. The agent library becomes infrastructure rather than a framework-specific artifact.

The Anthropic documentation on agent architecture converges on similar principles around context isolation and structured hand-offs. The ecosystem is moving toward a shared understanding of what good agent composition looks like, even as frameworks differ on routing strategy and ceremony level.

Codex’s approach optimizes for fast iteration and low setup cost. You write agent descriptions in AGENTS.md, wire them up with .as_tool(), and the orchestrator figures out the routing. The cost is that routing correctness is a property you verify through testing, not through the compiler or framework enforcement. For teams building robust production pipelines where agent routing failures have real consequences, the explicit graph approaches may be worth the ceremony. For exploration and tooling where fast iteration matters more, Codex’s approach removes a real source of friction. Both are legitimate trade-offs; the important thing is understanding which one you are making.

Was this interesting?