OpenAI’s Codex picked up subagent support this week, as covered by Simon Willison. The feature lets you define custom agents using the OpenAI Agents SDK, wrap them with .as_tool(), and have an orchestrating Codex instance delegate to them during a coding run. Each subagent gets its own isolated context window, its own model choice, and its own set of capabilities.
The mechanics are well-documented, but there is a design choice worth examining that is easy to overlook: Codex routes to subagents by description, not by explicit dispatch logic.
How the Routing Works
When you define a custom agent with .as_tool(), you pass two arguments: a name and a description. The orchestrating model reads that description at runtime to decide whether to delegate:
from agents import Agent, Runner
security_reviewer = Agent(
name="security-reviewer",
instructions="Review code for OWASP Top 10 vulnerabilities, injection patterns, and authentication issues.",
model="o4"
)
orchestrator = Agent(
name="orchestrator",
model="gpt-4.1",
tools=[
security_reviewer.as_tool(
"review_security",
"Review a code file for security vulnerabilities. Use when the task involves authentication, input validation, SQL queries, or any user-facing data handling."
)
]
)
The orchestrator never sees the instructions field of security_reviewer. The routing decision is made entirely from the description you provide in .as_tool(), interpreted in the context of the current task. There is no dispatch table, no explicit graph, no conditional logic you write. The orchestrator reasons about which tool to call the same way it reasons about everything else: by reading the descriptions as natural language.
The Alternative: Explicit Graphs
The contrast with LangGraph makes the trade-off concrete. In LangGraph, you define an explicit graph where nodes are agents and edges are transitions. The routing logic is code you write:
workflow = StateGraph(AgentState)
workflow.add_node("orchestrator", orchestrator_agent)
workflow.add_node("security_reviewer", security_agent)
workflow.add_conditional_edges(
"orchestrator",
route_to_reviewer, # your function, deterministic
{"security": "security_reviewer", "continue": END}
)
LangGraph’s approach requires more upfront specification. You enumerate the conditions explicitly and write routing functions as code. The trade-off is that the routing behavior is fully deterministic: you can read the code and know exactly when security_reviewer will be called, and you can test that routing logic in isolation.
Codex’s description-driven approach trades that certainty for developer ergonomics. You write a description, and it mostly works. You do not have to think about edges or conditions or what happens when a task falls between two categories.
Where Description-Driven Routing Breaks Down
The failure mode appears when descriptions overlap. If you have a code-reviewer agent and a security-reviewer agent, and both descriptions mention reviewing code for issues, the orchestrator has to infer which one to use for any given task. That inference is influenced by phrasing, model version, and context in ways that are genuinely hard to predict.
The resulting routing can be inconsistent. Security-sensitive tasks get routed to the general code reviewer on some runs; general code review tasks get escalated to the expensive security model on others. No exception is thrown, the logs show normal tool calls, and you only notice if you are watching closely enough to see the pattern accumulate across multiple runs.
This failure mode is different in kind from a bug in explicit routing code. A wrong conditional will misbehave consistently and you will find it in tests. Probabilistic routing can misbehave intermittently, and the variance can look indistinguishable from normal LLM output variation until you have enough data points to see it clearly.
The practical defense is to write descriptions that are jointly coherent, not just individually accurate. Each description should actively discriminate between agents. The exclusion pattern is genuinely useful here: adding “Do not use for security vulnerabilities or performance analysis” to your general code reviewer’s description gives the orchestrator a concrete negative signal when a security task arrives. Negative exclusions change routing behavior; they are not documentation formalities.
Why This Trade-Off Exists
Description-driven routing fits naturally with how LLMs already work with function calls. OpenAI’s function calling API uses the same model: you describe a function, the model decides when to call it. Codex extends that pattern to agent invocations. The developer experience is consistent with existing tool-calling conventions, and you do not have to learn a separate abstraction for agent orchestration versus tool use.
There is also a flexibility argument. Explicit routing graphs handle only the transitions you defined at graph-design time. A description-driven system can handle task variants and edge cases that were not anticipated when the agent library was designed, because the orchestrator can reason about novel situations using the descriptions as guidance rather than failing to match against a fixed set of conditions.
The cost is that you are delegating routing correctness to the model. For well-separated agent responsibilities with clear discriminating signals in the descriptions, this works reliably. For adjacent capabilities where the boundary is genuinely fuzzy, the routing becomes inconsistent.
Model Selection and Cost
Each agent in the Agents SDK accepts its own model parameter, which creates a meaningful cost lever:
doc_writer = Agent(name="doc-writer", model="gpt-4o-mini", tools=[read_file])
security_reviewer = Agent(name="security-reviewer", model="o4", tools=[read_file])
orchestrator = Agent(
name="orchestrator",
model="gpt-4.1",
tools=[
doc_writer.as_tool("write_docs", "Generate inline documentation for a function or module. Do not use for security review or architectural analysis."),
security_reviewer.as_tool("review_security", "Review code for security vulnerabilities, injection risks, and authentication issues. Use when the task involves user-facing data handling or access control.")
]
)
Pattern-completion tasks like generating documentation do not require a reasoning model. Security review, architecture analysis, and tasks requiring multi-step constraint reasoning benefit from one. Routing tasks to the appropriate model tier is a cost lever: a routine documentation task sent to o4 costs substantially more than the same task sent to gpt-4o-mini, with no quality benefit.
The connection back to routing accuracy: description precision determines whether those cost savings materialize in practice. If the security reviewer is being called for tasks that should route to the doc writer, you are paying reasoning-model prices for documentation generation. Writing discriminating descriptions is as much an economics concern as a correctness concern.
Context Isolation
Separate from the routing question, each subagent invocation starts with a clean context window. The orchestrator’s full conversation history does not flow automatically into subagents; you pass them only what they need.
This matters on two dimensions. The Lost in the Middle paper found that language models perform significantly worse when relevant information is buried in long contexts. Starting subagents with only the task-relevant information they need keeps them working with the part of the context where attention is most reliable.
The security implication is also real. The InjecAgent benchmark measured roughly 24% attack success rates against single agents processing untrusted content. Context isolation means injected instructions cannot propagate automatically from one subagent to the orchestrator or to other agents; each invocation is a contained scope.
Comparing the Landscape
Codex is not the only multi-agent coding framework, and the comparison table is informative:
| Tool | Routing Model | User-Extensible Agents | Explicit Dependency Graph |
|---|---|---|---|
| Codex | Description-driven | Yes, via AGENTS.md | No |
| LangGraph | Code-defined edges | Yes, via graph nodes | Yes, enforced by framework |
| AutoGen | GroupChat with manager | Yes, via class definitions | Limited |
| Claude Code | Predefined agent types | No | Sequential by default |
| Cursor / Aider | Single-agent | No native delegation | N/A |
Codex sits in an interesting position: user-extensible like LangGraph and AutoGen, but without the explicit graph structure that makes routing in those frameworks testable and deterministic. The bet is that description quality can carry the routing burden that graph structure carries elsewhere. For many real-world coding tasks with reasonably distinct agent responsibilities, this bet pays off. For complex agent libraries where capabilities blur together, it requires careful description design.
The AGENTS.md Convention
Codex extends its existing AGENTS.md project instruction convention to support custom agent definitions. This means your agent library lives in a human-readable file tracked in version control alongside the code it supports.
This is the right call. A Makefile works the same way. A GitHub Actions workflow works the same way. Making the tooling configuration readable without running anything is a property worth having, both for new contributors and for code review. When someone adds a new agent definition to AGENTS.md, reviewing the description for overlap with existing agents is straightforward; the entire routing surface is visible in one file.
The limitation is that AGENTS.md has no enforcement mechanism. A Makefile with a misconfigured target fails loudly. An AGENTS.md with overlapping descriptions just routes inconsistently. The documentation and the routing behavior can diverge without any signal that they have diverged.
What to Actually Do With This
The Agents SDK is well-designed and the feature is worth using. The description fields deserve the same review discipline you would give to an API design: check for interface conflicts when you add new agents, write explicit exclusions where capabilities could overlap, and treat the full set of descriptions as a routing surface that needs to be coherent as a whole, not just individually sensible.
Description-driven routing lowers the barrier to building agent libraries significantly relative to explicit graph frameworks. That is a real advantage. The reliability boundary is determined by how discriminating your descriptions are, and that boundary is worth knowing before you ship something that depends on consistent routing behavior.