One Task, Many Models: The Cost-Performance Case for Custom Agents in Codex
Source: simonwillison
Simon Willison noted last week that OpenAI’s Codex CLI now supports subagents and custom agents. The coverage since has focused on the architectural shift: hierarchical delegation, context isolation, how the orchestrator routes to specialists. All of that is worth understanding. But there is a quieter part of the custom agent design that I keep coming back to, and it is not about architecture at all. It is about model selection.
In a single-agent coding workflow, you pick one model and it handles everything. Every file read, every edit, every test invocation, every security consideration runs through the same weights at the same configuration. The model choice is a global trade-off: you are paying for a capable model on tasks that might not need it, or you are skimping on a cheap model for tasks where the difference in reasoning quality is material.
Custom agents in Codex break that constraint. The OpenAI Agents SDK, which underlies Codex’s agent system, accepts a model parameter per agent definition. You can construct a hierarchy where different agents in the same pipeline run on entirely different models.
from agents import Agent, Runner
doc_writer = Agent(
name="doc-writer",
instructions="Generate inline documentation for the provided function. Be concise and accurate.",
model="gpt-4.1-mini",
tools=[read_file]
)
security_reviewer = Agent(
name="security-reviewer",
instructions="Review the provided code for injection vulnerabilities, insecure deserialization, and privilege escalation risks. Be thorough. Flag anything uncertain.",
model="o4",
tools=[read_file]
)
orchestrator = Agent(
name="orchestrator",
instructions="Coordinate documentation and security review tasks.",
model="gpt-4.1",
tools=[
doc_writer.as_tool("write_docs", "Generate inline documentation for a function"),
security_reviewer.as_tool("review_security", "Review code for security vulnerabilities. Use when code handles user input, authentication, or file access.")
]
)
The doc_writer runs on gpt-4.1-mini. The security_reviewer runs on o4. The orchestrator coordinating between them sits in the middle.
This is a real cost lever. Documentation generation is largely pattern completion: function signature in, formatted comment out. A mini model handles it well. Security review requires reasoning about indirect data flows, trust boundaries, and edge cases that pattern completion cannot catch reliably. A reasoning model earns its cost there. Running both tasks on o4 would be accurate but expensive. Running both on a mini model would be cheap but uneven in quality where it matters most.
The Task Classification Problem
Using model heterogeneity well requires being honest about which tasks actually need deep reasoning. This is not always obvious. Consider test generation. Writing basic happy-path unit tests for a pure function is pattern completion; a mini model is fine. Writing tests that cover boundary conditions in a stateful system with side effects, or that verify that error paths propagate correctly through a middleware stack, benefits from a model that can reason about the space of possible states rather than just imitate testing patterns.
The practical framework I find useful: if you could describe the task as “transform input X into output Y according to a clear schema,” a mini model will do. If the task requires reasoning about what could go wrong, what invariants should hold, or what the code is actually doing as opposed to what it looks like it is doing, a more capable model is justified.
This maps loosely onto the difference between synthesis and analysis. Documentation synthesis, boilerplate generation, comment formatting, import organization: pattern completion is sufficient. Code review, architecture assessment, security analysis, test coverage design: you want a model that will actually reason about the code rather than recognize its surface structure.
Temperature as a Second Axis
Model selection is the more visible dimension, but temperature configuration per agent is worth noting too. A brainstorming agent tasked with generating multiple implementation approaches benefits from a higher temperature: you want diverse options, not the single most likely completion. A security reviewer should run at temperature 0 or close to it. You want deterministic reasoning and full coverage of known vulnerability patterns, not creative reinterpretation.
In a single-agent system, temperature is again a global setting. You end up with a compromise that is slightly too deterministic for creative tasks and slightly too variable for analytical ones. Per-agent temperature configuration, like per-agent model selection, trades away simplicity for fitness to purpose.
How This Compares to Other Approaches
Claude Code’s built-in agent types take a different stance. The Explore, Plan, and general-purpose agents are predefined with fixed capabilities. You select from Anthropic’s taxonomy rather than defining your own. The models powering each type are not something you configure; they are determined by Anthropic based on the task profile.
This design is simpler to use and opinionated in a useful way for the common case. But it does not expose model selection as a tunable. If you have a specific workflow where 80% of the agent’s work is something that could run on a lighter model, you cannot optimize for that.
LangGraph lets you assign models per node in a graph, which is architecturally similar to what Codex custom agents enable. The surface area is different: LangGraph is a general orchestration framework while Codex is a coding-specific tool, but the underlying capability to route different tasks to different models has been available in the LangChain ecosystem for a while. What Codex adds is accessibility within a coding-native workflow, without requiring teams to build and maintain a LangGraph pipeline from scratch.
AutoGen also supports per-agent model configuration, and its AssistantAgent class accepts an llm_config dictionary that can specify model, temperature, and other parameters independently for each agent in a conversation. The pattern is not new; it is just arriving in the more mainstream coding tool layer now.
The Infrastructure Right-Sizing Analogy
Systems teams have a familiar version of this problem. Not every service deserves a c5.4xlarge. You right-size instances to workloads: high-memory instances for cache layers, compute-optimized instances for processing pipelines, burstable instances for lightweight services. Running everything on the same instance type is operationally simple but economically wasteful.
Model selection in an agent hierarchy is the same decision at the AI layer. The costs are different from infrastructure (model API calls are pay-per-token, not pay-per-hour), but the principle is identical. A well-structured agent hierarchy fits the capability to the task, and over a large volume of agent invocations, the savings from running routine work on cheaper models compound meaningfully.
For teams running Codex against large codebases with significant volume, this is a real budget consideration. If documentation generation, code formatting checks, and changelog updates can run on a mini model, and only security review and architecture assessment route to a reasoning model, the effective cost per task decreases without sacrificing quality where quality matters.
Defining Custom Agents With Model Configuration
The practical side of setting this up in the Codex custom agent system involves declaring the agent’s role clearly enough that the orchestrator routes to it correctly, then specifying the model appropriate for that role. The description field in the agent registration is what the orchestrating model reads to make routing decisions, so it needs to specify the conditions under which the agent should and should not be invoked. A security reviewer described as “reviews code” will get invoked constantly. A security reviewer described as “reviews code that handles user input, authentication tokens, file paths, or external API responses for security vulnerabilities” will be invoked where it adds value.
The model choice and the description work together. A cheap model with a precise description applied to a narrow task produces good results and low costs. A reasoning model with a vague description applied to everything produces accurate but expensive results. The tuning is in both dimensions at once.
The feature Willison pointed to is a relatively small addition to what Codex can do. But it opens up a design space that single-model coding agents cannot enter: workflows where the capability profile of the AI changes task by task, matched to what each task actually requires. That is a different kind of flexibility than context isolation or parallel execution, and for teams thinking seriously about building agent-driven workflows, it is worth building around from the start.