OpenAI Brings Back Codex, and This Time It Means Something Different

OpenAI quietly retired the original Codex API in March 2023. At the time it felt like an inevitable consolidation: GPT-4 was better at code than Codex had ever been, and the Codex model family had been upstaged so thoroughly that maintaining a separate API for it stopped making sense. GitHub Copilot moved on to GPT-4-class models under the hood, and the Codex name faded into something you mentioned in historical context.

Now the name is back, and the positioning is fundamentally different. The original Codex was a model: you sent it a prompt, it returned a completion. That was the whole interface. The new Codex is closer to an agent product, one that can take a software engineering task described in natural language and execute it across multiple steps, touching real files, running commands, reading test output, and iterating until it has something worth showing you.

The “almost everything” framing is doing real work in that title. It is not just code generation. It is closer to what Cognition’s Devin promised in 2024, or what Anthropic shipped with Claude Code: a system that understands a codebase in context, can navigate it without hand-holding, and can complete a task that would have taken a developer an hour of focused work.

What Changed Between 2021 and Now

The gap between the original Codex and this relaunch is not primarily a model capability gap, though that matters. It is an architectural one. The 2021 Codex was designed around the autocomplete paradigm, the same paradigm that Copilot still uses for its inline suggestions. You get a cursor position, you complete what comes next. The model has no persistent state, no ability to read test results, no mechanism for trying something, seeing it fail, and trying again.

What made that paradigm limiting was not that the model was bad at writing code. Even in 2021, Codex could produce impressively functional functions given a clear docstring. The problem was that software engineering is not a series of isolated completion requests. It is a feedback loop. You write something, run it, read the error, adjust the approach, check the surrounding code for assumptions you violated, look at what the test framework expects, and try again.

Agents address this structurally. The underlying model does not change much, but the scaffolding around it does. The agent loop typically involves a model that can emit tool calls, a runtime that executes those calls against real or sandboxed environments, and a feedback mechanism that feeds results back into the context window. SWE-Bench, which became the standard benchmark for measuring this kind of capability, evaluates agents on exactly this: given a real GitHub issue from a real open-source repository, can the agent produce a patch that passes the existing test suite?

Early SWE-Bench numbers from 2024 were in the 12-20% range for the best systems. By mid-2025, leading agents were pushing past 50% on the verified subset. The underlying models improved, but so did the scaffolding: better context management, smarter tool use, more careful handling of long codebases that do not fit in a single context window.

The Competitive Context

OpenAI is entering a market that already has clear players. Anthropic’s Claude Code ships as a CLI tool that runs in your terminal, reads your actual files, executes shell commands, and integrates with your existing workflow without requiring you to upload your codebase anywhere. GitHub Copilot Workspace is a cloud-based take on the same idea, integrated directly into the GitHub pull request workflow. Devin from Cognition runs in a remote sandboxed environment and has been targeting enterprise engineering teams with an emphasis on long-running autonomous tasks.

Each of these reflects a different philosophy about where the agent should live and how much autonomy it should have. Running locally, as Claude Code does, means you do not have to trust a third party with your codebase. It also means the agent can access your actual development environment, your language server, your running processes, your test runner. Running in the cloud, as Devin and Copilot Workspace do, means the environment is reproducible and isolated, and you can review what the agent did before accepting it.

OpenAI’s position with the new Codex likely involves a cloud-based sandbox approach, consistent with how they have structured Operator and other agentic products. The sandboxed environment solves the trust and isolation problem cleanly, but it introduces the context transfer problem: the agent needs to understand your codebase without having spent weeks inside it the way your senior engineer has. Good retrieval, indexed understanding of the repository, and careful prompt engineering around context injection are the levers that make or break cloud-based coding agents.

What Actually Matters for Developers Using These Tools

Having used several of these tools on real projects, what I keep noticing is that the meaningful variable is not the model’s raw coding ability. Most of the frontier models can write competent code for well-defined tasks. The variable is how well the system handles the things that are not well-defined: the implicit conventions in a codebase, the test infrastructure that only works if you configure your environment a specific way, the fact that one module has a slightly different error handling pattern than everything else and the agent needs to notice and respect that.

This is where agent products diverge from each other much more than their headline benchmark numbers suggest. A system that scores well on SWE-Bench’s isolated repository tasks might still frustrate you on your own codebase, because your codebase has fifteen years of accumulated decisions that are not documented anywhere.

Context anchoring, the practice of maintaining persistent documentation that captures architectural decisions, conventions, and the “why” behind non-obvious choices, helps considerably here. A well-maintained AGENTS.md or similar file that gives an agent orientation before it starts working reduces the failure modes that come from the agent making plausible but wrong assumptions. This is worth doing regardless of which agent product you use.

The Brand Revival Question

It is worth pausing on the decision to use the Codex name at all. OpenAI has a portfolio of product names that carry technical weight with developers: GPT, DALL-E, Whisper, Sora, Operator. Codex was associated specifically with the code completion era, and with GitHub Copilot’s early days. Reviving it for an agentic product is a deliberate signal about intended audience.

The message seems to be that Codex is for developers who want to delegate engineering tasks, not just get autocomplete suggestions. That is a meaningfully different product category, and using a distinct name for it makes sense. It also lets OpenAI position this separately from ChatGPT’s coding capabilities, which are impressive but general-purpose. Codex can be marketed with developer-specific tooling, developer-specific integrations, and developer-specific pricing that would feel out of place in a consumer chat interface.

Where This Goes

The realistic near-term picture for autonomous coding agents is not full autonomy on complex tasks. It is reliable autonomy on well-scoped tasks: add this endpoint, write tests for this module, update this dependency and fix the breakage, migrate these files to the new API. These tasks are tractable today, and systems like the new Codex that handle them reliably would save meaningful time even if they cannot touch the hard architectural problems.

The harder question is what happens as these tools get better at the medium-complexity tasks that currently require a developer to hold a lot of context in their head. That is not a question about Codex specifically; it is about the trajectory of the whole category. The tools being built now are the infrastructure on which that future runs, and OpenAI throwing the Codex brand at it is a statement about how seriously they take that direction.