The 'Almost' in OpenAI's New Codex Is Doing Real Work

OpenAI has relaunched Codex with a scope that looks entirely different from the original. The 2021 Codex was a fine-tuned GPT-3 variant trained on public GitHub repositories, around 12 billion parameters, powering the early GitHub Copilot as a capable tab-completion engine that autocompleted functions, suggested variable names, and worked well within those limits. The new positioning is something else. “Codex for almost everything” implies end-to-end software engineering work: understanding a codebase, writing code, running tests, reading failures, iterating until something works, committing the result.

OpenAI deprecated the original Codex API endpoints in March 2023 as GPT-3.5 and GPT-4 made the specialized model unnecessary. Reviving the name now is a statement about category, not continuity. Codex in 2021 meant “a model that understands code.” In 2026, the intent is closer to “an agent that does software work.” That shift is not incremental.

Why the Hedge Matters

The word “almost” is not modesty; it is load-bearing.

“Everything” would be easy to dismiss: software engineering involves too much contextual judgment, too much communication, and too much institutional knowledge for any current system to handle all of it. “Almost everything” is a harder target to argue against, because it does not specify what falls outside the scope. The question of where the boundary sits ends up being the most important question, and the framing defers it.

This pattern shows up consistently in autonomous agent announcements. The demonstrated capabilities are real and often impressive. The gap between what is demonstrated and what is implied tends to live in the messy, under-specified work that makes up most of a developer’s actual day.

The Benchmark Story

Any serious evaluation of coding agents starts with SWE-bench, the dataset of real GitHub issues drawn from popular open-source repositories. An agent receives a repository snapshot and a bug report or feature request, then has to produce a patch that passes the project’s existing test suite without being given the tests. It is a harder bar than code generation benchmarks because it requires understanding an existing codebase, not writing something from scratch.

Early results established the baseline difficulty. GPT-4 scored around 1.7 percent on the full benchmark when it was first evaluated. Cognition AI’s Devin, released in early 2024 and treated as a landmark moment for autonomous software engineering, scored approximately 13.86 percent on the verified subset. That number generated significant attention despite meaning the agent could not resolve roughly 86 percent of the issues it was tested on.

By 2025, numbers had improved substantially. Agents built on frontier models with well-designed scaffolding were crossing 50 percent on SWE-bench Verified. The jump came from better base models combined with better agentic infrastructure: reliable file editing, sandboxed code execution, structured tool-use loops, and the ability to recover from errors mid-task rather than stopping at the first failure.

What SWE-bench measures well is the ability to resolve a well-defined, isolated issue in a project the agent has never seen. What it does not capture is the ability to understand why a decision was made six months ago, to make a judgment call between two architecturally valid approaches, or to write code that will read naturally to the team maintaining it. Those are also parts of software engineering, and they are harder to benchmark.

The Competitive Context

OpenAI is entering a market that has moved fast. Anthropic’s Claude Code has positioned itself as a terminal-native coding agent that runs inside the developer’s own environment, giving direct access to the project’s actual git state and file system. Cursor’s agent mode has been pushing similar capabilities inside an IDE. GitHub Copilot Workspace integrates into the pull request workflow directly. Devin targets longer-horizon autonomous work with a cloud execution model.

The divergence in philosophy across these tools is meaningful. Some run locally in your repository, keeping you in the loop through your normal git workflow. Others use cloud sandboxes, which enable asynchronous longer-running tasks but distance you from what the agent is doing and why. Some require approval on every file change; others run until they hit a decision they cannot make alone.

The “almost everything” framing from OpenAI suggests they are targeting the broader autonomy end of that spectrum: cloud execution, longer task horizons, less continuous human oversight. For some workflows that is exactly the right trade-off. For others, the reduction in control costs more than the increase in automation saves.

The Review Problem

There is a harder issue that tends to get less attention in these announcements: the cost of trusting the output.

An agent that generates twenty pull requests a week still requires someone to review twenty pull requests a week. If anything, the review burden increases because the code comes from a system that writes confidently, without the social context that makes it natural to ask “are you sure about this?” when a junior developer does something surprising. Code from an agent can be subtly wrong in ways that look right on first reading, because the model optimizes for plausible-looking output rather than correctness it has deeply verified.

The risk asymmetry is worth sitting with. If you review agent output carefully, you get efficiency gains on the generation side but review time limits how much you save overall. If you start reviewing less carefully because the agent is usually right, you are accepting a tail risk of shipping problems that would have been caught with full attention. Neither outcome is as clean as the benchmark numbers suggest.

This is not an argument against using coding agents. The tools are genuinely useful, and the leverage is real. The question is whether “almost everything” implies a level of end-to-end trust that the review problem does not yet support.

What Changes in Practice

From experience with current coding agents, the pattern of where they work well is fairly consistent. Bounded tasks with clear specifications are well-served: implementing a function from a precise description, writing tests for existing code, refactoring a file that has grown too large, translating a design into working code when the design is detailed enough. The gains there are concrete.

The ceiling shows up when the task is poorly specified, when the right answer depends on understanding decisions made months ago, or when the work requires knowing which of several architecturally valid options fits the team’s conventions. Those tasks depend on context that is distributed across conversation history, documentation, commit messages, comments, and the developer’s own memory. Agents are getting better at pulling in more of that context, but knowing what context matters has not been solved.

The “almost everything” claim probably means: almost everything where the task is well-defined and the relevant context is accessible. That is a large category of work. It is also not quite the same as what a developer does most of the time.

Where This Lands

OpenAI reviving Codex as an agentic coding platform is a direct response to a competitive landscape that has developed quickly and is not settling. The announcement comes from a company that was in some respects behind on the agentic coding market and has been watching competitors build real adoption.

The original Codex was a technical milestone that shaped what people expected from AI-assisted coding for the next several years. The new Codex enters a much more specific conversation, with users who have spent time with these tools and formed sharper opinions about where they break down.

The benchmark trajectory across the industry is real progress, and the “almost everything” framing is not purely marketing. The question worth holding onto is what occupies the remaining slice, because that slice tends to be where the interesting and consequential work lives.