From Autocomplete to Autonomous: What Five Years Did to Codex

The original OpenAI Codex shipped in August 2021 as a GPT-3 variant fine-tuned on roughly 54 million public GitHub repositories. It was technically impressive for its time: given a docstring, it could produce a working function. Given a comment, it could continue the pattern. GitHub licensed it to power early Copilot autocomplete, and for a while it felt like a genuine step change in how developers wrote code.

OpenAI deprecated the Codex API in March 2023. The reasoning was quiet but legible in retrospect: the base GPT-3.5 and GPT-4 models had absorbed so much code during training that the dedicated fine-tune no longer had an edge worth maintaining as a separate product. Codex as a model was absorbed into the larger foundation.

What OpenAI announced in April 2026 recycles the name but describes an entirely different category of product. Codex now refers to a software engineering agent, not a completion model. The gap between these two things is not incremental; it is architectural.

What the original Codex actually did

Codex was a next-token predictor with a strong prior toward syntactically valid code. Its interface was the completion API: you sent a prompt, it filled in what came next. The context window was 4,096 tokens. There was no execution, no tool use, no way for the model to run its output and observe what happened.

This was genuinely useful for boilerplate, for translating pseudocode into implementation, for suggesting API calls in unfamiliar libraries. But the ceiling was clear. Codex could not debug its own output. It could not look up documentation. It could not understand that its generated test suite was testing the wrong thing. It produced tokens; it did not solve problems.

The architecture that changed everything

The path from completion engine to software agent required several distinct capabilities, none of which the 2021 Codex had. The first was tool use: the model needed to invoke external systems, read their results, and incorporate that information into continued reasoning. The second was sandboxed execution: the model needed to actually run code, observe errors and output, and iterate based on what it saw. The third was significantly expanded context and improved long-horizon reasoning, because fixing a real bug requires holding a mental model of the entire relevant codebase, not just the adjacent lines.

OpenAI’s o-series models made the reasoning side of this tractable. The chain-of-thought training approach produced models that could work through multi-step problems in ways that earlier Codex could not. Combined with tool-use scaffolding and execution sandboxes, the result was a qualitatively different category of behavior.

The benchmark that tracks this directly is SWE-bench, which presents agents with real GitHub issues from popular open-source repositories and measures whether the agent can produce a patch that passes the existing test suite. In early 2024, the strongest systems were resolving around 13 to 14 percent of issues. By late 2024, capable models with appropriate scaffolding were reaching roughly 49 percent on the verified subset. The progress over twelve months was substantial, driven less by raw model capability than by the combination of better reasoning and better scaffolding infrastructure.

The new Codex is OpenAI’s claim to be at or above that frontier, with the agent able to handle a wide range of software engineering tasks: writing features, fixing bugs, running tests, navigating codebases, creating pull requests. The “almost” in the product name is doing real work, and it is worth taking seriously.

What “almost” actually covers

Autonomous coding agents are genuinely effective at a specific profile of task: well-scoped, verifiable, contained within a known codebase. Add a field to a data model. Write a test for this function. Fix the failing assertion with this error output. Refactor this method to use the new API. These tasks have clear acceptance criteria, often measurable by the existing test suite, and do not require judgment about product direction.

They are weaker at tasks requiring organizational context, stakeholder communication, or judgment about what to build and why. A coding agent can implement a specification but cannot produce one from a vague business requirement. It can pass a test suite but cannot evaluate whether the test suite covers the right cases. The “almost” boundary sits roughly at the edge of verifiable correctness.

This is not a criticism specific to the new Codex. It describes the state of all autonomous software engineering systems. Devin from Cognition AI shipped in 2024 with similar claims and similar constraints. GitHub Copilot Workspace has been iterating in the same space. The limiting factor is not engineering effort; it is the fundamental difficulty of specifying what “done” means for open-ended tasks.

The trust and verification problem

The more capable these agents become, the more the burden shifts from writing code to reviewing it. A developer who produces mediocre output is relatively cheap to supervise because the output is easy to reason about. An agent that produces a 400-line diff touching twelve files, all syntactically correct and test-passing, requires a different kind of review. The reviewer needs to understand not just what changed but whether the agent’s architectural choices were sound, whether the approach will scale, whether anything was quietly broken in the paths the tests do not cover.

This is the real open problem in autonomous software engineering, and “Codex for almost everything” does not resolve it. The HN discussion around the announcement spends significant time here. The productivity narrative is compelling until you account for the time required to safely review what the agent produced. For teams without strong test coverage or clear architectural boundaries, the review burden can exceed the generation benefit.

I build Discord bots and do a fair amount of systems programming. I use agentic coding tools daily. The tasks where autonomous generation saves the most time are also the tasks where I need to trust the output least carefully: adding a new command handler, wiring a config field through the stack, writing tests for pure functions. For anything that touches concurrency, event ordering, or persistent state, the review overhead grows faster than the generation benefit. That ratio is not unique to me.

The distance between 2021 and now

The original Codex demonstrated that language models could generalize across code in ways that mattered to working developers. The new Codex makes the more ambitious claim that they can execute entire software development tasks from description to merged patch. Between those two positions sits five years of work on reasoning architectures, tool use protocols, sandboxed execution environments, and benchmarking infrastructure that made it possible to measure progress at all.

Whether “almost everything” covers 30 percent of what professional developers do or 70 percent is a question that the next year of production usage will answer more honestly than any benchmark. What is clear is that the product OpenAI is shipping now shares nothing with the API they deprecated in 2023 other than the name. That name is doing substantial marketing work. The underlying system deserves to be evaluated on its own terms, independent of whatever nostalgia the brand is borrowing from the original.