From Autocomplete to Agent: The Architecture Behind the New Codex

Most developers remember Codex as the model that made GitHub Copilot feel like magic in 2021. It was code-davinci-002 under the hood, a GPT-3 fine-tune trained on public GitHub repositories, and it was genuinely impressive at the time. OpenAI quietly deprecated the original Codex API endpoints in March 2023 when GPT-3.5-turbo had become strictly better at code tasks. That was effectively the end of Codex as a product.

So when OpenAI announced Codex for almost everything, the name revival is somewhat misleading. The new Codex shares almost nothing with the original model except the brand. What OpenAI shipped is a fundamentally different kind of system, and the interesting engineering story is in exactly what had to change to get from stateless code completion to autonomous software engineering.

What Made the Original Codex a Dead End for Agents

The 2021 Codex model was a stateless function. You fed it a prompt, it emitted tokens, it was done. That design was fine for autocomplete but completely inadequate for software engineering tasks, which are inherently iterative. Writing code is easy; writing code that actually runs is the hard part. Any system that doesn’t have a feedback loop, that can’t execute its own output and observe what breaks, is going to hit a ceiling fast.

The limitation wasn’t the quality of the generated code. GPT-4 could generate plausible-looking code with high accuracy. The limitation was that “plausible-looking” and “correct” diverge rapidly as task complexity increases. Real software engineering involves reading error messages, adjusting, re-running tests, checking whether a function actually returns what you think it returns. You need a loop, not a one-shot completion.

This is why the 2021 Codex, and every pure completion model that followed it, hit a practical ceiling around “write me a function to do X”. Anything more complex required a human in the loop at every iteration.

The New Architecture: Model, Loop, and Sandbox

The new Codex is built on codex-1, a fine-tuned variant of o3 that OpenAI trained specifically for software engineering tasks using reinforcement learning in sandboxed environments. The choice of o3 as the base matters. The o-series models use extended thinking with chain-of-thought reasoning, which means the model can work through multi-step problems rather than just predicting the next token given prior context.

But the model alone isn’t the product. The product is the agentic loop around it. The system can read files, write files, execute shell commands, run test suites, and observe the results before deciding what to do next. That loop is what enables it to actually complete tasks rather than just generate plausible code fragments.

Sandboxing is non-trivial here and often gets glossed over. On macOS, the Codex CLI uses Apple’s Seatbelt (the same mechanism that App Store apps use) to restrict filesystem and network access. On Linux, execution happens inside Docker containers or via nsjail for tighter isolation. This matters not just for security but for reproducibility: sandboxed execution means the agent is running in a clean, predictable environment, which reduces the class of failures caused by environment state drift.

The open-source CLI, available at openai/codex on GitHub and written in TypeScript, exposes three approval modes: fully automated (the agent does everything without asking), semi-automated (it asks before running shell commands), and manual (it proposes every action for human approval). That tiered model is the right design for a tool that’s going to touch your filesystem. Starting in full-auto mode on a production codebase is a bad idea regardless of benchmark scores.

What SWE-Bench Verified Actually Tells You

OpenAI reported codex-1 achieving approximately 72% on SWE-bench Verified, which measures the agent’s ability to resolve real GitHub issues from open-source Python repositories. That number matters for context, but it needs some unpacking.

SWE-bench Verified is a curated subset of SWE-bench where human annotators confirmed that the test suite actually validates the fix. That filtering makes it a more reliable benchmark than the full set, where poor test coverage can inflate scores. A 72% pass rate means the agent correctly produced a patch that made a real GitHub issue’s tests pass about three quarters of the time.

What the benchmark doesn’t measure: tasks that require understanding cross-service dependencies, work involving UI or visual output, tasks where the success criterion isn’t expressible as a test suite, or anything that requires coordinating across multiple repositories. The benchmark is Python-centric and heavily biased toward isolated, well-scoped issues. “Almost everything” in the product name reflects this honestly: it’s most things in a certain class of tasks, not software engineering in general.

For comparison, Cognition’s Devin, which announced itself as the first autonomous software engineer in early 2024, scored around 13.8% on the original SWE-bench at launch. The progress across roughly eighteen months is not incremental.

The Competitive Landscape Is Genuinely Crowded

OpenAI launched Codex CLI at roughly the same time Anthropic shipped Claude Code, their own terminal-native coding agent. The timing was close enough that the two announcements stepped on each other. Both tools follow the same architectural pattern: a capable base model, a file and shell tool layer, sandboxed execution, and an approval flow for risky operations.

The differentiators are subtle but real. Claude Code is built on Anthropic’s Claude models and tends to be praised for its editing precision and its handling of large codebases where context management matters. Codex is built on the o3 reasoning line and benefits from that model’s strength on tasks that require multi-step planning. The HN discussion on the Codex launch featured predictable comparisons between the two, with developers noting that the choice often comes down to which model you find more reliable for your specific stack and task type.

GitHub Copilot has been building toward agentic features within VS Code, but it’s architecturally different in that it’s IDE-integrated rather than terminal-native. Cursor occupies a similar niche with a strong IDE-first experience. For developers who live in the terminal, Codex CLI and Claude Code are the natural comparison points. For developers who live in VS Code, Copilot’s agent mode is the more natural fit, even if the raw capability benchmark is lower.

Devin from Cognition remains interesting as a cloud-hosted agent with its own browser and development environment, positioning itself more as a “hire an agent” product than a developer tool. The use cases don’t fully overlap.

What “Almost Everything” Actually Means in Practice

The qualifier in the product name is doing real work. Coding agents in 2026 are excellent at a specific class of tasks: isolated, well-specified changes where correctness can be verified by running tests. They’re reliable for adding a new API endpoint with documented inputs and outputs, writing a migration for a schema change, converting a module from one format to another, fixing a bug where the expected behavior is clearly specified in the issue.

They’re unreliable for: tasks where the requirements are ambiguous, work that requires understanding organizational context (why was this designed this way?), changes that span multiple services with implicit contracts between them, and anything where the feedback loop is slow or expensive. If you can’t run tests locally that will tell the agent whether it succeeded, the agent is flying blind.

The pattern that works well is giving the agent a specific task with a clear success criterion, letting it run in a sandbox, reviewing the diff it produces, and then deciding whether to apply it. That’s closer to reviewing a junior developer’s PR than to having a senior engineer on your team. The agent doesn’t understand your system; it understands code.

For a bot project like the kind I work on, the practical value shows up in exactly the unglamorous work: generating test coverage for existing handlers, writing boilerplate for a new command that follows an existing pattern, converting a module to use a new internal API. That’s genuinely useful, even if it’s not the autonomous software engineer the marketing copy implies.

The original 2021 Codex was impressive for its time and then clearly bounded. The 2025-2026 version is a different kind of system. The ceiling is much higher, but the ceiling still exists, and understanding where it sits is more useful than either the hype or the backlash.