From Code Completion to Software Engineering Agent: What the New Codex Actually Does

The word “almost” in OpenAI’s latest Codex announcement is doing more work than it appears. It is not false modesty. It is a load-bearing qualifier that tells you something real about the current state of agentic software engineering, and understanding what sits on either side of that line is worth the time.

To appreciate how far Codex has come, it helps to remember what it was. When OpenAI released Codex in August 2021, it was a GPT-3-class model fine-tuned on public GitHub repositories. Its job was next-token prediction over code, and it was good at that job. It powered the first versions of GitHub Copilot, autocompleting function bodies and suggesting imports. The model itself had no tools, no ability to run code, and no awareness of whether what it produced actually worked. It was, in the taxonomy of 2021, a very capable autocomplete engine.

That version of Codex was deprecated in March 2023, superseded by GPT-4, which was better at code without being purpose-trained for it. The Codex name went quiet for a while.

The Agentic Reframe

The Codex that OpenAI is talking about now is architecturally unrelated to that 2021 model. It is not a fine-tuned completion model. It is an agent: a system that wraps a language model in a loop that can read files, write files, execute shell commands, inspect output, and iterate based on what it observes. The model driving it is from the o-series or GPT-4o family depending on task, but the model is almost secondary to the scaffolding around it.

The Codex CLI, which OpenAI open-sourced in April 2025, gave a clear view of this architecture. The CLI runs in your terminal, operates in one of three approval modes (suggest, auto-edit, and full-auto), and sandboxes execution using macOS Seatbelt or Docker depending on your platform. Network access is disabled by default in full-auto mode, which is a sensible constraint: an agent that can make arbitrary outbound connections while autonomously modifying your codebase is a much larger attack surface than one that cannot.

The tool loop is what makes this different from Copilot-style completion. When Codex is given a task, it does not produce a single code block and wait. It reads the relevant files, makes edits, runs the test suite, inspects the failures, adjusts, and repeats. The verification step is where a lot of the real value lives. A model that can check its own work against a concrete signal (test pass/fail, lint output, type errors) can recover from its own mistakes in ways that a one-shot completion cannot.

# Simplified view of the Codex agent loop
while not task_complete:
    observation = read_context()          # files, test output, stderr
    action = model.next_action(observation) # edit, run, search, ask
    result = execute(action, sandbox)      # sandboxed shell or fs op
    if result.is_terminal:
        break

This loop is not novel. LangChain, AutoGPT, and various research systems explored it from 2022 onward. What has changed is the quality of the underlying model following tool-use instructions reliably enough for the loop to converge instead of spiral.

What the Architecture Gets Right

Three design decisions in the current Codex stand out as technically sound.

First, the sandboxing story is honest. Rather than claiming the agent is safe while silently allowing unrestricted execution, OpenAI’s implementation gives operators explicit control over the capability level. Full network isolation by default, explicit opt-in for broader access. For someone running this against a production codebase, that matters.

Second, the approval modes map well to real trust levels. You would not run full-auto on an unfamiliar codebase with no test coverage. You might run it on a well-tested library module where the tests are the oracle. The suggest mode lets you treat the agent more like a very capable colleague who drafts changes for your review, which is often the right default.

Third, context management is taken seriously. Large codebases do not fit in a context window. The agent needs a retrieval strategy, and modern coding agents (including Codex, Claude Code, and Cursor) have converged on a combination of file tree inspection, semantic search, and targeted reads rather than trying to ingest entire repositories. This is the right tradeoff: you want the agent to find the ten relevant files, not summarize the three thousand irrelevant ones.

The Benchmark Question

OpenAI has cited SWE-bench numbers in their Codex positioning, and SWE-bench is worth understanding. It consists of real GitHub issues from popular Python repositories, with the agent’s job being to produce a patch that makes the failing tests pass. It is a more grounded benchmark than many code benchmarks because it involves reading an actual issue, finding the relevant code, making a targeted fix, and not breaking other tests.

Recent top scores on SWE-bench Verified have been in the 50 to 70 percent range depending on the system, which means a substantial fraction of real software issues remain out of reach. The failures cluster around tasks requiring broader context (understanding how a feature was designed to work, not just what is broken), multi-step architectural changes, and anything involving ambiguous requirements. These are not implementation problems that better models will cleanly solve. They are problems of specification clarity, which is a different category.

What “Almost” Actually Excludes

The honest reading of “Codex for almost everything” is that the current system handles well-defined, verifiable tasks with high reliability. Writing a function given a clear description. Fixing a bug where the expected behavior is unambiguous. Adding a test for an existing function. Refactoring code to match a specified pattern. These are tasks where the success condition is checkable, the scope is bounded, and the agent can observe whether it has succeeded.

The “almost” starts to slip when the task requires judgment that is not encodable as a test. Should this API return a 404 or a 403 when the resource exists but the user lacks permission? How should this error message read for a non-technical user? Is this abstraction the right level for this use case, or does it optimistically assume the codebase will grow in a direction it may not? These are questions where the agent can produce an answer, but there is no automated oracle to verify it, and wrong answers do not always surface immediately.

From my own experience building and maintaining a Discord bot with relatively complex state management, the tasks where I would fully trust an autonomous agent are a minority of the actual work. They are also, not coincidentally, the most tedious tasks: writing boilerplate, adding parameter validation, updating tests to match a changed interface. Offloading those is genuinely useful. The judgment calls around how the system should behave at its edges are not tasks I would hand off without review.

The Competitive Context

Codex is not operating in isolation. Anthropic’s Claude Code takes a similar terminal-agent approach with comparable tool capabilities. Cursor integrates the agent loop into an IDE, which lowers the friction for developers who think primarily in terms of files rather than terminal sessions. Devin, from Cognition AI, targets longer-horizon autonomous tasks and runs in a cloud environment rather than locally. GitHub Copilot Workspace takes the issue-to-pull-request pipeline and embeds it directly in the repository interface.

The differentiation between these systems is narrowing. The underlying models are close enough in capability that the real differences are in the scaffolding: how context is managed, how verification is done, how the human is kept in the loop, and how the sandboxing story holds up under adversarial conditions. Prompt injection attacks on coding agents are a real concern, and any system that processes code it did not write (dependencies, external documentation, issue templates) is a potential vector.

Where This Goes

The honest trajectory for coding agents is not that they will soon handle everything. It is that the boundary of what they handle reliably will expand, and the character of the remaining human work will shift. Writing boilerplate, wiring up integrations, and fixing well-understood bug classes will increasingly be machine work. The remaining human contribution will concentrate in the things that require judgment: deciding what to build, specifying success conditions clearly enough for an agent to verify them, and reviewing outputs with the context of how the system is actually used.

That is not a trivial shift. Writing a good specification is harder than writing the corresponding code, for most developers. The bottleneck moves upstream, not away.

For now, Codex for almost everything is a real and useful tool. The almost is not a marketing hedge. It is an accurate description of where the capability frontier sits, and taking it seriously is more useful than either dismissing the tool or overstating what it can do.