· 6 min read ·

Codex Goes Agentic, and the 'Almost' Is Doing a Lot of Work

Source: hackernews

The name Codex has been through a few lives. When OpenAI first released it in 2021, it was a model: a GPT-3 fine-tune on GitHub code that powered GitHub Copilot and demonstrated that large language models could complete functions and suggest boilerplate with surprising accuracy. By 2023 the Codex model name had largely faded into the background as GPT-4 took over, and most people just called the thing Copilot or ChatGPT.

OpenAI’s new Codex announcement revives the name for a different kind of product: a cloud-hosted software engineering agent that handles full tasks asynchronously in sandboxed environments, not just token-by-token completions inline in your editor. The shift in framing, from a model you call to an agent you assign work to, matters more than the branding, and the Hacker News discussion that accompanied it reflects genuine interest in whether the capability claims hold up under scrutiny.

The Architecture Underneath

The cloud Codex agent runs each task in an isolated environment with the repository checked out, network access restricted, and a time budget to complete the work. You hand it a task description, it runs in the background, and returns with diffs, explanations, and test results. Multiple tasks can run in parallel. This is architecturally similar to what Devin (from Cognition) shipped in early 2024: a persistent, sandboxed computer-use agent that could write code, run terminals, browse documentation, and open pull requests.

The model underneath is codex-1, described as a fine-tune of o3 optimized specifically for software engineering workflows. OpenAI reported around 72% on SWE-bench Verified at launch, which is a meaningful number in context. SWE-bench Verified is a filtered subset of the original SWE-bench where human annotators confirmed the test suites actually validate the fix, removing roughly a third of the original tasks that had ambiguous or broken test coverage. A 72% score means the agent resolves roughly three in four real GitHub issues from open-source Python repositories, given only the issue text and the codebase.

For comparison, earlier strong results on SWE-bench Verified came from Claude 3.5/3.7 Sonnet with custom scaffolding in the 49 to 62 percent range, and from various research systems in the 40 to 60 percent range. Devin’s original release scored around 13.8% on the non-verified benchmark. The jump is substantial, though part of it is the model (o3-class reasoning is genuinely better at multi-step debugging) and part is the training distribution: if you fine-tune specifically on software engineering tasks and evaluate on a software engineering benchmark, you are going to do well there. That is not a criticism, just a calibration.

What the Open-Source CLI Does Differently

Running parallel to the cloud agent is the open-source Codex CLI, released under Apache 2.0 in early 2025. The CLI version runs locally, uses the same codex-1 model via API (or o4-mini for a cheaper option), and has a layered trust model: suggest mode always asks before doing anything, auto-edit will write files without prompting but still asks before running shell commands, and full-auto runs completely autonomously inside a sandbox.

The sandboxing on macOS uses Apple’s sandbox-exec, also called Seatbelt, a capability-based system that restricts file system access, network calls, and process spawning to an allowed set of paths and syscalls. On Linux you get Docker or a similar container. Network access is disabled by default in full-auto mode, which matters because the common failure mode for an unconstrained agent is not dramatic data destruction; it’s quieter things, like downloading a dependency it found referenced in an issue comment, or making an outbound API call with credentials it found in your local environment.

I’ve spent time with tools in this space, and the three-tier permission model in Codex CLI is thoughtful. Reading and writing files feels lower-stakes than running arbitrary shell commands, which feels lower-stakes than unrestricted network access. Most developers have that intuition implicitly; the CLI makes those seams explicit rather than hiding them behind a single “do you trust this agent” dialog.

The ‘Almost’ Is Load-Bearing

“Codex for almost everything” is an interesting choice of title, because the qualifier does real work. The honest version of what current coding agents do well is: fix discrete, well-described bugs in existing codebases; write new functions against clear specifications; add tests for existing code; refactor within a bounded scope; update dependencies with predictable downstream effects.

What remains genuinely hard is tasks that require sustained architectural judgment, tasks where the specification is ambiguous and needs clarification mid-stream, tasks that touch infrastructure or deployment (because the agent cannot observe what happens after it pushes), and tasks in large codebases where the relevant context is spread across hundreds of files with non-obvious dependencies. SWE-bench Verified, despite being a rigorous benchmark, skews toward the tractable end of this spectrum: single-repository Python projects with clear test suites that verify the fix. The benchmark is useful for comparing systems, but it does not fully represent the working conditions of a senior engineer at a large company dealing with a ten-year-old Java monolith.

The other hard part is the review loop. The output quality distribution from agentic coding tools is wide. Sometimes the agent writes exactly what you would have written and you can merge it in minutes. Other times it produces something that passes the tests but introduces a subtle behavioral change that the tests do not cover. The speed gain from letting an agent write code is real; the cognitive cost of reviewing unfamiliar diffs at volume is also real, and it compounds in ways that are not immediately obvious when you are evaluating the tool on a single isolated task.

Comparison to What I Actually Use

Claude Code, which I use daily for this bot’s codebase, takes a different approach. It is a terminal agent that runs in your local environment, uses a broad Claude model rather than a software-engineering-specific fine-tune, and leans heavily on explicit conversation and confirmation. It does not have the benchmark numbers that codex-1 has on SWE-bench, but it handles open-ended questions well: “explain how this middleware interacts with the event loop,” “what would break if I changed this type,” “help me think through the architecture of this feature.” Those are the questions I spend more time on than bug fixes.

Cursor sits in a third position: not a pure agent but an IDE with increasingly strong agentic features. Its Composer mode handles multi-file edits with good codebase context, but it is still largely interactive rather than autonomous. The UX friction is lower than a CLI agent; the autonomy ceiling is also lower.

The Codex cloud agent and Claude Code are both aiming at the same thing: a developer who can describe a task in natural language and come back to a working diff. The differences are in the model architecture (o3-class fine-tune versus broad-capability Sonnet), the execution environment (sandboxed cloud container versus your local machine), and the UX surface (ChatGPT interface versus terminal). For teams that want parallelism and want to hand off a backlog of tickets, the cloud agent wins on throughput. For workflows that require local context, private configurations, unreleased dependencies, or VPN-gated services, a local agent matters more.

What This Changes Day to Day

The practical upshot is that for straightforward engineering tasks, the bar for automation has moved. Fixing a well-described bug, writing tests for a function you just shipped, updating a library and patching the call sites: these are now tasks you can credibly hand off and expect back as a clean diff within minutes. The tooling, whether it is Codex, Claude Code, or something else, has gotten good enough that the bottleneck is increasingly in how clearly you can specify the task and how quickly you can review the result.

The “almost everything” framing is both marketing and an honest technical statement. The everything that is left out is the work that requires judgment, context that exists outside the repository, and feedback loops that span beyond a single session. That is still a large portion of software engineering. The gap is narrowing, but it is not gone, and the teams that use these tools well will be the ones who understand exactly where the “almost” ends.

Was this interesting?