Codex Outgrows Its Name: Computer Use, Memory, and the Architecture of an Agentic Desktop
Source: openai
The original Codex was a model. It lived at an API endpoint, accepted a prompt, and returned code. GitHub Copilot was built on it, and for a while it was the most visible demonstration of what large language models could do with code. OpenAI deprecated that API in March 2023, directing developers toward GPT-3.5 and GPT-4. The name went quiet for a while.
Now Codex is back as something structurally different: a desktop application for macOS and Windows that combines computer use, in-app browsing, image generation, persistent memory, and a plugin system into a single interface aimed at developer workflows. The announcement frames this as Codex for “almost everything,” which is a significant claim and worth examining carefully.
What Computer Use Actually Means Here
Computer use is the feature that changes the most about what the tool can do. In the context of AI agents, computer use means the model can observe a screen and issue inputs: clicks, keystrokes, scroll events, form submissions. It sees what you would see sitting at the machine and acts on it.
Anthropicintroduced Claude’s Computer Use API in October 2024. OpenAI followed with their own CUA (Computer Use Agent) capabilities via a computer-use-preview model available through the Responses API. The Codex desktop app takes that underlying capability and wraps it in a first-party, polished experience rather than leaving it as a raw API for developers to scaffold themselves.
The practical difference is substantial. A coding agent that only controls its own tool calls, reading files, running shell commands, editing text, operates entirely inside a defined sandbox. A coding agent with computer use can open a browser to read documentation, navigate a GUI application that has no API, interact with local development environments, and move between applications the way a human developer would. The surface area of what it can access expands from whatever tools its scaffolding exposes to essentially anything on the machine with a visual interface.
This also means the failure modes are different. A tool-call-only agent fails by calling the wrong tool or producing a bad edit. A computer-use agent can navigate to the wrong page, click the wrong button, or misread a UI state and cascade that misreading through subsequent actions. The error surface is wider and the errors are harder to catch programmatically.
Browsing as a Development Primitive
In-app browsing addresses a specific friction point in developer workflows. When you are implementing an API integration, you spend time reading the provider’s documentation, then switching to your editor, then reading error messages, then switching back. The context split is real and it accumulates over the course of a session.
Having browsing available inside the same environment where code is being generated collapses that loop. The agent can read the documentation for the library you are about to use, pull a relevant example, and generate code that reflects what it just read, all in a single operation. The model is not reasoning from training data about how an API probably works; it is reading the current documentation and working from that.
This matters more than it might appear at first. Training data cutoffs mean that models have no knowledge of API changes, deprecations, or new features added after their cutoff date. A browsing tool transforms a static knowledge problem into a retrieval problem. The model is no longer limited to what it knew at training time about any given SDK or service.
The tradeoff is that browsing is slower than memory retrieval and the model needs to extract relevant information from a page that may contain a lot of noise. Web pages are not structured for machine reading; navigation requires choosing the right links, scrolling to the right sections, and filtering out sidebars, ads, and unrelated content. How well the Codex app handles that extraction will determine whether browsing feels like a useful feature or a slow one.
Memory as the Persistence Problem Solved
Persistent memory is the feature that makes the most difference for recurring work. Without it, every session starts from zero. The agent has no knowledge of your project’s conventions, your preferred patterns, the decisions you made last week, or the context behind the current task. You or the system prompt has to carry all of that.
OpenAI added memory to ChatGPT in 2024, initially as an opt-in feature that stored facts the model determined were worth keeping. The Codex app brings memory into a development-focused context, where the relevant things to remember are different: project structure, architectural decisions, team conventions, recurring tasks, and the state of ongoing work.
The hard engineering problem in agent memory is the write path, not the read path. Deciding what to store, how to structure it, when to update it, and how to handle contradictions between old and new information is significantly harder than doing a retrieval query. A memory system that stores everything becomes noisy. One that stores too little fails to provide continuity. Getting the write policy right is where these systems succeed or fail in practice.
The read path matters too: retrieval quality determines whether the stored memory actually surfaces when it is relevant. Semantic similarity search works well for factual retrieval but less well for procedural or structural information. A well-designed memory system for development work probably needs multiple retrieval strategies, not just embedding similarity.
Image Generation in a Developer Tool
Image generation is the unexpected addition here. It is worth thinking about where this actually fits in a developer’s workflow, because the use cases are less obvious than for the other features.
The most direct use cases are around UI and design work: generating placeholder images for mockups, creating test assets for image processing pipelines, producing documentation screenshots, or prototyping visual components before committing to a design. For developers working on products that involve visual content, having generation available without switching to a separate tool reduces friction.
There is also a more subtle use case: using generated images as communication artifacts. Describing a UI layout in text and generating a rough visual representation gives you something concrete to evaluate and iterate on before writing any code. The image becomes an intermediate artifact in the design loop rather than an end product.
Plugins and the Ecosystem Bet
Plugins extend the agent’s capabilities beyond what OpenAI ships natively. The shape of a plugin system matters a lot: whether it follows something like the Model Context Protocol conventions, what trust model applies to plugin-provided tools, and how plugins can interact with the agent’s memory and context window.
OpenAI has run a plugin experiment before. ChatGPT’s plugin ecosystem launched in 2023, generated significant developer interest, and was eventually folded into a different model (GPTs, then the assistant API). The lesson from that experience is that plugin discoverability and quality control are harder than the plugin protocol itself. A marketplace of hundreds of plugins is only useful if the agent reliably chooses the right one and the plugin performs reliably when called.
For a developer-focused tool, the highest-value plugins are probably integrations with existing infrastructure: CI/CD systems, issue trackers, deployment platforms, monitoring services, database clients. If the agent can query your test results, read your error logs, and create tickets in your project management system without leaving the session, that is genuinely useful. If the plugin ecosystem fills up with novelty integrations and the infrastructure connections are mediocre, the plugin system adds complexity without adding value.
What the “Almost” in Almost Everything Signals
The qualifier in “almost everything” is doing real work. Computer use, browsing, memory, image generation, and plugins together do expand the surface area of what a desktop agent can do substantially. But there are categories of developer work where these capabilities still fall short.
Code review at scale, navigating a large unknown codebase and producing reliable architectural suggestions, depends on retrieval quality that context windows and browsing cannot fully substitute for. Long-running autonomous tasks, ones that span hours or days rather than a single session, require checkpoint and recovery mechanisms that go beyond in-session memory. Team-oriented workflows, where the agent needs to understand what multiple people are doing and coordinate across that, are not addressed by a single-user desktop app at all.
The “almost” is an honest acknowledgment that a desktop app with these features is a powerful tool for an individual developer’s workflow. It is not yet a system that can stand in for a broader slice of software development process.
The IDE Versus Agent Dichotomy
The broader question that Codex’s feature set raises is where the center of gravity in developer tooling will settle. Tools like Cursor 3 and Windsurf extend the editor into an agent container: the agent lives inside the development environment and operates on code within it. Tools like Codex, built as standalone agents with computer use, invert that: the agent operates at the OS level and the development environment is one of the things it can control.
Neither architecture is clearly superior. Editor-embedded agents have better access to language server data, tighter integration with the editor’s own context, and a more constrained action space that makes behavior more predictable. OS-level agents with computer use have broader reach, can coordinate across tools that have no APIs, and can handle workflows that do not live entirely in a single editor.
Most developers will probably use both approaches for different tasks, just as they use a screwdriver and a drill for different jobs. The more interesting question is which approach accumulates better memory and learning over time, because that is what will determine which one a developer reaches for as the default.