The Plausibility Problem: What Fifteen Years of Automated Program Repair Research Tells Us About Codex Security

Machines That Fix Code Are Not New

OpenAI’s Codex Security, now in research preview, promises something that sounds novel: an AI that finds vulnerabilities in your code and then patches them automatically. Most of the coverage treats this as a new capability. It is not, exactly. A research field called Automated Program Repair (APR) has been working on the same problem since roughly 2009, and the lessons from that work are directly relevant to evaluating what Codex Security can and cannot do.

The history is worth understanding, not as trivia but because the field identified the fundamental difficulty with automated patching early on. That difficulty has not been solved. It has been reframed.

GenProg and the Test-Suite Problem

The landmark APR paper was GenProg, published by Claire Le Goues and colleagues at Carnegie Mellon in 2012. GenProg applied genetic programming to bug repair: mutate a buggy program using a small set of operators (insert, delete, swap statements), evaluate each candidate against a test suite, and evolve a population toward versions that pass. Applied to real bugs from open-source C projects, it produced patches without any human-authored templates.

But the research community quickly identified a structural problem. GenProg was optimizing against the existing test suite, and test suites rarely cover the full intended behavior of a program. A system that passes the failing test might do so by deleting the code path that triggered the failure, returning a hardcoded value that satisfies the assertion, or clamping output in a way that breaks behavior the tests never checked.

The community coined a specific term for this: plausibly correct. A plausible patch passes the test suite. A correct patch actually fixes the underlying problem without introducing new ones. Subsequent analysis found that a substantial fraction of GenProg’s plausible patches were semantically wrong in one of these ways. The gap between plausibility and correctness became the central research problem of the entire field.

Prophet (2016, from researchers at MIT and the University of Washington) attacked this by learning a correctness model from a corpus of human-authored patches. Rather than mutating randomly, Prophet prioritized candidates that statistically resembled the transformations real developers made when fixing real bugs. This improved patch correctness meaningfully and represented the first learning-based repair system to show results on real-world defects. It also foreshadowed the direction the field would eventually go: learning from human patches rather than searching combinatorially.

What LLMs Actually Change

The move from genetic programming and constraint-based repair to large language models is not just a technical upgrade. It changes the source of repair knowledge fundamentally.

GenProg drew on mutations of the program itself. Prophet drew on a learned distribution over human patch shapes. A model like the one powering Codex Security has absorbed an enormous breadth of code, security advisories, CVE write-ups, and patch diffs, meaning it carries something closer to the semantic intent behind fixes rather than their surface statistics.

For well-understood vulnerability classes, this matters considerably. Consider SQL injection. A traditional APR system given a SAST finding would need templates to know that this:

cursor.execute("SELECT * FROM users WHERE id = " + user_id)

should become this:

cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))

A language model has seen this transformation in Python’s sqlite3, psycopg2, and SQLAlchemy’s raw text API, along with the edge cases where naive parameterization still fails: different parameter styles across database adapters, named versus positional parameters, cases where the same input reaches multiple queries. It has also seen the second step that developers commonly miss: updating hash comparison logic after switching password storage from MD5 to bcrypt, or adjusting downstream API responses after sanitizing a value that callers expected in its original form.

The model brings context that neither dataflow graphs nor genetic operators could represent: the import structure of the module, naming conventions of the codebase, whether a sanitizer already exists elsewhere that the fix should call rather than reinvent.

The Plausibility Problem Does Not Go Away

The APR field’s central insight still applies. Plausibility and correctness are different things, and language models are extraordinarily good at generating plausible output.

A model that produces a convincing patch for a path traversal vulnerability might sanitize the immediate file access correctly while missing a second access path three function calls up the stack. It might parameterize the flagged query while leaving semantically equivalent string concatenation in a different module. It might eliminate the traversal while introducing a time-of-check/time-of-use race condition in the file existence check. The patch compiles, the SAST finding disappears, the existing tests pass, and the code remains exploitable.

This is not hypothetical. The DARPA AI Cyber Challenge (AIxCC) in 2024 specifically tasked teams with building AI systems that could find and patch vulnerabilities in open-source software under competition conditions. Teams reported strong detection rates and uneven patch correctness, consistent with what APR researchers found a decade earlier with entirely different techniques. Detection is the easier half of the problem.

GitHub Copilot Autofix, the most direct competitor to Codex Security and already integrated into GitHub Advanced Security, publishes data on developer acceptance rates for its suggested fixes. Acceptance rate is not a correctness metric, since developers might accept patches they do not scrutinize closely. But it provides a real-world signal that the workflow is useful, and it gives a comparison baseline for evaluating Codex Security as it matures.

The Coverage Gap That No Automated Tool Closes

The vulnerability classes where AI-assisted patching works well are the ones that also yield to traditional SAST: injection vulnerabilities, use of deprecated cryptographic primitives, missing input validation, hardcoded credentials. These have well-understood, largely mechanical fixes.

Business logic vulnerabilities are different. Insecure Direct Object Reference (IDOR), broken access control, and authorization flaws that depend on knowing which users own which resources do not appear in dataflow graphs and do not have generic fixes. A patch for an IDOR vulnerability requires understanding the intended access control model of the application, which is not present in the code in any form a static analyzer or language model can extract reliably.

This coverage gap is not specific to Codex Security; it affects every automated security tool. But it matters more in the context of an agentic patcher than a passive scanner. A team that routes all SAST findings through an AI patcher and closes the resolved tickets may develop confidence that their security posture is improving when the more dangerous vulnerability classes remain entirely unaddressed.

Integration Is the Real Differentiator

The dimension where Codex Security and tools like it genuinely improve on the APR research prototypes is workflow integration. Academic repair systems operated on isolated programs with known bug locations and curated test suites. A production tool that opens a pull request with the proposed patch, explains the vulnerability in plain language, cites the relevant CWE, and includes a regression test is delivering a workflow artifact alongside the code transformation.

That workflow value is real and largely separable from the correctness question. Even if an organization treats AI-generated patches as first drafts requiring human review, having the draft ready alongside the finding reduces the time from detection to remediation. That metric matters: the mean time to remediate vulnerabilities in production code is measured in weeks to months across the industry, not because developers lack skill but because findings queue up while developers context-switch between features.

The research preview framing from OpenAI is appropriate. It signals that unattended pipeline use is not the intended deployment model yet, which is consistent with what APR research suggests: automated repair is most reliable as a starting point for human review, not as a replacement for it. For high-confidence, low-complexity findings, automation with lightweight review is reasonable. For anything involving multi-file data flow, access control logic, or unfamiliar codebases, human review is not optional.

What to Watch as the Tool Matures

A few signals will tell you whether Codex Security is developing in a trustworthy direction. The APR research community maintains standard benchmarks, including SWE-bench for general software engineering tasks and security-specific variants. Whether OpenAI publishes benchmark performance, and on which benchmarks, will indicate how seriously the correctness question is being engaged versus the plausibility question.

Confidence calibration matters too. A tool that distinguishes “mechanical fix, high confidence” from “complex multi-file change, review carefully” is more trustworthy than one that presents all patches with equal certainty. APR systems that lacked confidence estimation tended to produce correct and incorrect patches in ways developers could not easily distinguish from each other.

Finally, coverage of semantic vulnerabilities such as IDOR and broken access control would represent a genuine advance. No current tool handles these reliably. Progress there would mean something qualitatively new is happening, not just existing SAST with a smarter patch generator attached.

The question Codex Security needs to answer is the same one Le Goues and colleagues posed in 2012: when the machine says the code is fixed, how do you know it actually is?