Static analysis for security has always made a bet: that dangerous code can be recognized by its shape. For three decades, that bet paid off tolerably well. OpenAI is now arguing, in their post about why Codex Security doesn’t include a SAST report, that the bet has expired.
The argument is worth working through carefully, because it is not just product positioning. It reflects a genuine shift in where vulnerabilities live and in what kind of reasoning is required to find them.
Why SAST Made Sense in 1999
SAST originated in the C/C++ world, where the most common vulnerability classes were direct consequences of unsafe library usage. gets() has no bounds checking and will overflow any buffer you hand it. strcpy() likewise. printf(user_string) with a user-controlled format string is a format string vulnerability. These are not subtle. They are structurally visible in the source code without any reasoning about data flow or program semantics.
Tools like Flawfinder and RATS were essentially grep with a database of dangerous function names. Their false positive rates were high, but the true positives were real, exploitable, and fixable. The tooling did genuine work.
Second-generation SAST added taint analysis: mark sources (HTTP parameters, file reads, environment variables) and sinks (SQL execution, OS commands, eval()), then track whether tainted data can reach a sink without passing through a sanitizer. CodeQL, Checkmarx, and Fortify all center on this model. For SQL injection in a trivial case, it works:
# SAST correctly flags this
query = "SELECT * FROM users WHERE name = '" + request.args["name"] + "'"
db.execute(query)
The taint path is direct and the sink is obvious. Taint analysis was a genuine improvement for this class of vulnerability.
The Problem Is Sanitizer Semantics
Where taint analysis breaks down is in how it handles the code between source and sink. Sanitizers are registered as binary: either a function is known to clear taint, or it is not. This causes both false positives and false negatives, often simultaneously in the same codebase.
Consider an XSS attempt with a naive filter:
def render_user_content(content):
safe = content.replace('<script', '')
return f"<div>{safe}</div>"
If replace('<script', '') is registered as a sanitizer, taint analysis clears the taint and misses the vulnerability. The filter is trivially bypassed with <SCRIPT, <<script, or <img onerror=alert(1)>. If it is not registered, taint analysis flags the code as XSS, which is technically correct but requires a human to verify that the sanitizer is actually insufficient.
The inverse problem occurs with unknown-but-correct sanitizers. A team that writes a rigorous sql_escape() function using a well-tested approach gets their code flagged because SAST does not know that function is safe.
What both cases require is not pattern matching but reasoning about whether a given transformation actually neutralizes the threat downstream. That is work that SAST cannot do: evaluating the semantic sufficiency of an intermediate check relative to the requirements of a specific sink.
OpenAI calls this constraint reasoning. The name is precise. The question is not “does a sanitizer exist” but “given the constraints this value must satisfy to be exploitable, can an attacker still satisfy them after every transformation the value passes through?”
Consider a more subtle case:
def get_user(user_id):
if not user_id.isdigit():
raise ValueError("bad id")
if len(user_id) > 10:
raise ValueError("too long")
query = f"SELECT * FROM users WHERE id = {user_id}"
return db.execute(query)
A SAST taint analyzer sees user input flowing into a SQL string and either flags it (if isdigit() is not a registered sanitizer) or clears it (if it is). Constraint reasoning evaluates the actual question: after isdigit(), can the value still contain SQL metacharacters? No. isdigit() guarantees only [0-9] characters pass through, which cannot produce a SQL injection payload. The correct verdict is not exploitable, and arriving at it requires understanding what the sink is vulnerable to, not just whether a sanitizer function name appears.
The Formal Methods Parallel
Constraint reasoning is not a new concept in program analysis. Symbolic execution, developed since the 1970s, does exactly this: it maintains symbolic constraints on program variables and uses SMT solvers like Z3 to determine whether those constraints can be simultaneously satisfied. Tools like KLEE and angr use symbolic execution for security analysis and can reason about sanitizer sufficiency in ways SAST cannot.
The practical problem with symbolic execution is path explosion. Real programs have enormous numbers of distinct execution paths. Symbolic execution must either bound the analysis (and miss paths) or face combinatorial blowup. For a 100,000-line application with complex control flow, exhaustive symbolic execution is computationally intractable on any reasonable budget.
LLM-based constraint reasoning approximates this. It does not run formal SMT queries; it applies learned representations of code semantics to reason about whether a constraint chain actually holds. The result is less formally guaranteed but capable of handling the messy complexity of real codebases, frameworks, and third-party dependencies that symbolic execution struggles with. What LLMs bring to this problem is the ability to apply that kind of semantic reasoning at production scale without the combinatorial costs that have historically made formal methods impractical outside narrow domains.
The Vulnerability Landscape Has Moved On
The more structurally important argument in OpenAI’s framing is historical. The vulnerabilities that SAST was built to detect, at scale, have largely been eliminated from mature codebases. Parameterized queries are now default in every serious ORM. Template engines auto-escape by default. Buffer overflow-prone C APIs are wrapped in safe abstractions. The direct-injection vulnerability classes that taint analysis excels at finding are rarer in new code and are being continuously surfaced by existing tooling in legacy code.
The vulnerabilities appearing in major incidents over the last five years are different in kind:
- Authentication bypass through logic errors, where the code is internally consistent but semantically wrong
- Second-order injection, where a payload stored during one request is executed during another, across a persistence boundary that taint analysis cannot cross
- Business logic vulnerabilities, where legitimate API calls in the wrong sequence produce unauthorized outcomes
- SSRF in cloud architectures, where the relevant trust boundary is between services rather than within a single function
None of these are visible to taint analysis. They require understanding what the code is supposed to do and identifying where its behavior diverges from that intent. That is semantic understanding, not pattern recognition. No rule written against an AST or dataflow graph will surface them.
The Alert Fatigue Compounding Effect
False positive rates in SAST deployments consistently land between 30 and 70 percent in production codebases, a range documented across Gartner application security testing market analyses, Semgrep’s own documentation, and multiple enterprise case studies. Microsoft Research found that developers acted on roughly 14 percent of static analysis findings, with the rest dismissed or suppressed.
The operational consequence is not just wasted triage time. It is degraded signal. When a tool produces false positives at a 50 percent rate, developers learn to dismiss its findings. True positives get dismissed along with false positives. A team that has trained itself to ignore SAST alerts is arguably in a worse security posture than a team with no SAST at all, because they have built a workflow that systematically routes security findings toward being ignored.
An AI system that produces fewer, higher-confidence findings changes this calculus. The value is not just precision on any individual finding; it is the restoration of signal value to the output as a whole. A developer who trusts that a finding represents a genuine, exploitable issue will investigate it. A developer who has learned that findings are mostly noise will not, regardless of what the finding says.
What This Means in Practice
The Codex Security approach is not a complete replacement for all static analysis. Semgrep-style pattern matching for secrets, dependency vulnerabilities, and obviously dangerous API usage is fast, cheap, and useful at the margin. The appropriate conclusion is not to discard scanners but to be precise about what they can and cannot find, and to stop expecting taint analysis to catch the vulnerability classes that were never in its scope.
What the OpenAI post is marking is a genuine inflection point in the security tooling landscape. The vulnerability classes most relevant to security outcomes today are not the ones static analysis was designed to find. The tools that matter going forward are those capable of reasoning about what attackers can cause a program to do, not just what the code looks like. The formal methods community established this theoretically decades ago. What has changed is that reasoning at that level is now available at a scale and cost that makes it practical for production engineering teams.