· 6 min read ·

The Soundness Trap: Why SAST's False Positive Problem Is Structural, Not Accidental

Source: openai

The standard complaint about static analysis security tools is that they generate too many false positives. Enterprise teams commonly disable them, or tune them into irrelevance, because the ratio of real vulnerabilities to noise is too low to justify the review burden. This complaint is usually framed as a tooling quality problem: if the tools were better, they would be more precise. The framing is wrong. SAST tools produce noisy reports by design. The false positive problem is not incidental to how they work; it is a direct consequence of what they are optimized for.

OpenAI’s post on why Codex Security doesn’t include a SAST report argues from the developer trust angle: a finding you don’t trust is worse than no finding at all, because ignoring alerts is a learned behavior, and once a developer learns to ignore a tool’s output, the real findings get buried too. That argument is correct and practically important. But the deeper reason SAST produces the noise it does is theoretical, and understanding it clarifies why the alternative approach is architecturally different rather than just better-implemented.

Soundness and What It Requires

A sound static analyzer is one that never misses a true vulnerability. If a vulnerability exists in the code, a sound tool will report it. This property is called recall in information retrieval terms, and it sounds like an obvious goal. The problem is that achieving soundness at scale requires a specific tradeoff: the analyzer must report findings whenever it cannot prove safety. When the analyzer is uncertain, the sound choice is to flag.

This is where the false positives come from. Uncertainty is pervasive in static analysis because precise reasoning about program behavior requires solving problems that are undecidable in the general case. Consider the three major sources:

Alias analysis. Determining which memory locations a pointer might refer to is central to security analysis, and it is one of the hardest problems in compilers. If a sanitization function is called through a function pointer, and the analyzer cannot determine what the pointer points to at that call site, it must conservatively assume the sanitization might not occur. Tainted input may still reach the sink. The finding fires. In real codebases, function pointers, interface dispatch, and virtual calls are common precisely in the middleware and framework layers where sanitization typically happens.

Interprocedural analysis depth. Most SAST tools bound their cross-function analysis at some call depth to keep analysis tractable. If a security-relevant transformation happens seven function calls deep, outside the analysis boundary, the tool sees what looks like unsanitized input reaching a dangerous operation. The transformation happened; the tool’s horizon was too narrow to see it. Checkmarx and Fortify are known to struggle with this in codebases that have deep call hierarchies, which includes most production Java and C# applications.

Path feasibility. A path-insensitive analyzer might determine that tainted data can reach a dangerous operation along some execution path through the control flow graph. That path might be infeasible: a branch condition elsewhere in the function guarantees it cannot execute. Determining path feasibility requires constraint solving, which, in the general case, reduces to SAT. Tools that skip this step, which most do, will report the theoretical path as a vulnerability even when it cannot occur at runtime.

The result of these constraints is what program analysis research calls an over-approximation: the set of reported vulnerabilities is a strict superset of the set of real vulnerabilities. This is not a failure mode. It is the mathematical structure of sound static analysis. Tools like CodeQL and Semgrep are building against this ceiling as much as any other constraint.

The Numbers

The practical consequence is well-documented. NIST’s SATE (Static Analysis Tool Exposition) program has been running structured evaluations of SAST tools since 2008. Their findings across editions consistently show median false discovery rates of 50% to 70% on real C and C++ codebases. The OWASP Benchmark, which tests detection on a synthetic Java web application with known vulnerabilities, shows most tools achieving true positive rates of 30% to 55% at a 50% false positive threshold, meaning that at the operating point where the tool catches half the real vulnerabilities, it also generates a false positive for nearly every true positive.

A 2023 study published in IEEE Transactions on Software Engineering evaluated CodeQL, Semgrep, Checkmarx, Fortify, and SpotBugs against a curated Java and C benchmark. The average false positive rate across tools was 67%. Semgrep’s syntactic rules performed better on precision but worse on recall; Checkmarx’s deeper dataflow engine caught more vulnerabilities but at a substantially higher noise floor.

These numbers correlate with what security teams report in practice. When developers cannot distinguish real findings from noise at better than a coin flip, they stop distinguishing. The tool gets tuned down, suppressed, or abandoned. The security signal is lost.

What Constraint Reasoning Does Differently

The approach Codex Security uses is not a better SAST tool. It is asking a different question.

SAST asks: does this code match a pattern associated with vulnerability class X, where the pattern is expressed over an abstract representation of the program (AST, CFG, dataflow graph)? The answer is necessarily conservative because the pattern match must work under incomplete information.

Constraint-based reasoning asks: is there a concrete execution of this code, in this specific codebase context, where a specific security invariant is violated? The invariant might be “this SQL query parameter must be parameterized or sanitized before reaching the database driver,” and the constraint reasoning is over the actual call graph, the actual library semantics visible from the dependencies, and the actual data flow as the LLM understands it from reading the code.

Consider what this looks like in practice. A pattern-based tool sees:

query = f"SELECT * FROM users WHERE id = {user_id}"
cursor.execute(query)

And fires unconditionally, because string interpolation into a SQL query matches the injection pattern. A constraint-based system would ask: where does user_id come from? If the call site is:

@require_staff
def internal_health_check(request):
    report_id = config.get('default_report_id')  # integer from config file
    run_diagnostic_query(report_id)

Then the constraint that user_id must be attacker-controlled for the injection to be exploitable is not satisfied. The finding is suppressed. The same pattern match that would fire in a SAST tool is recognized as non-exploitable in context.

The tradeoff here is explicit: Codex Security is precision-optimized rather than recall-optimized. It accepts missing some real vulnerabilities in exchange for reporting only findings with sufficient confidence that developers trust them. The bet is that a smaller set of trusted findings produces better security outcomes than a large set of findings developers have learned to dismiss.

What Sound Analysis Gets Right That AI Does Not

Abandoning soundness has real costs. A recall-optimized tool with honest false positive rates is still better than a precision-optimized tool that misses novel vulnerability classes.

The categories where constraint-based AI reasoning struggles are the categories where SAST over-approximation is most valuable: race conditions in concurrent code, logic flaws with no pattern representation, and authentication failures that require understanding the application’s intended security model. An LLM analyzing code for SQL injection patterns will do that well; the same LLM analyzing whether a token refresh flow has a TOCTOU vulnerability in a distributed system is in harder territory.

There is also an auditing concern. SAST tools are deterministic and auditable. The same codebase run through CodeQL with the same query suite produces the same output, which is important for compliance workflows where reproducibility matters. A constraint-based AI system is probabilistic; two runs may disagree. For teams using security tooling to satisfy SOC 2 or PCI-DSS requirements, this is a non-trivial limitation.

The Design Statement

The decision not to include a SAST report in Codex Security is a claim about what security tooling is for. If the goal is coverage documentation, sound over-approximate analysis is the right tool: it proves you checked, even if the checking is noisy. If the goal is finding exploitable vulnerabilities that developers will actually fix, precision-optimized constraint reasoning has a stronger argument.

The industry has been running the soundness bet for twenty years. The adoption rates and suppression behavior suggest it has not produced the security outcomes it was supposed to. OpenAI’s architectural choice is a response to that record. Whether the constraint reasoning approach delivers on its precision claims across diverse languages, frameworks, and codebase sizes is what will determine whether it changes the tooling landscape or becomes one more approach teams quietly disable.

Was this interesting?