Why SAST's False Positive Problem Is Fundamental, Not Fixable

When a SAST tool reports a vulnerability, it’s telling you something precise: that according to its model of code execution, a dangerous pattern exists. It is not telling you whether that pattern is reachable from user input, whether the input is constrained in a way that makes exploitation impossible, or whether some upstream validation mechanism renders the finding irrelevant. That distinction, between a code pattern and an actual vulnerability, is where the false positive problem lives, and it’s not fixable by writing better rules.

OpenAI’s post on why Codex Security doesn’t include a SAST report, published March 16, 2026, describes a deliberate design choice: skip taint analysis and pattern libraries, use AI-driven constraint reasoning and validation instead. The argument is that this approach finds real vulnerabilities with fewer false positives. Understanding what makes that claim credible requires stepping back from tool comparisons into the theory of what static analysis actually computes.

What Taint Analysis Computes

Most serious SAST tooling beyond grep-level pattern matching is built around taint analysis. The core idea: mark sources (user-controlled input) and trace data flow through the program to sinks (dangerous operations like SQL execution, file access, or eval). When tainted data reaches a sink without passing through a recognized sanitizer, report it.

Consider this Python:

def get_user_record(user_id):
    query = f"SELECT * FROM users WHERE id = {user_id}"
    return db.execute(query)

A taint analysis correctly flags this. user_id flows from an external parameter to a string interpolation feeding into db.execute. Now consider:

def get_user_record(user_id: UserID):
    query = f"SELECT * FROM users WHERE id = {user_id.value}"
    return db.execute(query)

Where UserID is:

@dataclass
class UserID:
    value: int

    def __post_init__(self):
        if not isinstance(self.value, int) or self.value <= 0:
            raise ValueError("invalid user id")

A SAST tool still flags this. The tainted data still flows from an external call site through UserID construction into the query string. But it is not injectable. The value field is typed as int, validated on construction, and int interpolation into SQL cannot produce injection. The constraint on user_id.value makes the pattern inert.

The tool flags it anyway, because tracking that constraint is outside what taint analysis computes.

The Soundness Trap

This gap is not a fixable bug in the tooling; it’s a consequence of what soundness requires.

A sound static analysis guarantees no false negatives: if a vulnerability exists, the tool finds it. To achieve soundness, the analysis must over-approximate. When it cannot prove that a path is safe, it assumes the path is unsafe. When it cannot prove that a type constraint holds at every call site, it assumes the constraint might be violated.

The alternative, requiring the analysis to prove safety before dismissing a finding, requires solving problems that do not scale to real codebases. Path explosion, pointer aliasing, indirect calls through function pointers or virtual dispatch, cross-module data flow: these make precise analysis computationally intractable for production code.

CodeQL, GitHub’s query-based analysis engine, is among the most sophisticated SAST approaches available. It builds a full relational model of code including data flow, control flow, and type hierarchy, then lets you query it with a Datalog-like language. It still produces false positives on the same class of problem: semantic constraints on values that are not encoded in the type system or visible to dataflow analysis.

The false positive rates for SAST tools in production deployments routinely exceed 50%. Teams that deploy these tools at scale develop suppression workflows, triaging queues, and custom exclusion rules to manage the noise. The tool finds things; humans decide what is real. For many security teams, that overhead is the dominant cost.

Constraint Reasoning as a Different Computation

What Codex Security describes as AI-driven constraint reasoning is structurally different from taint analysis. Instead of tracking where data flows, it reasons about what values can be. The distinction matters.

A constraint reasoner evaluating the UserID example would track not just that user_id.value reached the query, but that it must be a positive integer given the validation in __post_init__. It then evaluates the sink: a positive integer interpolated into this SQL string cannot produce injection, because integers do not contain SQL metacharacters. The finding is suppressed.

This is closer to what a security engineer does during manual code review. They do not just look for patterns; they ask whether the conditions for exploitation are satisfiable given everything they know about the code.

The formal methods analog is SMT solving. Tools like KLEE (symbolic execution over LLVM bitcode), angr (binary analysis framework), and Triton (dynamic symbolic execution) do something related: they encode program state as logical formulas and use solvers like Z3 to determine whether a dangerous state is reachable. The approach is precise. It is also expensive. Symbolic execution on non-trivial programs hits path explosion quickly and requires substantial engineering effort to scale to production codebases.

AI constraint reasoning occupies a middle position. It does not enumerate paths formally, but it reasons about constraints semantically, drawing on a trained understanding of how code behaves. It scales better than symbolic execution because it approximates in the opposite direction from SAST: rather than over-approximating (flag everything uncertain), it reasons toward a conclusion about whether exploitation is plausible given the observable constraints.

The Validation Step

The second piece Codex Security emphasizes is validation. After identifying a potential vulnerability, the system attempts to construct a concrete proof of exploitability, confirming that the conditions for it are actually satisfiable.

This mirrors the distinction between static and dynamic analysis. DAST tools like OWASP ZAP find real vulnerabilities because they test against a running application. The tradeoff is coverage: DAST only reaches code paths exercised during testing, missing anything not triggered by the test harness.

AI validation attempts to get the useful property of each approach: analyze statically across the full codebase, but filter findings through an exploitability check before surfacing them. If the analysis identifies a potential injection point, the validation step asks whether a concrete payload could be constructed that would reach and exploit it. If the answer is no, the finding is suppressed. This is where the reduction in false positives comes from. The initial scan is not more conservative; the findings are filtered through an exploitability check before they surface.

What You Trade Away

Dropping a SAST report has a cost worth naming clearly.

Soundness disappears. A tool that only reports high-confidence findings will miss things. The class of vulnerabilities that require complex symbolic reasoning to detect, or that are exploitable only under unusual but possible conditions, will be underreported. If your threat model includes sophisticated adversaries with time to construct non-obvious exploits, constraint reasoning alone leaves gaps.

SAST also produces findings that are useful even when not immediately exploitable: code quality issues, insecure defaults, deprecated API usage, and patterns that create future risk. A significant portion of what organizations do with SAST output is not purely about finding active vulnerabilities; it is about enforcing standards and catching drift before it becomes a liability. That use case largely disappears when the tool filters based on current exploitability.

The OWASP treatment of SAST has historically framed these tools as part of a defense-in-depth strategy rather than a standalone solution. Constraint reasoning-based tools position themselves differently: fewer findings, higher signal. Whether that tradeoff fits a given security program depends on where the review bandwidth actually goes.

The Larger Pattern

Codex Security’s approach follows a trajectory visible across the tooling landscape over the past few years. Semgrep added AI-assisted rule generation and triage. GitHub’s security features began incorporating model-based exploitability reasoning. Snyk started reasoning about reachability rather than just flagging CVEs. The direction is consistent: move from pattern detection toward exploitability reasoning, from noise toward signal, from generating a report to generating an answer.

The SAST approach encoded a hard problem in rules. The constraint reasoning approach encodes it in training data and inference. Neither is complete. But for the specific problem of false positive fatigue, which has caused organizations to ignore security tooling output wholesale, the AI-driven approach is addressing something real.

Whether it is addressing the right thing depends on which adversary you are defending against.