· 6 min read ·

Four Generations of SAST and the False Positive Problem That Outlasted Each One

Source: openai

The false positive rate in static analysis has been a documented problem for as long as the tools have existed. Production SAST deployments consistently generate between 50% and 90% false positives depending on configuration and codebase. Developers learn this quickly. They stop reading the alerts. At that point the tool provides a false sense of coverage while real vulnerabilities sit unreviewed in a pile of noise.

OpenAI’s Codex Security doesn’t produce a SAST report. The decision reflects an architectural position on what the false positive problem is and why it has persisted through four generations of static analysis tooling.

The Four Generations

Static analysis for security didn’t start sophisticated. First-generation tools were pattern matchers: grep for strcpy, flag it, move on. The false positive rate was enormous because the tools had no understanding of context. A strcpy call with a fixed-size constant source isn’t a vulnerability, but first-generation tools couldn’t see that.

Second-generation tools introduced AST (Abstract Syntax Tree) analysis. Instead of matching text patterns, they parsed code into a tree and matched against structure. This eliminated some of the most obvious false positives: the tool now understood that strcpy(buf, "hello") wasn’t the same as strcpy(buf, user_input). But it still couldn’t follow data flow. If user_input was assigned to a local variable three lines earlier, many tools still couldn’t connect those dots.

Third-generation tools added taint analysis: a formalized model of data flow where “tainted” values (attacker-controlled data) are tracked as they move through a program. When a tainted value reaches a “sink” (a dangerous operation like a SQL query or a raw strcpy call), the tool flags it. Tools like Fortify SCA, Checkmarx, and later CodeQL all built around this model. Taint analysis reduced false positives meaningfully for the cases it could see.

The fourth generation added interprocedural analysis: tracking taint across function call boundaries. A tainted value passed as an argument to a helper function, modified, and returned could now be followed. Computationally expensive and requiring approximation tradeoffs, it still improved precision. But the false positive problem persisted.

What None of These Solved

The structural problem in all four generations is the same: approximation rooted in incomplete library knowledge. Taint analysis requires enumerating taint sources and sinks. If a source or sink comes from a third-party library, the tool needs a model of that library’s behavior. Those models are wrong, incomplete, or missing entirely for most real codebases.

This library modeling gap is where false positives and false negatives both originate. If a library function sanitizes its input before returning, but the SAST tool doesn’t have a model for that, it will flag any downstream use of the returned value. If a library function introduces a vulnerability without a corresponding model, the tool will miss it. Neither problem yields to better pattern matching or more precise AST analysis. Both require semantic understanding of what the library actually does.

Interprocedural analysis partially addresses this for code the tool can read, but third-party libraries are often compiled or minified. Even when source is available, building and maintaining accurate models is an ongoing burden. The fourth generation reduced false positives compared to its predecessors, but the underlying cause remained.

The SARIF Standard and What It Represents

The SARIF standard (Static Analysis Results Interchange Format) is an OASIS JSON schema for representing static analysis output. GitHub’s code scanning pipeline ingests SARIF. So does Azure DevOps. The format is well-designed for what it is: a portable container for tool findings that integrates into CI/CD workflows and developer review tooling.

SARIF is built around a specific model of what security tooling produces: a list of potential issues with locations, rule IDs, and severity levels. It has fields for kind (open, review, pass), level (error, warning, note), and suppressions. The assumption baked into the schema is that the output is a report to be reviewed and triaged, and that not all findings will be real.

Suppression mechanisms, baseline comparisons, and deduplication features exist within the SARIF ecosystem precisely because the false positive problem is treated as a workflow management problem rather than a tool accuracy problem. SARIF doesn’t make this worse, but the toolchain that has grown around it encodes triage as the normal operating mode for developers receiving security findings.

Constraint Reasoning as a Different Model

What Codex Security does is structurally different. Rather than tracking taint through code and flagging when it reaches a sink, constraint reasoning asks whether the conditions necessary for exploitation can actually be satisfied.

Consider a classic SQL injection pattern:

query = f"SELECT * FROM users WHERE id = {user_id}"
db.execute(query)

Taint analysis flags this if user_id traces back to any attacker-controlled input. Constraint reasoning asks: what are the constraints on user_id at this execution point? If earlier in the call chain user_id is cast to int and validated against a range, the string interpolation still looks dangerous syntactically, but it cannot be exploited with SQL metacharacters because the value can only ever be an integer. Propagating that constraint through the call chain eliminates the finding before it is ever reported.

The same reasoning applies to buffer size constraints, format string arguments that are provably constant, and race conditions that require a specific interleaving that lock-ordering analysis shows cannot occur. The question in each case is whether an attacker can construct input that reaches the dangerous state, not whether code contains a pattern structurally resembling a dangerous state.

This is harder to compute than taint analysis. It requires reasoning about value ranges, type invariants, and broader program state. For general programs this is undecidable, and any practical implementation will have bounds on how deeply it can reason. But AI models bring something rule-based constraint solvers lack: world knowledge about APIs, common idioms, and library behavior accumulated from training. Where a SAST tool needs an explicit model entry to know that int(x) in Python eliminates non-numeric content, a model trained on extensive code can reason from what it knows about the type system and the function’s semantics without a hand-written rule.

Why the SARIF Workflow Doesn’t Fit

Producing SARIF output from a constraint reasoning system is technically possible, but SARIF’s architecture of suppression fields and severity levels exists to support triage of uncertain findings. When findings are validated through constraint reasoning before they surface to the developer, the triage step is largely done. The developer receives confirmed vulnerabilities rather than candidates.

The operational difference is material. A developer looking at a CI run with 200 SAST findings knows from experience that a large fraction are false positives. They spend time triaging, or they develop rules of thumb for ignoring categories of findings. Either path carries risk: triage is expensive, and categorical ignoring creates blind spots for real vulnerabilities in deprioritized categories. A report of 12 confirmed vulnerabilities requires no such triage.

The tradeoff falls on the false negative side. Constraint reasoning that cannot follow a code path will either conservatively flag the case or, depending on implementation, drop it. Completeness is harder to guarantee than it is for exhaustive pattern matching, and a tool that only reports confirmed findings can miss things that a noisier tool would have caught by accident. That tradeoff is worth acknowledging.

But the false positive problem stems from the underlying model, not from misconfiguration within it. Constraint reasoning changes the model rather than tuning parameters on top of the old one, which is why SARIF, designed around the triage workflow, has no natural role in what the new output requires.

Four generations of SAST tools made real progress, each meaningfully better than the last. Each also kept the same fundamental output: a list of possible issues for a human to review and filter. Constraint reasoning produces a list of confirmed issues, and that difference changes what the output format needs to be.

Was this interesting?