· 6 min read ·

The Signal Contamination Problem: Why Combining SAST with AI Security Analysis Backfires

Source: openai

OpenAI published an explanation of why Codex Security doesn’t include a SAST report alongside its AI-driven findings. The argument isn’t simply that AI is better than SAST. The specific claim is that combining SAST output with AI-based security analysis produces worse results than AI alone. That’s worth examining on its own terms, because it cuts against the natural intuition that more signals always improve coverage.

What SAST Reports Look Like in Practice

A SAST tool’s output is a function of its rule database applied to your code. The tool parses source into an abstract syntax tree, traces data flow through the control flow graph, and fires when user-controlled input reaches a dangerous sink without passing through a recognized sanitization function. The last phrase determines where most false positives originate: recognized sanitization.

Here’s a common example in Rust:

fn query_user(id: &str) -> Result<User, Error> {
    let parsed_id: u64 = id.parse().map_err(|_| Error::InvalidId)?;
    let sql = format!("SELECT * FROM users WHERE id = {}", parsed_id);
    db.query_one(&sql, &[]).await
}

The taint analysis trace is clear: id enters as a string parameter, flows through format!, and lands in a SQL call. A SAST tool that doesn’t specifically model str::parse::<u64>() as sanitization for numeric injection contexts will fire here. The parsed_id binding is a u64, making injection impossible, but the tool sees pattern, not semantics.

For custom validation functions, the situation is worse. A tool can only recognize sanitization it has been explicitly taught to recognize, and no rule database has been taught about your internal validate_user_id() helper.

The Rule Coverage Ceiling

Semgrep handles this better than most tools, with a pattern language close to the source syntax and a large published registry. Teams can write custom rules for their internal patterns, and well-maintained Semgrep configurations can substantially reduce false positive rates. But this requires ongoing investment: new frameworks, new library versions, and new internal patterns all require rule updates.

CodeQL is more semantically expressive, with interprocedural taint analysis and a query language capable of modeling complex sanitization paths. A thorough CodeQL analysis is genuinely impressive technical work. It can also take hours on large codebases, and its accuracy on any specific stack still depends on how well the query library was written for that stack’s idioms.

The OWASP Benchmark Project puts documented numbers on this variance: false positive rates across major SAST tools range from 30% to over 70% on real-world test codebases, with the spread driven primarily by rule coverage of the specific framework under test. This isn’t a bug in any particular tool; it’s a structural consequence of rule-based analysis applied to codebase diversity.

The Combination Problem

The Codex Security argument, as OpenAI frames it in the article, isn’t that SAST findings are wrong to make. The claim is that their noise profile makes them a liability when combined with AI-based security output.

The AI tool reasons about exploitability. It traces call graphs, checks whether authentication middleware intercepts before the vulnerable function executes, and verifies whether user-controlled data actually reaches the dangerous operation given full project context. When it surfaces a finding, it’s backed by constraint reasoning: the conditions required for the vulnerability to be exploitable have been checked and confirmed.

SAST findings operate on a different basis. They report on patterns. A pattern match that the AI tool would rule out, having found the endpoint requires admin authentication, will still appear in the SAST output because SAST doesn’t know about the authentication layer.

In a combined report, both kinds of findings land in the same list. The developer has to maintain separate mental models for what each source means in practice: that the AI findings are high-confidence and require action, while the SAST findings require manual verification. That’s a workable division in principle. Under deadline pressure, developers are less consistent about maintaining that distinction than the theory assumes.

The deeper problem is that confidence calibration is contagious in mixed-signal environments. A developer who investigates eight SAST findings in a row and finds six of them false positives carries that expectation forward, regardless of whether the ninth finding came from SAST or the AI layer. The signal quality of a finding is shaped partly by the noise profile of the findings surrounding it.

Alert Fatigue as a Systems Problem

Alert fatigue in security tooling has been observed and documented for long enough that its mechanism is well-understood. When a security tool produces a high volume of findings with a significant false-positive fraction, developers adjust by lowering the priority they assign to each finding. This is rational: if the expected value of investigating a finding is low, spending less time on it makes sense.

The problem is that this calibration is uniform. It applies to real findings as well as false positives. A developer who has learned that a tool’s findings are about 40% real will not switch to careful investigation when the rate temporarily hits 100% for a genuine critical vulnerability. They apply the prior built from past experience.

Introducing SAST output into a high-precision AI security tool retrains that prior toward a lower expected value per finding. The SAST findings don’t just add noise; they reprice the AI findings in the developer’s mental model.

The economics of this are asymmetric. A tool that surfaces fifteen high-precision findings per week, thirteen of which are real, builds a different developer response than a tool that surfaces sixty findings per week, twenty of which are real. The second tool has higher total detection but lower per-finding trust, and lower per-finding trust is what causes real vulnerabilities to be dismissed.

What This Approach Accepts

The honest accounting of excluding SAST is that it accepts different gaps in exchange for better signal quality.

SAST tools, given a rule for a vulnerability class, find every instance of that pattern with high consistency. An AI system covers the same class probabilistically, with coverage varying by how well the training data represents the specific code idioms in the target codebase. For common, well-documented patterns like SQL injection or command injection in mainstream frameworks, SAST coverage can be essentially exhaustive within its file scope. The AI’s coverage of the same patterns is harder to characterize precisely.

There’s also an explainability difference. A SAST taint trace is mechanically verifiable: you can follow the data flow path from source to sink in the code and confirm the finding independently. A finding backed by AI constraint reasoning requires trusting the model’s analysis, which is less auditable by inspection. Security teams that have built review workflows around data flow evidence need to adjust how they approach AI-backed findings.

DARPA’s AI Cyber Challenge in 2024 and 2025 demonstrated real capability: AI systems finding and patching vulnerabilities in production open source code. The evaluations concentrated on characterized vulnerability classes. Coverage of subtle multi-step vulnerabilities, race conditions, and logic flaws without syntactic signatures remains an empirically open question as these tools deploy to production codebases.

The Behavioral Bet

The decision to exclude SAST is ultimately a bet about what behavior the tool will induce. A smaller set of high-confidence findings may generate better security outcomes than a larger set with wider confidence variance, even if the larger set has higher theoretical coverage. The coverage metric that matters is not “findings surfaced” but “real vulnerabilities fixed,” and the path between those runs through developer trust.

Whether that bet pays off depends on data that isn’t yet publicly available: how AI-based security tools perform on diverse production codebases at scale, what their false negative rates are on vulnerability classes that SAST would catch reliably, and whether the higher per-finding precision translates to higher fix rates. The architectural reasoning is sound; the empirical record is still being written.

What OpenAI has made explicit by publishing this explanation is that the question of what to exclude from a security tool’s output is as consequential as the question of what to include, and that the answer is not primarily a question about detection capability but about how developers behave when they encounter the output.

Was this interesting?