Confident Findings, Invisible Scope: The Coverage Trade in AI Security Analysis
Source: openai
The conversation around Codex Security and why it skips a traditional SAST report has focused almost entirely on the false positive problem. That framing is understandable: SAST false positive rates are genuinely bad, and OpenAI’s case for AI-driven constraint reasoning rests on the claim that a trusted finding is more valuable than a large noisy report. That claim is correct as far as it goes. But precision and coverage are two separate properties, and collapsing them into a single conversation about noise obscures a different kind of risk.
What Coverage Means in Security Analysis
When a SAST tool runs against a codebase, its coverage is explicit by construction. CodeQL ships organized query suites: security-and-quality, security-extended, and a core security suite. Each query targets a specific CWE category. You can enumerate the list, identify which categories are in scope, and make coverage claims to a security auditor: we ran the extended query suite, which covers CWE-89 through CWE-134, and here are the findings. Semgrep makes this even more explicit with named rule sets and registry IDs. The output is noisy, but the scope is documented.
This matters for a specific class of use case. When a team needs to demonstrate security due diligence to satisfy a SOC 2 audit, a PCI-DSS requirement, or a customer security questionnaire, the auditable coverage claim is part of the deliverable. The report says: we checked for these vulnerability classes, here are the findings, here is how we triaged them. A false positive in that context is an annoyance. A coverage gap is a different kind of problem.
AI constraint reasoning has no rule list. When Codex Security analyzes code and reports no findings, the output conveys one of two things: either the code has no exploitable vulnerabilities in the classes the system reasons about, or the system’s reasoning did not surface a vulnerability that exists. From the output alone, you cannot tell which is true. There is no CWE mapping to check, no query suite to inspect, no coverage claim to evaluate.
The Validation Step and What It Proves
OpenAI’s approach describes not just constraint reasoning but validation: confirming that a potential vulnerability is reachable with attacker-controlled data before reporting it as a finding. This is the mechanism that reduces false positives, and it is similar in spirit to what symbolic execution does. KLEE, published at OSDI 2008, and Microsoft’s SAGE, used in Windows development through the Windows 7 cycle, both take the same approach: before reporting a finding, generate a concrete input that demonstrates exploitability. The result is near-zero false positives on the paths they report.
Validation confirms a positive finding with high confidence. It does not affect coverage. If the constraint reasoning step never generates a candidate finding for a given vulnerability class, the validation step has nothing to confirm. The precision benefit from validation only applies to what the reasoning surface generates; it provides no signal about what the reasoning surface misses.
This is the structural asymmetry. SAST over-approximates: it reports too many things. AI constraint reasoning under-approximates: it reports only what the model is confident about, and the decision about what deserves confidence is made inside a system with no external specification. In formal terms, SAST is sound, or attempts to be. AI constraint reasoning is precise on what it reports but makes no formal completeness claim over the full vulnerability space.
Where Both Approaches Fail at Boundaries
Security vulnerabilities in modern architectures commonly span service boundaries: an authenticated user in service A stores crafted data, and service B processes it later without validation. Neither SAST nor AI constraint reasoning handles this case reliably, but they fail in different ways.
SAST fails because most tools don’t track taint across service call boundaries. The tool sees a REST API call with a string parameter and has no model of what the receiving service does with it. Some tools support custom taint sources and sinks to bridge this, but configuring them requires understanding both services and the serialization format in between. The failure is visible: the tool simply doesn’t flag cross-service paths because those paths don’t appear in the data flow graph.
AI constraint reasoning fails because the context window limits how much code can be analyzed at once. If service A and service B are in separate repositories, the analysis of service B is likely performed without the context of how service A populates the data. The model may reason correctly about the code it sees but have no basis for flagging the vulnerability because the taint source is outside its analysis scope. The failure is invisible: the model analyzes service B’s processing code, sees valid-looking input handling, and reports no findings.
The difference is diagnostic. SAST’s failure is explicit. The cross-service path was not analyzed because there is no data flow rule for it; you can see this in the tool’s scope documentation. The AI’s failure is implicit: a clean report gives no indication that the relevant context was absent.
# Service A: stores user-controlled data
def save_template(user_input):
db.execute("INSERT INTO templates (content) VALUES (?)", (user_input,))
# Service B: processes stored templates, analyzed separately
def render_report():
template = db.fetchone("SELECT content FROM templates")
# Server-side template injection if content is attacker-controlled
return jinja2.Template(template['content']).render()
A SAST tool analyzing service B flags nothing because template['content'] is a database read, not a taint source. An AI system analyzing service B in isolation makes the same mistake. The difference is that the SAST tool’s scope limitation is at least enumerable.
The Calibration Problem
There is also a question of calibration: how should developers weight a clean AI constraint reasoning report? With SAST, a clean report on a covered CWE means the pattern was checked and not found. It is still possible to have a vulnerability outside the rules, but the scope boundary is explicit. A clean AI report conflates “not vulnerable in classes the model understands well” with “not vulnerable in classes the model has less training signal on.” The model’s confidence is not uniformly distributed across all vulnerability classes, and from the outside, there is no way to know which classes are well-covered.
CVE distribution data makes this concrete. Buffer overflows, use-after-free, and SQL injection are heavily represented in published vulnerability research. Hardware timing side channels, protocol-level authentication bypasses, and logic flaws in business workflows are underrepresented relative to their actual prevalence. A model trained on the public vulnerability corpus will have corresponding gaps in its confidence, and those gaps may not align with the gaps that matter most for any specific application.
Symbolic execution tools like KLEE sidestep this problem entirely: they reason from first principles about program state, so their coverage of a given code path does not depend on training distribution. An LLM-based system’s coverage of a vulnerability class correlates with how well that class was represented in training data, and that correlation is not disclosed in the tool’s output.
The Useful Framing
The precision argument for AI constraint reasoning is real. Developers ignoring SAST output is a documented problem, and a tool that produces trusted findings developers act on immediately has genuine value. The OpenAI post is right that a finding you ignore is worse than no finding at all, because the learned dismissal behavior extends to real vulnerabilities. Precision-optimized tooling addresses this directly.
But precision and coverage are orthogonal properties, and the current framing treats choosing precision as if it makes coverage questions irrelevant. A tool that reports three high-confidence vulnerabilities and a tool that reports zero findings are producing fundamentally different kinds of output, and the zero-findings case requires knowing the tool’s coverage scope to interpret correctly.
For teams running AI-based security analysis, the practical implication is that it pairs well with explicit coverage verification from elsewhere: a SAST tool with a well-defined query scope for documented compliance claims, periodic manual penetration testing for the adversarial creativity that neither approach models well, or a structured threat model that identifies which vulnerability classes are highest priority for the application. The confidence of a clean AI report is most useful when you have independent verification that the scope of what was analyzed is adequate. Without that, confident silence is its own kind of noise.