The false positive rate dominates every discussion of SAST tooling. OpenAI’s explanation of why Codex Security skips the SAST report focuses there too: constraint reasoning produces fewer spurious alerts than taint analysis, and fewer alerts means developers actually act on the ones that remain. That argument is sound. But SAST’s precision problem also runs in the other direction, and the same root cause that generates false positives generates false negatives. That side of the equation is considerably harder to measure because false negatives produce silence rather than noise.
The Binary Decision at Every Library Call
Every taint-based SAST tool faces an unavoidable decision at each function call it cannot fully model: propagate the taint through, or drop it.
Propagating taint through an unmodeled function is the sound choice. If the function passes user-supplied data to a dangerous operation internally, taint tracking continues correctly. But if the function sanitizes or validates the data before returning, the taint label persists incorrectly after clean data exits the function. The developer gets an alert on a flow that was already made safe. That is a false positive.
Dropping taint at an unmodeled function boundary avoids that noise. But if the function stores tainted data in a location that gets retrieved later, or passes it through to a dangerous sink the tool cannot see, the taint chain is severed. The downstream dangerous operation appears to receive clean input. No alert fires. That is a false negative.
Tools like CodeQL and Semgrep address this by building explicit models of common libraries. CodeQL’s Python framework models include Django’s ORM, Flask’s request object, and the standard library. Semgrep’s Pro tier includes interprocedural taint models for common Java and JavaScript frameworks. These models encode the binary decision explicitly: Django.db.models.query.QuerySet.filter() with parameterized inputs is declared a sanitizing operation for SQL injection purposes, so taint drops after it.
The models are genuinely valuable. They reduce false positives on covered frameworks substantially. But they create a sharp discontinuity: library calls that appear in the model are handled with meaningful accuracy, and library calls that do not appear in the model are not. That boundary is finite, actively maintained by a relatively small number of contributors, and systematically concentrated in the frameworks with the highest industry adoption at the time the model was written.
The Recognized Sanitizer Asymmetry
Consider a team running Semgrep against a Python service with a custom input validation layer:
def validate_category(value: str) -> str:
allowed = frozenset({'summary', 'detail', 'audit'})
if value not in allowed:
raise ValueError(f"Invalid category: {value!r}")
return value
def run_report(conn, user_input: str):
category = validate_category(user_input) # custom sanitizer
return conn.execute(
f"SELECT * FROM reports WHERE category = '{category}'"
)
The taint flow is: user_input (tainted) to validate_category() (unmodeled) to conn.execute() (dangerous sink). Semgrep’s taint mode propagates taint through unmodeled calls by default. Unless the team has written a Semgrep rule explicitly declaring validate_category as a sanitizer, the SQL call fires an alert. The security constraint is satisfied by construction: category can only be one of three safe string literals. The tool cannot see this.
The fix requires writing a custom rule declaring the function as a sanitizer. That rule needs to be updated when the function is renamed, when the validation logic moves to a different module, or when the service adds a second validation helper for a different purpose. The maintenance burden scales with the quantity of internal validation code.
The inverse problem produces false negatives. If the same team caches user-supplied values through an in-house Redis wrapper that SAST treats as a safe retrieval source:
def get_report_config(user_id: int) -> dict:
# retrieves data originally supplied by user during setup,
# stored in Redis by a separate ingestion service
return redis_client.get(f"config:{user_id}")
def generate_report(conn, user_id: int):
config = get_report_config(user_id)
# config["template"] was user-supplied; SAST doesn't know this
return conn.execute(
f"SELECT * FROM templates WHERE name = '{config['template']}'"
)
Here, the taint chain is broken at redis_client.get(). That call looks to SAST like a database read, returning clean data from a known source. The user-supplied origin of config["template"] is invisible. The SQL interpolation below it produces no finding.
The OWASP Web Security Testing Guide describes multi-hop injection as a documented vulnerability class. The stored injection pattern specifically exploits this kind of analysis gap: malicious content is stored in a system SAST treats as a clean source, retrieved later, and used in a dangerous operation. SAST’s taint model systematically misses these because its architecture treats external stores as trust boundaries by default.
The Library Gap at Scale
The modeling gap widens predictably with codebase age and internal infrastructure complexity. Young services using standard frameworks land almost entirely within the modeled surface. Services that have accumulated internal libraries, proprietary ORMs, or non-standard caching and messaging layers push increasingly outside it.
For Go and Rust codebases, the problem is more severe because the SAST rule database is thinner to begin with. gosec covers gosec-specific patterns in the Go standard library and major database drivers; it has limited models for the broader ecosystem. Rust’s tooling is even thinner, and the language’s memory safety guarantees mean the tools that do exist focus on application-layer concerns where rule databases have the least coverage.
The Veracode State of Software Security research documents that developers fix roughly 56% of flaws identified by static analysis. That rate is partly a consequence of the false positive noise the existing posts have covered. But it also reflects the category of real vulnerabilities SAST reports that turn out not to be exploitable in context, mixed with the false positives, training developers to underweight findings from the same tool that also quietly misses flows through unmodeled infrastructure.
How Constraint Reasoning Addresses Both Sides
The LLM-based constraint reasoning in Codex Security does not operate on a recognized-versus-unrecognized binary at library boundaries. It reasons about what a function does by reading its code and inferring what invariants it establishes on its return value.
For the custom validate_category() function: the system reads the function body, observes the frozenset membership check, infers that the return value is constrained to three specific safe string literals, and evaluates whether that constraint satisfies the security requirement at the SQL sink. No rule has to be written declaring the function as a sanitizer. The semantic analysis is the declaration.
For the Redis cache case: if the retrieval function is visible in the project scope, the system can follow the data lineage back through the cache and reason about whether values retrieved from it carry user-supplied content in forms that could be dangerous at the downstream operation. The question changes from “is this call in the model” to “what does this call do.”
This addresses both failure modes through the same mechanism: reasoning about semantics rather than consulting a binary known/unknown classification. False positives from unmodeled sanitizers decrease because the system can evaluate what the sanitizer does. False negatives from trust-boundary assumptions decrease because the system can follow taint through intermediate storage when the storage mechanism is in scope.
The constraint reasoning framing in OpenAI’s article concentrates on the validation step as the source of precision improvement, which is accurate. But the interprocedural coverage improvement is the other half of the same story. Constraint reasoning does not just validate findings more precisely; it finds a class of multi-hop flows that taint-based analysis would have quietly dropped.
The Coverage Transparency Problem
There is a measurement asymmetry that favors SAST from a security program perspective, and it is worth naming directly.
When SAST misses a vulnerability because the taint flow crosses an unmodeled library boundary, the tool at least has a defined scope. You can enumerate what the model covers, identify libraries that are not covered, and make a reasoned judgment about the gap. CodeQL’s query documentation specifies which framework sinks and sources are modeled. The boundary of what is checked is, in principle, knowable.
When AI-based constraint reasoning misses a vulnerability, the failure mode is less transparent. The model does not produce a manifest of which library calls it analyzed accurately versus where it relied on a plausible but incorrect semantic inference. A finding of zero vulnerabilities in a project with a complex custom caching layer might reflect accurate analysis or confident mischaracterization of what the layer does. From the outside, these produce identical output: silence.
This is not an argument against constraint reasoning. It is an argument for being precise about what the improvement buys. The false positive reduction is real and measurable by comparing findings to a ground truth. The false negative reduction for cross-boundary flows is plausible and likely real for well-understood library patterns; it becomes harder to verify empirically in exactly the parts of the codebase where SAST was also most likely to miss things.
Where This Leaves the Trade-off
The architectural difference between SAST and constraint reasoning is usually framed as a precision argument: fewer false positives means developers take findings seriously. That framing is accurate and important. But it understates the case by focusing on one side of a two-sided problem.
SAST produces false positives and false negatives from the same root cause: the binary library modeling problem. Every function call that is not in the model either propagates taint it should have dropped or drops taint it should have propagated. A tool that models more libraries reduces both error types, but the modeling effort scales with the diversity of the library ecosystem and degrades as internal codebases grow beyond the standard-library surface.
Constraint reasoning with project-wide scope addresses the root cause rather than expanding the model database. It reasons about library semantics rather than looking them up, which means its accuracy degrades more smoothly as the codebase grows less conventional, rather than falling off a cliff at the model boundary.
The false positives are what developers complain about in SAST postmortems. The false negatives are what appear in incident reports. Evaluating constraint reasoning only on the false positive dimension tells half the story, and it is the less consequential half.