· 6 min read ·

The Problem SAST Was Never Built to Solve

Source: openai

Why Rules Never Scale

Static application security testing has existed as a formal discipline since at least the early 2000s. Tools like Fortify, Coverity, and their successors built entire product categories around a simple premise: compile a set of rules about dangerous code patterns, scan codebases at scale, and report matches. The pitch was compelling. The reality was a sustained struggle with false positives.

The core issue isn’t that the tools are poorly built. It’s that rules operating on syntax cannot reliably distinguish between code that is dangerous and code that merely looks dangerous. Consider a routine that queries a database:

def get_user(conn, user_id):
    query = f"SELECT * FROM users WHERE id = {user_id}"
    return conn.execute(query)

A SAST tool with a SQL injection rule will flag this. Now consider:

def get_user(conn, user_id: int):
    query = "SELECT * FROM users WHERE id = %s"
    return conn.execute(query, (user_id,))

Both queries involve user-supplied data reaching a SQL execution point. The first is exploitable. The second is parameterized and safe. A pattern-matching rule struggles to tell them apart without understanding the semantics of parameterized queries, the type system, and the specific database driver in use.

This is the false positive treadmill: every rule that catches real vulnerabilities also catches patterns that look similar but are safe. Tuning rules to reduce false positives means writing more rules, which means maintaining a larger rule set, which means slower scan times and ongoing maintenance burden for every new framework or language idiom that emerges.

Taint Tracking and Its Limits

Modern SAST tools moved past simple pattern matching toward taint analysis, which tracks data flow from untrusted sources (user input, HTTP parameters, file reads) to dangerous sinks (SQL queries, shell commands, HTML output). Tools like CodeQL and Semgrep both support this model to varying degrees.

Taint analysis is genuinely more powerful than syntactic rules. It can follow data through variable assignments, function calls, and some control flow branches. But it has structural limits that are difficult to engineer around.

First, taint analysis depends on having a correct and complete model of what constitutes a sanitizer. If a custom function strips HTML tags before interpolating into a template, the taint tracker needs to know that this function satisfies the required security constraint. Without that knowledge, it either propagates taint through the sanitizer (false positive) or trusts it blindly (potential false negative).

Second, inter-procedural analysis across large codebases is computationally expensive. Most tools make practical trade-offs: they analyze within modules, they ignore dynamic dispatch, they approximate across library boundaries. These approximations introduce both false positives and false negatives.

Third, taint tracking has no model of context. It doesn’t know that a parameter marked as tainted was already validated by an authentication middleware layer several stack frames earlier. The taint propagates regardless, because the tool has no way to reason about the semantic meaning of what that middleware does.

CodeQL’s QL query language lets security researchers write sophisticated dataflow queries that can encode some of this context, but doing so requires significant expertise. It shifts the work from the tool to the analyst, and the resulting queries are still operating within a rule-based framework that struggles with novel patterns.

What Constraint Reasoning Actually Means

OpenAI’s Codex Security approach is architecturally different from both syntactic rules and taint tracking, though it borrows ideas from formal verification and abstract interpretation.

The core idea in constraint reasoning is to track what must be true about a value at each point in a program’s execution. Rather than asking “does tainted data reach this sink,” it asks what constraints a value needs to satisfy at a given point, and whether those constraints can be verified given everything known about how the value was produced.

This framing connects to ideas from abstract interpretation, which was formalized by Patrick and Radhia Cousot in the late 1970s. Abstract interpretation proves properties about programs by analyzing them over abstract domains rather than concrete values. Instead of computing that a variable equals 42, you compute that it’s a positive integer, or that it came from a sanitized source. When the abstract value at a dangerous operation lacks the required property, you have a potential vulnerability.

The key contribution of an LLM-backed approach is the ability to reason about semantics that were never explicitly encoded in a formal grammar. An LLM can read a custom input validation function, understand what it does, and determine whether it satisfies the required security constraint for the context where the sanitized value is eventually used. It can also reason about framework-level invariants without needing those invariants to be manually encoded as rules. For example, understanding that Express’s req.params values are always strings, or that Django’s ORM parameterizes queries by default, are things a model with broad code exposure can apply without a rule author having written them down.

Here is a simplified view of the difference in reasoning:

# SAST taint analysis reasoning
# 1. user_data comes from request.args (tainted source)
# 2. user_data reaches conn.execute() (dangerous sink)
# 3. ALERT: potential SQL injection

# Constraint reasoning approach
# 1. At conn.execute(), the query argument must satisfy: no unsanitized user content
# 2. Trace backwards: what is the query argument's derivation?
# 3. user_data is passed via %s placeholder to a parameterized execute() call
# 4. Parameterized queries satisfy the constraint by construction
# 5. No vulnerability; constraint is met

The validation step that follows constraint analysis is equally important. Once a potential vulnerability is identified, the system reasons about whether it is actually exploitable given the full calling context. This is the stage that eliminates the most false positives: not just “tainted data reaches a sink” but “is there a reachable code path with no intervening constraint satisfaction that an attacker could exercise?”

The Trade-offs Worth Naming

This approach has genuine advantages for precision. It also has properties that SAST tools don’t share, and not all of those are favorable.

SAST tools are deterministic. Run CodeQL twice on the same commit and you get the same results. LLM-based analysis has variance in what it finds and how it explains findings. This matters for CI pipeline integration, for audit trails, and for regulatory compliance contexts where reproducibility is required.

SAST tools are also fast on subsequent runs with caching and incremental analysis. LLM inference at the scale of a large codebase has meaningful cost and latency implications, particularly for workflows that want security feedback on every pull request.

SAST rules are auditable in a way that LLM reasoning is not. A security team can read a Semgrep rule and understand exactly what it matches. Explaining why an LLM-based system flagged or did not flag something requires trusting an explanation rather than inspecting a specification. Even with chain-of-thought output, there’s no formal guarantee that the stated reasoning is the actual reasoning.

None of these are arguments for preferring false positives from syntactic rules. But they are real engineering trade-offs that any team integrating AI-based security analysis should account for before treating it as a wholesale replacement for existing tooling. Used alongside traditional SAST, it fills in the gaps that rule-based systems structurally cannot cover. Used as the sole mechanism, it introduces a different set of unknowns.

Where This Leaves Security Tooling

The tooling landscape for code security has been slowly moving toward semantic analysis for years. SAST tools have added machine learning layers, dataflow analysis, and framework-specific knowledge to improve precision. The step OpenAI is taking with Codex Security is to treat semantic understanding as the primary analysis mechanism rather than a post-processing layer on top of rules.

Whether this holds up across large, polyglot, legacy codebases with unconventional patterns is a question that real-world deployment will answer over time. The theoretical argument for why constraint reasoning should outperform rule-based SAST is sound. The practical question is how well the system handles the long tail of code that doesn’t resemble common training distributions, and whether security and compliance teams can work with results that carry no formal SAST attestation. The absence of a SAST report is a feature for precision. For teams with audit requirements that specifically mandate SAST coverage, it may introduce a gap that requires a different conversation.

Was this interesting?