I’ve used Semgrep, SonarQube, Bandit, and a handful of others over the years. They all produce the same category of output: a ranked list of things that look like they might be wrong. Some of those things are wrong. A lot of them aren’t. The developer experience of working with a SAST report is fundamentally a triage process, not a remediation process.
OpenAI’s explanation of why Codex Security doesn’t generate this kind of report cuts to a structural issue that the security tooling industry has been working around for years: static analysis tools understand code syntax and data flow graphs, not code semantics. The distinction matters because it determines what the tool can and cannot reason about.
How SAST Actually Works
A SAST tool operates on a representation of your code, typically an AST combined with a control flow graph and a data flow graph. The analysis applies rules to this representation. For taint-based analysis, those rules look roughly like this: certain sources (HTTP request parameters, environment variables, database reads) produce tainted data; that taint propagates through the data flow graph; if it reaches a dangerous sink (a SQL query, a file path, a command execution) without passing through a declared sanitizer, flag it as a potential vulnerability.
This model handles a lot of real vulnerability classes, which is why SAST tools are useful. SQL injection, XSS, path traversal, command injection, and others all share the same structural signature: attacker-controlled data reaches a dangerous operation. The taint model captures this structure.
The problem is that “dangerous” depends on context in ways the taint model cannot represent. Consider a path traversal check:
def get_user_file(filename):
safe_name = filename.replace("../", "")
return open(os.path.join(BASE_DIR, safe_name)).read()
The SAST tool sees a sanitizer call (replace) on a tainted value before it reaches an open() call. Depending on how the tool’s sanitizer rules are configured, it may or may not flag this. If the sanitizer is registered, the tool clears the taint and reports no issue. But the sanitization is incomplete: it doesn’t handle URL-encoded variants (%2e%2e%2f), null bytes, or unicode normalization tricks. Whether this is a real vulnerability depends on what the calling code does with the filename, what the web framework does to URL parameters, and whether the underlying OS performs any normalization. The taint model doesn’t have the semantic machinery to reason about any of this.
Second-Order Vulnerabilities
Second-order injection is where SAST breaks down most visibly. The classic case:
def register_user(username):
# Input sanitized here
safe_username = escape_sql(username)
db.execute(f"INSERT INTO users (username) VALUES ('{safe_username}')")
def generate_report(user_id):
# This query is safe
user = db.fetchone(f"SELECT username FROM users WHERE id = {int(user_id)}")
# This one isn't - username came from the database but was originally user-controlled
db.execute(f"SELECT * FROM audit_log WHERE actor = '{user['username']}'")
The second query in generate_report uses data from the database, not directly from user input. SAST tools typically don’t track taint across database boundaries. The data was written by the user, stored, and then retrieved; the tool sees it as an untainted database read. The vulnerability doesn’t appear in the output.
Catching this requires understanding that data stored in a database retains its origin properties when retrieved. That’s a semantic claim about what the system does, not a structural property of the code.
The same problem appears in deserialization vulnerabilities, where the danger depends on the deserializer and the class of the object being reconstructed, not on any data flow pattern visible in the call site. It appears in TOCTOU races, where the window between a security check and the operation it guards is a runtime property, not a syntactic one. These are vulnerability classes where pattern matching is structurally insufficient.
What Constraint Reasoning Does Instead
Rather than tracking taint through a data flow graph, constraint-based analysis builds a model of what must be true for the code to be safe, then evaluates whether the actual code satisfies that model. For the SQL example, the constraint is: values interpolated into this query string must be either non-attacker-controlled or properly parameterized. The analysis reasons about whether that constraint holds, drawing on semantic understanding of what functions do, not just how data flows between them.
This is related in spirit to symbolic execution and formal methods tools like Facebook’s Infer, which has been used for production bug detection at scale. Infer works by computing pre- and post-conditions for code fragments and checking whether they compose correctly. The difference with AI-driven constraint reasoning is that it can leverage a broader base of semantic knowledge, including knowledge about what security-relevant functions do in natural language terms, without requiring explicit formal specifications or hand-written analysis rules.
For the incomplete sanitizer case, the analysis can ask whether replace('../', '') prevents all known path traversal techniques. An LLM-based system knows about URL encoding, null bytes, and unicode normalization as concepts, and can reason about whether the sanitizer addresses them. A SAST tool with a declared sanitizer rule simply marks the taint as cleared.
The Precision-Recall Tradeoff
Traditional SAST tools optimize for recall: they want to surface every possible vulnerability, accepting a high false positive rate. This is a defensible choice. Missing a real vulnerability is costly; triaging a false positive is annoying but cheap. The report is designed to be reviewed by a security engineer who filters it down.
AI-driven analysis shifts the optimization toward precision: report only what the system is confident is a real vulnerability. This changes the output from a list to review to a set of findings to act on directly. There’s no triage step, but there’s also reduced visibility into what the tool checked and didn’t check.
The precision focus creates a coverage gap. SAST tools can enumerate the rules they apply and claim coverage over specific vulnerability classes. AI analysis doesn’t have a rule list. When it doesn’t report a vulnerability, you can’t easily tell whether the code is safe or the analysis simply didn’t model that vulnerability class. For a formal security audit, that distinction matters considerably.
Tools like CodeQL attempt to bridge this by making their query language auditable, so you can inspect exactly what analysis was performed and write new queries to extend coverage. AI-based systems face a different auditing challenge: explaining a finding in natural language is not the same as providing a formal proof that the reasoning is complete and correct.
What This Means in Practice
The Codex Security approach is better suited for developers who want to know what to fix, not what to review. For security engineers running comprehensive audits, the lack of a structured report and explicit coverage claims is a real limitation. These are different use cases, and conflating them produces either too much noise or false confidence in coverage.
The more interesting question is whether AI constraint reasoning can be made auditable over time. Formal verification tools like Lean and Coq show that machine-checkable proofs are achievable for software properties; the challenge is automating specification inference at scale. If AI systems can generate not just findings but verifiable reasoning traces, the precision advantage carries without the auditability cost. That combination, high confidence findings with transparent reasoning, would represent something genuinely new in security tooling rather than a repackaging of existing techniques.
Static analysis tools have always had the same underlying constraint: they can see what the code looks like, but reasoning about what it does requires something more. AI systems that can close that gap will produce genuinely different results, and that starts with not treating a pattern-match report as equivalent to a vulnerability assessment.