The Formal Methods Problem That AI Security Analysis Finally Makes Tractable
Source: openai
The Gap Between Theory and Practice
Security researchers have known for a long time that the right way to find vulnerabilities in code is not pattern matching. It is reasoning about what must be true at each point in a program’s execution, and checking whether those constraints are satisfied. This is not a novel insight from OpenAI. It is the basic premise behind symbolic execution, bounded model checking, and formal verification, all of which predate large language models by decades.
What those approaches share with Codex Security’s constraint reasoning is the underlying idea: instead of asking whether code matches a suspicious pattern, ask whether the security properties a piece of data must have can be proven given its derivation. What they share equally is a history of scaling problems that kept them out of routine developer workflows. The interesting question about the LLM-based approach is not whether constraint reasoning is theoretically sound. It is whether AI analysis sidesteps the specific computational problems that made the formal approach impractical.
What Symbolic Execution Actually Does
Symbolic execution, developed in the 1970s by James King and systematically engineered into tools like KLEE and Microsoft Research’s SAGE in the 2000s, runs a program with symbolic inputs instead of concrete values. Where a concrete execution of get_user(conn, 42) would evaluate user_id as the integer 42, a symbolic execution keeps user_id as an unconstrained symbol and tracks constraints as the program branches.
When the program hits a conditional like if user_id > 0, it forks: one path adds the constraint user_id > 0, the other adds user_id <= 0. At every sink, it checks whether the accumulated constraints on a value are consistent with exploitation. For SQL injection: is there a satisfying assignment of the input symbols that results in unsanitized user content reaching the query, given all the constraints accumulated along this execution path?
This is constraint reasoning in the formal sense. It is also, in theory, exactly what you want for security analysis. It reasons about semantics, not syntax. It accounts for control flow. It can determine whether a particular parameterized query is safe without needing a rule author to encode the semantics of parameterized queries.
The problem is path explosion.
// A simple function with 10 conditionals produces up to 2^10 = 1024 paths.
// A realistic codebase has millions.
void process(int a, int b, int c, int d, int e) {
if (a > 0) { ... }
if (b > 0) { ... }
if (c > 0) { ... }
// ...
}
Each conditional doubles the number of paths to explore. In a non-trivial program, the number of paths is astronomically large. Tools like KLEE use heuristics to prioritize interesting paths, and SAT/SMT solvers like Z3 handle the constraint satisfaction at each point efficiently. But even with aggressive pruning, symbolic execution stalls on real-world codebases. KLEE works well on self-contained modules and kernel utilities; it struggles on the Django app in front of a PostgreSQL database.
This is why symbolic execution never became a routine part of software development pipelines. It is the right tool for auditing a cryptographic library or verifying a critical firmware component. It does not scale to a web service with 200,000 lines of Python and dependencies on a dozen third-party packages.
Where SAST Fits In This Lineage
SAST tools like Semgrep, CodeQL, and Checkmarx are engineering compromises. They retain some of the formal methods vocabulary, particularly taint analysis, which tracks data from untrusted sources to dangerous sinks. But they abandon the path-sensitive constraint tracking that makes symbolic execution theoretically complete.
CodeQL’s QL query language lets you write dataflow queries that encode context, and the underlying Datalog-style analysis is sophisticated compared to simple pattern matching. But it is still fundamentally a rule system. Someone had to encode the knowledge that parameterized queries are safe. Someone had to model which functions count as sanitizers. The analysis is bounded by the completeness of the rule set, not by reasoning about the code itself.
Semgrep’s taint mode has the same character. You can write rules that track data flow and recognize sanitizers, but novel sanitizers in application code, custom validation logic that your team wrote three years ago, or framework idioms not covered by the public rule registry are invisible to the analysis. The tool has no way to reason about what an unfamiliar function does.
This produces the false positive treadmill that anyone who has run SAST in CI has experienced. The tool flags every database query involving user-supplied data, every template interpolation, every shell command that constructs strings at runtime, because distinguishing safe from unsafe requires semantic understanding the rule system does not have. Security engineers then spend time triaging noise, tuning rules, and writing suppressions, which takes time away from the vulnerabilities that actually matter.
What LLMs Change About This
The LLM-based approach in Codex Security is not symbolic execution and does not claim to be formally sound. It is making a different bet: that a model with broad exposure to code can perform approximate semantic reasoning accurately enough to be more useful than rules, without the computational intractability of path-sensitive analysis.
Where symbolic execution computes constraint satisfaction exactly but cannot scale, an LLM approximates it. It reads a custom sanitization function and infers whether it satisfies the security constraint required at the sink, not by formally proving it but by recognizing the pattern from training. It understands that Django’s ORM parameterizes queries by default not because a rule encodes this but because it has processed enough Django code to treat it as background knowledge.
This trades soundness for tractability, which is a real trade-off. Formal symbolic execution gives you a proof certificate; LLM analysis gives you a confidence level. There will be vulnerabilities the model misses and safe code it flags. But the important comparison is not against the theoretical ideal of complete formal verification. It is against the alternative actually available: SAST tools with 30-70% false positive rates on real codebases, requiring constant rule maintenance and producing findings that developers learn to ignore.
The validation step that follows constraint analysis is what produces practical precision. Once a potential vulnerability is identified, the system reasons about exploitability given the full calling context. This filters findings to those where there is a plausible attacker-reachable path with no intervening constraint satisfaction. It is the same question symbolic execution answers by exhaustive path enumeration, answered here by trained semantic inference.
# Taint analysis concludes: alert.
# user_data (tainted) -> conn.execute() (sink)
# Constraint reasoning continues:
# At conn.execute(), is there an unsanitized user-controlled value in the query string?
# conn.execute("SELECT * FROM users WHERE id = %s", (user_id,))
# The %s placeholder passes user_id as a parameter, not as interpolated SQL.
# This satisfies the constraint by construction.
# Finding suppressed.
The Engineering Trade-offs
None of this means constraint reasoning is a complete replacement for existing tooling. The properties that SAST tools have are worth naming.
SAST analysis is deterministic. Run the same scan twice and get the same results. This matters for audit trails, for compliance workflows that require reproducible scan results, and for teams trying to track vulnerability trends over time. LLM inference has variance. Two runs on the same commit may produce slightly different findings, which complicates integration into workflows that treat security scan output as a stable signal.
SAST tools are auditable. A Semgrep rule is a human-readable specification of exactly what the tool detects. You can read it, understand it, and explain to a security auditor precisely what coverage it provides. An LLM-based system can produce chain-of-thought reasoning, but that reasoning is not a formal specification; it is the model’s best explanation of its own output, which may not fully reflect the underlying computation.
SAST rules also have a stable relationship to compliance frameworks. If your organization has an audit requirement for SAST coverage of OWASP Top 10 categories, a Semgrep or CodeQL scan produces attestable output. Constraint reasoning that skips a formal SAST report may require additional justification in regulated environments, regardless of its actual precision advantage.
For most teams, the right answer is probably both: traditional SAST for its determinism, auditability, and compliance footprint; AI-driven constraint reasoning for the class of vulnerabilities that require semantic understanding and that SAST rules structurally cannot catch. The Codex Security framing presents the absence of a SAST report as a precision feature, which it is. Whether it also introduces a compliance gap depends on the team’s obligations.
What This Points Toward
The direction here is clear: security tooling is moving toward semantic analysis, and the question is which mechanism for doing that analysis is practical at scale. Symbolic execution got the theory right in the 1970s and has remained too expensive for routine use ever since. LLMs offer a way to approximate the same reasoning at a cost that fits into developer workflows.
The unknowns are around the tail of atypical code. Well-studied frameworks and common patterns are likely to be handled well by a model with broad training. Legacy codebases with unusual conventions, internal DSLs, or heavy use of metaprogramming are harder to reason about, and there is no formal guarantee about how the model handles cases far from its training distribution. That is an empirical question, and the answer will emerge from deployment at scale over the next few years.
What the approach gets right is the diagnosis of why SAST’s false positive problem has been so persistent. It is not a bug in the tools; it is a consequence of using rules to approximate semantic reasoning. Fixing it requires better semantic reasoning, not better rules. Formal methods had the right idea. Making it practical required waiting for a different kind of analysis engine.