Why Formal Methods Never Won the SAST Precision War

Traditional SAST tools have a false positive problem that the security research community identified and began working on long before LLMs entered the conversation. OpenAI’s post on why Codex Security doesn’t include a SAST report frames the decision around developer trust and precision, but the technical approach it describes, AI-driven constraint reasoning, follows in a lineage of tools that have been trying to solve the same problem since at least 2008. Understanding why those earlier tools couldn’t achieve broad adoption puts the current moment in better context.

What Symbolic Execution Actually Does

The core problem with SAST is path feasibility: a tool may flag a code path as vulnerable even when that path cannot execute at runtime, because checking whether a path is feasible requires solving constraints, and constraint solving at scale is expensive. The field developed symbolic execution as a direct response to this.

Symbolic execution, rather than running code with concrete values, runs it with symbolic variables representing unknown inputs. At each branch, the engine records the constraint imposed by the branch condition. When a dangerous operation is reached, the engine queries a SAT or SMT solver about whether the accumulated path constraints are satisfiable. If they are, the input that satisfies them is a proof of exploitability. If not, the path is infeasible and no finding is reported.

KLEE, developed at Stanford and published in 2008, demonstrated this on LLVM bitcode. In its original paper, KLEE found 56 bugs in GNU Coreutils, several of which had survived years of traditional testing and SAST review. Its false positive rate on those bugs was effectively zero: every reported path came with a concrete test input that reproduced the issue. Microsoft’s SAGE system extended the approach to grey-box fuzzing, using symbolic execution to generate inputs that maximize code coverage, and ran it continuously against Windows components through the Windows 7 development cycle.

Why Symbolic Execution Didn’t Become Standard Practice

The results from KLEE and SAGE were technically impressive. The tools did not achieve broad adoption. The reasons are worth examining because they are not accidents of timing or marketing.

State explosion is the fundamental constraint. The number of paths through a program grows exponentially with the number of branches. A single 1000-line function with 50 branch points has a theoretical path count that exceeds tractable exploration by any means. Symbolic execution engines manage this through heuristic search strategies, memoization, and selective coverage goals, but the explosion is never fully tamed. Practical symbolic execution is effective on isolated modules or small programs; on a 500,000-line codebase, it requires significant engineering effort just to scope the analysis.

External calls compound this. Symbolic execution on LLVM bitcode can reason about code the compiler sees, but system calls, separately compiled library functions, and network interactions all represent boundaries where the analysis must either model the behavior explicitly or abandon symbolic tracking. Modeling glibc correctly is a research problem in its own right. Real applications make hundreds of external calls, each a potential analysis boundary.

The tooling burden is substantial. Running KLEE requires building a project with Clang to produce LLVM bitcode, writing driver programs that initialize symbolic inputs correctly, configuring memory limits and timeout budgets, and then interpreting solver output. This is not a workflow that integrates into a CI pipeline without significant investment. Teams that used it effectively tended to be dedicated security engineering groups with time to invest in the setup; it never became something a developer could run with a single command against arbitrary code.

Facebook’s Infer and the Compositional Approach

Facebook’s Infer took a different approach. Rather than exploring execution paths, Infer uses bi-abduction, a technique from separation logic, to automatically infer pre- and post-conditions for procedures. It analyzes code compositionally: each function is summarized, and summaries are composed to reason interprocedurally without exploring the full call graph. This makes it tractable on large codebases. Facebook ran it against Android and iOS code during CI and used it to find null dereference, memory leak, and resource handling bugs at production scale. Mozilla and Spotify adopted it for similar purposes.

Infer’s limitation is the class of properties it can check. Bi-abduction is particularly well-suited to memory and resource safety: null pointers, heap leaks, use-after-free. Security vulnerabilities involving taint propagation, input validation, and authentication logic require different formal machinery, and Infer’s analysis does not generalize to them without explicit specification work. Writing security specifications in separation logic is expert-level work that most teams are not staffed to do.

The pattern across both KLEE and Infer is the same: technically superior signal at higher operational cost. Each tool required specialized knowledge to deploy, generated results that required specialized knowledge to interpret, and produced coverage claims that required understanding the tool’s formal model to evaluate. The barrier was not the quality of the findings but the path to getting them.

What LLM Constraint Reasoning Adds to This History

The approach Codex Security describes sidesteps the scaling and specification problems that limited symbolic execution and Infer by replacing formal machinery with learned semantic knowledge.

A symbolic execution engine needs an explicit model of what cursor.execute(query) does to data if it is going to detect SQL injection. Someone must write that model, or the analysis produces no finding at the call site. An LLM-based constraint reasoner already knows that calling cursor.execute with an f-string containing unparameterized user input violates the security invariant for database queries, without any explicit specification, because that knowledge was encoded in its training. The same applies to path traversal sanitizers, shell command construction, deserialization gadgets, and the other patterns that SAST rules encode by hand.

This removes the setup cost that prevented symbolic execution from reaching most development teams. The tradeoff is formal rigor. KLEE provides a concrete satisfying input as evidence for each finding; you can reproduce the bug with that input and verify the report mechanically. AI constraint reasoning provides a natural language explanation of why the finding is believed to be a real vulnerability, which is harder to verify and may be incorrect in ways that a formal proof would not be.

The concern about novel vulnerability classes is real. Symbolic execution finds bugs in classes it has never seen before, because it reasons from first principles about program state. LLM-based analysis extrapolates from its training distribution, and its confidence on vulnerability patterns underrepresented in training data is lower. Bug classes that were poorly documented or undiscovered when the model was trained represent a genuine gap that formal approaches do not share.

The Common Thread

The history of precision-oriented security analysis tools is a history of technically correct approaches that could not be delivered at acceptable operational cost. KLEE needed Clang and a custom driver. Infer needed a specialized build integration and expertise in separation logic. Both produced better signal than most SAST tools; neither became standard practice for the teams that most needed better tooling.

AI constraint reasoning is a bet that the barrier was the delivery mechanism rather than the underlying approach. If developers can get precision-oriented security analysis without configuration, formal specifications, or compilation instrumentation, the adoption curve changes even if the formal guarantees do not match what symbolic execution can provide. The security research community has argued for precision over recall for decades, in journal papers and tool evaluations that most development teams never read. The argument is not new; making it accessible might be.