· 6 min read ·

The SAST Report Format Is an Admission of Uncertainty

Source: openai

When OpenAI published their explanation of why Codex Security doesn’t produce a SAST report, the interesting part wasn’t the AI angle. It was this: the report format itself exists because SAST tools cannot validate their own findings. The SARIF output, the severity tiers, the triage workflows, the suppression pragmas scattered through your codebase — all of that infrastructure exists to offload a discrimination problem that the tool created but couldn’t solve. Once you build a system that validates findings before surfacing them, the report format becomes the wrong artifact entirely.

That framing shifts how you read the decision. This isn’t “AI is better than SAST.” It’s that if you change the fundamental constraint — if findings are confirmed before they appear — then the entire downstream apparatus the industry built around SAST is solving the wrong problem.

How Taint Analysis Actually Creates the Problem It’s Famous For

Understanding why requires a brief look at how modern SAST tools work. The core technique is taint analysis: mark certain nodes in your code’s data flow graph as sources (HTTP parameters, environment variables, database reads), define sinks (SQL execution, shell commands, HTML output), and flag any path where tainted data reaches a sink without passing through a declared sanitizer.

CodeQL expresses this in a Datalog-style query language where you write explicit predicates about sources, sinks, and sanitizers. Semgrep has a taint mode with similar semantics. Both are genuinely sophisticated tools. The problem is structural, not one of implementation quality.

At every function call that isn’t explicitly modeled, the tool faces a binary decision: propagate taint forward (conservative, sound) or drop taint (unsound). If your codebase wraps database queries in a custom ORM, sanitizes inputs inside an internal validation library, or routes user data through a Redis layer before it reaches a SQL query, the tool either fires when it shouldn’t or misses what it should catch. The two failure modes share the same root cause. NIST’s SATE evaluations have measured false discovery rates of 50 to 70 percent across SAST tools on real codebases, consistently, across years of evaluation. The OWASP Benchmark shows that at the operating point where tools catch half of real vulnerabilities, they generate nearly one false positive per true positive.

This is also why the tuning spiral exists. Teams deploy a SAST tool, get overwhelmed by noise, add # nosec pragmas and // nosemgrep: rule-id comments and .semgrepignore entries, then wonder why real findings get missed. The suppression mechanism is load-bearing. It’s not a workaround. It’s how the tool is designed to be used.

What Constraint Reasoning Changes

The design premise behind Codex Security is different in kind, not degree. Instead of asking “does tainted data reach a dangerous sink?”, it asks “is there a concrete execution of this code where a specific security invariant is violated?” The invariant for SQL injection is something like: values interpolated into this query string must be either non-attacker-controlled or properly parameterized.

Consider a simple case:

# SAST: fires. String interpolation into SQL query.
query = f"SELECT * FROM users WHERE id = {user_id}"
cursor.execute(query)

A taint analyzer sees user_id as potentially tainted and an f-string into cursor.execute() as a dangerous pattern. But:

@require_staff
def internal_diagnostic(request):
    report_id = config.get('default_report_id')  # integer from application config
    run_internal_query(report_id)

Constraint reasoning asks whether report_id can be attacker-controlled. It can read the config loading code, understand that default_report_id is a static integer set at deploy time, and determine that the invariant isn’t violated. Finding suppressed, not because a suppression pragma was added, but because the constraint was evaluated.

This is related to symbolic execution, which has a long lineage — James King’s 1976 work, Microsoft’s SAGE, Stanford’s KLEE. The fundamental power of symbolic execution is that it can report a concrete satisfying input as proof that a path is reachable. The fundamental weakness is path explosion: at N branch points you get 2^N paths, and real programs have enough branches that exhaustive exploration is infeasible. LLM-based constraint reasoning approximates this without the path explosion problem because it doesn’t enumerate paths — it reasons about invariants. The tradeoff is that it doesn’t produce a formal proof.

The Vulnerability Classes That Shift

The techniques diverge most noticeably on a few specific vulnerability patterns.

Custom sanitizers. If your application validates a category parameter like this:

ALLOWED_CATEGORIES = frozenset(['electronics', 'clothing', 'books'])

def get_products(category: str) -> list:
    if category not in ALLOWED_CATEGORIES:
        raise ValueError('invalid category')
    return db.execute(f"SELECT * FROM products WHERE category = '{category}'")

Taint analysis sees the f-string and fires. It has no model for frozenset membership checks as sanitizers unless you explicitly declare one. A constraint reasoning system reading this code understands that category can only take three specific values, none of which are injection payloads. The invariant is satisfied.

Authorization and IDOR. Taint analysis has no dangerous sink for “forgot to verify record ownership.” There’s nothing to taint, nothing to track. The OWASP Top 10’s Broken Access Control consistently tops the list partly because SAST tools generate almost no signal about it. Constraint reasoning can ask whether the code that performs a database lookup by user-supplied ID includes a check that the returned record belongs to the requesting user — a semantic property, not a data flow property.

Second-order injection. Data written to Redis by an attacker, retrieved later, interpolated into a SQL query. The retrieval from Redis looks like a clean database read. Taint analysis across storage boundaries requires explicit Redis source modeling, which most configurations don’t have.

What Codex Security Gives Up

This is where the OpenAI post is worth reading carefully, because they’re explicit about the tradeoff. Codex Security is precision-optimized, not recall-optimized. It accepts missing some vulnerabilities in exchange for reporting only high-confidence findings. The failure mode is quiet: a false negative is invisible. You get a clean report and cannot distinguish it from a clean report on a codebase the system analyzed inadequately.

SAST tools have the opposite failure mode. Their false positives are visible and auditable. You can inspect the rule, trace the data flow, understand why the tool fired. You can audit the CodeQL query suite and produce a coverage claim for a security auditor. Codex Security has no CWE mapping, no query suite to inspect, no explicit coverage specification. The non-determinism is a compliance problem: two runs on the same codebase may disagree, which is incompatible with SOC 2 or PCI-DSS requirements that ask for reproducible, auditable findings.

Some vulnerability classes are structurally outside what constraint reasoning can address. Race conditions and TOCTOU require reasoning about temporal interleavings, not code structure. Deserialization gadget chain analysis requires classpath inspection that exceeds static reasoning. Authorization as a systemic property requires knowing what the intended security model is, not just what the code does — and that knowledge isn’t in the code.

The Artifact Problem

The deepest point in the Codex Security design is that the SAST report format exists to solve a problem SAST created. SARIF exists because tools generate findings they cannot validate. Severity tiers exist to help developers prioritize which unvalidated findings to investigate. Suppression pragmas exist to let developers record which unvalidated findings they’ve decided to ignore. The tooling ecosystem around SAST is, largely, infrastructure for managing uncertainty.

When findings are validated before they surface, none of that infrastructure is needed. A finding that appears in Codex Security’s output is reported because the system determined the invariant is violated with attacker-controlled data. There’s no queue to triage, no list to suppress, no severity tier to interpret. The CI integration has --fail-on-findings always present because every finding that appears is one that should block.

That’s a genuine architectural shift, not a marginal improvement on existing tools. The question is whether the precision tradeoff is acceptable for a given use case. For a team that needs auditable coverage for compliance purposes, SAST’s noisy-but-auditable output may be preferable to a quiet-but-opaque system. For a team that has stopped looking at their SAST output because 80% of it is noise, the tradeoff goes the other direction.

The tools are solving different problems. The interesting thing is that Codex Security makes that explicit rather than pretending the SAST approach can be fixed by better tuning.

Was this interesting?