· 6 min read ·

The Audit Infrastructure Behind the SAST Report

Source: openai

The OpenAI explanation of why Codex Security doesn’t produce a SAST report focuses on a signal quality argument: AI-driven findings and pattern-matched SAST findings have incompatible confidence profiles, and combining them degrades developer trust in the output as a whole. That reasoning is sound. But the SAST report format itself is a historical artifact with a specific origin, and understanding where it came from explains something about who Codex Security was designed to serve and who it wasn’t.

How the Compliance Market Built SAST

Static analysis research predates commercial SAST by decades. Dataflow analysis, abstract interpretation, and taint tracking have roots in compiler theory work going back to the 1970s. What changed in the early 2000s was the emergence of enterprise security procurement requirements that created a commercial market for tools producing documented security assessments.

Organizations selling to financial institutions, healthcare providers, and government agencies started facing contractual and regulatory demands that their software had been tested for vulnerabilities. The buyers demanding this weren’t typically engineering teams. They were CISOs, risk managers, and compliance officers, and their deliverable to auditors was evidence: documented proof that security scanning had occurred, that findings had been generated, and that remediation had been tracked.

Fortify Software, Coverity, Checkmarx, and Veracode all grew their enterprise businesses selling into this compliance market. The purchasing decision was made by the same organizational function that procures legal, audit, and risk services, and the output those buyers needed was a comprehensive report they could file. Whether developers resolved the findings was, in practice, a secondary concern.

This shaped the tools at a product level. A SAST report with 300 findings including 200 false positives is still a compliance artifact. It demonstrates that scanning occurred and that findings were documented. NIST 800-53 SA-11 is the control that makes this explicit: it requires organizations to “employ static code analysis tools to identify common flaws and document the results of the analysis.” The operative requirement is documentation. FedRAMP, which layers on 800-53, carries the same language, meaning any organization pursuing a FedRAMP authorization needs SAST output as part of its assessment package, regardless of what other security tooling it runs.

What the Output Format Reflects

The SARIF format, the industry standard output for SAST tools today and the format accepted by GitHub Advanced Security, Azure DevOps, and VS Code’s security views, encodes this lineage directly. A representative SAST finding looks like:

{
  "version": "2.1.0",
  "runs": [{
    "tool": {"driver": {"name": "Semgrep", "version": "1.x"}},
    "results": [{
      "ruleId": "python.django.security.injection.sql",
      "level": "error",
      "message": {"text": "Potential SQL injection via cursor.execute()"},
      "locations": [{
        "physicalLocation": {
          "artifactLocation": {"uri": "app/views.py"},
          "region": {"startLine": 47}
        }
      }],
      "partialFingerprints": {
        "primaryLocationLineHash": "abc123def456"
      }
    }]
  }]
}

The partialFingerprints field exists to allow findings to be tracked across code changes in audit management platforms like DefectDojo. The schema assumes findings will be imported into a tracking system where compliance teams can monitor resolution status over time. Nothing in the format captures whether the finding is exploitable, whether it exists in reachable code, or whether the full calling context changes the risk assessment. Those questions were not the design constraints that produced the format.

Semgrep and CodeQL both export SARIF. Both are genuinely capable analysis tools that can find real vulnerabilities. But the output format they share was designed around the compliance workflow, and that design shapes how their output is consumed downstream.

The Developer Experience as a Secondary Concern

The consequence of designing for compliance buyers is that developer experience became a secondary concern. This isn’t a criticism of the tools’ technical quality. It reflects the procurement incentives during the period when the market formed.

Developers encounter SAST output in a different context from the compliance team. They are responsible for investigating and resolving findings under time pressure, and their expected return from investigating any given finding depends on the tool’s precision. The BSIMM studies of security program maturity across hundreds of organizations document a consistent pattern: SAST finding backlogs accumulate faster than they are resolved, and teams at lower maturity levels cope by deprioritizing SAST work until compliance deadlines force a triage cycle. This is rational behavior given low per-finding precision. It is also precisely the failure mode that renders the compliance artifact useless as an engineering feedback signal.

GitHub Code Scanning made a meaningful step toward developer integration by displaying SARIF findings inline in pull requests, reducing the friction of triage by surfacing findings in the workflow where code changes happen. But it does not change what the findings contain: pattern matches without exploitability validation. The SARIF format moves into the PR review UI, but it carries the same precision characteristics it always had.

What Codex Security Is Designed Around

A Codex Security finding is a different kind of artifact. Rather than a machine-readable record formatted for audit import, it is a validated judgment: the reasoning system has checked whether the identified code path is exploitable given the full project context, the authentication layers above it, and the constraints on user-controlled inputs before they reach the vulnerable point. The output is closer to a PR review comment from a security engineer who has read the relevant surrounding code than to a SARIF entry.

This is what makes the comparison between Codex Security and SAST partly a category error. The precision argument is real, but the more structural difference is that the two tools are built for different consumers. SAST produces documentation for the compliance workflow. Codex Security produces actionable findings for the development workflow. Those are different products with different design constraints, and they happen to overlap in the question of what code vulnerabilities exist.

Meta’s Infer static analyzer is an instructive parallel. Infer uses separation logic and bi-abduction to find real null pointer dereferences, resource leaks, and memory safety issues at scale. It is a more sophisticated analyzer than most commercial SAST tools for its target vulnerability classes, and it has found real bugs in Facebook’s production iOS and Android code before release. It does not produce SARIF output and is not commonly used as an SA-11 compliance artifact. Teams that use Infer typically run a conventional SAST tool alongside it for compliance coverage and use Infer for engineering depth. The tools serve different purposes even when they analyze similar code.

The Practical Gap for Regulated Teams

The institutional reality is that organizations operating under FedRAMP, organizations subject to NIST 800-53 SA-11, and organizations whose auditors have operationalized security testing requirements as “show us your SAST reports” cannot substitute Codex Security for their SAST tooling regardless of which tool produces more accurate findings. The auditor needs SARIF output from a recognized static analysis tool to check against the control. A constraint-reasoned AI finding, however precise, does not satisfy that documentation requirement in the current audit framework.

This is not a gap that Codex Security’s design failed to address. It’s a gap that reflects the fact that the tool was designed around a different use case. The compliance attestation problem and the engineering feedback problem require different outputs, and building a tool that serves both simultaneously means making compromises in the direction that the compliance market’s noise tolerance permits.

The practical outcome for most organizations will be layered tooling: conventional SAST for compliance coverage and audit evidence alongside AI-based constraint reasoning for the findings developers act on during development. OWASP SAMM describes a similar principle in its security testing guidance, distinguishing between coverage-oriented testing and depth-oriented testing. SAST provides the coverage layer that regulators ask for. Tools like Codex Security provide the depth layer that actually changes what code ships.

The SAST report wasn’t designed poorly. It was designed for a buyer who needed documentation of due diligence, and the tools delivered that reliably. Codex Security is designed for a buyer who needs vulnerabilities fixed before code ships. Those are different problems, and the absence of a SAST report in the output is the most direct statement of which problem the tool is solving.

Was this interesting?