SAST Tools Don't Fail at the Gate, They Drift There

What a Build Gate Is For

The theory of shifting security left is straightforward: find vulnerabilities before merge, not after. A build gate enforces this by treating a security finding the same way a failing test treats a broken assertion. The pull request does not merge until the finding is addressed. The gate is the mechanism that gives the policy teeth.

Tests work as gates because they have low false positive rates by construction. A test either passes or it fails, and when it fails, the failure is almost always real. Developers trust the signal, so they act on it. The gate holds.

SAST tools do not have this property. Their false positive rates, even on well-configured rule sets, run high enough that blocking on every finding would halt development completely. So teams do what seems reasonable: they tune.

The Tuning Spiral

A typical deployment starts with a tool like Semgrep or Bandit integrated into a CI pipeline. The configuration looks roughly like this:

# .github/workflows/security.yml (initial setup)
name: Security Gate
on: [pull_request]
jobs:
  sast:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run Semgrep
        run: |
          semgrep --config=p/owasp-top-ten \
                  --severity=ERROR \
                  --error \
                  --output=semgrep-results.json
      - name: Fail on findings
        run: |
          count=$(jq '.results | length' semgrep-results.json)
          if [ "$count" -gt 0 ]; then exit 1; fi

This blocks the build on any ERROR-severity finding from the OWASP Top Ten rule set. Within a week, the team discovers that half the findings are false positives. Some rules flag their internal framework code, which wraps inputs correctly but in a way the tool cannot see. Others fire on test fixtures that use deliberately unsafe constructs. Developers start complaining that the gate is blocking valid work.

The first response is to suppress individual rules. Then to lower the severity threshold, or to maintain a growing .semgrepignore file. Then individual developers start adding inline suppression comments.

# six months into the deployment

def process_user_input(data: dict) -> str:
    query = f"SELECT * FROM users WHERE id = {data['user_id']}"  # nosec B608
    return db.execute(query)

def render_template(user_content: str) -> str:
    return f"<div>{user_content}</div>"  # nosec B703

def load_config(path: str) -> dict:
    with open(path) as f:
        return yaml.load(f)  # nosec B506 - internal path only

def run_command(cmd: str) -> str:
    return subprocess.check_output(cmd, shell=True)  # nosec B602 B603

Bandit’s #nosec pragma silences the finding at that line. Semgrep has # nosemgrep: rule-id. GitHub Advanced Security lets you dismiss an alert with a reason and a comment, which is reflected in the security tab but does not reopen on new commits to the same line. Each individual suppression is locally defensible. Collectively, they hollow out the gate.

The CI configuration evolves to match:

# .github/workflows/security.yml (twelve months later)
name: Security Scan
on: [pull_request]
jobs:
  sast:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run Semgrep
        run: |
          semgrep --config=p/owasp-top-ten \
                  --severity=ERROR \
                  --exclude-rule=python.django.security.injection.tainted-sql-string \
                  --exclude-rule=python.flask.security.xss.reflected-xss-all-tags \
                  --ignore-file=.semgrepignore \
                  --output=semgrep-results.json
        continue-on-error: true
      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: security-results
          path: semgrep-results.json

The --error flag is gone. continue-on-error: true means the pipeline passes regardless. The output is uploaded as an artifact. Nobody reads artifacts from security scans that don’t fail builds.

Advisory Mode as an Outcome

This is alert fatigue, and it is a security failure mode, not just an operational inconvenience. The term gets used loosely, but its specific meaning here is precise: a team learns through repeated experience that the alerts from a given system are unreliable, and stops treating them as signals requiring action. Once that learned behavior is established, it applies to the real findings too.

The outcome is advisory mode. The scan runs, the report is generated, and the report is available for review. In practice, review happens during compliance audits, not during development. The triage backlog grows. Findings from six months ago sit alongside findings from today, and there is no systematic way to distinguish which ones were real. The tool has become compliance theater: evidence that scanning occurs, not evidence that vulnerabilities are caught.

This is not a failure of intent. The teams that end up here generally started with good-faith attempts to run a blocking gate. The tuning spiral is the structural outcome of running a high-false-positive tool in a context that requires low false positives, which is every context where developer attention is the scarce resource.

What Precision Changes

OpenAI’s post on why Codex Security doesn’t include a SAST report frames the precision problem from the developer trust angle: a finding you don’t trust is worse than no finding at all. The argument applies directly to the gate problem. A gate that developers route around is worse than no gate, because it creates an illusion of coverage while consuming engineering time.

Precision-optimized analysis, which validates that a potential vulnerability is actually reachable with attacker-controlled data before reporting it, changes the math on false positives at the gate. If the tool reports a finding, the finding is real. That property is what makes a gate holdable.

The CI integration for a precision gate looks different, and the difference is meaningful:

# .github/workflows/security.yml (precision-based gate)
name: Security Gate
on: [pull_request]
jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run precision security analysis
        run: |
          codex-security scan \
            --output=findings.json \
            --fail-on-findings
      - name: Block on confirmed vulnerabilities
        if: failure()
        run: |
          echo "Build blocked: confirmed exploitable vulnerabilities found"
          jq '.findings[] | {file, line, type, severity}' findings.json
          exit 1

The key properties: --fail-on-findings is always present. There is no continue-on-error. There is no suppression list, because the findings that appear are confirmed findings, not noise. A developer who sees this gate fail can treat the output as a real vulnerability description, not as a triage task that might or might not be worth investigating.

The absence of a #nosec-equivalent workflow is significant. When the false positive rate is low, there is no pressure to maintain suppression infrastructure. The gate stays a gate.

What the Gate Still Cannot Do

Precision solves the false positive problem. It does not solve the coverage problem. A gate that blocks only on confirmed findings is only as useful as the set of vulnerability classes the analysis reasons about. If the tool reasons about SQL injection and command injection but not about insecure deserialization or server-side request forgery, those vulnerability classes pass through the gate regardless of how precise the analysis is.

This matters for thinking clearly about what a precision gate provides. It provides a strong guarantee on a bounded set of vulnerability classes: the things it finds are real, and the build will not merge with those things present. It does not provide a guarantee that the merged code contains no vulnerabilities. The scope of the guarantee is the scope of the tool’s reasoning surface.

For most teams, that bounded guarantee is significantly more useful than the illusory broad coverage of a SAST deployment that has drifted into advisory mode. A gate that reliably blocks ten vulnerability classes, and does so without generating friction that leads to suppression, is more valuable in practice than a tool that nominally covers fifty classes but gets routed around on the third week of deployment.

The engineering decision is to be clear about that scope, document it, and supplement with other controls for the classes outside it. That framing is more honest than the alternative, which is running SAST in advisory mode and describing it as a security gate.