· 7 min read ·

Amazon's Sign-Off Policy Reveals a Problem the Industry Has No Tooling to Solve

Source: lobsters

Amazon made headlines after a string of production outages by requiring senior engineers to sign off on any AI-assisted code changes before they reach production. The company also held a mandatory all-engineering meeting, which is the kind of organizational signal that marks a genuine inflection point, not just a policy memo. Companies hold mandatory engineering-wide meetings when something crossed a severity threshold that normal incident review processes couldn’t absorb.

The policy is defensible. Given what we know about how AI tools fail in production, adding a human gate is the correct interim response. But the policy also exposes something the reporting hasn’t focused on: the software industry currently has no reliable, mechanical way to know which code an AI wrote. That gap makes Amazon’s sign-off rule an honor system, and honor systems degrade under deadline pressure.

What Kind of Bug Breaks Production

The outages that prompted this policy were not caused by syntax errors or type mismatches. Those get caught by CI. What leaked through were changes that were syntactically valid, passed all tests, and were wrong in ways that only surfaced under specific production conditions.

This is not a new failure mode. It’s the central problem that decades of automated program repair research have tried to solve. The GenProg paper from CMU in 2012 coined the term “plausible patch” for a fix that makes a test suite pass without being semantically correct. A substantial fraction of GenProg’s plausible patches turned out to be wrong, achieving correctness by deleting the failing code path, hardcoding values that satisfied assertions, or clamping output in ways that broke untested behavior. The Prophet system from MIT and UW in 2016 tried to close this gap by learning a correctness model from human-authored patches, but the plausibility problem was never solved, only reframed.

LLMs are extraordinarily good at generating plausible output. That’s what makes them useful and what makes the gap between plausibility and correctness harder to detect than it was with earlier repair tools. An AI assistant doesn’t generate obviously broken code. It generates code that looks exactly like what a competent engineer would write, except it may have gotten a critical invariant wrong about the live system’s state.

Infrastructure code is particularly exposed to this failure mode. A VPC routing rule, an IAM policy document, or a service mesh configuration doesn’t have meaningful automated test coverage. Its correctness depends on understanding the production topology, which exists outside the code repository. An AI tool working from the repository alone can generate a configuration that is valid against the schema, passes linting, and is wrong for reasons only a senior engineer with operational context would recognize.

The Provenance Problem

The sign-off policy, as reported, applies to “AI-assisted changes.” This is where things get structurally complicated. “AI-assisted” is currently a self-reported property. There is no standard mechanism in Git, GitHub, GitLab, or any common CI system that captures which parts of a diff came from an LLM versus human typing. Engineers have no formal way to mark their commits as AI-assisted except through commit message conventions, and conventions without enforcement degrade.

The category is also genuinely fuzzy. An LLM-generated complete function is clearly AI-assisted. An LLM-explained algorithm that a human then rewrote from scratch is less clear. A human-written algorithm with AI-generated tests sits somewhere else again. Amazon Q Developer, Amazon’s primary AI coding assistant, integrates into the IDE at the suggestion level, which means a single commit might contain lines from five different AI interactions interspersed with human edits.

What a real provenance system would need to look like: at the IDE level, tools like Amazon Q or GitHub Copilot would record AI contributions as metadata that survives the edit-and-commit cycle. At the commit level, a Git trailer convention could propagate that metadata:

AI-Assisted: amazon-q/2026-03
AI-Assisted: github-copilot/gpt-4o

At the CI layer, policy gates would enforce review requirements based on that metadata, the same way branch protection rules require approving reviews before merge. This produces an automated gate instead of an honor system, plus a timestamped audit trail of every AI-assisted change that shipped with a named approver.

The technical difficulty here is low. The gap is standardization across tooling vendors. Vendors have little structural incentive to adopt conventions they didn’t create, and there is no industry body currently positioned to drive this convergence.

What the Historical Record Suggests

This is not the first time the industry has faced a gap between process-based controls and the tooling needed to enforce them mechanically. The pattern is recognizable from change management history.

ITIL-style Change Advisory Boards were the original answer to production risk: a human review before any change reaches production. CABs worked, roughly, but they were vulnerable to exactly the same failure mode as Amazon’s current policy. Change request systems could be bypassed, forms could be filled inaccurately, approvals could be gamed under deadline pressure, and the gap between what was approved and what was deployed was difficult to audit after the fact.

Terraform and Pulumi moved infrastructure state into code, making it reviewable in pull requests with full auditability. Kubernetes admission controllers enforce policy at apply time, not after. Deployment pipelines enforce staging-before-production without requiring manual confirmation at each stage. Each of these transitions followed the same arc: a process-based control was replaced or supplemented by a tool-level enforcement mechanism that closed the gap between what the process required and what actually happened.

Amazon’s sign-off rule is a process-based control at the beginning of that arc. The objection that it won’t scale is correct. It is also not the right objection to raise right now. The policy is the correct interim state while the industry builds the tooling to enforce it mechanically. Objecting to it on scalability grounds is equivalent to objecting to pre-deployment CAB reviews in 2005 because Terraform would eventually replace them.

Safety-critical industries have navigated this transition before. Aviation software certified under DO-178C requires review artifacts keyed to software criticality levels, with multiple sign-offs for safety-critical functions. Medical device software under IEC 62304 requires design reviews formally traceable to safety requirements for Class C software. Those review requirements were not designed to be the final state. They were designed to hold while the industry built better tooling, and they have been supplemented over time with automated verification tools that check properties the human reviewers were previously catching manually.

What a Sign-Off Actually Has to Do

For the policy to be worth anything, the senior engineer signing off has to engage with specific questions that CI cannot answer. Does this change interact with anything in the production topology not captured in the code? Does it depend on operational invariants maintained by humans rather than enforced by the system? Does the logic hold at production scale, under failure modes not modeled in any test?

A pro-forma sign-off from a senior engineer who glanced at the diff and confirmed that CI is green provides the same false assurance as skipping the review. It creates a paper trail without creating an actual safety check, which is arguably worse than no policy because it diffuses accountability without adding oversight.

The DARPA AI Cyber Challenge in 2024 gave a useful data point on where AI patch generation actually works and where it doesn’t. Teams building AI systems to find and fix vulnerabilities in real open-source software found strong detection rates but uneven patch correctness. Mechanical vulnerability classes, SQL injection, deprecated crypto primitives, missing input validation, hardcoded credentials, were handled well. Business logic vulnerabilities, IDOR, broken access control, authorization flaws, were not. Those require understanding the intended access control model, which is not present in the code in any form a static analyzer or LLM can extract reliably. The same distinction applies to infrastructure code: mechanical properties are checkable, operational invariants are not.

The SWE-bench benchmark has become the standard measure for AI software engineering capability on general tasks, and it shows a similar pattern: AI tools perform well on well-specified bugs with good test coverage and poorly on bugs where the correct behavior requires understanding context outside the repository.

The Industry Is Watching

Amazon is not operating in isolation. Microsoft is watching how the mandatory-review model holds up, with GitHub Copilot Autofix integrated directly into the PR workflow. Meta published CyberSecEval as a framework for measuring AI security capability, which suggests they are thinking carefully about what their tools can and cannot do. Open-source projects like Redox OS made the inverse choice, banning LLM-generated contributions outright because their Developer Certificate of Origin model requires contributors to assert provenance of their work, and LLM-generated code has no clear provenance.

The Redox position is interesting precisely because it’s a tooling-level enforcement rather than a process-level request. You cannot accidentally commit LLM-generated code to Redox because the contribution model rejects it structurally. That’s a much stronger control than a sign-off requirement, and it points at what the eventual solution looks like: move the enforcement down the stack until it’s no longer possible to bypass the policy without actively circumventing tooling.

For large engineering organizations running mixed human-AI development at scale, an outright ban is not practical. The realistic path is provenance tracking built into the tools, propagated through the commit graph, and gated at CI. Amazon’s sign-off rule is the process layer that needs to exist while that infrastructure gets built. Organizations that are building provenance tracking now, whether internally or by converging on something that could become an industry standard, will be better positioned when the next round of incidents arrives. Because there will be a next round. The underlying failure mode, AI tools generating plausible but operationally wrong changes, is structural, not incidental. It follows directly from how these models work.

Was this interesting?