Amazon's AI Sign-Off Policy and the Provenance Problem It Can't Yet Enforce
Source: lobsters
Amazon’s decision to require senior engineers to sign off on AI-assisted changes, reported by Ars Technica, is a sound policy response to a real problem. But understanding why it is difficult to enforce reveals something more interesting about where the industry’s AI-assisted development practices still need to mature.
The report describes a string of production outages attributed to AI-generated changes, followed by a mandatory company-wide engineering meeting and a new sign-off requirement. The policy makes sense on its face: add a review gate for a category of changes associated with elevated risk. The harder question is what “AI-assisted” means in practice and how a policy enforces that boundary without tooling that captures it.
The Attribution Problem
In a large engineering organization, change attribution is an infrastructure problem, not a process problem. You know a change passed CI because the pipeline enforced it and logged the result. You know a change received code review because your PR tooling required approval before merge. These controls work because they are mechanically enforced at the point of change, not dependent on engineer self-reporting.
“This change was AI-assisted” is, in most organizations today, a self-reported property. There is no standard mechanism in Git, GitHub, GitLab, or any common CI system that captures which parts of a diff came from an LLM versus human typing. Engineers using Amazon Q Developer, GitHub Copilot, or other AI coding tools have no formal way to mark their commits as AI-assisted short of adding a commit message convention, and conventions without tooling enforcement degrade quickly under deadline pressure.
The category is also genuinely fuzzy. A change where an engineer used an LLM to generate a complete function is clearly AI-assisted. A change where the engineer asked an LLM to explain the behavior of an existing function, then rewrote it themselves, is less clear. A change that uses a human-written algorithm but AI-generated tests sits somewhere else again. Any policy built on “AI-assisted” as a discrete binary inherits all of this ambiguity. Amazon’s policy, as reported, relies on engineers self-classifying their own changes before the review gate applies.
What Automation Taught the Industry About Change Management
The industry has navigated this kind of challenge before with a different category of risky automation: infrastructure changes.
Before modern infrastructure-as-code tooling, infrastructure changes were high-risk and loosely controlled. ITIL-style change advisory boards (CABs) grew up to add human review before production changes, exactly the model Amazon is applying to AI-assisted code now. CABs worked at a certain scale, but they carried the same structural vulnerability: the control lived in process, not enforcement. Change request systems could be bypassed. Forms could be filled inaccurately. Approvals could be gamed under time pressure.
The industry’s durable answer was to push the control down into tooling. Terraform and Pulumi made infrastructure state explicit and reviewable in pull requests, turning a process artifact into a code artifact with all the auditability that comes with version control. Kubernetes admission controllers enforce policy at apply time, not after. Deployment pipelines enforced staging-before-production constraints without requiring anyone to manually confirm compliance each time.
The same arc is likely ahead for AI-assisted code. Process controls are the correct first step when tooling does not yet exist. But process controls without enforcement degrade over time, and the durable solution is gates at the tool layer.
What AI Provenance Tooling Would Need to Look Like
A robust AI provenance system for code changes would need to operate at several levels.
At the IDE level, tools like Amazon Q or Copilot would record AI contributions as metadata that survives the edit-and-commit cycle. If an engineer accepts a Copilot suggestion and modifies it before committing, the result still carries the provenance marker. A reasonable definition might treat any change that originated from an LLM suggestion, regardless of subsequent edits, as AI-assisted, since the architectural decision to use that suggestion was made by a human and the correctness of the implementation partly traces back to the model’s output.
At the commit level, a standard for expressing AI provenance in commit metadata would close the attribution gap. Git already supports trailer conventions for attribution. A hypothetical standard might look like:
AI-Assisted: amazon-q/2026-03
AI-Assisted: github-copilot/gpt-4o
These are straightforward to emit from IDE integrations and straightforward to parse in CI pipelines. Informal versions of this already exist: some developers include Generated with Claude Code or Co-Authored-By: github-actions[bot] in commits. The gap is standardization across tooling vendors, not technical difficulty. Vendors have little structural incentive to adopt conventions they did not create, which is why this likely requires either an industry working group or a dominant player making their convention the default.
At the CI layer, policy gates could enforce review requirements based on this metadata. The same way branch protection rules require approving reviews before merge, a policy rule could require a designated approver specifically when an AI-Assisted trailer is present. This moves the enforcement from honor system to automated gate, and it creates an audit trail: every AI-assisted change that shipped to production has a timestamped approval from a named senior engineer.
Why Infrastructure Code Carries the Specific Risk
There is a reason the incidents that prompted Amazon’s policy involved operational and infrastructure changes rather than application logic. Infrastructure code has a different risk profile for AI generation.
Application logic typically has test coverage, operates under defined interface contracts, and fails in ways that surface during staging or canary rollouts before reaching full production. Infrastructure code often has none of these properties. A change to a Terraform module for VPC routing, an IAM policy document, or a service mesh routing rule may have no automated test coverage at all. Its correctness depends on understanding the operational state of a live system, the implicit contracts between services, and the load characteristics that only appear at scale.
AI models generating infrastructure changes face a harder version of the context problem. They cannot observe the actual state of your production systems. They generate plausible-looking configurations based on patterns in training data. For application code, plausible is often good enough. For infrastructure code operating at AWS scale, plausible and correct diverge precisely in the scenarios with the largest blast radius: regional failover configurations, IAM permission boundaries, network ACL ordering.
This is why senior sign-off makes specific sense for this category. The engineers with deep production mental models are positioned to catch the gap between “this is valid Terraform syntax” and “this will break our cross-region failover under a specific partition scenario.” That knowledge is not written down anywhere the model could have seen it.
The Broader Implication
Amazon’s policy is a reasonable interim measure applied to a real problem. It is also a signal that tooling for AI-assisted development has not kept pace with the adoption of that tooling.
The Redox OS project made the inverse choice: ban LLM-generated contributions outright, treating provenance as a prerequisite for the Developer Certificate of Origin model they rely on. That works for an open-source project that can control its contributor policies. A company the size of Amazon cannot effectively ban the tools, so they are adding a review layer instead.
Both approaches are working around the same absence: a standard, tooling-enforced way to say that a given piece of code was AI-assisted, what model produced it, and when. Without that infrastructure, every organization is building its own convention or relying on self-reporting, which means the audit trail that a post-incident review would need simply does not exist in most codebases today.
The organizations that build that infrastructure now, whether through internal tooling or by converging on an industry standard, will be in a substantially better position when the next round of AI-related incidents prompts the question: how much of what shipped was model-generated, and is there any record of who reviewed it.