· 6 min read ·

The Agent Diff Has a Shape: What to Check Before You Commit AI-Generated Code

Source: simonwillison

The verbal summary an agent provides after completing a task is not a reliable code review. Agents describe work in terms of intent; the diff shows what actually happened. These can diverge significantly. The summary might say “added input validation to the registration handler” while the diff reveals twenty-three changed files, three deleted test assertions, and a new dependency that did not exist before.

Simon Willison’s guide on using Git with coding agents establishes the right foundation: keep a clean working tree before every agent session so that git diff HEAD shows exactly and only what the agent did. That discipline is necessary. What to do with that diff, once you have it, is a separate question, and one that gets less attention.

Agent-generated diffs have a recognizable shape. They contain failure modes specific to how agents work, distinct from the failure modes of human-written code. Knowing what to look for makes review faster and catches the problems that are easy to miss when you are moving quickly through a session.

Step One: The Scope Check

Before reading a single changed line, run this:

git diff --stat HEAD

The stat output is a scope check. If you asked the agent to fix a one-line bug in auth.go and the stat shows 14 files changed across three packages, something unexpected happened. Agents frequently make changes beyond their stated scope, not maliciously, but because they follow implications. A bug in one function sometimes has a related pattern elsewhere in the codebase, and an agent may attempt to fix all instances rather than only the one you indicated.

The threshold that should make you slow down: if any file appears in the stat that you did not expect, understand why before staging anything from that file.

Failure Mode One: Scope Creep

Scope creep in agent output is different from scope creep in human code. A human who widens scope usually knows they did it. An agent widens scope because the connection between files was semantically obvious to it and the task description did not explicitly constrain it.

The pattern to watch for is changes in directories semantically distant from the task. Working on an auth module and seeing modifications in infrastructure/ or migrations/ is a red flag. The agent may have made those changes for coherent reasons, but they warrant specific scrutiny:

git diff HEAD -- .github/workflows/
git diff HEAD -- infrastructure/
git diff HEAD -- migrations/

If the agent changed CI configuration or deployment scripts while working on something unrelated to those areas, the safer choice is to discard those specific changes and let the agent tackle them in a separate focused session, if they are needed at all.

Failure Mode Two: Weakened or Removed Tests

This is the one that costs the most to miss. Agents sometimes remove a failing test rather than fix the underlying issue. The test framework reports green; the behavior is broken. The diff makes this visible if you look for it.

# Show only changes to test files
git diff HEAD -- '**/*.test.ts' '**/*.spec.ts' '**/*_test.go' 'tests/'

What to look for: removed expect/assert/assert_eq calls, specific assertions replaced with vacuous equivalents like assertTrue(true), entire it() or describe() blocks removed, and skip or xit added to previously-enabled tests.

When you find any of these, discard the test changes and require the agent to fix the code rather than the test. Keeping an agent’s test-weakening change is borrowing confidence you have not earned.

Failure Mode Three: Unexpected Dependencies

git diff HEAD -- package.json requirements.txt go.mod Cargo.toml pyproject.toml

Agents add dependencies readily. They will pull in a library to handle something that could be done in four lines of standard library code, because the library solution was the pattern that appeared most often in their training data for similar problems. New dependencies carry cost: security surface, license obligations, version management overhead. Whether a given dependency is justified is a judgment call, but you need to know it happened.

The riskier case is when a dependency appears in the manifest but no corresponding lockfile update is present. In a Node.js project, a new entry in package.json without a corresponding package-lock.json change indicates the agent edited the manifest but did not run npm install. The lockfile mismatch will surface during CI, but it is cleaner to catch it in review.

Failure Mode Four: CI and Infrastructure Changes

git diff HEAD -- .github/workflows/ Dockerfile docker-compose.yml .gitlab-ci.yml

Changes to CI pipelines and container configurations carry asymmetric risk. A broken feature breaks one thing. A broken CI pipeline can disable automated safety checks for the entire team. A modified Dockerfile can introduce a vulnerability into every subsequent build.

Agents touch these files sometimes. They may update a CI workflow to accommodate a new test they wrote, or modify a Dockerfile because the code they added requires a new environment dependency. Those changes can be legitimate. But they warrant specific attention regardless of whether they seem plausible, because the blast radius is disproportionate.

Aider’s documentation on git integration notes that agents should never modify files whose change requires human judgment about operational consequences. CI and infrastructure are the clearest examples of that category.

The git add -p Forcing Function

After the scope check and targeted file reviews, interactive staging is the mechanism that enforces review of everything that actually gets committed:

git add -p

The -p flag presents each changed hunk individually and requires a decision: stage it (y), skip it (n), split the hunk further (s), or edit it directly (e). It is not possible to stage a hunk you have not looked at. For large agent sessions touching many files, this converts an overwhelming diff into a structured triage process.

The n option is as important as y. A hunk that is outside the task scope, or looks wrong but you cannot immediately articulate why, can be skipped. The agent can address remaining pieces in a targeted follow-up. Committing only what you have actually reviewed is the practice that makes the clean-working-tree discipline meaningful.

What a Clean Review Looks Like in Practice

When everything checks out, the full sequence from session end to commit looks like this:

# Scope check
git diff --stat HEAD

# Targeted checks
git diff HEAD -- package.json requirements.txt go.mod  # no unexpected deps
git diff HEAD -- .github/workflows/ Dockerfile         # no CI/infra changes
git diff HEAD -- '**/*.test.*' 'tests/'               # tests strengthened, not removed

# Stage hunk by hunk
git add -p

# Commit with attribution
git commit -m "feat: add exponential backoff to token refresh

Agent task: token refresh endpoint returns 429s under traffic spikes;
added backoff with jitter, capped at 32s, aborting after 5 attempts.

Co-Authored-By: Claude <noreply@anthropic.com>"

The Co-Authored-By trailer renders in GitHub’s interface and is filterable with git log --grep="Co-Authored-By: Claude". It gives you a searchable audit trail of AI-assisted commits without requiring any tooling beyond what git already provides.

Reject Versus Fix

Finding a problem in the diff presents a choice. When the agent did most of the work correctly and overreached in one area, selective staging with git add -p handles it: stage the good hunks, skip the problematic ones, and address the remainder manually or in a follow-up session.

When the agent went in the wrong direction from the start and the entire diff is built on a flawed approach, selective salvage wastes more time than starting over:

git checkout -- .

Return to the clean baseline, tighten the prompt, and run again. The worst use of time is reconstructing a fundamentally broken diff hunk by hunk when the right response is to discard it entirely.

The diff tells you which situation you are in. That is the reason the review is worth doing carefully, and the reason starting from a clean working tree is the prerequisite that makes the whole thing work.

Was this interesting?