When Your AI Editor Rewrites the Room to Fix a Lightbulb

There is a specific kind of frustration that comes from asking an AI coding assistant to rename a variable and getting back a file where the indentation has changed, three functions have been reordered, some error handling has been added that you did not want, and the original variable rename is buried somewhere in the middle. The task was done, technically. But so was a lot of other stuff you never asked for.

This is what the article “Over-editing refers to a model modifying code beyond what is necessary” calls over-editing, and it puts a formal name to something that anyone who uses LLM-based coding tools regularly will recognize immediately. The naming matters because you cannot reason clearly about a problem until you have separated it from the general noise of “AI sometimes does weird things.”

What Over-Editing Actually Is

Over-editing is not about correctness. The edits a model makes when it over-edits are often individually defensible. Renaming tmp to temporary_buffer is arguably cleaner. Adding a null check is arguably safer. Reformatting to a consistent style is arguably more readable. The problem is that none of those changes were requested, and their presence in a diff obscures the one change that was.

The minimal editing principle is simple: a model should make the smallest change that satisfies the request. If the user says “fix the off-by-one error in this loop,” the model should change the loop bounds. It should not also refactor the surrounding function, add logging, or rewrite the variable names. The scope of the output should match the scope of the input.

This sounds obvious. It is also, apparently, quite hard to train for.

Why Models Over-Edit

The root cause is in how these models are trained to be helpful. RLHF (Reinforcement Learning from Human Feedback) rewards outputs that human raters judge as good. When a rater sees a model make a targeted, minimal fix, it looks… fine. When a rater sees a model make the fix and also clean up the surrounding code, it often looks better. More effort. More thoroughness. More value delivered.

The problem is that raters are usually evaluating a single response in isolation. They are not the person who later has to review a PR where a two-line bug fix comes with 47 lines of reformatting, or who has to figure out whether the model’s “helpful” null check changed any behavior. The training signal does not penalize collateral edits because collateral edits are not obviously bad in context-free evaluation.

This is the same mechanism that drives AI sycophancy, just applied to code instead of opinions. The model is not trying to deceive you; it has learned that doing more tends to score better, so it does more.

The Benchmark Gap

The evaluation problem compounds this. SWE-bench, the dominant benchmark for coding agent capability, measures whether a model can resolve GitHub issues. It checks if the tests pass after the model’s changes. It does not measure whether the model made unnecessary changes, or how large the diff was relative to the minimal required diff, or whether unrelated parts of the codebase were touched.

This means a model can score very well on SWE-bench while being terrible at minimal editing. The benchmark incentivizes task completion, not surgical precision. Until there is a widely-adopted benchmark that penalizes over-editing, the training incentives will continue to favor thoroughness over restraint.

A model that rewrites 200 lines to fix a 3-line bug passes SWE-bench the same as a model that makes exactly the 3-line fix. From a deployment perspective, those two models behave very differently.

What This Costs in Practice

The practical costs accumulate in ways that are easy to underestimate.

Code review becomes harder. A PR that contains a bug fix and also contains 80 lines of incidental reformatting forces reviewers to mentally separate the functional changes from the cosmetic ones. This takes time and introduces the possibility of missing something important because it was buried in noise.

Blame history is polluted. git blame becomes less useful when large portions of a file were touched by an AI assistant doing cleanup that was never requested. The historical signal of who changed what and why gets diluted.

Unexpected behavior changes slip through. Over-editing is not always purely cosmetic. A model that decides to “improve” your error handling while fixing an unrelated bug may change behavior in ways that are hard to spot in review. The requested change passes tests; the unrequested change has edge cases that do not.

Trust erodes. If you cannot predict the scope of what a model will change when given a task, you have to review everything carefully every time. This undermines the efficiency gains that made the tool worth using.

How Tooling Tries to Constrain It

Different tools have taken different approaches to limiting over-editing, with varying success.

Claude Code uses SEARCH/REPLACE blocks, where the model must specify exactly what text to find and exactly what to replace it with. This structure makes it mechanically harder to make sweeping changes, because each replacement must be anchored to existing code. A model that wants to reformat your entire file has to write an enormous number of individual SEARCH/REPLACE blocks, which tends to discourage it.

<<<<<<< SEARCH
def calculate_total(items):
    sum = 0
    for item in items:
        sum += item.price
    return sum
=======
def calculate_total(items):
    total = 0
    for item in items:
        total += item.price
    return total
>>>>>>> REPLACE

Unified diff format has similar properties: every changed line must be anchored to context lines, which makes it harder to accidentally change things far from the intended edit site.

Line-range targeting, available in some editors, lets you explicitly tell the model “only look at lines 40 through 55.” This narrows the context the model has access to, which limits what it can touch.

System prompt instructions help too, but they are fragile. Telling a model “only make the minimal necessary change” in a system prompt will work some of the time and fail some of the time, especially for complex tasks where the model has to reason about what counts as necessary.

The Connection to Agentic Coding

This problem becomes more serious as AI coding tools become more agentic. When a model is making a single edit in a chat interface, over-editing is annoying. When a model is autonomously running through a codebase making dozens of edits as part of a larger task, over-editing can compound into something that is genuinely hard to review or reverse.

An agent that over-edits at each step will produce a diff that is much larger than necessary, touch files that were not relevant to the task, and introduce incidental changes that interact with each other in unpredictable ways. The more autonomous the system, the more important minimal editing becomes, because the human review step happens less frequently.

This is one of the reasons that systems like Aider put significant effort into their edit formats. Getting the edit format right is not a UX detail; it is a core part of making an autonomous coding agent trustworthy.

What Would Fix It

A proper fix has to happen at the training level. Models need to be evaluated and rewarded based on edit minimality, not just task completion. This probably means:

Benchmarks that score solutions partly on diff size relative to an oracle minimal solution
Human preference data that specifically penalizes unnecessary changes, collected from reviewers who see the before-and-after in realistic code review contexts
Automated metrics that flag when a model changes code outside the relevant region for a given task

In the meantime, the most effective mitigation for individual developers is to be explicit about scope in prompts, use tools that structurally constrain edits, and treat AI-generated diffs with the same scrutiny you would apply to a junior developer who tends to over-engineer things. The output might be mostly right; the excess might matter.

Over-editing is not a catastrophic failure mode. Models that do it still get tasks done. But it is a quiet tax on every interaction, paid in review time, in blame pollution, and in the creeping sense that you are not quite sure what changed or why. Putting a name to it is a useful first step toward demanding better.