When AI Editors Can't Leave Well Enough Alone

Anyone who has used an AI coding assistant for more than a few sessions has encountered this: you ask it to fix a specific bug, and it hands you back a diff that fixes the bug, renames three variables for “clarity,” adds type annotations to an adjacent function, reformats a block it decided was hard to read, and splits one method into two because it thought the single-responsibility principle applied. The bug is fixed. Everything else is noise you now have to review, reason about, and accept or reject.

This pattern has a name now. A recent post by nrehiew calls it over-editing: a model modifying code beyond what is necessary to fulfill the request. The framing is useful because it separates a property of the edit from a property of the output. The code the model produces might be better in some aesthetic sense while still being a worse edit, because a worse edit is one that changed more than the task required.

That distinction matters more than it might initially appear.

Why Models Over-Edit

Large language models do not perform edits in any algorithmic sense. They do not locate a region of code, compute a minimal delta, and apply it. They generate text, token by token, conditioned on the context they were given. When that context includes a file and an instruction, the model produces a completion that it has learned, through training, to associate with “good” outputs given inputs like this one.

The problem is that training data and feedback signals have almost no representation of minimal edits as a virtue. Models are trained on human-written code, where complete rewrites and opinionated refactoring are common. They are evaluated on benchmarks like SWE-bench where the pass/fail signal is whether tests pass, not whether the change was minimal. And when humans provide preference feedback during RLHF, annotators rating two outputs will often prefer the one that looks more complete, more polished, more thorough, even when the task was narrow.

The result is a systematic bias toward doing more. A model that adds error handling, improves naming, and adds a docstring while fixing a bug has been rewarded for those additions during training, because those additions look like better code. The fact that they were not requested is not a signal the training pipeline reliably captures.

This is structurally similar to sycophancy in preference-trained models: the model learns to produce outputs that look impressive to evaluators rather than outputs that are most useful. Over-editing is sycophancy applied to code quality instead of opinion validation.

The Concrete Costs

Over-editing is annoying in isolation. It compounds badly in practice.

The first cost is review burden. Every line that changes is a line a developer has to read and verify. A surgical three-line fix is reviewable in thirty seconds. A fix plus an unsolicited refactoring might require understanding why variables were renamed, verifying that the refactored control flow preserves edge-case behavior, and checking that the reformatted code will pass the project’s linter configuration. The original task took thirty seconds; the over-edited version might take twenty minutes.

The second cost is blast radius. Every line changed is a line that could have been changed incorrectly. A model that renames a variable might rename it in four of five call sites, leaving one that breaks at runtime. A model that restructures a function to improve readability might silently alter the order of side effects. The more the model touches, the more surface area exists for new bugs.

The third cost is git history quality. git blame and git bisect are precision tools. A commit that fixes one bug should identify one causal change. When an AI assistant’s commit mixes the fix with reformatting and renaming, bisect results become ambiguous and blame output stops pointing at meaningful decisions. Small, focused commits are not just aesthetic; they are operational infrastructure.

# A minimal fix: clear intent, bisectable
git show a3f91bc
# -    return items[idx]
# +    if idx < 0 or idx >= len(items):
# +        raise IndexError(f"index {idx} out of range")
# +    return items[idx]

# An over-edited fix: same bug fixed, now good luck with bisect
git show d8e72af  
# -def get_item(items, idx):
# +def get_item(items: list, idx: int) -> Any:
#      """Retrieve item at index."""
# -    return items[idx]
# +    if idx < 0 or idx >= len(items):
# +        raise IndexError(f"index {idx} out of range")
# +    return items[idx]
# +
# +
# +def _validate_index(idx: int, length: int) -> None:
# ...

The second commit is “better code.” It is a worse commit.

How Editing Tools Try to Constrain This

Tool builders have been aware of this problem and approached it through a few different mechanisms.

Aider, one of the more widely used AI coding assistants, offers multiple edit formats and has documented tradeoffs between them. Its “whole file” format asks the model to rewrite the entire file, which makes it easy for the model to apply changes but maximizes over-editing risk. Its “diff” and “udiff” formats ask the model to produce something closer to a patch, constraining how much it can touch. The tradeoff is that diff formats require the model to correctly identify and reproduce context lines, which frontier models sometimes fail at, producing patches that do not apply cleanly.

The SEARCH/REPLACE block format, used by Aider and adopted by several other tools, sits in between:

<<<<<<< SEARCH
def get_item(items, idx):
    return items[idx]
=======
def get_item(items, idx):
    if idx < 0 or idx >= len(items):
        raise IndexError(f"index {idx} out of range")
    return items[idx]
>>>>>>> REPLACE

This forces the model to identify exactly what it is replacing, which mechanically discourages touching code it did not need to touch. If the model wants to rename a variable, it has to explicitly include a SEARCH block containing the old name and a REPLACE block containing the new one. That friction alone suppresses some over-editing.

But none of these are complete solutions. They are interfaces that make over-editing harder, not models that have learned not to over-edit.

The Measurement Gap

The deeper problem is that the field does not have a standard way to penalize over-editing in evaluation. SWE-bench measures whether tests pass. The METR study on AI software engineering showed that many benchmark-passing patches would be rejected in real code review, but it focused on correctness and style issues, not specifically on edit minimality.

There are formal tools for reasoning about minimal edits. The Myers diff algorithm, which underlies Git’s diff output and many other comparison tools, defines a minimum edit sequence between two strings in terms of insertions and deletions. But minimum edit distance in raw text is not the same as minimal edit in semantic terms: you can make a semantically identical change with very different diff sizes depending on formatting, and you can make a large diff that represents a genuinely minimal semantic change if the surrounding code required restructuring.

A more useful metric would be something like: given the task specification and the correct output, what is the smallest set of changes to the input file that produces a semantically equivalent correct output, and how far does the model’s actual edit deviate from that? This is computationally expensive to compute at scale, but it would give a meaningful signal about over-editing tendency.

Some researchers have looked at related metrics. CodeBLEU measures structural similarity between generated code and reference code, which captures some of this, but it is primarily used for code generation evaluation, not editing evaluation. Dedicated editing benchmarks like EditBench exist for natural language, and code-specific analogs are starting to emerge, but none have become standard.

What Good Editing Behavior Actually Looks Like

From a developer’s perspective, a well-behaved AI editor should have a narrow interpretation of its mandate. “Fix this bug” means fix the bug. It does not mean fix the bug and improve adjacent code that struck the model as suboptimal. “Add error handling here” means add error handling at that specific call site. It does not mean add error handling and also refactor the function that contains the call site.

This requires something closer to a principle of least intervention: change the minimum necessary to accomplish the task, and leave everything else exactly as it was, including code that the model might “know” is not idiomatic or could be written differently.

This is a harder constraint to satisfy than it sounds. When a model is generating a function that contains a bug fix, it has no clean way to say “I am deliberately leaving this other code exactly as-is.” The generation process does not have that concept. What would help is training on data where minimal edits are explicitly labeled and rewarded, and evaluation pipelines that penalize unnecessary changes directly rather than ignoring them.

Some of this comes down to instruction following, and the best current models are meaningfully better at narrow task execution than earlier ones. Asking Claude or GPT-4o to fix only a specific thing and leave everything else untouched produces more constrained edits than asking the same of models from two years ago. But the tendency to add, improve, and expand is not gone. It is just more suppressible through explicit prompting.

Why This Is Worth Taking Seriously

Over-editing is easy to treat as a minor annoyance because any individual instance is not catastrophic. The fix still works; you just have some extra diff to review. But at scale, across a development workflow that involves dozens of AI-assisted edits per day, the accumulated review burden is substantial. And the occasional over-edited bug introduction, where the model silently changed something it should not have touched while addressing something else, can cost more time to debug than the original fix saved.

The framing in the original article is right to treat this as a model property, not just a user experience complaint. Models have over-editing tendencies that are consistent and measurable, and those tendencies come from training choices and evaluation gaps. Fixing them requires addressing the root, not just prompting around the symptom.

The benchmark conversation in AI coding has focused on whether models can solve tasks. The next conversation needs to include how they solve them, and whether the changes they make beyond the task are worth the cost of reviewing them.