A new paper on arxiv ran a controlled experiment on what happens to people after AI assistance is removed. The result: participants who worked with AI tools performed significantly worse on independent follow-up tasks than a control group that never had AI access at all. They also gave up faster. Not just “not as good as when they had AI” but worse than people who never touched it.
That second part is the one worth sitting with. Using AI actively made people worse at the thing than not using AI at all.
The Mechanism: Productive Struggle
The researchers propose several interacting explanations, but they all converge on the same educational psychology concept: desirable difficulty. Robert Bjork’s work on this goes back to the early 1990s and the core claim is straightforward. Effortful cognitive processing during practice is what produces durable memory and skill. When you struggle to retrieve something, construct an answer, or work through a problem without external scaffolding, the act of struggle is itself the mechanism of learning. Remove the struggle, and you remove the encoding.
AI assistance, in this framing, functions like an infinitely available worked example. Cognitive Load Theory, developed by John Sweller, has long established that worked examples are highly effective for novices but should be gradually withdrawn as competence develops, because the transition to independent problem-solving is what consolidates skill. AI never withdraws. It remains equally capable and equally available regardless of how much the user has learned. There is no fading of assistance, no moment where the scaffolding comes down.
The study also found that AI-assisted participants showed reduced persistence on hard problems after the tool was removed. They spent less time on tasks before abandoning them. This is consistent with a learned-helplessness pattern: after repeated exposure to a workflow where difficult problems are routed to an external system, the user’s tolerance for difficulty without that system decreases. The hard problem that would have triggered more effort now triggers a shorter search and an earlier quit.
This Has Happened Before
GPS is the canonical prior case. A 2015 study by Dahmani and Bohbot found that habitual GPS users showed reduced grey matter volume in the hippocampus compared to people who navigated by memory and landmark. The hippocampus is central to spatial navigation and memory consolidation more broadly. Heavy GPS use was not just a behavioral change; it corresponded to structural differences in the brain. The 2009 work by Burnett also showed that GPS-reliant drivers performed worse on unaided route recall tasks.
The calculator literature is more complicated. A widely cited 1986 meta-analysis by Hembree and Dessart found that calculator use improved problem-solving performance and attitudes toward math when paired with conceptual instruction, but degraded mental arithmetic fluency when used as a replacement for understanding. The key variable was pedagogy, not the tool itself. Students who used calculators without also being taught the underlying operations fared worse on arithmetic tasks than students who learned without calculators.
Beth Sparrow, Jenny Liu, and Daniel Wegner published what became known as the “Google Effects on Memory” paper in Science in 2011. They found that when people expected to be able to look something up later, they were less likely to remember the information itself and more likely to remember where to find it. This is a benign version of cognitive offloading, the redistribution of cognitive work onto external systems. The AI case described in the new paper is a more consequential version: offloading not just memory but generative reasoning and problem construction.
The Counterargument Worth Taking Seriously
The strongest objection to the paper’s framing is also the most practical one: the study measures performance without AI, but in most real professional contexts, AI tools will remain available indefinitely. If the goal is task output, not skill development, and if AI is always present, the question of independent performance may simply not be relevant.
This is a coherent position and not a naive one. Writing replaced oral memory for a significant portion of human knowledge transmission. Calculators replaced mental arithmetic for most professional contexts. Society accepted those trade-offs because the replaced skills were superseded, not just supplemented. The argument for AI assistance follows the same logic: if the tool is permanently available and reliable, skill atrophy is the cost of a trade-off that has positive expected value.
The problem with this argument is that it assumes continuity of access and reliability that does not actually exist. AI tools fail, have rate limits, produce incorrect outputs that require the user to evaluate them, and are unavailable in contexts where latency, security, or cost prohibit them. More fundamentally, evaluating AI output requires enough independent competence to identify when the output is wrong. A user who has fully offloaded their problem-solving capability cannot catch the errors in the AI’s answers.
There is also the expertise reversal effect to consider. Research on worked examples shows that as expertise increases, worked examples become less useful and eventually counterproductive, because experts have enough schema to process problems directly and the detailed guidance interferes rather than assists. This suggests AI assistance may be most harmful to learners and least harmful to experienced practitioners who engage more critically with AI output. The negative effects in the paper may be substantially concentrated in people who are still developing competence.
What This Means in Practice
The calcultor research offers a useful model for thinking about this. The variable that determined whether calculators helped or hurt was whether the underlying conceptual layer was also being taught. When it was, calculators freed up attention for higher-order reasoning. When it was not, calculators replaced the reasoning process entirely.
The same distinction applies to AI assistance. Using an AI tool to speed up boilerplate you already understand is different from using it to generate solutions to problems you have not yet learned to solve. The former is augmentation; the latter is substitution. The paper’s findings apply most directly to the substitution case.
For developers, this suggests that the useful discipline is maintaining deliberate practice without AI assistance on the kinds of problems that build foundational competence. Not as a purist stance against the tools, but as a practical recognition that the tools only augment capability that exists independently. A developer who cannot think through an algorithm without autocomplete filling in each step does not have the competence to evaluate whether the autocomplete’s algorithm is correct, efficient, or appropriate for the constraint.
The paper is unlikely to change how most people use AI tools, and it probably should not. The productivity gains from AI assistance are real. The 2023 study by Peng et al. measured developers completing tasks 55% faster with Copilot than without. Those gains matter. But they are gains in output speed, not gains in capability. The research on AI assistance is a reminder that confusing the two has a cost, and that cost accumulates over time in ways that do not show up until the tool is unavailable or wrong.
The GPS analogy ends on a concrete note: plenty of people who use GPS constantly can still read a map. They just choose not to. The choice to maintain that capability is the point.