The Mental Work Coding Agents Don't Eliminate

Simon Willison published a piece on the cognitive cost of coding agents this week, and it touched on something I’ve been turning over for a while. The promise of agents like Claude Code, Cursor, and similar tools is straightforward: they write the code while you direct them at a higher level. The implicit assumption is that this reduces your cognitive load. That framing misses what’s actually happening.

The Work Shifts, It Doesn’t Disappear

When I started using Claude Code heavily for Ralph, my Discord bot, the most striking thing wasn’t how much faster I moved. It was how the nature of my work changed. I stopped spending most of my time writing code. I started spending it reading code, approving changes, checking diffs, and deciding whether the agent had understood my intent correctly.

That’s not lighter work; it’s different work, and in some ways harder.

Writing code is a synthetic act. You build a mental model of the problem and translate it into executable form, getting immediate feedback as you go. The tight loop between thought and output is why flow states are so achievable when programming. Reviewing agent-generated code is a verification act. You hold your intent in working memory while scanning someone else’s solution to check if it matches. Those two cognitive modes draw on different resources and don’t transition cleanly between one another.

Gerald Weinberg’s research on multitasking in software development found that adding a second simultaneous project costs roughly 20% of your productive time to context switching. By the third concurrent project, you’re losing 40% to switching overhead alone. Agentic coding doesn’t produce the same profile as juggling separate projects, but it creates a similar pattern: the agent runs, you do other things, the agent surfaces a decision point, you switch back, you reconstruct what the agent has done, you decide, you switch away again. The reconstruction step is the expensive one.

Attention Residue and the Agent Loop

Sophie Leroy’s concept of attention residue describes what happens when you switch away from a task before it reaches a natural stopping point: part of your attention stays behind, on the unfinished work, even as you engage with something else. The effect is measurable and consistent across studies: people perform worse on a new task when they’ve left a previous one incomplete.

Agentic coding creates this condition structurally. The agent is always in motion. You can’t reach a clean handoff because the handoff is the entire workflow. You’re perpetually mid-task while the agent runs, and perpetually pulled back when it needs a decision.

Gloria Mark’s research at UC Irvine on workplace interruptions found that it takes an average of 23 minutes to fully return to a task after a disruption. That figure seems high for the quick approval prompts coding agents generate, but those prompts aren’t neutral. Each one requires you to understand enough of what the agent has done to make a judgment about what it should do next. At the wrong moment, a 30-second decision becomes a context reload.

The Verification Problem

There’s a specific cognitive demand in reviewing AI-generated code that doesn’t map cleanly onto traditional code review. When you review a colleague’s pull request, you’re evaluating code written by someone who shares your understanding of the codebase, who made decisions you can reason about, and who you can ask questions of. The author had context you can recover.

When you review agent-generated code, you’re evaluating code written by a system that had exactly the context you gave it, no more. If the agent made a wrong turn early, subsequent code can look locally reasonable while being systemically wrong. You need to hold the entire thread of what you asked, what the agent did, and how the current state relates to your original intent simultaneously, in order to catch those errors.

This is a form of mental simulation, and it’s expensive. It’s closer to debugging than reviewing.

I’ve lost hours to this pattern: approving changes that looked correct in isolation, then discovering twenty interactions later that they had accumulated into something structurally wrong. The agent wasn’t doing anything obviously bad at any individual step. The problem was the direction of travel, which I’d stopped actively tracking.

Tool Design and Cognitive Budget

Not all agents create this overhead equally. The interaction model matters.

Tools that surface reasoning, not just results, reduce the verification burden. When Claude Code explains why it chose a particular approach, I don’t need to reconstruct that reasoning from the diff. I can evaluate the stated reasoning instead, which is faster and more reliable. The code still matters, but the explanation gives me something concrete to agree or disagree with.

Interrupt granularity matters too. An agent that asks for approval on every file edit creates a different cognitive rhythm than one that runs a full implementation and surfaces a single diff. Neither is universally better. Fine-grained interrupts keep you aligned with what the agent is doing but fragment attention continuously. Large batch approvals let you stay in a working state for longer but increase the risk of missing drift. The right tradeoff depends heavily on how well you can specify the task upfront.

Claude Code’s model, asking for tool-use confirmations rather than line-by-line edits, sits in a reasonable middle position. But it means you need to maintain enough situational awareness to distinguish consequential tool uses (overwriting a file you haven’t reviewed) from routine ones (reading a file you already know). That awareness has to come from somewhere.

The Skill Set Changes

What Willison’s piece identifies, and what I’ve arrived at through building with these tools over the past year, is that effective agentic development requires a genuinely different skill set from effective solo coding.

Solo coding rewards the ability to hold a complex system in working memory and manipulate it incrementally. Agentic development rewards the ability to specify clearly upfront, monitor direction rather than just output, and catch divergence early before small misalignments compound. It’s closer to technical project management than programming, and it’s not a skill that good programmers automatically possess.

I’ve watched capable engineers struggle with agent-based workflows not because the agents lacked capability, but because maintaining the supervisory thread was itself the bottleneck. They’d set the agent on a task, let their attention drift, come back, approve changes too quickly, and spend the next hours unpicking the accumulated results. The agents were capable; the engineers’ ability to sustain supervisory attention over long runs was the constraint.

This isn’t an argument that agentic coding is worse. It’s an argument that the cognitive costs are real, specific, and different from what most discussions acknowledge. The efficiency gains are genuine for well-scoped tasks in familiar territory. For work in unfamiliar codebases, or on systems where interconnections are subtle, the supervisory overhead can eat the gains and then some.

For Ralph, where I know the entire codebase and the tasks are well-bounded, agents accelerate the work substantially and the cognitive overhead is manageable. For work in less familiar territory, I’ve learned to be deliberate about how much of the decision-making I delegate and at what granularity.

The agent will write the code; the cognitive work of tracking whether it’s writing the right code remains yours, and it compounds across a session in ways that are worth accounting for before you hand over the wheel.