· 6 min read ·

Coding Agents Don't Remove Cognitive Load, They Redistribute It

Source: simonwillison

Simon Willison published a piece yesterday on the cognitive impact of coding agents that’s worth reading carefully, because it names something most developer experience writing tends to skip over: using these tools is genuinely exhausting in ways that sneak up on you. I want to push that observation further and ground it in a framework that makes the mechanism clearer.

Cognitive Load Theory and Why It Matters Here

Cognitive load theory, developed by psychologist John Sweller in the late 1980s, divides the mental work of learning and problem-solving into three types. Intrinsic load is the irreducible complexity of the task itself. Extraneous load is friction introduced by how the task is presented or the tools you’re using. Germane load is the productive mental effort that builds schemas and understanding.

The promise of coding agents is that they reduce extraneous load. You don’t have to remember the exact API signature; you don’t have to write boilerplate by hand; you don’t have to context-switch to documentation tabs. That part of the promise is real. GitHub Copilot does reduce the friction of certain mechanical tasks, and Claude Code in an agentic loop can complete multi-step refactors that would have taken an afternoon of tedious editing.

The problem is that the framework also tells you what happens next. When extraneous load drops sharply, the intrinsic load of the task doesn’t change. And if you’re not careful, the germane load that produces understanding gets crowded out by a new and different form of extraneous load: the cognitive overhead of supervising an agent.

The Shift, Not the Elimination

The classic mental model for development cost is simple: you think about the problem, you write code, you verify it works. Thinking is expensive, writing is moderately expensive, verification is relatively cheap if the code is yours.

Agents invert the last two. Writing becomes nearly free, at least for the first draft. Verification becomes the expensive step, and it’s expensive in a way that’s qualitatively different from verifying code you wrote yourself.

When you write a function, you hold a model of it in your head as you write it. Verification is mostly checking that the implementation matches the model you already built. When you read AI-generated code, you’re building the model from scratch from the output, not from the process. That’s harder and slower, because you’re reverse-engineering the author’s intent rather than confirming your own.

This is the asymmetry at the heart of agent-assisted development. Code generation scales up. Code comprehension doesn’t scale the same way, because comprehension is load-bearing in a way that generation isn’t. You can always generate more code. You cannot skip understanding it.

The Principal-Agent Problem in Software

Economists use the principal-agent problem to describe situations where one party (the agent) acts on behalf of another (the principal) but has different incentives or information. The classic examples involve shareholders and executives, or clients and lawyers. The agent is not malicious; they’re just optimizing for something slightly different from what the principal actually wants.

Coding agents have exactly this property. Claude Code, Cursor, GitHub Copilot’s workspace features — they optimize for completing the stated task. They will produce code that passes tests, compiles, and fulfills the literal specification. They have no particular incentive to produce code that you, the developer, will understand easily, or that maintains the local conventions of your codebase, or that is trivially reviewable six months from now by someone who wasn’t in the conversation.

This divergence shows up in practice. Agents tend toward completeness over clarity. They’ll produce a working solution that handles edge cases you didn’t ask about, which sounds good until you’re reviewing 300 lines of generated TypeScript trying to figure out which parts are load-bearing and which parts are speculative. The code is correct, but your mental model of it is full of holes, and the holes are invisible until you hit one.

Context Window Management Is Its Own Cognitive Tax

Long agent sessions in Claude Code or similar agentic loops introduce a specific category of overhead that doesn’t get discussed enough: keeping the agent on track as context grows.

Language models have finite context windows, and performance degrades as context fills. Instructions given early in a session compete with everything added since. An agent working through a large refactor across many files will progressively lose coherence about constraints you established at the start — architectural decisions, naming conventions, invariants you explicitly stated. The agent doesn’t flag this. It just quietly starts optimizing locally, for the immediate task, and your global constraints drift.

Managing this requires you to periodically reassert context, summarize earlier decisions, or start new sessions with distilled instructions. All of that is cognitive work that falls entirely on you. It’s not captured in lines-of-code metrics or task completion times. It’s the invisible overhead of being the agent’s external memory.

For Discord bot development, I’ve hit this concretely. A session that starts with “keep the handler architecture consistent with the existing command pattern” will drift by the time the agent is working on the fifth file, producing code that technically works but introduces structural inconsistency. Catching and correcting that drift requires holding the original constraint in your own head and pattern-matching against each generated output. That’s not building understanding of the problem; it’s entirely overhead.

The Trust Calibration Problem

Every interaction with a coding agent forces a low-stakes but high-frequency decision: verify this output carefully, or trust it and move on. Neither extreme is right. Verifying everything defeats the purpose; trusting everything is how you ship subtle bugs at scale.

The problem is that calibration is hard to maintain. Agents are often right on the mechanical tasks, which trains you to trust quickly. Then they produce something subtly wrong in a domain where they’re less reliable — concurrency, security boundaries, protocol edge cases — and you miss it because your verification threshold has drifted toward trust.

This is not a hypothetical. Security researchers have demonstrated that LLMs produce plausible-looking but vulnerable code with enough frequency to be a real risk. The vulnerability doesn’t announce itself. It looks like correct code until you reason carefully about the invariants.

The trust calibration problem is aggravated by task complexity. For a one-line utility function, trusting the agent output is usually fine. For a multi-file change that touches authentication logic or state management across concurrent operations, careful verification is warranted. The agent doesn’t signal which category a given output falls into. You have to maintain that judgment yourself, across every interaction, all session.

What the Germane Load Is Actually Buying

None of this is an argument against using agents. The productivity gains on the right class of task are real, and I’m not interested in arguing otherwise.

The argument is for precision about what kind of cognitive work remains, and why it matters. The intrinsic complexity of the software you’re building hasn’t dropped. The germane load of building actual understanding of your system has, if anything, become more important — because you’re now reviewing code you didn’t produce, at higher volume, with a trust calibration problem that requires constant attention.

Willison’s observation that these tools are cognitively tiring maps directly onto this. The tiredness isn’t from writing code. It’s from sustained high-vigilance review of outputs you can’t fully trust, in sessions where you’re also managing context drift and making rapid trust calibration decisions. That’s a different kind of hard than staring at a blank editor, but it’s not easier.

The developers who do well with these tools in the long run will be the ones who recognize this shift explicitly and manage for it: time-boxing agent sessions to limit context drift, maintaining genuine code review standards on agent output rather than rubber-stamping it, and protecting the germane cognitive work that builds actual system understanding rather than letting it get crowded out by the volume of generated code.

The tools are powerful. The cognitive cost was always there. It’s just located somewhere different now.

Was this interesting?