· 6 min read ·

The Bottleneck Doesn't Disappear When AI Writes the Code

Source: simonwillison

Simon Willison’s recent piece on coding after coders arrives at a moment when the question of programming’s future is genuinely harder to dismiss than it has ever been. The tools are real. The productivity gains are documented. People with no programming background are building functional software. And yet the framing of “the end of programming as we know it” carries an echo that anyone who has followed this space long enough will recognize.

This is not the first time someone credible has made this call.

A Pattern Worth Taking Seriously

In 1982, James Martin published Application Development Without Programmers, arguing that fourth-generation languages and relational database tools would allow business analysts to build their own systems, sidelining professional developers for the bulk of enterprise software work. The prediction was not crazy. SQL and spreadsheet tools did automate entire categories of work that had previously required Fortran or COBOL programmers. The class of developer who specialized in writing custom reports and data-entry screens shrank.

But it did not end programming. It shifted the bottleneck. The analysts who tried to replace developers with 4GL tools discovered that specifying what a query needed to produce required a precision of thought that was not fundamentally different from programming. The hard part was never the syntax. It was knowing what you wanted, precisely enough to express it in any language.

The CASE tools era of the late 1980s and early 1990s tried the same move at a higher level of abstraction: generate code from visual diagrams. These tools mostly failed, and the failure was instructive. Organizations found that the analysts using the tools lacked the conceptual vocabulary to specify systems with enough precision for code generation to work. The bottleneck did not move below the developer; it exposed that the developer’s real value was never the typing.

Visual Basic in the 1990s did something these earlier waves did not. It genuinely democratized Windows programming for a specific category of applications. The corporate developer building internal tools, the power user who wanted to automate Excel, the small business owner who needed a simple database front-end: these people built things with VB that would previously have required hiring a programmer. Programming did not end, but a real market segment moved.

The no-code and low-code wave of the 2010s followed the same pattern again. Bubble, Webflow, Power Apps, and their peers carved out genuine markets. They work extremely well for a constrained category of problem: applications whose requirements fit neatly within the platform’s model of the world, with limited integration complexity and tolerance for the abstractions the platform imposes. At the boundary where your requirements diverge from the platform’s assumptions, you hit a wall.

Each of these waves correctly identified that a layer of implementation was about to get cheaper. Each of them underestimated how thoroughly the bottleneck would shift upward rather than disappear.

What Is Different This Time

The current AI coding tools are more capable than anything in that lineage. GitHub Copilot, Cursor, and Claude Code operate across arbitrary codebases in natural language. Andrej Karpathy’s term “vibe coding,” coined in early 2025, captured a real emergent behavior: experienced developers operating at a level of abstraction where they describe intent and accept implementations without always reading the code carefully. Non-programmers are building functional prototypes.

The breadth genuinely exceeds prior waves. There is no obvious platform boundary where the tools stop working. The question is whether this difference in degree amounts to a difference in kind, or whether the structural dynamic is the same.

I think the structural dynamic is largely the same, even if the magnitude is greater.

What AI coding tools do well is generate plausible implementation given a clear specification. What they do poorly is identify when a specification is underspecified, recognize non-functional requirements that were implicit, and evaluate whether their own output is correct in the ways that matter. A GitHub study on Copilot showed substantial productivity gains on self-contained, well-defined tasks. The conditions that produced those results — isolated task, clear spec, correctness verified externally — are not the conditions of most production software work.

Specifying a system well enough for AI to implement it correctly requires understanding the domain, the failure modes, the performance requirements, the security constraints, and the maintenance implications. None of that knowledge is supplied by the tool. All of it has to come from somewhere.

The Specification Problem in Practice

I build Discord bots and dabble in systems programming. The AI-assisted parts of that work are not the same across both.

For Discord bot features — event handlers, command parsing, state management in JSON files, scheduled processors — AI tools give me significant leverage. The tasks are tractable. The specifications are expressible. When the output is wrong, the failure is usually visible quickly. I still read the code before merging anything non-trivial, because subtle logic errors in async event handling fail in ways that only surface under specific conditions, but the iteration loop is short.

Systems programming work is different. Writing code that touches memory layout, synchronization primitives, or kernel interfaces requires reasoning about properties that are genuinely difficult to specify in natural language. “Don’t use more than N bytes of stack in this path” is not a requirement you can hand to an AI and verify easily. The Redox OS project’s policy against LLM-generated contributions is extreme, but the reasoning is sound: in domains where subtle correctness failures have severe consequences, the ability to evaluate what the AI produced requires the same depth of understanding that would allow you to write it correctly yourself.

This is the structural ceiling on vibe coding. It works in domains where specifications are tractable and failures are visible. It has higher costs in domains where neither is true.

What Programming Becomes

Willison has spent several years building tools on top of language models — his Datasette project and associated LLM tooling are serious attempts to understand what these capabilities actually enable in practice. His perspective is pragmatic rather than eschatological, and that pragmatism is useful here.

The claim that this is “the end of programming as we know it” is accurate in a limited sense. The version of programming that consisted primarily of translating well-understood specifications into working code is under real pressure. If that was your primary value, the tools are compressing the market for it.

The version of programming that consists of understanding complex systems deeply enough to specify what they should do, diagnosing failure modes, making architectural decisions with long-term consequences, and evaluating whether implementations match intent — that version is not ending. The demand for it is arguably increasing, because the tools make the gap between a clear specification and working code narrower, which means the specification itself becomes the scarce, high-value input.

Every prior wave of programming automation produced the same outcome: it made a specific category of implementation cheaper, and it raised the floor on what the remaining programming work required. 4GLs did not eliminate developers; they eliminated the need for developers to do certain things, and developers moved up. The same pattern has a reasonable chance of repeating.

The uncomfortable corollary is that developers who have primarily been doing the things that are getting automated — and have not been building the judgment and domain knowledge that sits above implementation — are the ones who will feel the most pressure. The tools are not neutral in who they help. They compress the gap between an experienced developer with domain knowledge and a less experienced one more than they compress the gap between a developer with real judgment and someone without it. That judgment gap may be getting wider, not smaller.

Was this interesting?