The What/How Loop Was Building Something Besides Software

Back in January, Martin Fowler published a conversation with Unmesh Joshi and Rebecca Parsons about how LLMs reshape the abstractions in software. The central idea is what they call the what/how loop: the iterative process of defining intent at one level of abstraction and filling it in with implementation at the next level down. The conversation frames this as a mechanism for managing cognitive load, and argues that LLMs are changing how we navigate it.

That framing is correct as far as it goes. But it focuses on what the loop produces, working software with managed complexity, rather than what the loop does to the person running it. Every traversal of the what/how boundary, every time a developer dives from intent into implementation and back, is also a training event. The developer who wrote the pagination function by hand, hit the N+1 query failure, traced it back through the ORM, and fixed it carries something afterward: the capacity to evaluate that class of code on sight. An LLM traversing the same boundary carries nothing. It generates; it does not learn.

This distinction matters more as the loop accelerates.

What the Loop Was Training

Unmesh Joshi wrote a companion piece to the Fowler conversation, The Learning Loop and LLMs, published in November 2025. His concern is precisely this: that LLMs become counterproductive when used to shortcut the learning loop, the iterative cycle through which developers build genuine understanding of the how by working through it, hitting failures, and tracing reasons.

The mechanism is not subtle. Software at every layer contains decisions that are not obvious from the interface. A database query that returns correct results under low load can produce N+1 failures under realistic load, not because the query was wrong but because the ORM traverses a relationship per result row. A developer who has traced that failure knows to look for it. One who learned SQL entirely from LLM output, whose mental model was formed from generated code rather than debugged code, has no reason to look.

# This looks correct. It is correct for small datasets.
def get_user_orders(user_ids):
    users = User.query.filter(User.id.in_(user_ids)).all()
    # If 'orders' is a lazy relationship, this is N+1:
    return [{'user': u.name, 'orders': [o.total for o in u.orders]} for u in users]

# The fix requires knowing that the failure exists:
def get_user_orders(user_ids):
    users = User.query.filter(User.id.in_(user_ids)).options(
        joinedload(User.orders)
    ).all()
    return [{'user': u.name, 'orders': [o.total for o in u.orders]} for u in users]

An LLM generating the first version against a prompt asking for a function that returns users with their orders is not making an error in any obvious sense. The code is syntactically correct, the logic is coherent, and it will pass unit tests that do not instrument the database. The failure is latent. Detecting it requires knowing that lazy loading exists, understanding what it does, and recognizing that the result set size makes it a problem here. That knowledge comes from having been burned by it, not from reading documentation about it.

Why Speed Changes the Calculus

The important thing about the what/how loop, as Fowler, Parsons, and Joshi describe it, is that it is iterative. You write a what, something generates a how, the how surfaces implicit decisions you did not specify, you refine the what, you regenerate. With hand-written code, this cycle ran at the speed of typing, debugging, and rethinking, which meant each iteration had meaningful duration. The duration was not waste. It was the interval in which a developer’s mental model could update.

With LLMs, the same cycle runs in seconds. You can traverse the what/how boundary fifty times in an afternoon without building any understanding of what lies on the other side. Each iteration produces working code. The aggregate produces no depth.

This is not a criticism of LLMs specifically. It is a structural property of any translator that operates faster than the learner can absorb. The same concern accompanied every prior what/how automation wave. When 4GLs made generating database queries cheap in the 1980s, the bottleneck shifted to specifying what the query should produce, and analysts found that precise specification required the same kinds of thinking as implementation. When CASE tools in the late 1980s tried to generate code from visual diagrams, they failed not because the generators were bad but because analysts lacked the conceptual vocabulary to specify systems precisely enough. That vocabulary came from implementation experience.

The pattern is consistent: automation tools that bypass a learning loop produce people who cannot evaluate the automation’s output. Joel Spolsky’s Law of Leaky Abstractions made this precise in 2002: all non-trivial abstractions leak, and when they leak, you must understand what they abstract to fix them. Using the abstraction without understanding the layer below is fine until it fails. At that point, the developer who never crossed the boundary cannot diagnose the failure.

LLMs leak. Every probabilistic translator leaks. The question is whether the developers using them have built enough understanding of the how to diagnose failures when they appear.

Evaluating What You Did Not Write

Rebecca Parsons’ contribution to the Fowler conversation is a formal framing that sharpens this problem. The what/how separation maps to the distinction between a language’s denotational semantics, what a program means, and its operational semantics, how those semantics execute. Denotational and operational semantics are formal frameworks precisely because natural-language intent is insufficient for verifying correctness: the same English description admits multiple implementations, and not all of them satisfy the same invariants.

This is why evaluating LLM-generated code is harder than it looks. Three implementations of the same prompt can differ in subtle, consequential ways:

// Prompt: "return the top 5 users by activity score"

// A: sorts ascending (wrong direction, passes prompt literally)
const top5 = users.sort((a, b) => a.activityScore - b.activityScore).slice(0, 5);

// B: correct direction, but filters out admins (unstated assumption)
const top5 = allUsers
  .filter(u => u.role !== 'admin')
  .sort((a, b) => b.activityScore - a.activityScore)
  .slice(0, 5);

// C: correct
const top5 = users.sort((a, b) => b.activityScore - a.activityScore).slice(0, 5);

Evaluating these requires understanding the domain well enough to know that admin filtering is not implied, and understanding the sort direction well enough to catch A’s inversion. Neither is derivable from the prompt alone. Both require what the loop would have built: the capacity to read code critically rather than just generate it.

This is not a new observation, but it has a new weight. With deterministic translators, compilers, query planners, garbage collectors, the evaluation burden was light: passing tests plus compilation equaled trustworthy output within defined semantics. The how was guaranteed by the translator’s correctness. With probabilistic translators, the same guarantee does not hold. Fluent-looking code with plausible structure can be wrong in ways that tests do not catch and that code review will miss unless the reviewer brings evaluative depth.

The Redox OS project bans LLM-generated contributions for exactly this reason. In domains where subtle correctness failures have severe consequences, using a probabilistic translator requires the same depth of understanding that would allow you to write the code correctly yourself. The abstraction evaluation problem is most acute where failures are least visible.

What to Do About It

The practical implication is not that developers should avoid LLMs. It is that the value of LLMs depends significantly on the depth of understanding the developer brings to evaluating their output, and that depth requires maintaining the learning loop even when it is no longer strictly necessary for producing software.

Some of this can be structural. Architecture fitness functions, executable tests that verify structural properties rather than behavioral ones, catch layer boundary violations automatically without requiring a reviewer who knows the codebase history. Strong type systems make the what machine-checkable, so evaluation does not depend entirely on developer judgment. Swift Package Manager targets that correspond to architectural layers will reject generated code that crosses declared dependency boundaries before any human reviews it.

But none of these tools replaces evaluative capacity built from traversing the loop manually. They catch the failures that can be mechanically specified in advance. Latent performance problems, semantic mismatches with unstated domain assumptions, plausible-but-wrong edge case handling: these require a reviewer who has been on the other side of the same failures.

The Fowler conversation treats the what/how loop as a mechanism for managing cognitive load. That is accurate. It is also a mechanism for building the cognitive capacity that makes the loop navigable at all. LLMs can generate the how indefinitely without building anything. The developer who uses them without maintaining their own traversal of the boundary ends up dependent on a translator they cannot evaluate, which is precisely the failure mode that every prior automation wave produced when it moved faster than the learning it was supposed to replace.

The bottleneck in software development shifts upward as automation improves. That has been true since Fortran. What also shifts upward is the required depth of understanding at the new boundary. The what/how loop does not disappear when an LLM crosses it for you. It relocates, and when it leaks, you still need to know what is underneath.