· 6 min read ·

The What/How Loop Is a Training Loop, and LLMs Changed the Training

Source: martinfowler

A few weeks back I finished a Telegram transport integration that had been sitting on my backlog. I used Claude to generate most of the connection handling, the message queue logic, the retry behavior. It passed tests, worked in production within an hour of starting, and the commit landed clean.

Three days later, I was looking at intermittent dropped messages under high load and genuinely did not know where to start. I had not written the queue logic; I had not traced through the retry state machine. I knew what the code was supposed to do, and I had no map of what it actually did under pressure.

This experience has a theoretical frame now. In their January 2026 conversation, Martin Fowler, Rebecca Parsons, and Unmesh Joshi discuss how LLMs fit into the long history of software abstraction: specifically the recurring cycle of specifying intent (the what) and generating implementation (the how). The observation that sticks predates LLMs; it concerns what the what/how loop has been doing all along besides producing working code.

The loop has been running since Fortran

Every major shift in programming has been a renegotiation of the what/how boundary. Fortran (1957) automated the translation from algebraic expressions to machine code. SQL automated query planning: you specify what data you want, the query planner decides join order, index selection, and scan strategy. Garbage collectors automated memory reclamation. React’s reconciler automated DOM updates from declarative UI descriptions.

Joel Spolsky’s Law of Leaky Abstractions (2002) is the canonical statement of what happens next: every non-trivial abstraction eventually exposes the layer beneath. The query plan regresses. The garbage collector pauses at inopportune moments. React’s reconciliation hits an edge case. When that happens, you need how-knowledge, and if you have it you fix the problem; if you do not, you are stuck.

The Fowler conversation places LLMs in this lineage. Rebecca Parsons brings a formal semantics background to the discussion, and the framing clarifies something important. Compilers and query planners provide deterministic translation. The compiler that translates your C to machine code gives you a formal guarantee that the semantics of your program are preserved. The SQL planner may choose any execution strategy, but it is contractually bound to return the correct result. The translation is certain at the semantic level.

LLMs do not provide that guarantee. A prompt maps to a distribution over implementations, and different runs of the same prompt can produce different outputs with varying degrees of correctness. That is a description of the technology. The consequence is that the evaluation burden on the developer is heavier than it is with deterministic translation, and the skills that enable that evaluation are different from those that mattered with prior automation waves.

What the loop was also doing

Here is the part the Fowler article surfaces that I find most significant: every traversal of the what/how boundary has historically been a training event for the developer who made it.

You write a query, it returns wrong results, you trace the query plan, you learn something about how the database handles your data distribution. You write a function that fails under concurrent load, trace the thread interleaving, and learn something about lock granularity. Each failure event produces how-knowledge. That knowledge improves your what-specification the next time: you know which edge cases to name explicitly, which assumptions to state, which constraints to surface.

Unmesh Joshi’s companion piece, The Learning Loop and LLMs, makes this argument directly. The iterative cycle through which developers build genuine how-knowledge is precisely what enables them to evaluate probabilistic output. The developer who has traced an N+1 query failure knows to look for it. A developer who has never encountered that failure has no reason to look for it, and an LLM will generate code exhibiting it without any signal.

# Looks correct. Passes unit tests that don't instrument the database.
def get_user_orders(user_ids):
    users = User.query.filter(User.id.in_(user_ids)).all()
    return [{'user': u.name, 'orders': [o.total for o in u.orders]} for u in users]

# The fix requires knowing the failure mode exists:
def get_user_orders(user_ids):
    users = User.query.filter(User.id.in_(user_ids)).options(
        joinedload(User.orders)
    ).all()
    return [{'user': u.name, 'orders': [o.total for o in u.orders]} for u in users]

The test that verifies behavior against a small fixture will not surface the N+1 pattern. A developer who has never hit this in production, because prior implementations were LLM-generated, has no model for why the first version might degrade at scale.

The same dynamic appears in bot development at a different layer. Command rate limiting is a common feature: an LLM generates sliding window logic that works correctly under light load. Under sustained burst traffic, the window state management can develop race conditions. A developer who generated the logic without reading it starts debugging from ‘the rate limiter is broken.’ A developer who wrote a naive token bucket first, hit exactly this problem, and upgraded to a sliding window implementation has a meaningfully different starting point for that debugging session.

Speed changes the calculus

With LLMs, the what/how loop runs in seconds. Before, a day of implementation work might produce three or four meaningful traversals of the boundary: write some code, hit a failure, trace it, understand it, revise. With an LLM, you can traverse the boundary fifty times in an afternoon without building any understanding of what lies on the other side. Each traversal produces working code, and the aggregate can produce no depth in the how.

My Telegram integration took about an hour of real work. Before LLMs, the same integration would have taken a day or two, and I would have read through the connection handling, made mistakes in the queue management, debugged the retry logic, and accumulated knowledge I did not accumulate this time. The speed gain is real, the knowledge gap is equally real, and the knowledge gap is invisible in the output.

Fred Brooks drew the relevant distinction in No Silver Bullet (1986): accidental complexity is friction from tools and representations; essential complexity is the inherent difficulty of the problem. LLMs absorb a substantial portion of accidental complexity: boilerplate, library syntax, implementation patterns. They leave essential complexity untouched, and essential complexity is where the hard debugging sessions live.

What you can do about it

The remediation is to structure the loop so traversals build understanding rather than skip it.

Read the generated code before committing it, even briefly. The goal is a mental map of where the how-knowledge lives so you can find your way back when something breaks. This is slower than accepting output without reading but much faster than debugging without navigational memory.

Use types to encode domain semantics explicitly. TypeScript’s type system makes the what machine-verifiable. When an LLM generates code that conflates two different meanings of ‘active user’ or ‘pending message,’ a well-typed codebase surfaces the conflict statically. Types are the cheapest form of structural specification available, and they encode knowledge that would otherwise live only in the heads of the developers who wrote the original code.

Write the test before you accept the generated code. Kent Beck’s original framing treated tests as executable specifications. Writing the test first forces you to state the what precisely before evaluating whether the generated how satisfies it. If you cannot write the test, the specification is underspecified, and the LLM is filling in intent from its training distribution rather than from your actual requirements.

For structural properties, architecture fitness functions from Ford, Parsons, and Kua’s Building Evolutionary Architectures (O’Reilly, 2017) provide executable structural what-specification. They verify module boundaries, dependency directions, and interface constraints automatically, giving generated code a structural test beyond behavioral tests.

The bottleneck relocates rather than disappears

The Fowler conversation frames this clearly: LLMs extend the seventy-year abstraction staircase. Natural language specification joins Fortran, SQL, and garbage collection as automation waves that moved the what/how boundary upward. Human responsibility migrates to a higher level of abstraction each time.

What changes is the precision required at the new level. SQL freed developers from writing query plans but required them to understand relational data modeling well enough to state queries that return the right data. LLMs free developers from writing boilerplate but require them to understand their domain and its edge cases well enough to state specifications that produce correct behavior, and to recognize when the generated how fails to match the intended what.

The how-knowledge required for that recognition comes from traversals of the boundary under conditions of failure. Speed makes those traversals optional in a way prior automation never did. The compiler always made you wait; the N+1 failure always surfaced eventually in production. LLMs generate code that runs, and the traversal that builds no understanding is the one where the output worked on the first try and you accepted it without reading it.

The loop still needs to run, and LLMs made it optional; that is the trap.

Was this interesting?