There is a useful distinction between writing code and specifying behavior, and Hillel Wayne’s recent piece on his newsletter makes the case that LLMs handle these two tasks very differently.
Vibe coding works well enough for implementation. You describe an intent loosely, the model produces something plausible, you iterate. The feedback loop is fast and the output is usually close enough to useful. The model has seen enormous amounts of code, and most code follows familiar patterns.
Specifications are a different animal. A formal spec, whether written in TLA+, Alloy, Z notation, or even just a rigorous English requirements document, has to be precise without being ambiguous. Every edge case matters. A specification that is mostly correct is not correct. This is the domain where Hillel spends most of his time, and according to his article, it is where LLMs fall apart.
The failure mode is not that the model refuses or produces garbage. It is subtler. The model produces something that looks like a specification, has the right shape, uses the right terminology, and is wrong in ways that are easy to miss. Specifications require careful reasoning about state, invariants, and behavior under all possible conditions. LLMs are pattern matchers trained on text; they are good at producing text that resembles correct specifications, which is not the same thing.
I ran into a version of this building a Discord bot scheduler. I asked an LLM to help me reason through a state machine for job scheduling: what happens when a job is cancelled mid-run, when two triggers fire simultaneously, when a retry limit is reached. The code it produced was fine. The reasoning it offered about state transitions was subtly inconsistent. When I asked it to formalize the constraints, it kept softening them into prose that sounded correct but would not have caught a class of bugs I cared about.
The interesting question is why this gap exists. My working theory is that implementation code is forgiving. A bug in implementation code usually produces a visible failure at some point. A bug in a specification might never surface as a test failure because the tests were derived from the same flawed understanding. The error is load-bearing in a way that makes it invisible.
LLMs also seem to struggle with the discipline of leaving things unspecified. A good formal spec says exactly what must be true and nothing more; it does not fill in implementation details to seem complete. LLMs, trained to produce fluent and complete-seeming output, tend to over-specify or blur the line between what is required and what is one possible approach.
None of this means LLMs are useless for specification work. They can help with boilerplate, explain notation, and find obvious gaps when prompted carefully. But the vibing approach, where you trust the model to fill in the hard parts, does not transfer. Specifications are a domain where sloppiness costs you later, and LLMs optimized for fluency do not have a strong prior toward rigor.
The broader implication is worth sitting with. As more developers reach for LLMs as their primary thinking tool, the skills associated with precise specification may atrophy further. That would be a bad trade.