Hillel Wayne has a piece up arguing that LLMs are bad at vibing specifications, and it cuts at something I’ve been turning over for a while.
The vibe coding framing, popularized by Andrej Karpathy, is roughly: prompt an LLM casually, accept the output without reading it too carefully, iterate until it works. For throwaway scripts and UI scaffolding, this is often fine. The feedback loop is fast, correctness is observable, and if something breaks you notice immediately.
Specifications are a different category of problem entirely.
A specification is a precise, formal description of what a system must do, not just what it appears to do when you run it a few times. This includes things like TLA+ models, formal invariants, API contracts, state machine descriptions, and security properties. The whole point of writing a spec is to capture behavior that is difficult or impossible to observe directly from running the code.
When you vibe a spec, you lose the thing that makes it valuable.
LLMs are good at producing text that looks like a specification. They’ve seen enough TLA+, enough Alloy, enough formal contracts that they can generate syntactically plausible output. But generating something that looks correct and generating something that is correct are very different. With code, the runtime tells you when you’re wrong. With a specification, there is no runtime to catch the error. You’re the runtime. That requires a kind of careful, deliberate reasoning that informal iteration doesn’t support.
There’s also a subtler issue. Specifications are where you figure out what you actually want, not just describe what you think you want. Writing a formal model forces you to resolve ambiguities, confront edge cases, and make tradeoffs explicit. If you outsource that process to an LLM through loose prompts, you don’t get precision back, you just get the ambiguity laundered through plausible-looking formalism.
This maps onto something I’ve noticed when using LLMs for anything that requires deep semantic correctness rather than syntactic fluency. SQL queries are a good example. An LLM will produce a query that looks right and runs without errors, but silently computes the wrong thing in edge cases that the casual reader wouldn’t notice. The further you get from “does this execute” toward “does this mean what I think it means,” the less reliable the output becomes.
None of this is a knock on using LLMs for specification work at all. They’re genuinely useful for explaining formal notation, translating between equivalent representations, or helping you get started with an unfamiliar formalism. The problem is specifically the vibing approach: treating the spec like a code artifact you can iterate toward informally.
Wayne’s point is essentially that precision requires deliberate effort, and that effort can’t be offloaded through casual prompting. That seems right. The interesting follow-up question is whether better prompting discipline, formal verification pipelines, or future model improvements can close the gap, or whether this is a structural limitation of how these models reason about formal semantics. My guess is it’s partly both, but the structural part is larger than the current discourse acknowledges.