Specs as a First-Class LLM Interface: What CodeSpeak Gets Right

Andrey Breslav spent years designing Kotlin, a language built on the premise that type safety and expressive syntax should not be in tension. The null safety system, the extension functions, the sealed classes: all of it reflects a philosophy that the compiler should catch your mistakes before they become runtime bugs. That same instinct shows up clearly in his new project, CodeSpeak, which was announced to considerable HackerNews discussion on March 12, 2026.

The pitch is simple on its face: instead of writing natural language prompts to direct LLM behavior, you write formal specifications. But the implications of that shift are worth unpacking carefully, because the problem CodeSpeak is trying to solve is one that anyone who has shipped an LLM-backed feature in production will recognize immediately.

The Prompt Engineering Problem

When you write an English prompt for an LLM, you are not programming. You are negotiating. The model has internalized billions of tokens of human text, and your carefully worded instruction is just another token sequence it will interpret probabilistically. Two prompts that are semantically identical to a human reader can produce dramatically different output distributions. A prompt that works reliably on gpt-4o might degrade when you migrate to the next model version. And crucially, you cannot formally verify what your prompt actually says, because it is not a specification in any rigorous sense.

This creates a category of problems that are familiar to developers but that the AI tooling ecosystem has been slow to address seriously: prompt brittleness, prompt drift across model updates, the inability to write deterministic tests for LLM behavior, and the general difficulty of reasoning about what a prompt guarantees.

The field has not been entirely asleep on this. LMQL arrived in 2022 as a SQL-inspired query language for LLMs, letting you constrain generation inline with Python-style control flow and type constraints. Guidance from Microsoft followed a template-based approach, interleaving generation and constraint in Handlebars-style syntax. Outlines takes the problem at the decoding layer, using finite state machines and regular expressions to constrain token sampling so the output is structurally guaranteed before it even leaves the model. DSPy, from Stanford’s NLP group, reframes the whole problem: instead of writing prompts at all, you write declarative modules that DSPy’s optimizer compiles down to effective prompts automatically.

Each of these tools is solving a real problem. But they are mostly solving it inside existing languages, as libraries. What Breslav appears to be arguing with CodeSpeak is that the problem is deep enough to warrant its own language.

Why “Specs” Is the Right Frame

The word choice matters here. A specification, in the computer science sense, defines the contract between a component and its callers: what inputs are valid, what outputs are guaranteed, what invariants hold. API specifications like OpenAPI and AsyncAPI work this way. TypeSpec, Microsoft’s newer API description language, tries to generalize the idea further. The notion of writing behavior as a contract first, and then generating implementations from that contract, is a classical software engineering idea that predates LLMs by decades.

Applied to LLMs, a spec-first approach would mean describing not what you want to say to the model, but what you want the interaction to guarantee. The input schema, the output schema, the behavioral constraints, the examples that serve as ground truth, the failure modes that are unacceptable. The spec becomes the source of truth; the LLM prompt becomes a derived artifact, something the toolchain can generate or optimize on your behalf.

This is a meaningful inversion of how most teams currently work. Today, the prompt is the source of truth, and the spec (if one exists at all) is documentation you write after the fact. Codifying this distinction at the language level rather than at the convention level is a significant design decision.

What Breslav Brings to This

Kotlin’s design history is relevant context here. Breslav joined JetBrains in 2010 to design what became Kotlin, and the language that shipped reflects a careful study of Java’s failure modes: verbose boilerplate, nullable references that produce NullPointerExceptions at runtime, the impedance mismatch between OOP and functional patterns. Every major Kotlin feature is a direct response to a concrete class of Java bugs or ergonomic friction points.

If he is applying the same methodology to LLM programming, the design of CodeSpeak is probably a direct response to specific, observed failure modes in prompt-based LLM interaction. The emphasis on specs over English suggests he has identified ambiguity and unverifiability as the root failures, the same way he identified nullability as Java’s central safety hole.

Language designers who have shipped production languages think about these problems differently than library authors. The tradeoffs between expressiveness and analyzability, between flexibility and guarantees, between developer ergonomics and tooling capabilities, are problems Breslav has navigated before at scale.

The Constrained Generation Connection

One technical dimension worth examining is how spec-driven LLM interaction relates to structured or constrained generation. Tools like Outlines and llama.cpp’s grammar mode work by encoding output constraints as a finite state machine and then masking the logits at each decoding step to eliminate tokens that would violate the grammar. This guarantees structural correctness without any post-hoc parsing or retry logic.

The limitation of current constrained generation approaches is that they operate at the token level, not the semantic level. You can guarantee that the output is valid JSON; you cannot guarantee that the JSON object’s priority field actually reflects a reasonable assessment of the input. The structural contract is enforced, but the semantic contract is still just a prompt.

A proper spec language could potentially bridge this gap by making semantic constraints explicit and tool-verifiable, not just structurally but in terms of behavior over a defined test corpus. When a spec change breaks expected behavior on a set of canonical examples, the toolchain can flag it, the same way a type error fails a build. This is closer to property-based testing applied to LLM behavior than to anything that exists today.

The Adoption Challenge

New languages face a brutal adoption problem even when they solve real problems. Kotlin succeeded in part because JetBrains controlled IntelliJ IDEA, the dominant Java IDE, and because Google blessed Kotlin for Android development. CodeSpeak has neither of those tailwinds.

The HackerNews thread for CodeSpeak drew significant discussion, with 236 comments on 271 points as of the original posting, suggesting genuine developer interest rather than mere curiosity. But interest and adoption are different things. The LLM tooling ecosystem is fragmented, model-specific in many ways, and moving fast enough that any language-level bet carries real risk of being overtaken by a shift in how models work.

There is also a real question about whether developers want to learn a new language to interact with LLMs, or whether they would prefer libraries that bring spec-like guarantees to existing languages. The DSPy model, where you stay in Python and the optimizer handles prompt generation, has proven attractive precisely because it minimizes the conceptual surface area you have to internalize.

The Case for Language-Level Thinking

Despite those headwinds, there is a principled argument that a dedicated language is the right level to address this problem. The history of type systems and contract languages suggests that when a class of bugs is fundamental enough, you need language-level support to make the guarantees real. You cannot bolt null safety onto Java as a library; Kotlin had to encode it in the type system. You cannot add meaningful algebraic effects to Python as a library; you need language semantics.

If LLM interaction really does require formal specification of behavioral contracts, input and output schemas, and semantic invariants, then trying to express all of that as decorators on Python functions or as chained method calls on a fluent API will produce something that is technically workable but semantically awkward. The spec will not quite fit the host language’s idioms, and the result will be a leaky abstraction.

CodeSpeak is a bet that the leakiness is bad enough to justify the cost of a new language. Given Breslav’s track record, it is a bet worth watching carefully.

The space between natural language prompts and formal verification has a lot of room in it, and right now most of the tools live closer to the prompt end than the verification end. If CodeSpeak can find a practical equilibrium that gives developers real guarantees without requiring them to write formal proofs, it could matter considerably. The prior art, from LMQL to Guidance to DSPy, has shown that structured LLM programming has genuine demand. The question is whether a new language is the right vessel for it.