Formalizing the LLM Interface: What Andrey Breslav Sees That Prompt Engineers Miss

Andrey Breslav spent years designing Kotlin to make the JVM feel less hostile to developers who just wanted to express intent clearly without fighting the type system. Now he’s applying that same instinct to a different problem: the fact that talking to large language models in English is a fundamentally poor interface for software systems.

The project is called Codespeak, and its premise is captured in its tagline: talk to LLMs in specs, not English.

The Actual Problem with Prompt Engineering

Prompt engineering has always been a workaround dressed up as a discipline. The core issue is that natural language is underspecified. When you write a prompt like “summarize this document concisely and highlight the key technical decisions”, you’re relying on the model’s prior training to disambiguate what “concisely” means, what counts as “technical”, and what “highlight” implies about format. That works well enough for one-off use, but at software system scale, it creates a cluster of real problems.

First, prompts are brittle across model versions. A prompt tuned against GPT-4o may behave differently against GPT-4.1 or Claude 3.7, not because the models are worse, but because the behavior of ambiguous natural language instructions shifts with training distribution changes. Teams maintaining LLM-powered systems in production know this as the silent regression problem: you upgrade a model and something subtly breaks with no clear diff to examine.

Second, natural language prompts don’t compose. You can’t write a function that takes two prompt fragments and reason formally about what their combination will produce. You can concatenate strings, but that’s not composition in any meaningful sense. There’s no type signature, no precondition, no postcondition.

Third, and most importantly for systems builders: English prompts are not testable in the way software should be testable. You can write eval harnesses, but you’re ultimately measuring emergent behavior rather than verifying against a declared specification.

Libraries like DSPy, Guidance, Instructor, and Outlines have all tried to paper over this with different approaches. DSPy replaces handwritten prompts with trainable modules that optimize themselves against a metric. Guidance constrains model output to follow a grammar. Instructor coerces model outputs into Pydantic schemas. Outlines does structured generation at the token level. Each solves a piece of the puzzle inside the host language, whether Python or whatever else, but none of them actually changes the interface. They’re all still fundamentally operating on natural language under the hood and using code to manage the mess around it.

A language designer’s instinct when facing this situation is different from a library author’s instinct. A library author asks: what can I build within the existing system that makes this easier? A language designer asks: what is the right primitive?

What a Spec Language for LLMs Looks Like

The “specs, not English” framing is doing a lot of work. In software, a specification is a formal description of what something should do, separating the what from the how. SQL is a specification language in this sense: you specify the shape of the data you want, and the query planner figures out how to get it. You don’t tell the database which indexes to walk; you declare a relation and let the engine handle execution.

Applying that model to LLM interaction means writing something that formally declares the structure of your intent, the constraints on the output, the relationship between inputs and expected outputs, and possibly the verification conditions, rather than writing prose that tries to guide the model through natural language instruction.

This is conceptually closer to LMQL, a query language for language models developed at ETH Zürich that lets you write constrained generation with control flow, than it is to standard prompt engineering. LMQL lets you write things like conditional branches mid-generation and constrain which tokens are valid at each point. But LMQL is still embedded in Python and still relies heavily on string interpolation and natural language fragments within queries.

A fully realized spec language would go further. It would have its own grammar, its own semantics, its own notion of types for the entities being communicated about. It would let you reason about LLM interaction the way you reason about a type-checked function call, with static guarantees where possible and well-defined failure modes where not.

Breslav’s background makes him unusually well positioned to think about this. Kotlin’s design was heavily influenced by making intent explicit. The language consistently chose to make what programmers meant more visible in the syntax: explicit nullability with ?, extension functions that read as member functions without altering class hierarchies, coroutines that express asynchronous logic in sequential style. These are all choices about making the gap between intent and syntax smaller. That same instinct applied to LLM interfaces points directly toward specs.

Why a Language and Not a Library

The most common dismissal of projects like this is: you could build that as a library. Python is flexible enough. TypeScript has template literal types. Why invent a new language?

The answer is that languages and libraries have different affordances for the tools built around them. A language can have a type system that understands its own semantics. It can have an LSP implementation that gives you completions and error checking as you write. It can have a compiler that transforms specs into optimized prompt strategies, potentially switching between different execution backends (different models, different APIs) based on the spec rather than requiring the developer to manage that. A library lives inside a host language’s type system and tooling, which means it’s always fighting the host’s semantics when trying to express domain-specific concepts.

SQL survives as a distinct language because relational algebra has its own mathematical structure that doesn’t map cleanly onto general-purpose type systems. The argument for a spec language for LLMs is analogous: the semantics of specifying LLM behavior, with its probabilistic outputs, its context dependencies, its verification requirements, are different enough from general programming that forcing them into a library abstraction always leaves seams showing.

There’s prior art here beyond LMQL. Microsoft Research’s TypeChat experiments used TypeScript type definitions as the specification for LLM outputs, getting structured data back by giving the model a schema and asking it to produce conforming JSON. That works, and it’s a meaningful step toward the spec approach, but it’s still using a general-purpose type language as a proxy for a domain-specific one. The spec is always going to be slightly ill-fitting when expressed in types designed for a different purpose.

The Adoption Problem

The Hacker News discussion around Codespeak surfaced the predictable skepticism about yet another language in the LLM tooling space. The concern is reasonable: this space has fragmented badly. Every few months there’s a new framework, a new abstraction layer, a new way to structure prompts. Most of them have short half-lives.

What distinguishes a language with staying power from a framework that fades is usually whether the core abstraction is right. SQL survived because the relational model turned out to be a genuinely powerful way to think about structured data, and the language mapped cleanly onto it. Kotlin survived because the JVM interop story was excellent and the language design decisions were consistently good.

For Codespeak to have staying power, the spec abstraction needs to map cleanly onto how people actually want to reason about LLM behavior in production systems. That means it needs to handle the messy cases: multi-turn conversations, tool use, context window management, retrieval augmentation, graceful degradation when model behavior doesn’t meet spec. Whether the design handles these at the language level or punts them to a runtime is one of the most interesting architectural questions in the project.

Breslav has shown with Kotlin that he can design a language that earns adoption by being genuinely useful rather than just technically elegant. The LLM tooling space badly needs something that takes the interface problem seriously at a foundational level rather than adding another abstraction layer on top of English prose. Whether Codespeak is that thing is still an open question, but the instinct behind it, that formal specifications are the right primitive for this problem, seems correct.