When Prompt Engineering Gets a Type System

The problems with prompting LLMs in natural language have been obvious for years, but the ecosystem has mostly responded with better templates and longer system prompts. CodeSpeak, a new language from Andrey Breslav, the lead designer of Kotlin, takes a different position: natural language is the wrong substrate for LLM instructions, and a proper specification language should replace it.

That is a provocative claim, and the HackerNews response reflects it, with over 200 comments across a wide spectrum of reactions. But the claim has real theoretical backing, and Breslav’s involvement matters for specific technical reasons, not just credential-dropping.

The Ambiguity Problem

When you ask an LLM to “write a function that parses dates,” you are relying on shared context that may or may not be present. What date formats? What should it return on failure, an exception or null? What calendar system? English prose hides all of these questions behind a surface fluency that mimics precision while achieving none.

Prompt engineers have developed workarounds: explicit enumeration of constraints, few-shot examples, chain-of-thought priming, XML tags, system prompt hierarchies. These are effective at the margins, but they all operate within the same paradigm. They make natural language less ambiguous rather than eliminating ambiguity structurally.

The formal methods community has spent decades on exactly this problem in a different domain. Languages like TLA+, Alloy, and Z notation were built to describe system behavior without the interpretive slack that natural language allows. A TLA+ spec for a concurrent algorithm can be model-checked. It either satisfies the properties you claimed or it does not. There is no version of “it depends on what the author meant.”

CodeSpeak appears to be drawing on this lineage, applying spec-first thinking to the problem of communicating intent to language models.

What Already Exists

This territory is not unexplored. Several projects have tried to impose structure on LLM interaction, each with different emphases.

LMQL from ETH Zurich is the most academically grounded attempt. It adds query language semantics to prompt construction, allowing constraints like where len(TOKENS(answer)) < 100 or where answer in ["yes", "no"] to be expressed programmatically and enforced during generation. The runtime interleaves Python execution with model generation rather than treating the prompt as a static string.

Microsoft’s Guidance takes a templating approach, using Handlebars-like syntax to define structured outputs and interleave generation with logical control flow. The core insight is that templates can represent grammars, and LLMs with constrained decoding can be made to follow them reliably.

DSPy from Stanford moves further from the prompt-as-string model. You define modules with typed signatures, and DSPy compiles those signatures into prompts through an optimizer. The language model never sees your spec directly; the framework mediates between your declared intent and whatever surface representation actually elicits the desired behavior.

Microsoft’s TypeChat takes an integration approach rather than a new language, using TypeScript types as the specification. You define the shape of what you want as an interface, and TypeChat generates prompts that instruct the LLM to produce JSON conforming to that schema.

Each represents a real improvement over free-form prompting for specific use cases, but each carries limitations. LMQL requires constrained generation support from the backend. Guidance’s templating can become complex enough to obscure intent. DSPy’s compilation step adds a layer of indirection that makes debugging harder. TypeChat’s JSON-centric model works well for data extraction but poorly for tasks that require narrative generation.

What a Language Designer Brings

Breslav’s background is relevant here in a specific way. Kotlin’s design success came from decisions that look obvious in retrospect but were carefully reasoned in context: null safety as a type-level concern rather than a runtime one, data classes as first-class language citizens, extension functions for clean API evolution. None of these ideas were new, but Kotlin packaged them in a way that felt natural to Java developers. Adoption followed because the language did not demand a paradigm shift to get started.

A spec language for LLMs faces a similar challenge. The most rigorous approach, full formal logic with model checking, is also the most inaccessible. A pragmatic language designer might ask: what is the minimum formalism that eliminates most of the ambiguity, while remaining writable by someone who has never read a formal methods paper? That constraint shapes language design toward constructs with immediate intuitive payoff rather than theoretical completeness.

The Kotlin parallel extends to tooling. One of Kotlin’s early advantages was IntelliJ support from day one, courtesy of JetBrains. Whether CodeSpeak ships with comparable tooling investment remains to be seen, but a language without an editor integration is a hard sell regardless of design elegance.

There is also a design philosophy question here about where the spec boundary sits. A type system like TypeScript constrains outputs after the fact, checking that what the LLM produced conforms to a schema. A spec language operating at the input side would constrain what you are allowed to ask and how, making certain classes of ambiguity inexpressible rather than just detectable.

The Deeper Question

The strongest criticism of the spec-over-English premise is also the simplest: LLMs were trained on natural language. Their internal representations are optimized for interpreting prose, not formal notation. Writing a spec that is perfectly unambiguous by human lights does not guarantee the model interprets it the way you intend.

This is a real concern, and the field has evidence on both sides. LLMs do respond to structured formats, sometimes dramatically so. JSON-like representations, typed schemas, and constrained templates routinely outperform equivalent prose for structured tasks. At the same time, models are sensitive to surface form in ways that formal semantics cannot predict. Two spec strings with identical meaning can produce different outputs depending on incidental formatting choices.

What the spec language approach addresses is the developer’s side of the interface. Even if the LLM internally interprets your spec through some learned statistical approximation, you gain debuggability: when the output is wrong, you have a formal object you can inspect, compare, and modify systematically, rather than iterating on natural language where causality is harder to trace.

DSPy’s optimizer makes this point implicitly. The fact that you can optimize a signature-based program suggests that the structured representation, even without being parsed logically by the model, provides enough invariance to be worth working with. Structure buys repeatability, and repeatability is what makes a system debuggable.

Where This Lands

The broader field is converging toward structured LLM interaction. The mechanism varies, from type-based schemas to query languages to dedicated spec languages, but the underlying premise is consistent. Natural language is a good interface between humans; it is a poor interface between a developer’s intent and a probabilistic system that needs consistency to be useful in production.

CodeSpeak is the most explicit statement of that premise so far. Whether the language delivers depends on design and tooling choices that will only be visible as the project develops. The problem it is attacking is real, and the prior art is mature enough that the design space is well-understood. Having someone with Breslav’s track record in pragmatic language design working on it is a meaningful signal that the result might actually be something developers will use.