A Language Designer Takes Aim at the Prompt Engineering Mess

Andrey Breslav spent over a decade shaping Kotlin into a language known for its expressive type system, null safety, and careful attention to the gap between what programmers intend and what code actually does. That same concern, the distance between intent and execution, is now the animating idea behind CodeSpeak, his new project that reframes how developers communicate with large language models.

The premise is simple to state and hard to dismiss: natural language is a terrible interface for specifying software behavior, and using English to tell an LLM what to do is no exception. Prompts are ambiguous, brittle under rephrasing, and difficult to compose. CodeSpeak is an attempt to replace them with a formal specification language.

This is not a new idea. But it matters that Breslav is the one building it.

What Prompt Engineering Actually Is

The term “prompt engineering” sounds like a discipline, but in practice it is closer to folklore. You learn that adding “think step by step” improves reasoning on certain tasks. You discover that the order of few-shot examples matters. You find that the same model responds differently to “do not include” versus “exclude”. None of these behaviors are documented in any specification. They emerge from training dynamics that the model’s own developers do not fully understand, and they change between model versions without warning.

This is not a criticism of LLMs as technology. It is an observation about the interface layer. We are using a communication medium designed for humans, with all its ambiguity and contextual dependency, to specify behavior that needs to be precise and reproducible. The mismatch is architectural.

Formally specifying behavior is exactly what type systems do for compilers, what schemas do for databases, and what contracts do for APIs. The idea of applying the same rigor to LLM interactions is not exotic. Several projects have explored it.

Prior Art and Where It Falls Short

TypeChat, from Microsoft Research, uses TypeScript type definitions to constrain LLM output shapes. You define an interface, the model is asked to produce JSON that satisfies it, and TypeScript’s type checker validates the result. It solves one narrow slice of the problem: output structure. It says nothing about input semantics, behavioral contracts, or how the LLM should reason through a task.

LMQL, from ETH Zurich, goes further with a query language that lets you constrain generation token by token, scripting LLM behavior with Python control flow and logical constraints. It is technically impressive but feels like a low-level assembly language for prompts: powerful if you know exactly what you want, cumbersome for expressing higher-level intent.

DSPy from Stanford takes a different angle entirely. Rather than specifying behavior explicitly, you define program structure and let an optimizer tune the prompts for you. The appeal is that it removes prompt brittleness by making prompts a compilation artifact rather than a handwritten artifact. The cost is opacity: the prompts that DSPy generates are often unreadable, and debugging failures requires understanding the optimization process, not just the specification.

Outlines, from dottxt, enforces structured generation at the token level using finite-state machines and context-free grammars. It guarantees that outputs conform to a schema by construction, without relying on the model to voluntarily comply. That is a genuine contribution, but it operates at the generation layer, not at the specification layer.

The pattern across these tools is that each solves a real subproblem while leaving adjacent problems untouched. TypeChat structures outputs. LMQL scripts generation. DSPy optimizes prompts. Outlines enforces syntax. None of them provide a unified language for specifying what a language model interaction is supposed to accomplish, end to end.

What a Language Designer Brings

Breslav’s background makes CodeSpeak worth taking seriously as an attempt at that unified layer. Kotlin was designed with deliberate attention to the failure modes of Java: verbose boilerplate that obscured intent, null references that made programs lie about their types, and a type system that could not express many useful invariants. The design choices in Kotlin were not aesthetic preferences. They were responses to concrete categories of bugs and miscommunications between programmer intent and runtime behavior.

Applying that same analytical lens to LLM interaction produces a specific kind of project. A language designer who has spent years thinking about the gap between what programmers mean and what code does will naturally focus on formal semantics, on what it means for a specification to be complete, consistent, and unambiguous.

The HN discussion around CodeSpeak’s launch attracted significant engagement, which suggests the idea is landing with the people most likely to use it: developers who have spent enough time writing prompts to be frustrated with the status quo, and language enthusiasts who recognize the conceptual territory.

The Hard Problems Ahead

Formal specification languages for LLMs face a structural challenge that formal specification languages for traditional software do not. When you write a type signature, a compiler checks it against a deterministic execution engine. The engine does what the spec says, or it raises an error. When you write a formal spec for LLM behavior, you are specifying behavior for a probabilistic system that was trained, not programmed. The model may not comply, and there is no enforcement mechanism short of sampling and checking outputs at runtime.

This means a spec language for LLMs has to solve two distinct problems. First, it needs to express intent precisely enough that a model trained to follow specs can interpret it correctly. Second, it needs to provide enough structure that violations can be detected and handled. These are both hard problems, and they pull in different directions. A spec that is maximally precise may be too rigid for a model to satisfy reliably. A spec that is maximally flexible may not add much over English.

There is also the question of model support. TypeScript types only constrain LLM outputs if you use a validation layer at runtime. LMQL’s constraints only work if you have access to the generation process. A specification language that wants to work across different LLM providers, with different APIs and different generation mechanisms, will need either a compilation strategy that maps specs to model-specific prompt formats, or direct model training to understand and follow the spec language. The first approach gives you portability but reintroduces translation ambiguity. The second requires either training your own models or convincing frontier labs to add spec-following to their training regimes.

Why This Direction Is Right Anyway

None of these challenges make the project wrong. They make it hard. The same objections could have been raised against type inference in the 1990s, or against garbage collection becoming a default in the 2000s, or against null safety becoming a language-level guarantee in the 2010s. Each of those changes required tooling, runtime support, and eventually model adoption before they delivered on their promise.

The current state of prompt engineering, where teams maintain large text files of carefully phrased instructions and check them into version control like configuration artifacts, is genuinely unsustainable at scale. The developers who write those prompts have no tooling to check for contradictions, no type system to catch semantic errors before deployment, and no abstraction mechanism that does not require them to trust the LLM to correctly interpret compositional English prose.

A formal specification language does not solve all of those problems immediately. But it creates a foundation on which those solutions can be built. Tooling, static analysis, testing frameworks, and model support all become possible once there is a formal surface to attach them to.

Breslav built a career on that kind of patient, foundational work. CodeSpeak is early, and most of the hard questions remain open. But the observation driving it, that English is the wrong language for specifying LLM behavior, is correct. The field will eventually converge on something like what CodeSpeak is attempting. The interesting question is whether this particular design will be the one that survives.