The Interface Definition Language Is Back, and This Time the Server Is a Language Model

There is a well-worn template for what happens when you need to communicate with a system that lives at the boundary of your application’s type system. You write a specification language. The spec describes inputs and outputs formally. Tooling generates whatever the local environment needs: client stubs, validation schemas, documentation, wire format serializers. The spec is the source of truth, and everything else is derived from it.

This is what CORBA’s Interface Definition Language did in the 1990s. It’s what Apache Thrift and Protocol Buffers did for distributed systems through the 2000s and 2010s. It’s what OpenAPI does for REST services today. When Andrey Breslav, the designer of Kotlin, launched CodeSpeak in March 2026 to significant discussion on Hacker News, the framing he chose, specs instead of English, places the project in this lineage. CodeSpeak is an interface definition language, except the system on the other side of the interface is a large language model.

That framing clarifies both what it’s likely to get right and what will be genuinely difficult.

How Interface Definition Languages Earn Their Keep

The value proposition of an IDL is not primarily syntactic. It’s that you get a single source of truth that compiles into artifacts for multiple consumers. A .proto file in Protocol Buffers can generate Python client code, Go server stubs, and a JSON schema for documentation, all from the same definition. The spec encodes the contract once, and the rest is mechanical derivation.

For network services, this solved a concrete problem. Before widespread adoption of gRPC or Thrift, teams maintained client libraries in three languages by hand, kept documentation in sync with actual behavior by convention, and discovered mismatches between what a service accepted and what a client sent at runtime, in production. An IDL moved that discovery forward to compile time. If the schema says a field is required and the client does not set it, the generated code enforces that before the message leaves the machine.

The complaint people who build production LLM integrations make about prompt engineering is structurally the same. Discovery happens at runtime, in production, when a user triggers an edge case you did not think to test. Your English prompt does not have a type system. There is no compiler to catch the mismatch between what you said and what you meant.

The LLM as Service

The LLM-as-service framing is not new. Every major model provider exposes a REST API, and structured output modes already give you something closer to a typed interface. OpenAI’s tool calling, Anthropic’s tool use, and Google’s function calling all require you to write a JSON schema describing what the model should produce or invoke. That schema is an interface definition in miniature.

{
  "name": "classify_text",
  "description": "Classify the input and return a structured result",
  "parameters": {
    "type": "object",
    "properties": {
      "category": { "type": "string", "enum": ["positive", "negative", "neutral"] },
      "confidence": { "type": "number" },
      "explanation": { "type": "string" }
    },
    "required": ["category", "confidence"]
  }
}

The problem is that this schema lives embedded in a Python dict literal or a TypeScript object literal, with no dedicated parser, no linter that understands its semantics, no diff tool that can tell you what changed about the contract between versions. It is not a first-class artifact you can maintain separately from the code that uses it.

This is a pattern Discord bot developers will recognize from a different angle. Discord’s slash command API forces you to declare parameter names, types, descriptions, and valid choices as a formal schema before your handler runs. That is the input half of what a spec language would provide. What is missing is the same rigor on the output side: a formal declaration of what the model should produce, with dedicated tooling to enforce and evolve it as your requirements change. You have been writing half an IDL this whole time without calling it that.

Where the Analogy Gets Strained

The hard part is that a language model is not a gRPC service. A gRPC service executes a deterministic function: given the same inputs and the same code, it produces the same outputs. A language model is probabilistic. The semantics of the function are learned from a training distribution, not written by a programmer, and the same prompt at the same temperature may produce different outputs.

Protocol Buffers specifies message structure; it says nothing about what a service does with messages. For a deterministic service, behavioral verification is a testing problem. For a probabilistic model, it is a different class of problem entirely.

LMQL from ETH Zürich and Outlines both approach structural constraint at the token level, making it mechanically impossible for the model to produce output that violates a declared schema. That gives you structural conformance. It does not give you semantic conformance: the model can produce a structurally valid response that is semantically wrong. Microsoft’s TypeChat experiments showed that TypeScript type definitions work well as output schemas for LLMs, but the type definition tells you nothing about whether the content of a sentiment field actually reflects the sentiment of the input.

A spec language for LLMs has to account for this gap. One reasonable approach is to treat the spec as a contract over a test corpus: the spec is valid when model outputs conform to the schema on a set of declared examples. This is closer to property-based testing than to type checking, but it is the right shape for the domain. Whether CodeSpeak handles this at the language level or defers it to a separate evaluation framework is one of the most consequential design decisions in the project.

What the Compilation Model Looks Like

In the IDL world, the compiler transforms a spec into artifacts: generated code, wire format documentation, validation logic. The analogous pipeline for a spec language targeting LLMs would transform a spec into prompts, tool calling schemas, structured output configurations, and test harnesses.

BAML from Boundary already does something close: you write a function definition in a custom schema format, and the tooling generates prompt templates and output parsers in your host language. The spec is the source of truth; the host language code is derived. That is the IDL pattern, implemented as a library.

A dedicated language can go further. The compiler could optimize prompt generation based on the target model, select different generation strategies (constrained decoding versus retry-and-validate) based on the output schema complexity, or generate multi-turn conversation scaffolding from a declarative spec. These optimizations are difficult to implement cleanly in a library because they require understanding the semantics of the spec, not just its structure. A library lives inside the host language’s type system, which means it is always working against the grain when trying to express domain-specific concepts that do not map cleanly onto general-purpose types.

The Track Record and the Bet

Breslav’s track record with Kotlin is the main reason to take CodeSpeak seriously past the announcement phase. Kotlin’s design reflected a sustained commitment to pragmatism: null safety because NullPointerException was the leading cause of Java crashes, coroutines because existing async solutions in the JVM ecosystem were too heavy, seamless Java interop because the existing ecosystem was too large to discard. Every significant design choice addressed a specific, observed failure mode rather than an aesthetic preference.

The same methodology applied to LLM programming would produce a language aimed at the most painful concrete problems: prompt brittleness across model updates, inability to test against a declared contract, no composability at the interface level. These are the places where English prompts fail in production, and they are the places where a spec language, done right, would provide real value.

The IDL lineage gives this kind of project a known pattern, with 30 years of lessons about schema evolution, backward compatibility, and tooling ecosystems. Protocol Buffers’ field numbering scheme, OpenAPI’s discriminator support, and Thrift’s optional field handling all emerged from production experience with what breaks when specs change. CodeSpeak will face analogous challenges as models update and as the structured output capabilities of different providers diverge.

The probabilistic nature of the target system is the genuine novelty, the thing no prior IDL had to contend with. How CodeSpeak draws the line between structural guarantees and semantic guidance, between what the spec enforces and what it only declares as intent, will determine whether it finds a design that is both rigorous and practical. The instinct to treat this as an interface definition problem rather than a prompt engineering problem is the right place to start.