From Ubiquitous Language to Executable Specifications: What LLMs Reveal About Domain Precision

Back in January, Martin Fowler published a conversation with Rebecca Parsons and Unmesh Joshi about how LLMs reshape the abstractions developers build. The central concept is the what/how loop: the iterative process of specifying intent at one level of abstraction and having something, a compiler, a query planner, or an LLM, generate the concrete realization at the next level down. That framing is useful, and the conversation has been circulating among engineering teams that take AI-assisted development seriously.

What makes it worth revisiting two months on is something specific to Parsons’s contribution. She brings programming language theory to a conversation that could easily stay at the level of developer intuitions, and the formal framing illuminates a problem that more informal treatments miss: the gap between what domain language expresses and what machine-executable specification requires.

Denotational Intent and Operational Binding

The formal version of what/how separation is the distinction between denotational semantics and operational semantics. Denotational semantics defines what a program means: the mathematical objects it denotes, independent of any execution mechanism. Operational semantics defines how it runs: the specific steps the machine takes to produce a result. A well-designed language lets you reason at the denotational level without thinking about the operational one, because the denotational meaning is preserved regardless of which operational path is chosen.

This distinction has immediate practical significance. When you write SQL to retrieve rows matching a condition, you state what data you want in denotational terms, and the query planner is free to choose any operational path that preserves that meaning. The abstraction holds because the query planner is formally bound to preserve denotational correctness. You can trust the translation.

The problem with natural language as a specification medium is that it has strong denotational content and no operational binding. When you describe a feature in a prompt, you communicate meaning. That meaning is partial, ambiguous, and dependent on context the prompt does not state. The LLM generates something that matches the literal prompt’s surface meaning while potentially violating the full semantic contract the developer intended. There is no formal guarantee equivalent to what the query planner provides, and the LLM is not bound to honor one.

Where Domain Language Falls Short

The connection to Domain-Driven Design is direct. Eric Evans’s ubiquitous language, introduced in Domain-Driven Design: Tackling Complexity in the Heart of Software (Addison-Wesley, 2003), is a practical attempt to bridge the gap between domain expert intent and software implementation without requiring full formal specification. You establish a shared vocabulary where each term carries stable semantics agreed on by the domain experts and developers working within a bounded context. “Order” has specific state transitions and invariants. “Customer” has specific lifecycle behaviors. Every team member understands what the terms mean, and that shared understanding does the work that formal specifications would otherwise do.

Ubiquitous language works because the semantics of the terms are held jointly by the people using them. When a developer writes code involving an Order, they mean the full set of behavioral constraints that the bounded context has defined for that term. The precision is tacit and social, not formal and machine-verifiable.

LLMs are trained on domain language. They learn that “Order” relates to “Customer,” to “LineItem,” to “Fulfillment.” They generate code that uses these terms in syntactically coherent ways. What they do not have is the semantic contract: the bounded context that gives each term its precise meaning within this system.

Consider what happens with a term as apparently simple as “active customer”:

// What "active customer" means in the billing bounded context
function isActive(customer: Customer): boolean {
  return customer.subscriptionStatus === 'paid' && !customer.suspendedAt;
}

// What "active customer" means in the access control bounded context
function isActive(customer: Customer): boolean {
  return customer.lastLoginAt > subDays(new Date(), 90) && !customer.deactivatedAt;
}

// What "active customer" means in the analytics bounded context
function isActive(customer: Customer): boolean {
  return customer.eventCount30Days > 0;
}

All three implementations are correct within their bounded context. An LLM generating a feature that involves “active customers” will select one interpretation from its training distribution, not from the explicit bounded context of the system being modified. The prompt says “active customer”; the generated code honors some definition of that term; whether it honors the right one depends on how much bounded context was embedded in the prompt. In most cases, almost none of it was.

This is the specification gap. Ubiquitous language closes the vocabulary gap between domain experts and developers. It does not close the gap between developer intent and machine-executable specification, because that closure requires the semantic contracts to be explicit somewhere the machine can read them.

The Evaluation Problem Requires Domain Semantics

The reason this matters extends beyond LLMs sometimes generating wrong implementations. Developers review generated code and catch errors, and the review process is load-bearing. The deeper problem is that a certain class of errors is invisible without domain knowledge that the reviewer must hold independently.

Consider: an LLM generates a method that checks customer activity before processing a charge. The method compiles, the unit tests pass, and the logic looks reasonable to a developer who is not deeply familiar with the billing domain’s definition of “active.” The error surfaces as a production edge case months later, because catching it earlier required understanding the bounded context semantic that was never stated in the prompt, the code, or the tests.

This is the concern Unmesh Joshi raised in his earlier piece on the learning loop: the developer reviewing LLM output needs genuine understanding of the domain’s “what” to evaluate whether the “how” is correct. Lexical familiarity with domain terms is not sufficient. Semantic familiarity with the contracts those terms carry is what makes evaluation possible.

The asymmetry between writing specifications and verifying them is real, and it usually favors verification. A developer who could not have written a complex SQL query from scratch can still recognize that the generated query has the wrong join condition. That asymmetry is what makes code review viable at all when working with generated code. But the asymmetry breaks down when the error is a violation of an unstated semantic contract rather than a mechanical error in logic. You cannot recognize a violation of a contract you were never told exists.

Making the Semantic Contract Machine-Verifiable

The practical response is to make the semantic contracts explicit and verifiable, so that the machine can check them rather than relying on the reviewer to hold them in memory.

Strong type systems are one mechanism. A type that encodes the billing domain’s definition of “active” cannot be accidentally substituted for the access control domain’s definition:

// Explicit domain types make bounded context visible to the LLM
type BillingActiveStatus = { readonly subscriptionStatus: 'paid'; readonly suspendedAt: null };
type AccessActiveStatus = { readonly lastLoginAt: Date; readonly deactivatedAt: null };

function chargeBillingActiveCustomer(customer: Customer & BillingActiveStatus): Receipt { ... }
function grantAccessToActiveCustomer(customer: Customer & AccessActiveStatus): Session { ... }

The type signature communicates which bounded context’s semantic is in play. The LLM can read this without tracing through any implementation. More importantly, if the LLM generates code that passes the wrong type of customer to either function, the compiler catches it before any reviewer sees the code.

Architecture fitness functions extend this to structural properties. Neal Ford, Rebecca Parsons, and Patrick Kua’s Building Evolutionary Architectures (O’Reilly, 2017) introduced fitness functions as executable tests that verify structural and behavioral properties rather than individual feature behavior. In a Java or Kotlin codebase, ArchUnit implements these directly:

@ArchTest
val billingContextIsolation: ArchRule = noClasses()
  .that().resideInAPackage("..billing..")
  .should().dependOnClassesThat()
  .resideInAPackage("..access.impl..")
  .because("billing and access control have separate definitions of customer activity")

When an LLM generates code that crosses a bounded context boundary, the fitness function fails. The violation is explicit, machine-attributable, and visible in CI before any human reviews the change. The semantic contract that previously lived only in the team’s shared knowledge now lives in an executable test.

This is the direction Birgitta Böckeler’s harness engineering framing points toward: codebase design choices that make the “what” accessible to the LLM through interfaces, types, and structure rather than through prompts alone. Types encode bounded context semantics. Fitness functions make those semantics verifiable. Together they convert tacit domain knowledge into machine-readable specification.

The Precision Demand Has Always Been There

The Fowler conversation positions LLMs as raising the importance of specification precision. That framing is accurate but understates the continuity. The precision demand has been there since the beginning, and every wave of automation that tried to make “how” generation cheaper ran into the same wall.

CASE tools in the late 1980s and early 1990s could generate code from visual diagrams. They mostly failed not because the generation mechanism was wrong, but because analysts lacked the conceptual vocabulary to specify systems precisely enough for generation to work. Fourth-generation languages generated database queries; the bottleneck shifted immediately to specifying exactly what queries to produce. Low-code platforms moved application structure below the line; specifying complex business rules still required the same conceptual work that programming required.

What LLMs change is the breadth of the surface area where the precision demand applies. Where 4GLs raised the specification bar for database queries and no further, LLMs raise it for arbitrary feature development. Every natural language prompt is a specification claim, and the gap between what the prompt literally says and what the developer fully intended is the loop failure point.

Rebecca Parsons’s formal semantics background matters here because the field already has tools for closing this gap under stricter conditions. Formal specification languages, type systems, executable tests, architecture fitness functions: these are all mechanisms for converting informal intent into machine-verifiable meaning. They make the denotational semantics operational enough that the system can verify they have been honored.

The domain language was always insufficient as a formal specification; working developers could fill in the gaps from shared tacit knowledge about bounded context semantics. LLMs trained on that language cannot access the tacit knowledge, only the surface vocabulary. The parts that developers previously carried in their heads have to become explicit in types, tests, and structural constraints. The tools for doing that are not new. The urgency of using them is.