Bringing Proof Obligations into TypeScript with LemmaScript and Dafny

TypeScript’s type system is, by mainstream language standards, remarkably expressive. Conditional types, template literal types, infer, mapped types, variadic tuples: collectively they let you encode a surprising amount of invariant knowledge at the type level. But there is a categorical difference between “this value has shape X” and “this function preserves property Y,” and no amount of generic cleverness closes it. A sort function can be typed to accept and return number[], but nothing in TypeScript’s type system can say that the output is a permutation of the input, or that no element was dropped.

This is the gap LemmaScript targets. It is a verification toolchain that lets you annotate TypeScript with proof obligations, then discharges those obligations through Dafny, Microsoft Research’s verification-aware programming language. The premise is straightforward: keep writing TypeScript, but reach into the formal methods world for correctness guarantees you cannot get from types alone.

What Dafny Actually Does

Dafny is worth understanding properly before talking about any toolchain built on top of it. It was created by Rustan Leino at Microsoft Research and has been publicly available since around 2009. The language itself looks vaguely like a mix of C# and ML, but its defining feature is that it treats verification as a first-class compile-time concern rather than a test-time one.

You express specifications using a handful of constructs:

requires clauses state preconditions that must hold before a function runs
ensures clauses state postconditions the function must guarantee on return
invariant clauses on loops state what must hold at every iteration
decreases clauses provide termination arguments
modifies and reads clauses constrain heap access and mutation

These are not runtime assertions. Dafny ships them to Z3, Microsoft Research’s SMT solver, which attempts to prove them statically. If Z3 cannot discharge a proof obligation, the program does not compile. The feedback is at the line level: Dafny tells you exactly which postcondition it could not verify and why.

A simple example in Dafny:

method Abs(x: int) returns (y: int)
  ensures y >= 0
  ensures y == x || y == -x
{
  if x < 0 { return -x; }
  return x;
}

This is not a test. Z3 will prove that for any integer x, the returned value is non-negative and equal in magnitude to the input. There is no test suite that can say that.

Dafny also supports ghost variables and lemmas: constructs that exist purely for the proof and are erased at runtime. This lets you write inductive proofs over recursive data structures, prove properties of sorting algorithms, verify cryptographic protocols, and so on.

Why TypeScript’s Type System Falls Short Here

TypeScript’s type system is Turing-complete at the type level, which sounds impressive until you realize what it cannot express. A type can describe structure; it cannot describe behavior over arbitrary inputs.

Consider a binary search implementation:

function binarySearch(arr: number[], target: number): number {
  let lo = 0, hi = arr.length - 1;
  while (lo <= hi) {
    const mid = Math.floor((lo + hi) / 2);
    if (arr[mid] === target) return mid;
    if (arr[mid] < target) lo = mid + 1;
    else hi = mid - 1;
  }
  return -1;
}

The TypeScript return type is number. You can document that it returns the index of target if found, or -1 otherwise, but you cannot enforce it. A caller cannot know whether this is correct for all inputs without reading the implementation or running tests. The type says nothing about the contract.

Libraries like Zod and io-ts help with runtime validation of data shapes at system boundaries, but they are still about shape, not behavior. Effect gives you typed errors and dependency injection, which improves correctness in a different dimension. None of these touch the question of algorithmic invariants.

There have been academic explorations of dependent types in TypeScript, mostly by encoding them into the type system using tricks, but they are fragile and unverified. The encoding does not prove anything; it just collapses types in ways that happen to mirror the invariant you want.

The Prior Art Landscape

This problem is not new. The formal methods community has been solving it in other ecosystems for decades.

Liquid Haskell adds refinement types to Haskell, letting you annotate functions with predicates that are checked by an SMT solver. The syntax feels natural in Haskell because the language is already pure and its type system is expressive. You can write:

{-@ bsearch :: xs:[Int] -> t:Int -> {v:Int | v >= -1} @-}

And Liquid Haskell will verify the bound statically.

F* from MSR and INRIA goes further, offering a full dependent type system with effect tracking. It has been used to verify cryptographic code in Project Everest, including a formally verified TLS implementation. The HACL* library, which ships in Firefox and Signal, was verified with F*.

Why3 is a deductive verification platform that can target multiple SMT solvers and theorem provers. Frama-C does similar work for C via the ACSL annotation language.

What all of these have in common is that they require developers to leave their primary language, or at least accept a significant annotation burden and unfamiliar tooling. The TypeScript ecosystem has none of this, which is what makes LemmaScript interesting.

What LemmaScript Actually Does

LemmaScript’s approach is to treat Dafny as a verification backend rather than a primary language. You write TypeScript and annotate functions with specification comments or decorators that LemmaScript understands. The toolchain then translates the annotated TypeScript into Dafny, runs the verification, and reports results back in terms of your original TypeScript source.

The translation layer is the hard part. TypeScript and Dafny have fundamentally different type systems and execution models. Dafny’s types are mathematical: integers are unbounded by default, sequences are immutable and functional, and the heap model is carefully controlled. JavaScript and TypeScript carry decades of runtime behavior, including prototype chains, coercion, and mutation everywhere.

LemmaScript necessarily works on a restricted subset of TypeScript. You cannot verify arbitrary TypeScript, because much of JavaScript’s semantics has no clean Dafny equivalent. But for pure functions operating on simple data types, the mapping is tractable.

A verified binary search in this style might look something like:

// @requires arr is sorted
// @ensures result === -1 || arr[result] === target
function binarySearch(arr: number[], target: number): number {
  // ...
}

The toolchain takes these annotations, generates a Dafny method with the corresponding requires and ensures clauses, and runs verification. If Dafny can prove the contract, the function is marked verified. If not, you get a counterexample or a failing obligation to debug.

The value here is not that you run Dafny in production. Dafny code is erased or compiled separately. The value is the proof itself: a guarantee, stronger than any test, that the contract holds for all inputs within the preconditions.

The Annotation Burden Trade-off

Formal verification does not come free. Writing a correct ensures clause for a non-trivial function often requires as much thought as writing the function itself. For the binary search above, a complete specification would need to express that the output index is within bounds, that arr[result] === target when not -1, and for completeness, that -1 is only returned when target is genuinely absent from arr. That last property requires a universal quantification:

ensures result == -1 ==> forall i :: 0 <= i < arr.Length ==> arr[i] != target

Writing this correctly, and writing the loop invariant that lets Dafny prove it, is non-trivial. Dafny requires loop invariants to be strong enough that Z3 can propagate them through each iteration. Finding that invariant is often the bulk of the verification work.

This is inherent to the domain, not a flaw in LemmaScript specifically. Any verification toolchain faces it. Liquid Haskell is often easier because Haskell’s purity lets the prover infer more. Dafny gives you more control but demands more annotation.

For TypeScript developers encountering this for the first time, the learning curve is real. The payoff is also real, particularly for code where correctness is genuinely load-bearing: parsing, cryptography, data structure invariants, financial calculations.

Where This Sits in the Ecosystem

The most direct comparison is to TypeScript’s own type-level programming. Developers already push the type system hard to encode invariants: ReadonlyArray, branded types, discriminated unions, the satisfies operator. These catch a real class of bugs. LemmaScript targets a different class entirely, one that types cannot reach.

The more honest comparison is to property-based testing via fast-check. Property testing also tries to verify behavioral invariants by generating many inputs and checking that properties hold. It is more accessible, integrates naturally into existing test suites, and catches most real-world bugs. But it is probabilistic, not exhaustive. A verified Dafny proof covers all inputs; fast-check covers the ones the generator happened to produce.

For most TypeScript applications, property testing is the right tool. For code where “most inputs” is not good enough, formal verification is worth the cost. LemmaScript makes that option available without requiring developers to fully learn Dafny as a primary language, which lowers the barrier enough that it might actually get used.

The project is early. The subset of TypeScript that maps cleanly to Dafny is restricted, the tooling is not mature, and the developer experience around Dafny error messages is notoriously difficult even in Dafny itself. But the direction is sound. Bringing SMT-backed proof obligations into the TypeScript workflow, even partially, gives the ecosystem something it has not had before: a path from “we tested this thoroughly” to “we proved this correct.”

That is a meaningful distinction, and it is worth paying attention to even if you never use it in production.