· 7 min read ·

The Grammar-as-Data Parser That Made the JSONata Port Tractable

Source: hackernews

The Reco.ai post about rewriting JSONata with AI in a day has been covered thoroughly from a few angles: the economics of running a JavaScript interpreter at serverless scale, the conformance test suite that made correctness verifiable, and the semantic traps in undefined propagation and sequence handling that kept prior ports from reaching full compatibility. Those explanations are accurate. What they mostly skip over is the parser layer, and why translating an expression parser from JavaScript to another language is often the hardest part of this kind of port, except when the parser is a Pratt parser.

What Makes Expression Parsing Hard to Port

A conventional recursive descent parser encodes the grammar as a collection of mutually recursive procedures. You have a parseExpression function that calls parseTerm, which calls parseFactor, which calls parseUnary, which eventually reaches atoms. The call structure mirrors the grammar’s production hierarchy. Operator precedence is implicit in which function calls which; a + has lower precedence than * because parseAddition calls parseMultiplication rather than the other way around.

This encoding is natural for humans to read and debug, but it has a property that makes translation harder: the grammar is implicit in the recursion structure. To translate the parser to a new language, you have to understand the grammar it encodes, because the precedence relationships are not written down anywhere explicitly. They emerge from which functions call which. An AI assistant translating a recursive descent expression parser has to correctly infer and preserve those call relationships, and a mistake in one call site can break the precedence of an entire class of expressions in ways that are not immediately obvious from any single failing test.

Pratt Parsers Work Differently

A Pratt parser, named after Vaughan Pratt’s 1973 paper “Top Down Operator Precedence”, inverts this encoding. Each token type carries three pieces of data:

  • A nud (null denotation) function: how to parse this token when it appears in prefix position, at the start of an expression
  • A led (left denotation) function: how to parse this token when it appears in infix position, after an expression already exists to its left
  • A bp (binding power): a number representing the token’s operator precedence

The core parsing function is then a fixed, grammar-independent loop:

function expression(rbp) {
  let token = current;
  advance();
  let left = token.nud();

  while (rbp < current.bp) {
    token = current;
    advance();
    left = token.led(left);
  }

  return left;
}

That is the complete algorithm. The grammar is encoded in the nud and led functions attached to each token type, and in the bp binding power numbers. To add a new infix operator, you register it with a binding power and a led function. The expression function never changes.

JSONata’s parser uses this approach, which is a natural choice for expression languages in the JavaScript ecosystem. Douglas Crockford used a Pratt parser in JSLint and described the technique in a widely read 2007 essay. The approach has spread through languages and runtimes that need to parse operator-heavy expression grammars without the friction of generating a parser from a formal grammar specification.

Why This Architecture Is Easier to Translate

When an AI assistant is asked to translate a Pratt parser, the problem structure is fundamentally more favorable than translating a recursive descent parser.

First, the core algorithm is minimal. The expression(rbp) loop is a dozen lines with no grammar-specific logic. Translating it to Go, Rust, or any other language is mechanical. The function has a clear specification stated in the algorithm itself: call the leading token’s nud, then consume subsequent tokens while their binding power exceeds the right binding power parameter. There is nothing to infer.

Second, each operator’s parsing logic is isolated. The nud for the minus sign handles unary negation. The led for the multiplication operator handles binary multiplication. These are independent functions with independent failure modes. If the translation of the ~> chain operator is wrong, only expressions using ~> fail. The failure is narrow and the feedback from the test suite is precise: these tests fail, they all involve this operator, here is the current translation of this operator’s handler. The correction is focused rather than systemic.

Third, and most importantly, the grammar is data. Binding powers, nud functions, and led functions are values assigned to token types during parser initialization. In JavaScript, this typically looks like a map from token types to structs or objects with function properties. Translating this to another language is a data conversion problem, not a structural refactoring problem. The precedence of + is the number 50. It is written down explicitly. Moving that number and its associated parsing function to a new language requires no inference about the grammar.

Compare this to porting a recursive descent expression parser where the precedence of + is implied by the fact that parseAddition is called from parseExpression and calls parseMultiplication. A model translating that structure has to understand what the recursion encodes before it can translate it faithfully.

A Concrete Example

Here is a simplified version of how JSONata registers its + operator. The pattern is characteristic of Pratt parsers:

registerInfix('+', 50, function(node, left) {
  node.type = 'binary';
  node.value = '+';
  node.lhs = left;
  node.rhs = expression(50);
  return node;
});

The binding power is 50. The led function receives the left-hand side, calls expression(50) to parse the right-hand side at the same precedence level (which handles left-associativity), and returns an AST node. The recursive call to expression with the same binding power is the mechanism for left-associativity across all operators; it is not specific to +.

Translating this to Go involves creating a token dispatch function with a corresponding case:

case TokenPlus:
  return &BinaryNode{
    Value: "+",
    LHS:   left,
    RHS:   p.expression(50),
  }

The structure is nearly identical. The binding power constant is still explicit. The recursive expression call with the same binding power still handles left-associativity. A model that has seen Pratt parsers can make this translation accurately because the structural correspondence is direct and the invariants are visible in the code.

The Full Expression Grammar

JSONata’s expression syntax is moderately complex. It covers arithmetic and comparison operators with standard precedence, boolean operators, path navigation using dot notation, predicate application with brackets, array and object constructors, conditional expressions, lambda definitions with the $ prefix convention, partial function application, and the ~> pipe-chain operator. This is roughly twenty distinct token types that need non-trivial nud or led implementations.

Under a recursive descent approach, implementing this grammar requires careful thought about how each construct fits in the precedence hierarchy and how left-recursive cases are handled. The interaction between path navigation and predicate application, for instance, requires specific decisions about associativity that would be encoded implicitly in the call structure. Under the Pratt approach, path navigation has a binding power, predicate application has a binding power, and the expression loop handles their interaction automatically.

For the JSONata port specifically, this means the parser was the part of the codebase where the AI-assisted translation was most likely to be accurate on the first pass, and where failures were most likely to be narrow and fixable in isolation. The evaluator had harder semantic challenges, particularly around sequence handling and undefined propagation, because those require preserving behavioral invariants across dozens of functions simultaneously. The parser had one hard problem, getting the binding powers right, and each binding power was a single number sitting next to the code that used it.

The Broader Observation

Pratt parsers have been recommended for expression parsing in practical contexts for years, particularly after Bob Nystrom’s Crafting Interpreters popularized them for the embedded language crowd. The standard reason given is that adding a new operator does not require touching the core parsing algorithm. That is true, and it is a real advantage for maintainability.

The portability advantage is less often discussed. Parsers built around explicit grammar data, where precedence is a number and parsing behavior is a function attached to a token type, translate more cleanly to new languages than parsers where the grammar is encoded in recursive procedure structure. This is not specific to AI-assisted translation; hand-ported Pratt parsers are easier to get right than hand-ported recursive descent parsers for the same reason. The grammar is a first-class artifact of the implementation rather than an emergent property of the call graph.

For the JSONata rewrite, the practical consequence was that the parser layer likely went fast and stayed correct while the harder work happened in the evaluator’s semantic edge cases. The conformance test suite is what made that semantic work verifiable. But the parser architecture is what kept the translation of the parsing layer from becoming another hard problem on top of the ones that were genuinely difficult.

If you are building a DSL or expression language and expect to port it later, the Pratt approach is worth considering specifically because it makes the grammar portable. Not just between implementations by human engineers, but between implementations produced by an AI translating your code into a target language it has never seen your specific grammar in before.

Was this interesting?