53% Faster, 61% Fewer Allocations: What the Liquid Speedup Teaches About Ruby at Scale

Simon Willison flagged a notable performance result from the Shopify/liquid repository this week: a 53% improvement in parse and render throughput, paired with 61% fewer object allocations. These are not incremental gains from microoptimizations. Numbers at this magnitude usually mean something structural changed. Looking at the history of how Liquid has been optimized makes the picture clearer.

Liquid is the sandboxed template engine at the core of Shopify storefronts. Every product page, every collection listing, every cart page rendered by a Shopify merchant goes through it. The engine was written by Tobias Lütke in 2006 and deliberately limited: no arbitrary Ruby execution, no file I/O, no access to the broader object graph unless you explicitly expose it via Liquid::Drop. The constraint is the point. You can parse a template once, cache the result, and hand it to untrusted theme developers knowing they cannot exfiltrate data or call system commands.

That safety model, though, comes with a performance ceiling that compiled template engines do not have. ERB, Haml, and Slim all work by compiling templates down to Ruby bytecode that the MRI runtime can optimize. Liquid interprets an AST on every render call. Each {{ product.title | upcase }} walks the node tree, resolves the variable through a scope stack, dispatches to a filter, and accumulates the result into a string. At Shopify’s request volume, even a few microseconds of unnecessary overhead per render call adds up quickly.

Why Allocations Are the Right Thing to Measure

The 61% allocation reduction is the more interesting figure. Parse and render throughput depends on many variables, including CPU frequency and contention from concurrent requests. Allocation counts are more deterministic and point directly at root causes.

MRI Ruby (the standard C implementation) uses a generational mark-and-sweep garbage collector. Short-lived objects that do not survive the first GC cycle are cheap, but they are not free. Each allocation burns time in the allocator, and a high allocation rate means the minor GC fires more frequently. In a web server handling concurrent requests, GC pauses are shared across all threads; one allocation-heavy request can delay responses for unrelated work.

More specifically, when you allocate many small String objects during a hot render path, you are generating work that the GC has to trace and collect before it can give that memory back. Template engines are particularly prone to this because they fundamentally exist to concatenate strings, and naive string handling allocates a new object at every step.

The String Problem

The most immediate allocation pressure in a Ruby template engine comes from string literals in the source code itself. Without # frozen_string_literal: true, every string literal in a Ruby file allocates a new String object each time that line is executed. Tag names like "if", "for", "unless"; whitespace like " "; output separators: all of these become heap objects on every parse call.

Adding the frozen string literal magic comment to all source files makes those literals compile-time constants. The object is allocated once at load time, and every subsequent reference to it is a pointer to the same immortal object. For a library like Liquid, which parses those tag name strings on every template parse, the savings compound across thousands of requests.

# Before: every call to this method allocates new String objects
def tag_name
  "if"
end

# After: frozen_string_literal: true means this is a single object, forever
def tag_name
  "if"
end

Frozen strings also have a secondary effect on branch prediction and memory layout, since the interpreter can skip the duplication check when passing them around. It is a low-effort change with a surprisingly wide impact.

Output Buffers and the Cost of String Concatenation

Another category of string allocation comes from how render output is accumulated. The naive approach is to return a String from each node’s render method and concatenate them as you walk the tree:

def render(context)
  output = ""
  nodelist.each do |node|
    output = output + node.render(context)  # new String on every iteration
  end
  output
end

This is O(n²) in the number of nodes. Each + operation allocates a new String large enough to hold both operands, copies both into it, and discards the old left-hand side. A template with 100 nodes allocates 100 intermediate strings.

The fix is to pass a mutable output buffer down through the render tree:

def render_to_output_buffer(context, output)
  nodelist.each do |node|
    node.render_to_output_buffer(context, output)  # mutates in place
  end
  output
end

Using String#<< (in-place append) instead of + means the output string grows once, in place. The allocator still resizes the underlying buffer periodically, but the old data is not copied into a new object on every step. This change alone tends to produce the largest single reduction in allocation counts for template engines with any nontrivial template size.

Object Pooling for the Context and Strainer

Beyond strings, the two most expensive objects to construct per-render are Liquid::Context and Liquid::Strainer. The Context holds the variable scope stack, interrupt state, and various bookkeeping data. The Strainer wraps the filter dispatch mechanism, and its construction historically involved Ruby’s Class.new to build a custom anonymous class per render, which is expensive.

Shopify introduced Liquid::StrainerFactory to address this: rather than constructing a new strainer class on each render, a factory caches strainer classes keyed on the filter set. Combined with object pooling at the Context level (pre-allocating a pool of Context instances, resetting them between renders instead of creating fresh ones), the per-render object count drops substantially.

This pattern is common in high-throughput Java and Go code but less common in Ruby, where the GC has historically been assumed to make allocation cheap enough to not worry about. At Shopify’s scale, that assumption breaks.

Cached Expression Parsing

The third major category is deferred but repeated work. In older versions of Liquid, variable expressions like product.title | upcase | truncate: 50 were tokenized at parse time but not fully converted into reusable objects. Parts of the expression were re-parsed on each render call.

Moving to cached Expression objects means the parsing work happens once when Liquid::Template.parse(source) is called, and each subsequent render call simply evaluates the already-parsed structure. This is the correct model for a template engine designed around parse-once, render-many semantics, but getting there requires careful separation of parsing state from rendering state.

The C Extension Layer

For workloads where even the optimized pure-Ruby path is not fast enough, Shopify maintains liquid-c, a C extension that reimplements the Tokenizer, Expression evaluation, and parts of the rendering pipeline. The pure-Ruby gains matter because not every Liquid user can or will pull in a native extension, and they set the baseline against which the C extension is measured.

The interesting thing about the relationship between the two is that the optimization techniques carry over. Object pooling at the Ruby level helps whether or not the C extension is active, because the C code still calls back into Ruby for Drop method calls and filter dispatch.

The Broader Point

What makes this result worth paying attention to is not Liquid specifically. These techniques, frozen string literals, buffer-based output accumulation, object pooling for expensive-to-construct objects, cached parsing results, apply to any Ruby library that lives on a hot path. The Liquid team measured, found the bottlenecks with memory_profiler and benchmark-ips, and applied targeted fixes.

Most Ruby performance work I see focuses on algorithmic changes or database query reduction. Allocation pressure at the library level is less frequently audited, partly because the tools to measure it are less integrated into the standard workflow. The Liquid benchmark suite, living in the performance/ directory of the repo, is a model worth copying: representative templates, realistic context data, and output in both iterations-per-second and total allocation counts.

A 61% reduction in allocations without changing what the library does is the kind of improvement that comes from measuring the right thing and then actually fixing it.