· 6 min read ·

How Liquid Got 53% Faster: Allocation Reduction in a Sandboxed Ruby Template Engine

Source: simonwillison

Shopify’s Liquid template engine recently landed a significant optimization: 53% faster parse and render, with 61% fewer object allocations. The allocation number is the one that explains the rest. In Ruby, a 61% reduction in allocations translating to a 53% speed improvement tells you that GC pressure was real but not the only bottleneck, that the remaining execution is also faster, and that the optimization was thorough rather than narrowly targeted at one hot path.

To understand why allocation count is the right metric to report here, it helps to understand what Liquid is and why it is architecturally different from the other Ruby templating options.

Why Liquid Cannot Compile to Ruby

ERB, Haml, and Slim all work by compiling template source text into Ruby code that is then evaluated. ERB wraps template fragments in string concatenation and embeds Ruby expressions directly:

# ERB compiles {{ product.title }} approximately to:
_buf << product.title.to_s

After compilation, rendering is native Ruby execution. This makes ERB fast once compiled, but it also means the template has full access to the Ruby runtime. You cannot safely give ERB templates to untrusted users.

Liquid was built for Shopify precisely so that merchants could customize their storefront templates without being able to execute arbitrary Ruby. The engine is a sandbox: variable access is restricted to an explicitly provided context, filters are allowlisted, and there is no path to Ruby’s object model from within a template.

Enforcing that sandbox requires interpreting templates rather than compiling them to Ruby. Every {{ product.title }} and {% for variant in product.variants %} is represented as a Ruby object in a parse tree, and rendering means walking that tree with a controlled execution environment. This is the source of Liquid’s inherent allocation overhead. Compiled template engines escape it by generating native Ruby code; sandboxed interpreted engines cannot, because the moment you compile to Ruby you lose the ability to restrict what the template can do.

Where Allocations Accumulate

During render, allocation pressure builds in several predictable ways.

The central runtime object is Liquid::Context, which maintains a stack of variable scopes as an Array of Hashes. Every {% for %} loop iteration pushes and pops a scope frame. A loop over 50 product variants allocates and discards 50 Hash objects, each subject to garbage collection. For a product page iterating over variants, images, and metafields, scope frame allocation accumulates quickly.

Variable lookup adds more. A reference like product.variants.first.price in a template historically triggered a String#split('.') call at render time, producing a temporary Array of path segments on every access. In a template referencing product variables dozens of times, this is a predictable and eliminable source of short-lived objects.

Filter evaluation compounds it further. A pipeline like {{ product.title | upcase | truncate: 50 }} routes through Liquid::Strainer, the filter sandbox. Depending on how strainer instantiation is handled, each filter call can trigger per-call allocation overhead that adds up across a page with many output variables.

Output accumulation is subtler. BlockBody#render builds output by appending to a String buffer. The critical distinction is whether appending uses << (in-place mutation, no allocation) or += (creates a new String on each concatenation). Any code path using += for output is a quadratic allocation pattern, one that gets worse as templates grow longer.

Why the GC Is the Right Starting Point

In C or Rust, profiling CPU cycles is the natural starting point for performance work. In Ruby, the GC introduces a layer between allocation count and wall-clock time that makes allocation count the better primary metric.

Ruby’s GC is a generational, incremental mark-and-sweep collector. Minor GC runs are triggered when the young generation fills up, proportionally to allocation rate. Each run pauses Ruby execution. In a web server handling concurrent requests, GC pauses averaging a few milliseconds per request accumulate into visible latency under load. The relationship between allocations and response time is not theoretical; it shows up in production p99 latency.

This is also why allocation reduction compounds with YJIT, the just-in-time compiler Shopify built and shipped in Ruby 3.1. YJIT eliminates method dispatch overhead and inlines hot paths, but it cannot eliminate GC pauses caused by high allocation rates. A codebase that generates fewer objects benefits from YJIT more cleanly, because the CPU cycles freed by the JIT are not interrupted by GC running to collect short-lived objects from the previous request.

The Techniques Behind the Numbers

A 61% allocation reduction in a mature, production codebase comes from a combination of targeted changes, not a single architectural shift. The standard toolkit for this kind of work in Ruby includes several high-yield techniques.

Pre-splitting variable lookup paths. Storing VariableLookup path segments as a frozen Array at parse time, rather than calling split('.') at render time, eliminates allocations on every variable access. Since parse results are cached and reused across many renders, this is a one-time cost traded for zero render-time allocation on every variable reference.

Frozen string literals. Adding # frozen_string_literal: true to source files makes all string literals return the same frozen, shared object instead of allocating a new mutable String on each call. For a template engine full of string-keyed lookups, filter names, and tag names, this reduces a large class of allocations to zero with no behavioral change.

Scope frame reuse. Rather than allocating a new Hash for each loop iteration’s scope frame, a pool of cleared and reused Hash objects eliminates the GC cost of iteration entirely. Ruby does not have built-in object pooling, but an Array-backed free list works well for bounded-size scope frames.

Output buffer discipline. Auditing all render paths to ensure output is accumulated with << rather than +=, and that a single shared buffer is threaded through recursive render calls rather than allocated per-node.

String deduplication via String#-@. Added in Ruby 2.3, -"string" returns a frozen, deduplicated string from a global interning table. Applied to parsed identifiers, variable names, filter names, tag names, this reduces memory footprint without changing behavior, which also lowers GC pressure across long-running processes.

None of these techniques are exotic. They are the standard output of an allocation profiling pass using tools like vernier (a sampling profiler with allocation call-graph support developed at Shopify) or ObjectSpace.trace_object_allocations. The discipline is in identifying the highest-count allocation sites and addressing them systematically rather than guessing at bottlenecks.

The liquid-c Context

It is worth noting that the liquid-c gem, a C extension reimplementing Liquid’s lexer and parser, already achieves 3 to 8x speedups for the parse phase. The pure-Ruby improvements described here almost certainly target the render phase, where liquid-c does not apply. The two approaches are complementary: liquid-c handles parse-time overhead for applications that can install native extensions, while pure-Ruby render optimizations benefit everyone, including applications running in environments where native extensions are unavailable or impractical.

Scale Makes Fundamentals Matter

Shopify runs millions of storefronts. A 53% reduction in template render time is not a local optimization; it is a reduction in the compute required to serve every product page, collection page, and checkout flow across the platform. At that scale, a focused allocation-reduction pass pays for itself rapidly in infrastructure costs.

For Ruby library and application developers, the Liquid approach is worth borrowing directly: start performance work with an allocations profile rather than a CPU profile, target the highest-count allocations first, and benchmark against workloads that reflect real usage rather than synthetic microbenchmarks. The specific numbers Shopify landed, 53% faster and 61% fewer allocations, are the kind of result you get when the profiling is honest and the changes address root causes rather than symptoms.

Was this interesting?