· 8 min read ·

Sparse Strips: How a Master's Thesis Brings GPU Thinking to CPU 2D Rendering

Source: lobsters

CPU 2D rendering has a long history of being the unglamorous part of the graphics stack. GPU rasterization gets the attention, the dedicated hardware, and the research spotlight. The CPU path gets maintained, slowly optimized, and occasionally rewritten when a new SIMD extension arrives. Most of the fundamental algorithmic ideas in CPU vector graphics trace back to work done in the 1980s and 1990s.

A 2025 master’s thesis from ETH Zurich makes a case that the algorithmic foundations deserve revisiting. The thesis introduces sparse strips, a rendering approach that decomposes the image plane differently from the classical scanline algorithm and exploits SIMD parallelism more aggressively than most existing CPU renderers manage to. The result is a rendering model that feels, structurally, like what you would get if you ported the ideas behind GPU tile-based deferred rendering to a CPU pipeline.

The Problem with Scanline Rendering

Classical scanline rasterization works roughly as follows. For each shape in the scene, you convert its boundary into a sorted list of edge crossings. You then sweep downward through the image one row at a time. At each row, you consult the active edge table, compute which spans of pixels are covered, and fill them in. The algorithm has good worst-case characteristics and fits the sequential memory access patterns of older CPUs well.

The trouble is that it maps badly onto modern hardware. Contemporary CPUs have wide SIMD registers (256 bits with AVX2, 512 bits with AVX-512) that can process 8 or 16 float32 values in a single instruction. Scanline rendering processes one row at a time, accumulating coverage along a single horizontal axis. The parallelism available in the vertical direction goes unused. When shapes are large and vertically deep, you spend a lot of time re-evaluating the same edge geometry against row after row with no opportunity to batch that work across rows.

Memory access patterns are also a concern. Modern renderers compete for cache with everything else running on the system. A scanline sweep that maintains a large active edge table and touches many different regions of the framebuffer in an unpredictable order produces pressure on L1 and L2 that compounds as scene complexity grows.

Libraries like Skia and Cairo have addressed these problems incrementally, adding SIMD-accelerated span filling, alpha pipeline vectorization, and tile-level culling. The Blend2D library goes further with JIT-compiled pixel pipelines that match the actual pixel format and blending mode at runtime. These are substantial engineering achievements, but they are optimizations within the scanline model rather than departures from it.

What Strips Are, and Why Sparsity Matters

The core idea in the thesis is to divide the image into horizontal strips, thin bands of pixels spanning the full width of the viewport and some fixed number of scanlines tall. Rather than processing one row at a time, the renderer processes one strip at a time. Within a strip, SIMD instructions can work across multiple rows simultaneously, filling the register width in both the horizontal and vertical dimensions.

The “sparse” part of the name refers to how the renderer handles strips that contain no geometry. In a complex document or UI scene, large portions of the image are often empty or covered by a single solid background color. A strip that falls entirely outside the bounding boxes of all active paths can be skipped with a single bounds check. Strips that are fully interior to a large filled shape can be handled with a fast solid-fill path. Only strips that contain actual path boundaries, where the interesting anti-aliasing work happens, require the full coverage computation.

This decomposition mirrors what GPU tile-based deferred renderers do at the hardware level. Mobile GPUs from ARM and Imagination Technologies have long used tile-based architectures where the screen is divided into tiles (typically 16x16 or 32x32 pixels), geometry is binned into tiles during a first pass, and then tiles are rendered independently with full locality. The CPU sparse strips approach is not trying to replicate tile-based rendering exactly, but it is importing the central intuition: spatial decomposition lets you exploit locality, skip empty regions cheaply, and parallelize work at a coarser granularity than individual pixels.

The Vello renderer from the Linebender project takes this idea to its logical extreme on the GPU side. Vello is a compute-shader based 2D renderer that runs an explicit pipeline of stages: flattening Bezier curves to polylines, coarse binning of line segments into tiles, and fine rasterization within each tile using atomics or prefix sums to accumulate coverage. The result is a GPU renderer that scales to scenes with millions of paths. Sparse strips is essentially asking whether some of the same structural clarity can work on a CPU.

SIMD in the Inner Loop

SIMD is where the performance argument becomes concrete. Consider a strip that is 8 scanlines tall. An AVX2 register holds 8 float32 values. If you orient your data so that the 8 elements of a register correspond to the 8 rows in your strip, you can compute signed area contributions from a line segment crossing that strip and accumulate coverage for all 8 rows simultaneously in a single pass over the segment list.

The key primitive is the winding number contribution of a line segment to a horizontal strip. For a segment that enters the strip from one side and exits from another, you need to determine, for each row in the strip, at what x-coordinate the segment crosses that row. This is a straightforward linear interpolation that maps cleanly onto SIMD arithmetic. Computing 8 crossing x-positions with 8-wide SIMD takes roughly the same number of instructions as computing a single crossing position would in scalar code.

Coverage accumulation can be done with the same register. For non-zero or even-odd fill rules, you accumulate signed crossing counts per row across the width of the strip as you walk each path’s segments. The result, after a horizontal prefix sum across the strip width, is a coverage mask that can drive the alpha blending of the final pixel colors.

This is not entirely unlike what fontdue, the pure-Rust font rasterizer, does internally. Fontdue converts glyph outlines to a sequence of line segments and then integrates coverage per row using SIMD to accelerate the inner loop. The sparse strips approach generalizes this to full vector graphics scenes, not just glyph rasterization, and makes the strip structure explicit as the organizing principle of the pipeline.

The ETH Zurich Context

The ETH Zurich PLF group (Programming Languages and Frameworks) has been involved in document rendering research connected to Typst, the modern typesetting system that also originated at ETH. Typst is built around Rust and aims to provide LaTeX-quality layout with substantially faster compilation and rendering. The thesis fits naturally into this context: a typesetting engine that can lay out complex documents in milliseconds needs a rendering backend that can keep pace.

The relevance to a tool like Typst is worth unpacking. Typst documents contain text, vector graphics, and images. Text rendering in particular involves thousands of small glyph outlines rendered at precise subpixel positions. A renderer that handles this workload well needs to be fast on many small, spatially distributed paths, not just a few large filled shapes. The sparse strips approach, by skipping strips with no geometry cheaply and processing covered strips with SIMD, suits this workload profile better than a general-purpose scanline renderer that pays overhead proportional to path count regardless of coverage density.

Where CPU Rendering Still Wins

There is a reasonable question about why CPU rendering warrants this level of investment at all, given how capable GPU renderers have become. The answer is that the GPU is not always available, or not always appropriate.

Headless rendering on servers, where there may be no GPU or only a generic software rasterizer, still happens constantly. PDF generation, screenshot services, document-to-image pipelines: these workloads run on CPU because spinning up a GPU context for short-lived rendering tasks carries overhead that dominates for small documents. A fast CPU renderer in this context is not competing with the latest GPU hardware; it is competing with software fallbacks that have not been touched in a decade.

GPU rendering also has bus transfer costs. If you are rendering to a surface that will ultimately be consumed by the CPU (embedded in a PDF, serialized to PNG, passed to an image processing pipeline), any GPU-rendered result needs to be read back, which is expensive. Rendering directly into CPU memory with a fast CPU renderer avoids that round-trip entirely.

There is also correctness sensitivity in certain domains. PDF and SVG rendering have requirements around exact antialiasing and sub-pixel accuracy. A CPU renderer whose behavior is deterministic and reproducible across hardware is easier to validate for these requirements than a GPU renderer whose output depends on driver behavior and floating-point rounding in shaders.

Where This Sits in the Ecosystem

The landscape of 2D CPU renderers in Rust has developed quickly over the past few years. tiny-skia provides a pure-Rust port of Skia’s core rasterizer with SIMD-accelerated pixel operations. The kurbo library provides Bezier math primitives used throughout the Linebender ecosystem. peniko handles color and paint types. None of these constitute a complete high-performance CPU renderer along the lines of what the thesis describes.

The closest prior work in production use is probably the rendering path in resvg, which leans on tiny-skia for rasterization. It is solid but does not attempt the kind of strip-level SIMD exploitation that the thesis proposes.

The thesis positions sparse strips not as a final answer but as a proof of concept that the algorithmic direction is worth pursuing. The rendering quality requirements for real-world use (subpixel antialiasing for text, correct handling of self-intersecting paths, proper compositing for transparency groups) add substantial complexity beyond what a research prototype needs to handle. Getting from proof of concept to production-quality renderer involves significant engineering work on top of the core algorithm.

What Makes This Worth Paying Attention To

Most optimizations in CPU rendering are incremental. Use wider SIMD instructions where they fit, add a fast path for axis-aligned rectangles, tune the active edge table data structure. These are real gains, but they are gains within an existing framework.

The sparse strips approach is architecturally different. It reorganizes the rendering pipeline around a spatial decomposition that was not present before, and it does so in a way that opens up SIMD parallelism that scanline rendering structurally cannot exploit. That is a different kind of contribution.

The timing also matters. As Vello continues to mature as the GPU-side answer for high-performance 2D graphics in the Rust ecosystem, there is increasing pressure on the CPU-side story to keep pace. A CPU renderer that can handle document-scale workloads efficiently, without GPU setup overhead, would fill a genuine gap that tools like Typst currently work around by accepting that rendering is slower than it could be.

The thesis will not ship as a production renderer on its own. But it provides a concrete algorithmic foundation and performance analysis that someone building the next generation of CPU-side vector rendering could build from. In a field where the conventional wisdom has held for thirty years, that is more than most academic work manages to deliver.

Was this interesting?