· 8 min read ·

DuckDB at Scale: Why Your Laptop's SSD Matters More Than Its RAM

Source: hackernews

The DuckDB team’s benchmark post demonstrates TPC-H at scale factor 100, 100GB of raw data, completing on a MacBook Air with 8GB of RAM. The result is genuine, and the engineering behind it is well-documented: vectorized columnar execution, grace hash joins for out-of-core processing, Parquet predicate pushdown, and Apple Silicon’s fast NVMe storage. The benchmark invites a generalization that does not hold uniformly, though: that RAM is no longer a constraint for analytical work on a laptop. RAM has become a secondary constraint rather than a primary one; three others have taken its place, and how they interact tells you precisely when a single node is enough and when it is not.

Memory Bandwidth: The Rate at Which CPU Touches Data

DuckDB’s vectorized execution processes data in batches of 2,048 column values, a size chosen to fit comfortably in CPU L2 and L3 cache, which run 8-30 MB on modern high-performance cores. Tight arithmetic operations, column scans, and filter evaluations execute against cache-resident data rather than against main memory. The compiler auto-vectorizes these loops into SIMD instructions: ARM NEON on Apple Silicon, AVX-512 on capable x86 cores. Multiple comparisons or additions execute per instruction cycle instead of one.

This design traces back to MonetDB/X100, Peter Boncz’s 2002 research that identified the central bottleneck in analytical database engines: a tuple-at-a-time engine triggers a round-trip to main memory at every operator boundary, stalling the CPU on memory latency rather than computing. Vectorized execution over cache-sized batches amortizes those round-trips across thousands of values. The CPU remains the bottleneck for cache-resident data; main memory only comes into play when loading new batches.

When data moves from RAM to cache, memory bandwidth becomes the limiting factor. Apple Silicon’s M-series delivers roughly 100-120 GB/s between RAM and CPU. A comparable x86 laptop with DDR5 typically sustains 50-70 GB/s. That gap has a direct effect on column scan throughput and hash table operations, which are the inner loops of most analytical queries. The M4 MacBook Air’s memory bandwidth sits closer to entry-level server hardware than to a standard x86 laptop from three years ago, which is part of why the DuckDB benchmark lands where it does on that specific machine.

I/O Bandwidth: The Cost of Spilling

DuckDB’s out-of-core processing encompasses three spillable operators. The grace hash join partitions both the build and probe sides by hashing the join key, so matching rows land in the same partition on both sides; partition pairs are joined sequentially, keeping peak memory bounded. The partitioned hash aggregation flushes oversized partial aggregates to disk and merges them after all input has been processed. The external merge sort generates sorted runs that fit in memory, writes them to temporary storage, and merges them in a final pass.

All three operators write and read large sequential chunks, not scattered small pages. DuckDB’s buffer manager does not perform random-access I/O during spill; it writes entire partitions as contiguous blocks. This is a deliberate design choice, and it matters considerably depending on the storage hardware involved.

Apple Silicon MacBooks use NVMe controllers with sequential read speeds of 3-7 GB/s depending on model and I/O size. A SATA SSD peaks near 550 MB/s. A mid-range spinning disk runs 100-200 MB/s. The cost of a spill operation scales directly with those numbers. A grace hash join that writes and reads back a 15GB partitioned build side takes roughly 2-5 seconds on M-series NVMe, about 27 seconds on a SATA SSD, and over two minutes on a spinning disk. On Apple Silicon, the round-trip cost of spilling from memory to disk and back is close to the cost of a fast S3 read. On spinning rust, the same operation is a multi-minute pause.

import duckdb

con = duckdb.connect()
# Setting memory_limit below available RAM forces DuckDB to spill
# temp_directory should point at your fastest available storage
con.execute("SET memory_limit = '6GB'")
con.execute("SET temp_directory = '/tmp/duckdb_spill'")
con.execute("SET threads = 8")

Pointing temp_directory at an external NVMe enclosure rather than the internal SSD can improve spill performance further when the internal drive is under contention. The claim that 8GB handles 100GB is most accurate on the hardware DuckDB benchmarked it on. On hardware with slower storage, the out-of-core story is weaker, and the practical limit before query times become impractical is lower.

Storage Capacity: The Hard Ceiling

Memory bandwidth and I/O bandwidth are rate constraints; you pay a time penalty when you exceed them. Storage capacity is an absolute limit: you cannot spill data that has nowhere to go.

DuckDB’s out-of-core execution requires that your working dataset plus temporary spill files fit on local storage simultaneously. The base MacBook Air ships with 256GB of storage. After macOS and typical applications, available space is roughly 150-180GB. TPC-H SF100 in compressed Parquet occupies 25-30GB, and spill files for complex multi-join queries peak temporarily at 40-60GB on top of that. Those numbers fit within the base configuration, which is how the benchmark works on a $1,099 machine.

TPC-H SF500 in compressed Parquet is approximately 125GB. Spill space for complex queries adds another 60-100GB. A 256GB SSD cannot accommodate this; a 512GB SSD is marginal. For analytical workloads in the 300-600GB range, storage capacity rather than RAM becomes the binding constraint.

The practical implication: for out-of-core DuckDB workloads, upgrading from 256GB to 512GB or 1TB of storage extends the viable dataset range more than upgrading from 8GB to 16GB of RAM. A RAM upgrade triggers more in-memory execution and reduces spill frequency. A storage upgrade extends the absolute ceiling beyond which no amount of efficient spilling can help.

Query Shape Determines the Spill Budget

Two queries over the same 100GB input dataset can have radically different peak memory requirements. The difference comes from the cardinality of intermediate state, not from input volume.

A GROUP BY on a low-cardinality column, say 50 distinct product categories, accumulates partial sums for 50 groups. The hash table stays tiny regardless of how many input rows stream through it; essentially no spilling occurs. A join between two large tables on a high-cardinality key, such as joining 70GB of order line items to 15GB of orders on a unique order identifier, builds a hash table proportional to the number of distinct join keys. At TPC-H SF100, the lineitem table contains 600 million rows. Without join reordering, the intermediate hash table grows to 20-40GB before aggregation collapses it.

-- Low spill budget: hash table holds 50 partial aggregates
SELECT product_category, SUM(revenue)
FROM orders
GROUP BY product_category;

-- High spill budget: hash table grows proportional to join key cardinality
-- Peak intermediate state depends heavily on join ordering
SELECT o.order_date, SUM(l.extended_price)
FROM orders o
JOIN lineitem l ON o.order_key = l.order_key
GROUP BY o.order_date;

DuckDB’s query planner uses column statistics to estimate join cardinality and reorders joins to minimize the largest intermediate hash table. When a small table joins a large one, the planner builds the hash table over the small side and probes it with the large side, keeping peak memory bounded by the smaller relation. When both sides are large, the grace hash join partitions the data and processes it in passes, targeting each partition at a fraction of the configured memory limit.

The EXPLAIN ANALYZE output shows operator-level memory usage, which makes it possible to identify exactly which operator is responsible for peak memory consumption:

EXPLAIN ANALYZE
SELECT o.order_date, SUM(l.extended_price)
FROM orders o
JOIN lineitem l ON o.order_key = l.order_key
GROUP BY o.order_date;

The output includes peak memory per operator. An operator that exceeds the memory limit will show spill statistics, including bytes written and read. This tells you whether a query is running within budget or relying heavily on I/O, which in turn tells you whether a faster SSD would help or whether the query itself needs restructuring.

Parquet Selectivity Shrinks the Effective Problem Size

The numbers above assume DuckDB needs to process the full physical dataset. Most analytical queries do not.

Parquet stores per-column statistics, minimum value, maximum value, and null counts, for each row group, which represents approximately 128MB of uncompressed data. DuckDB’s Parquet reader checks these statistics before reading any file content. A filter predicate causes the reader to skip every row group whose statistics make a match impossible: if every value in a row group’s date column predates a WHERE clause threshold, the entire row group is skipped without any I/O or decompression. Column pruning further reduces physical I/O by reading only the columns a query references.

A query nominally over 100GB of Parquet may physically read 6-8GB after these optimizations apply. The effective problem size has shrunk by an order of magnitude before the execution engine processes a single value. This is not a universal property: it requires that your data be organized or sorted in a way that makes the per-row-group statistics meaningful to your predicate. A filter on a column with no correlation to storage order provides no row-group skipping benefit. For well-partitioned event data filtered by time range, the effect is substantial, and it compresses the spill budget proportionally.

Where Distributed Infrastructure Remains Necessary

Storage capacity sets the practical data size ceiling for single-node work. Beyond that limit, you need network-attached storage, which DuckDB can query directly over S3 or HTTPS at the cost of I/O speed, or a distributed system.

Concurrent multi-user workloads are a separate case. DuckDB uses a single-writer lock and is not designed to serve dozens of simultaneous heavy queries without resource contention. A shared analytical service with multiple users and response-time requirements needs connection pooling and resource isolation that an embedded database does not provide.

Long-running fault-tolerant ETL jobs, the kind that checkpoint progress across stages and resume from failure after a node crash, remain genuinely harder on a single process than on a system with built-in task retry. A DuckDB query that runs for four hours and fails at hour three restarts from the beginning.

For everything else, which covers the majority of analytical work done by most engineering teams on datasets that fit on a modern laptop SSD, single-user or low-concurrency queries with completion times measured in minutes rather than hours, the Hacker News discussion around this benchmark reflects a consistent finding from practitioners: the crossover point where distributed infrastructure is genuinely necessary sits considerably higher than intuitions built on 2019-era tooling would suggest. The RAM limit is real but manageable. The SSD limit is the one worth planning around.

Was this interesting?