Why DuckDB on a Base MacBook Outperforms Your Spark Cluster for Single-Node Workloads
Source: hackernews
The DuckDB team published a benchmark post showing their engine processing datasets well beyond available RAM on the cheapest MacBook Apple currently sells. The reaction on Hacker News was predictable: surprise, some skepticism, and a lot of “but Spark can do this too.” The surprise is understandable; the skepticism is mostly wrong; and the Spark comparison misses the point entirely.
The more interesting question is not whether DuckDB is fast. It is why this specific combination of hardware and software works so well, and what that means for how a data engineer should make tooling decisions in 2026.
What the TPC-H Benchmark Actually Measures
TPC-H is an industry standard benchmark for analytical query workloads. It generates synthetic relational data at configurable scale factors: SF1 is roughly 1 GB of raw data, SF10 is 10 GB, SF100 is 100 GB, SF1000 is 1 TB. The queries involve multi-table joins, aggregations, and filters that stress every part of an analytical engine: I/O throughput, join ordering, memory management, and CPU utilization.
When you see DuckDB completing SF100 queries on a machine with 8 GB of RAM, that means the engine is processing 100 GB of data on a machine with roughly 12x less memory than the dataset size. That is not a trick. It is a consequence of how DuckDB’s architecture interacts with the storage and memory subsystem.
DuckDB’s Execution Engine: Vectorized and Columnar
DuckDB uses a vectorized, pull-based execution model. Data moves through the query plan in batches of fixed-size vectors, typically 2048 values each. This matters for two reasons.
First, modern CPUs are designed to execute the same operation on many values in parallel. SIMD instructions (AVX-512 on x86, NEON on ARM) can process 8 or 16 64-bit values in a single instruction cycle. A row-at-a-time engine wastes this capacity entirely. When DuckDB’s filter kernel iterates over a vector of 2048 integers comparing each against a threshold, the compiler auto-vectorizes this into a tight loop of NEON instructions on Apple Silicon. The CPU executes the comparison for multiple values per cycle instead of one.
Second, columnar storage means DuckDB only reads the columns a query actually touches. A Parquet file storing 50 columns in a 10 GB dataset might only require reading 3 columns for a given aggregation, reducing the physical I/O to a fraction of the total file size. Combined with predicate pushdown, DuckDB can skip entire row groups before they are ever read from disk. The Parquet format stores min/max statistics per row group per column. A filter like WHERE order_date > '2024-01-01' lets DuckDB skip any row group whose maximum order_date is earlier than the threshold without decoding a single value.
This is how DuckDB processes 100 GB in 8 GB of RAM: it never loads 100 GB. It loads the relevant slices, processes them in batches, and discards data that has been aggregated.
-- TPC-H Query 1 against a 100 GB Parquet dataset
SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) AS sum_qty,
sum(l_extendedprice) AS sum_base_price,
sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
avg(l_quantity) AS avg_qty,
avg(l_extendedprice) AS avg_price,
avg(l_discount) AS avg_disc,
count(*) AS count_order
FROM read_parquet('lineitem_sf100.parquet')
WHERE l_shipdate <= DATE '1998-09-02'
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;
DuckDB’s query planner pushes the l_shipdate filter into the Parquet reader. Row groups where the entire date range falls after the cutoff are skipped at the metadata level. The remaining row groups are decoded column-by-column, loading only the seven columns the query references. On SF100, this can reduce physical I/O by 60-80% depending on data distribution.
The Buffer Manager and Out-of-Core Processing
DuckDB’s buffer manager handles working sets larger than RAM through a least-recently-used eviction policy. When a hash table for a join or aggregation grows beyond available memory, the buffer manager spills the coldest pages to disk. These are written to temporary files in DuckDB’s own block format, not generic OS swap.
The critical design detail is that DuckDB controls what gets spilled and how. It knows which pages belong to a hash table’s probe side versus its build side. It can spill the build side of a hash join in sorted partitions, then read them back partition by partition during the probe phase, keeping peak memory usage bounded. This grace hash join approach has been part of database research since the 1980s, and DuckDB implements it correctly.
The temporary files written during spilling are sequential. DuckDB does not scatter random pages to random offsets; it writes and reads in large sequential chunks, which is the access pattern that NVMe SSDs are optimized to deliver.
Configuring DuckDB for out-of-core work requires two settings:
import duckdb
con = duckdb.connect()
con.execute("SET memory_limit='6GB'")
con.execute("SET temp_directory='/tmp/duckdb_spill'")
con.execute("SET threads=8")
# This join produces an intermediate result larger than available RAM.
# DuckDB partitions the hash join and spills gracefully.
result = con.execute("""
SELECT
n.n_name AS nation,
year(o_orderdate) AS order_year,
sum(l_extendedprice * (1 - l_discount)) AS revenue
FROM read_parquet('lineitem_sf100/*.parquet') l
JOIN read_parquet('orders_sf100/*.parquet') o
ON l.l_orderkey = o.o_orderkey
JOIN read_parquet('nation.parquet') n
ON o.o_nationkey = n.n_nationkey
GROUP BY nation, order_year
ORDER BY revenue DESC
""").fetchdf()
The lineitem table at SF100 is around 70 GB. The join against orders adds another 15 GB. DuckDB processes this by partitioning both relations, spilling to the temp directory as needed, and rejoining partition by partition. The user sets two config values and writes a standard SQL query.
Apple Silicon’s Unified Memory Architecture
The base MacBook Air uses an M-series chip with unified memory. The CPU, GPU, and Neural Engine share a single pool of LPDDR5X memory on the same package, connected via a high-bandwidth on-chip fabric.
On conventional x86 laptops, the CPU accesses main memory through a memory controller with typical bandwidth around 30-60 GB/s for a dual-channel consumer configuration. Apple Silicon’s M3 delivers around 100 GB/s of memory bandwidth in the base configuration; the M4 generation pushes this further.
For DuckDB specifically, high memory bandwidth matters at the point where vectorized kernels scan data out of the buffer cache. When DuckDB processes a 2048-element vector of 64-bit doubles from a cached page, that is a sequential memory read. The throughput of that read is bounded by memory bandwidth. At 100 GB/s, a saturated memory bus delivers data fast enough that the compute units stay fed; the bottleneck shifts to arithmetic and hash table lookups rather than memory latency.
The unified architecture also eliminates a PCIe bottleneck relevant to GPU-accelerated analytics. On systems with discrete GPUs, moving data from CPU-accessible RAM to GPU VRAM crosses PCIe at 16-32 GB/s. On Apple Silicon, the CPU and GPU share the same physical memory, so frameworks that delegate computation to the GPU can do so without copying. DuckDB does not currently issue Metal compute shaders for query execution, but the memory architecture removes a constraint that would otherwise make GPU offload expensive.
NVMe SSD Speeds and Sequential Access Patterns
The SSDs in modern MacBooks deliver sequential read throughput around 7 GB/s. This is not a favorable benchmark condition; it is what Apple’s NVMe controllers sustain when reading large sequential runs.
Conventional spinning disks read sequentially at 100-200 MB/s. SATA SSDs plateau around 550 MB/s. NVMe through PCIe 4.0 reaches 5-7 GB/s. The difference is an order of magnitude, and it changes the economics of out-of-core processing entirely.
DuckDB’s spill-to-disk behavior writes and reads large sequential chunks. When a hash join spills 8 GB of build-side data and later reads it back in 256 MB partitions, the effective I/O pattern is a long sequential write followed by a series of sequential reads. At 7 GB/s read throughput, DuckDB can reload an 8 GB spilled partition in just over one second. That same operation would take 80 seconds on a SATA SSD and several minutes on spinning disk.
The practical consequence: the performance penalty for exceeding available RAM on modern MacBook hardware is small enough that it does not dominate query time. Spilling is no longer a catastrophic fallback; it is a routine part of processing large datasets.
Why Spark Has Overhead That Makes It the Wrong Tool Here
Apache Spark is a distributed coordination framework. It was designed to execute query stages across tens or hundreds of machines, routing data between them over a network shuffle layer. That design incurs structural overhead that does not disappear when you run Spark in local mode on a single machine.
Spark’s execution model serializes intermediate data to JVM heap objects, materializes shuffle boundaries to disk between every stage, and schedules tasks through a driver process with per-task overhead. The JVM introduces garbage collection pauses proportional to heap pressure. The shuffle manager has fixed per-partition setup cost. Even with serialization optimizations like Tungsten, Spark writes shuffle data to disk between pipeline stages as an architectural invariant.
For a 100 GB TPC-H workload on a single machine with fast NVMe and 100 GB/s memory bandwidth, Spark’s coordination layer costs more time than the actual computation. DuckDB runs in-process with no serialization boundary, no shuffle network, no JVM. Its intermediate state lives in native memory under the buffer manager’s control.
The right mental model: Spark’s overhead is the price of distributed fault tolerance and coordination. On a single node, you pay that price and receive nothing in return.
When You Genuinely Still Need Distributed Systems
Distributed data processing remains the correct choice in specific circumstances:
- Data volume: If your raw dataset is multiple terabytes that cannot be stored on a single machine, distribution is unavoidable. A petabyte-scale data warehouse requires it.
- Concurrent workloads: A single DuckDB process does not serve many simultaneous heavy analytical queries efficiently. A cluster routes queries to separate workers and scales horizontally.
- Fault tolerance over long jobs: A computation that runs for 12 hours on one machine has no recovery path if the machine fails. Spark with checkpointing to object storage survives worker failures mid-run.
- Organizational access patterns: When dozens of analysts need simultaneous query access to shared datasets, a query service backed by distributed compute is the right architecture regardless of individual query size.
None of these apply to the common case where a data engineer needs to analyze a 50 GB export from a production database, explore a year of event logs, or prototype a transformation pipeline before deploying it. For those workloads, spinning up a cluster introduces latency, cost, and operational complexity that a single well-configured DuckDB process eliminates.
The Tooling Implication
The DuckDB benchmark matters because it shifts the threshold at which distributed systems become necessary. Five years ago, 100 GB of data required either substantial RAM (and expensive cloud instances) or a distributed cluster. Today, a base MacBook handles it with DuckDB in a time that is practical for interactive work.
The implication for tooling decisions is direct. Reach for DuckDB when your workload is analytical, fits within a few hundred gigabytes, and runs on one machine. Reach for Spark, Trino, or BigQuery when you have genuine data volume, genuine concurrency requirements, or genuine fault tolerance needs. The mistake is applying distributed infrastructure to problems that a well-engineered single-node engine solves better, faster, and without the operational overhead.
The four technical factors behind the result are DuckDB’s vectorized columnar engine with predicate pushdown, the buffer manager’s grace hash join and sequential spill design, the memory bandwidth Apple Silicon delivers, and the sequential I/O throughput that modern NVMe provides. Each one is meaningful independently. Together they change what is possible on a laptop.