How Three Architectural Choices Let DuckDB Process 100GB on 8GB of RAM
Source: hackernews
The DuckDB team’s recent post makes a claim that sounds like marketing: a base MacBook Air with 8GB of RAM can handle datasets that would make a Spark cluster sweat. The surprising thing is that it’s true, and the reasons are more specific and interesting than “modern laptops are fast.”
Three technical properties converge to make this work. Each one alone would be interesting. Together, they change the economics of when you actually need a cluster.
The hardware story that gets skipped over
The “big data requires a cluster” assumption was built on concrete hardware realities. A well-provisioned server in 2006 had 4-8GB of RAM. The datasets being analyzed were already larger than that. MapReduce was a reasonable answer to a genuine problem.
Apple Silicon’s M3 and M4 chips change several of those variables simultaneously. The base MacBook Air has approximately 100 GB/s of memory bandwidth and an NVMe drive capable of 3+ GB/s sequential reads. Both numbers matter, for different reasons.
The memory bandwidth matters because DuckDB’s vectorized execution engine is frequently memory-bandwidth-bound during column scans and hash table operations. The engine processes data in batches of 2,048 rows (one vector), using SIMD instructions that can compare or project multiple values per clock cycle. The bottleneck shifts from compute to how fast you can feed data to the compute units. Apple Silicon’s on-package memory eliminates the PCIe bus that separates CPU and memory on conventional server hardware, and the LPDDR5X pool runs fast enough that the arithmetic units rarely wait.
The NVMe speed matters for a different reason: it determines the cost of DuckDB’s spilling behavior when the dataset exceeds available RAM.
How out-of-core execution actually works
DuckDB has supported out-of-core execution since version 0.8. By the current 1.x release line, it covers the three operators that most commonly blow up memory budgets: hash joins, hash aggregations, and sorting.
For hash joins, DuckDB implements a grace hash join. When the estimated size of the build-side hash table would exceed the memory budget, DuckDB partitions both sides of the join by hashing the join key. Rows with the same key hash land in the same partition on both sides. DuckDB then processes each partition pair in sequence, keeping peak memory bounded to the size of the largest partition rather than the size of the full table.
For GROUP BY aggregations, the engine builds a hash table incrementally. When it grows too large, partial aggregates for some partitions are flushed to disk and merged back after the full input has been processed.
The critical implementation detail in both cases is that spill I/O is sequential, not random. DuckDB writes and reads in large sequential chunks, which is the access pattern NVMe handles well. Loading an 8GB spilled partition from a modern NVMe drive at 3 GB/s takes a few seconds. The same read from a 2015-era SATA SSD at 500 MB/s takes over 15 seconds. On the spinning disks that were common when Spark was being designed, it would take minutes. The whole out-of-core story only works at tolerable query times because the storage is fast enough to make spilling cheap.
Configuration requires almost nothing:
import duckdb
con = duckdb.connect()
con.execute("SET memory_limit = '6GB'")
con.execute("SET temp_directory = '/tmp/duckdb_spill'")
The default memory limit is 80% of system RAM. No query changes are needed. The planner decides when to spill without user intervention.
Parquet reduces the problem before it starts
Out-of-core execution is the safety net. Parquet pushdown is what keeps most queries from needing it.
Parquet files store per-column statistics per row group: minimum value, maximum value, and null count. Row groups are typically 128MB of uncompressed data. When DuckDB scans a Parquet file with a filter predicate, it evaluates those statistics before decompressing anything. If a row group’s maximum value for a date column falls before the filter threshold, the entire row group is skipped at the metadata level, zero bytes decompressed.
Combined with column pruning (only columns referenced by the query are decoded at all), a query nominally over 100GB of Parquet often generates 6-10GB of actual I/O. The problem has already shrunk by an order of magnitude before any vectorized execution happens.
This extends to remote data. With the httpfs extension loaded, the same pushdown applies to Parquet on S3 via HTTP range requests. DuckDB fetches only the row group byte ranges it needs:
conn.execute("INSTALL httpfs; LOAD httpfs;")
result = conn.execute("""
SELECT region, SUM(revenue) AS total
FROM read_parquet(
's3://my-bucket/sales/year=2025/**/*.parquet',
hive_partitioning = True
)
GROUP BY region
ORDER BY total DESC
""").df()
The network round-trip to read metadata is cheap. The savings in skipped range fetches accumulate quickly on filtered queries.
Where DuckDB compares favorably to Spark
Spark’s architecture was designed for distributed fault tolerance. On a single node, you pay the full architectural cost and receive nothing in return.
Spark serializes intermediate data through the JVM heap. It materializes shuffle boundaries to disk between every pipeline stage as an invariant of the execution model, even in local mode, even with Tungsten’s binary format optimization. The driver-based scheduler adds per-task overhead. GC pressure scales with heap size and intermediate state volume.
DuckDB runs in-process with no serialization boundary. Intermediate state lives in native memory under the buffer manager’s control. The buffer manager knows which pages belong to which operator’s data structures, and can make informed eviction decisions rather than treating all memory uniformly.
For a single-user exploratory workload against a 50-100GB dataset that fits on local storage, Spark’s distributed machinery is overhead without benefit. DuckDB completes the same query in a fraction of the time, with no cluster bootstrap latency and no per-query infrastructure cost.
The actual ceiling
The practical ceiling for DuckDB on a base MacBook is roughly 10-30x physical RAM, depending heavily on query shape. Low-cardinality GROUP BY is cheap to spill because partial aggregates are small. Large build-side hash joins against tables with few filters are more expensive: more data gets materialized into partitions and read back.
The TPC-H SF100 benchmark (approximately 100GB) completes on an 8GB machine with a 4GB memory limit. That’s the right order of magnitude for legitimately calling this “big data on a cheap MacBook.”
The workloads where this breaks down are real and worth naming: datasets that exceed local SSD capacity entirely, concurrent multi-user write workloads (DuckDB’s single-writer lock serializes writes), real-time streaming ingestion, multi-hour ETL jobs where mid-run machine failure needs recovery, and horizontal scaling under load. For those, you need a cluster or a cloud warehouse. DuckDB is not competing with Redshift at the 10-terabyte scale.
For the majority of analytical workloads that practitioners actually run day-to-day, the cheapest MacBook is now a credible local data warehouse. The mental model that conflated “big data” with “distributed systems” was correct in 2009. It was built on hardware that is no longer the baseline, against tools that have since made sequential spill I/O cheap enough to blur the line between in-memory and out-of-core execution. The HN discussion around the original post reflects practitioners updating their intuitions in real time, sharing benchmarks from their own datasets and consistently finding the crossover point where they actually need a cluster sitting higher than they expected.
The correct takeaway isn’t that clusters are obsolete. It’s that the default assumption should flip: start local, reach for distributed infrastructure when the workload requires it, rather than provisioning clusters for work a laptop can handle.