How DuckDB Turns 8GB of Unified Memory Into a Serious Data Warehouse
Source: hackernews
The cheapest MacBook Apple sells right now is the 13-inch MacBook Air with 8GB of unified memory. For most data workflows built around Pandas or PySpark, that spec is a hard ceiling: load more than a few gigabytes of CSV and you’re watching swap thrash until the process gets killed. DuckDB’s recent post demonstrates that this machine can handle big data workloads that would be impractical without it. The interesting part is not that DuckDB is fast; it has been fast for a while. The interesting part is why 8GB is enough, and what the architecture looks like that makes it so.
The Unified Memory Difference
Apple Silicon changes the memory calculus in a subtle but important way. Traditional x86 laptops with 8GB DDR5 are sharing that bandwidth between the CPU doing compute and the OS doing everything else. On M-series chips, the memory is on the same die as the CPU and the Neural Engine, with memory bandwidth around 100 GB/s on the M3 compared to roughly 50 GB/s for typical DDR5 laptop configurations. When DuckDB is streaming columnar data through vectorized execution, that bandwidth difference matters more than raw capacity. A lot of query execution time in columnar databases is spent shuffling data through the memory hierarchy, and a tight memory bus makes that cheaper.
This does not mean 8GB is suddenly limitless. It means that DuckDB can do more useful work per byte of memory it holds at a given moment, and the threshold at which spilling to disk becomes the bottleneck moves outward.
Out-of-Core Processing in DuckDB
DuckDB has supported out-of-core execution since the 0.8 era, and by 1.x the implementation covers the most common bottleneck operations: hash joins, hash aggregations, and sorting. The mechanism is straightforward to configure:
PRAGMA memory_limit = '4GB';
SET temp_directory = '/tmp/duckdb_spill';
With those two settings, DuckDB will partition in-memory hash tables and flush partitions to the temp directory when the limit is approached, then read them back during the probe phase. The same applies to aggregations: when a GROUP BY over a large fact table exceeds the memory budget, DuckDB partitions the aggregation state, spills to disk, and merges the results. Sort operations use an external merge sort that streams sorted runs through disk rather than requiring the full dataset to be in memory at once.
The critical thing is that none of this requires you to write different SQL. The query planner decides when to spill; you set the budget and walk away. Compare this to the Spark model, where you are writing DataFrames with explicit partitioning strategies, managing shuffle partitions (spark.sql.shuffle.partitions), and reasoning about serialization overhead. For a solo analyst or a small engineering team, the cognitive overhead gap between the two is enormous.
Vectorized Execution and Why It Matters for I/O
DuckDB uses a vectorized pull-based execution model. Operators pull batches of rows (typically 2048 values per vector) from their children rather than processing one row at a time. This is the same fundamental idea as Vectorwise and MonetDB/X100, work that goes back to Peter Boncz’s 2002 dissertation.
For out-of-core workloads, the batch size matters because it determines how aggressively DuckDB can pipeline I/O with computation. When a hash join is reading spilled partitions back from disk, the vectorized scan means it can be reading ahead into the next partition while computing on the current one. The same applies to Parquet reads: DuckDB’s native Parquet reader pushes down column selection and row group filters before decoding, so for a scan over a 50GB Parquet file asking for three columns out of thirty, most of the file never touches memory at all.
-- Only the 'amount', 'region', 'ts' columns are decoded.
-- Row groups not matching the WHERE are skipped entirely.
SELECT region, SUM(amount)
FROM read_parquet('s3://my-bucket/events/*.parquet')
WHERE ts >= '2025-01-01'
GROUP BY region;
This is where the MacBook Air’s fast NVMe storage (the M3 Air gets around 3 GB/s sequential reads) starts pulling its weight. When the bottleneck is I/O and DuckDB is doing aggressive predicate pushdown, the gap between local NVMe and a high-end cloud instance’s attached EBS volume is smaller than most people assume.
The Comparison That Matters
Pandas loads data into memory as a dense in-memory representation. A 10GB CSV will consume close to 10GB of RAM (often more, after string interning and object overhead), and any operation that creates a new column or intermediate result duplicates that memory. Polars is much better on this front, using Apache Arrow’s columnar layout and lazy evaluation via its LazyFrame API, but both libraries are fundamentally in-memory engines. When you exceed available RAM, they swap and stall.
Polars’ out-of-core streaming mode is experimental and limited in which operations it supports. DuckDB’s spill path is stable and general-purpose.
The other obvious comparison is cloud. A Redshift Serverless workgroup or a BigQuery slot will handle terabyte-scale queries without any local constraints, but you are paying per query and dealing with upload latency to get your data there in the first place. For iterative exploratory analysis where you are running dozens of queries against a dataset you already have locally, the economics of keeping it on your laptop with DuckDB are straightforward.
What the 8GB Scenario Actually Looks Like
The practical ceiling for the cheapest MacBook running DuckDB is roughly 10-30x the physical RAM, depending on the query. Aggregations with low-cardinality GROUP BY keys are cheap to spill; they produce small partial aggregates. Hash joins where the build side is large are more expensive, because you are materializing and re-reading partition data. Sorts are somewhere in between.
For a dataset of around 100GB of Parquet files, a well-written DuckDB query on an M3 Air with a 4GB memory limit will complete in minutes, not hours, assuming you are not doing something pathological like a many-to-many join with no filter. The TPC-H benchmarks DuckDB publishes consistently show competitive performance against much more heavyweight systems at the 10-100GB scale.
The Broader Shift
What the “big data on the cheapest MacBook” story actually represents is the slow decomposition of the assumption that scale requires infrastructure. The original big data stack (Hadoop, Hive, early Spark) was built around the constraint that a single machine could not hold the data or do the compute. Commodity servers were cheap; you bought more of them. That model made sense at the time.
Single-node hardware has gotten faster at a rate the distributed model did not anticipate. Apple Silicon’s memory bandwidth, fast NVMe storage, and DuckDB’s columnar execution engine combine to put a surprising amount of analytical capability in a $1,099 laptop. The implication is not that distributed systems are obsolete; at true petabyte scale they remain the right tool. The implication is that the threshold where you actually need them has moved up by an order of magnitude, and most data teams are not operating anywhere near that threshold.
DuckDB’s out-of-core engine is the piece that closes the gap. Fast hardware gets you to the edge; spill-to-disk execution lets you go past it without rewriting your queries or provisioning a cluster. That combination is what makes the cheapest MacBook a legitimate data warehouse.