· 5 min read ·

DuckDB's MacBook Benchmark Depends on How You Write Your Parquet Files

Source: hackernews

The DuckDB team’s benchmark post showing a base MacBook handling analytical queries over datasets much larger than its RAM has generated substantial discussion about execution engines, Apple Silicon memory bandwidth, and whether you still need a cluster. Most of that discussion focuses on DuckDB’s runtime behavior. The part that gets less attention is that the benchmark uses well-structured Parquet files, and Parquet is doing a significant portion of the work before DuckDB’s execution engine processes a single row.

This matters for practitioners. If you want to reproduce those results on your own data, you need to understand what Parquet’s structure contributes and what write-time decisions enable it.

Parquet Stores Query-Time Information at Write Time

Parquet is a columnar binary format. Each column in a dataset is stored contiguously, compressed independently, and annotated with statistics computed when the file is written: the minimum value, maximum value, and null count for each row group. A row group is a horizontal slice of the table, typically covering 128MB of uncompressed data. Those statistics live in the file’s metadata footer and are readable without decoding any row data.

When DuckDB reads a Parquet file and finds a WHERE clause filter, it evaluates that filter against the row group statistics before any I/O on the data itself. A row group whose recorded maximum event_ts falls before the query’s lower bound is skipped entirely. No decompression, no decoding, no memory allocation for that row group.

This is the same idea as a clustered index on a traditional database. An index does not remove data from the table; it provides a structure that makes irrelevant data skippable. Parquet row group statistics serve the same function, built directly into the file format rather than maintained as a separate structure. The critical difference from an index: the statistics only reflect the actual data range present in each row group. If your data is sorted by event_ts before being written, then most row groups will have non-overlapping time ranges, and most will be skippable for any time-range query. If your data is written in random order, every row group might contain timestamps from across the entire date range, and statistics-based skipping will not eliminate any row groups at all.

For a 100GB Parquet dataset, the difference between sorted and unsorted write order can mean the difference between DuckDB reading 8GB and reading 95GB for a typical time-range filter query.

Hive Partitioning Extends Skipping to the Directory Level

Parquet row group statistics operate within individual files. Hive-style directory partitioning applies the same skip logic at a higher level. A dataset organized as:

/data/events/
  year=2025/
    month=01/
      region=us-east/
        part-0001.parquet

lets DuckDB prune entire directories before opening any files. A query filtering for year=2025 AND month=03 AND region=us-east does not stat, open, or read the metadata footer of any file outside those directories. The file system layout becomes the index.

DuckDB reads Hive-partitioned datasets transparently:

import duckdb

conn = duckdb.connect()
conn.execute("SET memory_limit = '6GB'")

result = conn.execute("""
    SELECT
        month,
        SUM(revenue) AS monthly_revenue,
        COUNT(*)     AS event_count
    FROM read_parquet(
        '/data/events/**/*.parquet',
        hive_partitioning = True
    )
    WHERE year = 2025
      AND region = 'us-east'
    GROUP BY month
    ORDER BY month
""").df()

With a 100GB dataset spanning multiple years and regions, a query scoped to one region and one year might physically read 3-5% of total file content after directory pruning. The “100GB problem” is a 3-5GB problem by the time DuckDB opens its first row group.

Converting Your Data With DuckDB Itself

DuckDB can generate well-structured Parquet output from raw sources, which means the conversion step does not require a separate tool. The COPY statement accepts Parquet as a target format with partition and ordering options:

conn.execute("""
    COPY (
        SELECT *
        FROM read_csv_auto('/raw/events/*.csv')
        ORDER BY event_ts, region
    )
    TO '/data/events'
    (
        FORMAT PARQUET,
        PARTITION_BY (year(event_ts), month(event_ts), region),
        ROW_GROUP_SIZE 131072
    )
""")

The ORDER BY before writing is the critical detail. It sorts the data so that each row group contains a contiguous time range and a single region value. Row group statistics then accurately reflect narrow, non-overlapping ranges. The PARTITION_BY creates the Hive directory structure for directory-level pruning. Both affect subsequent query performance independently.

ROW_GROUP_SIZE of 131072 rows with 128MB of uncompressed data is the default and works well for most workloads. Smaller row groups mean finer-grained skipping but larger metadata footers and more file-open overhead. For datasets where most queries filter to under 5% of rows, smaller row groups can improve performance. For datasets where most queries scan large fractions, the metadata overhead outweighs the skip benefit.

What the Benchmark Is Actually Measuring

When DuckDB’s benchmark post reports query times against 100GB datasets on an 8GB machine, the numbers reflect the full stack: Parquet’s skip logic, DuckDB’s vectorized execution over the remaining data, and the buffer manager handling any intermediate state that exceeds memory. Parquet’s contribution to that result is not a constant. It depends on the data.

For a well-structured Parquet dataset with sorted write order and Hive partitioning, a selective time-range query might touch 5-10% of physical file bytes. For a flat, unsorted Parquet file without partitioning, the same query reads closer to 100% of the file. Both cases exercise DuckDB’s out-of-core machinery, but the first case gives the engine a 10-20x smaller input.

The Hacker News discussion around the post includes practitioners reporting their own benchmark numbers, and the variance is wide. Some see results comparable to DuckDB’s; others see much slower times. Data layout is the likely explanation for a significant portion of that variance.

The Write-Once, Query-Many Trade-Off

Structuring Parquet files for query performance requires paying a cost at write time. Sorting a 100GB dataset before writing it takes time and temporary disk space. Choosing the right partition key requires understanding your query patterns in advance. Getting the sort order wrong (or choosing a partition dimension that does not match actual filter predicates) means the benefits do not materialize.

For analytical datasets that are written infrequently and queried many times, this trade-off is strongly favorable. A daily pipeline that produces a new Parquet partition can sort and structure that partition in the pipeline step. Subsequent queries against it run faster indefinitely. The write cost is paid once; the read benefit compounds over every query.

For datasets that are appended to continuously or queried in unpredictable patterns, the calculus is less clear. Sorting requires a full dataset pass, which is expensive for large incremental writes. Apache Iceberg and Delta Lake address this with compaction and z-ordering operations that restructure existing data in the background. DuckDB 1.1 added native Iceberg reading, which means you can query an Iceberg table with DuckDB and benefit from its file-level statistics without managing Parquet files directly.

The benchmark result that motivated all this discussion is real. Reproducing it on your own data requires paying attention to how your data is written, not just which engine reads it.

Was this interesting?