Running TPC-H 1000 on Apache DataFusion on a single node

TL;DR: We successfully executed the full TPC‑H benchmark at scale factor 1000 (≈1 TB) using Apache DataFusion on a single node. To our knowledge, this is the first public demonstration of all 22 queries completing at this scale. This benchmark aligns with our belief that a single well‑provisioned machine is sufficient for the vast majority of data workloads.

Why we ran TPC‑H 1000 on a single node

At Embucket, our philosophy is simple: the majority of production data workloads don’t require a distributed fleet of machines. Modern cloud instances offer staggering compute and memory capacity, and most real‑world datasets are smaller than 1 TB. Testing TPC‑H at scale factor 1000 (SF1000) is a way to validate that belief and to push the limits of modern query engines in a controlled environment.

We chose Apache DataFusion for this test because it’s a fast‑evolving, Rust‑native query engine that we’re using as a building block for our broader mission: building Autonomous Data Infrastructure.

Benchmark environment

Hardware

AWS EC2 r6gd.metal (64 vCPUs, 512 GB RAM, NVMe SSD)
Region: us-east-2

Data format & storage

TPC‑H data generated via tpchgen-cli (Rust)
Stored as 20 Parquet files per table
Benchmarks run with both local SSD and remote S3 storage (local SSD used for final measurements)

Execution method

datafusion-cli with datafusion.execution.target_partitions = 32
Queries executed one at a time
Each query run twice:
- First run: cold cache
- Second run: hot cache (timings averaged across hot runs)

Metrics captured

Per‑query runtimes
Total runtime
Memory and CPU usage monitored externally (no exhaustive logs)
Profiling via EXPLAIN ANALYZE

The problematic queries: Q18 and Q21

Out of the 22 TPC‑H queries, two presented significant challenges at SF1000: Q18 and Q21. Initial attempts to run the original versions resulted in out‑of‑memory errors or excessive runtimes. We tried to fix this by tuning:

Adjusting the number of partitions (higher parallelism = higher memory pressure)
Enabling spilling to disk with a fast local SSD

Despite these, we faced instability: tuning for one query often broke the other. Ultimately, we rewrote both queries with structural optimizations.

How we rewrote the queries

Q18: The original query made two passes over the massive lineitem table. Intermediate results were huge due to the orders × lineitem join. We rewrote the query to filter lineitem early, reduce passes, and streamline joins.

Q21: The original used three self‑joins and EXISTS/NOT EXISTS subqueries on lineitem. We transformed these into semi/anti‑joins over pre‑filtered subsets. This drastically reduced row movement, enabled cheaper join strategies, and improved query plan stability.

Query rewrites

Original Q18

SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, SUM(l_quantity)
FROM customer, orders, lineitem
WHERE o_orderkey IN (
    SELECT l_orderkey
    FROM lineitem
    GROUP BY l_orderkey
    HAVING SUM(l_quantity) > 313
)
AND c_custkey = o_custkey
AND o_orderkey = l_orderkey
GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
ORDER BY o_totalprice DESC, o_orderdate
LIMIT 100;

Modified Q18

WITH order_totals AS (
    SELECT l_orderkey, SUM(l_quantity) AS sum_qty
    FROM lineitem
    GROUP BY l_orderkey
    HAVING SUM(l_quantity) > 300
)
SELECT c.c_name, c.c_custkey, o.o_orderkey, o.o_orderdate, o.o_totalprice, ot.sum_qty
FROM order_totals ot
JOIN orders o ON o.o_orderkey = ot.l_orderkey
JOIN customer c ON c.c_custkey = o.o_custkey
ORDER BY o.o_totalprice DESC, o.o_orderdate
LIMIT 100;

Original Q21

SELECT s_name, COUNT(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
    SELECT * FROM lineitem l2
    WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
    SELECT * FROM lineitem l3
    WHERE l3.l_orderkey = l1.l_orderkey
    AND l3.l_suppkey <> l1.l_suppkey
    AND l3.l_receiptdate > l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = 'ARGENTINA'
GROUP BY s_name
ORDER BY numwait DESC, s_name
LIMIT 100;

Modified Q21

WITH failed_orders AS (
    SELECT o_orderkey FROM orders WHERE o_orderstatus = 'F'
),
order_supplier_counts AS (
    SELECT l.l_orderkey, COUNT(DISTINCT l.l_suppkey) AS num_suppliers
    FROM lineitem l
    INNER JOIN failed_orders fo ON l.l_orderkey = fo.o_orderkey
    GROUP BY l.l_orderkey
    HAVING COUNT(DISTINCT l.l_suppkey) > 1
),
late_receipts AS (
    SELECT l.l_orderkey, l.l_suppkey
    FROM lineitem l
    INNER JOIN failed_orders fo ON l.l_orderkey = fo.o_orderkey
    WHERE l.l_receiptdate > l.l_commitdate
),
saudi_late_receipts AS (
    SELECT lr.l_orderkey, lr.l_suppkey, s.s_name
    FROM late_receipts lr
    INNER JOIN supplier s ON lr.l_suppkey = s.s_suppkey
    INNER JOIN nation n ON s.s_nationkey = n.n_nationkey
    WHERE n.n_name = 'SAUDI ARABIA'
)
SELECT slr.s_name AS s_name, COUNT(*) AS numwait
FROM saudi_late_receipts slr
INNER JOIN order_supplier_counts osc ON slr.l_orderkey = osc.l_orderkey
WHERE NOT EXISTS (
    SELECT 1 FROM late_receipts lr2
    WHERE lr2.l_orderkey = slr.l_orderkey AND lr2.l_suppkey <> slr.l_suppkey
)
GROUP BY slr.s_name
ORDER BY numwait DESC, s_name
LIMIT 100;

Results

Per‑query runtimes

Total runtime

All the measurements with the code to reproduce this benchmark can be found here: https://github.com/Embucket/benchmarks/tree/main/datafusion

Closing thoughts

We believe this benchmark is an important milestone for the DataFusion community. It shows that with the right setup and some careful query engineering, DataFusion can run TPC‑H 1000 on a single node.

This is only the beginning. As part of our journey toward Autonomous Data Infra, we’ll continue pushing the boundaries of what can be done with modern data engines on modern hardware.

Stay tuned.

Running TPC-H 1000 on Apache DataFusion on a single node

Running TPC-H 1000 on Apache DataFusion on a single node

Why we ran TPC‑H 1000 on a single node

Benchmark environment

The problematic queries: Q18 and Q21

How we rewrote the queries

Query rewrites

Original Q18

Modified Q18

Original Q21

Modified Q21

Results

Closing thoughts

Blog

4 Things to Keep in Mind While Benchmarking Data Engines

How We Tested Embucket's Snowflake Compatibility with SQL Logic Tests (SLT)

Running TPC-H 1000 on Apache DataFusion on a single node

Implementing MERGE INTO for Apache Iceberg with Datafusion

BYOC option for Snowflake workloads