pith. sign in

arxiv: 2606.24647 · v1 · pith:MNIGHN7Znew · submitted 2026-06-23 · 💻 cs.DB

Accelerating Presto with GPUs

Pith reviewed 2026-06-25 21:37 UTC · model grok-4.3

classification 💻 cs.DB
keywords PrestoGPU accelerationcuDFTPC-Hdistributed query processinganalytical benchmarksdata movement
0
0 comments X

The pith

Presto can run GPU operators with efficient data movement to deliver up to 6x cost/performance gains on analytical benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends Presto, a distributed SQL query engine, to execute operators on GPUs. It solves two main problems: moving data from storage into GPU memory and allowing operators to exchange data while staying in GPU memory during distributed execution. Experiments with the cuDF library on TPC-H queries guided the design choices and architectural changes needed inside the existing Presto framework. The result is a system that shows substantial cost and speed improvements over CPU-only Presto on standard benchmarks, with the code released in open-source Presto/Velox.

Core claim

We extended Presto to be GPU-aware by building mechanisms for efficient data transfer from storage to GPU operators and for data exchange between operators without leaving GPU memory, even when queries span multiple nodes. Initial experiments running TPC-H-derived queries on a multi-GPU cluster with cuDF measured the effects of different architectures and configurations. These measurements informed the integration of GPU execution paths into Presto, producing up to 6x cost/performance improvements over CPU Presto on standard analytical benchmarks.

What carries the argument

GPU-aware Presto extensions that manage data ingestion to GPU operators and inter-operator exchanges while keeping data resident in GPU memory across distributed nodes.

If this is right

  • Analytical queries achieve up to 6x better cost/performance than CPU Presto on TPC-H benchmarks.
  • Data can remain in GPU memory during exchanges between operators even in distributed settings.
  • The open-source changes become available for production workloads through Presto/Velox.
  • GPU execution paths integrate into the existing Presto framework without replacing the entire engine.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar data-movement patterns could reduce costs in other distributed SQL engines that adopt GPU operators.
  • Mixed CPU/GPU query plans may need additional scheduling logic when production data distributions differ from benchmarks.
  • Larger GPU clusters could expose new bottlenecks in cross-node data exchange that small-scale tests do not reveal.

Load-bearing premise

Performance characteristics measured in isolated cuDF experiments on TPC-H queries will carry over to the integrated Presto system when it runs full distributed queries that mix CPU and GPU paths on real production data volumes.

What would settle it

Measure end-to-end runtime and cost of the integrated Presto system on a multi-GPU cluster executing full distributed queries against the same workload run on CPU-only Presto, using production-scale data volumes rather than benchmark subsets.

Figures

Figures reproduced from arXiv: 2606.24647 by Daniel Bauer, Deepak Majeti, Devavret Makkar, Greg Kimball, Karthikeyan Natarajan, Luis Garces-Erice, Sean Rooney, Todd Mostak, Zoltan Arnold Nagy.

Figure 1
Figure 1. Figure 1: Overview of Software Architecture for Motivating Experiments [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Velox CudfVector Data Flow – table with stream [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: UcxExchange Architecture encapsulates the UCX context, worker and listener. The communi￾cator creates the UCX listener for incoming connection requests and also registers an active message handler for handling the initial request for data from receivers. This handshake contains sufficient information to uniquely identify the receiver. At each worker the communicator maintains a list of remote UCX endpoints… view at source ↗
Figure 5
Figure 5. Figure 5: Query execution time (seconds) for all 22 TPC-H queries at SF=1000 on 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Q5 execution time (seconds) on 8×A100 GPUs across scale factors SF=1000–10000, comparing HttpExchange vs UcxExchange. UcxExchange maintains >10× speedup at all scale factors. In order to show how performance varies across scale factor, we chose a query that was join-heavy, but ran on our infrastructure up to scale factor SF=10,000 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Weak-scaling analysis: TPC-H benchmark on A100 GPUs where data size and worker count increase proportionally [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Total TPC-H runtime by GPU configuration on single servers with modern GPUs. Scale factors range from 1K to 30K [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Price-performance Presto GPU advantage on AWS. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

We describe how we extended Presto to be GPU-aware. We focus on two critical challenges: efficiently moving data from storage to GPU operators, and enabling data exchange between operators without leaving GPU memory even when a query is distributed. To guide our design, we conducted a series of initial experiments in which we executed queries derived from the TPC-H benchmark on a multi-GPU cluster using NVIDIA's C++ cuDF data-frame library, and measured how different architectures and configurations influenced performance. We show how these insights inform our extensions to Presto, detailing the architectural changes required to integrate GPU execution into the existing Presto framework. Our initial evaluation demonstrates substantial cost/performance (up to 6x) improvements over CPU Presto on standard analytical benchmarks. Our code is available as part of open-source Presto/Velox, and we have started to use it to run customer production workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript describes extensions to the Presto query engine to support GPU execution via NVIDIA cuDF. It identifies two key challenges—data movement from storage to GPU operators and GPU-resident data exchange for distributed queries—and uses TPC-H-derived experiments on a multi-GPU cluster to inform the design. Architectural changes for integration into Presto are detailed, and an initial evaluation is reported to show up to 6x cost/performance gains over CPU Presto on analytical benchmarks, with the implementation open-sourced in Presto/Velox and already deployed on some production workloads.

Significance. If the reported gains are shown to hold for the integrated Presto system under distributed mixed CPU/GPU workloads, the work would be significant for accelerating analytical query engines, offering a practical path to GPU offloading in production systems like Presto. The open-sourcing of the code and mention of production use are strengths that support reproducibility and real-world relevance.

major comments (1)
  1. [Abstract] Abstract: the central claim of 'up to 6x' cost/performance improvements over CPU Presto is load-bearing, yet the text supplies no end-to-end measurements for the modified Presto engine on full distributed queries that cross CPU/GPU boundaries. The 6x figure is presented as arising from standalone cuDF TPC-H experiments; without numbers that include storage-to-GPU transfer, operator handoff, and partial CPU fallback overheads, it is unclear whether the integrated system achieves the claimed gains.
minor comments (1)
  1. The manuscript would benefit from explicit enumeration of the exact query set, hardware configuration (GPU count, interconnect), and baseline Presto version used in the initial evaluation, even if only summarized.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the constructive comment on the abstract. We address it point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'up to 6x' cost/performance improvements over CPU Presto is load-bearing, yet the text supplies no end-to-end measurements for the modified Presto engine on full distributed queries that cross CPU/GPU boundaries. The 6x figure is presented as arising from standalone cuDF TPC-H experiments; without numbers that include storage-to-GPU transfer, operator handoff, and partial CPU fallback overheads, it is unclear whether the integrated system achieves the claimed gains.

    Authors: We agree that the abstract as written can be read as claiming the 6x figure for the integrated Presto system. In fact the reported measurements come from the standalone cuDF TPC-H experiments that were performed to guide the design of data movement and GPU-resident exchange. The manuscript describes the required architectural changes to Presto but does not include end-to-end timings of the modified engine that incorporate storage-to-GPU transfers, operator hand-off, or CPU fallback. We will revise the abstract (and the corresponding evaluation paragraph) to state explicitly that the 6x gains are measured in the cuDF experiments and that full-system evaluation of the Presto integration remains future work. This change will be incorporated in the next revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering paper with external benchmark claims

full rationale

The paper describes a GPU extension to Presto and reports measured speedups on analytical benchmarks. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text. Claims rest on direct experimental comparisons to CPU Presto rather than any internal redefinition or prediction-by-construction. This is the expected non-finding for an implementation-focused systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on existing GPU hardware and the cuDF library rather than introducing new theoretical elements; no free parameters or invented entities are required for the central claim.

axioms (1)
  • domain assumption NVIDIA GPUs and the cuDF library deliver efficient execution for the analytical operators tested in TPC-H queries
    The design choices and performance expectations depend on the capabilities of this specific hardware and library stack.

pith-pipeline@v0.9.1-grok · 5704 in / 1210 out tokens · 21334 ms · 2026-06-25T21:37:11.363828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 10 canonical work pages

  1. [1]

    Tanveer Ahmad. 2022. Benchmarking Apache Arrow Flight - A wire-speed protocol for data transfer, querying and microservices. InBenchmarking in the Data Center: Expanding to the Cloud(Seoul, Republic of Korea)(BID’22). As- sociation for Computing Machinery, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/3527199.3527264

  2. [2]

    2021.Amazon Athena – Big Data Analytics Options on A WS

    Amazon. 2021.Amazon Athena – Big Data Analytics Options on A WS. Technical Report. Amazon Web Services. https://docs.aws.amazon.com/whitepapers/ latest/big-data-analytics-options/amazon-athena.html Amazon Athena uses Presto with full ANSI SQL support

  3. [3]

    Patel, and Rodrigo Aramburú

    Felipe Aramburú, William Malpica, Kaouther Abrougui, Amin Aramoon, Romulo Auccapuclla, Claude Brisson, Matthijs Brobbel, Colby Farrell, Pradeep Garigipati, Joost Hoozemans, Supun Kamburugamuve, Akhil Nair, Alexander Ocsa, Johan Peltenburg, Rubén Quesada López, Deepak Sihag, Ahmet Uyar, Dhruv Vats, Michael Wendt, Jignesh M. Patel, and Rodrigo Aramburú. 202...

  4. [4]

    BlazingSQL Team. 2020. BlazingSQL: A GPU-Accelerated SQL Engine Built on RAPIDS. https://github.com/BlazingDB/blazingsql Open-source GPU SQL engine built on cuDF and Apache Arrow

  5. [5]

    Peter Boncz, Thomas Neumann, and Orri Erling. 2014. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. InTechnology Conference on Performance Evaluation and Benchmarking. 61–76. https://doi. org/10.1007/978-3-319-04936-6_5

  6. [6]

    Sebastian Breß, Max Heimel, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2014. Exploring the Design Space of a GPU-Aware Database Architecture. InNew Trends in Databases and Information Systems, Barbara Catania, Tania Cerquitelli, Silvia Chiusano, Giovanna Guerrini, Mirko Kämpf, Alfons Kemper, Boris Novikov, Themis Palpanas, Jaroslav Pokorný, a...

  7. [7]

    2025.TPC-H DBGEN

    Electrum. 2025.TPC-H DBGEN. https://github.com/electrum/tpch-dbgen

  8. [8]

    1994.MPI: A Message-Passing Interface Stan- dard

    Message Passing Interface Forum. 1994.MPI: A Message-Passing Interface Stan- dard. Technical Report CS-94-230. University of Tennessee, Knoxville, TN, USA. https://www.mpi-forum.org/docs/mpi-1.1/mpi-11.ps

  9. [9]

    Hao Gao and Nikolai Sakharnykh. 2021. Scaling Joins to a Thousand GPUs. In ADMS@VLDB. https://api.semanticscholar.org/CorpusID:237250537

  10. [10]

    Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl

  11. [11]

    InProceed- ings of the VLDB Endowment, Vol

    Hardware-Oblivious Parallelism for In-Memory Column-Stores. InProceed- ings of the VLDB Endowment, Vol. 6. 709–720. https://doi.org/10.14778/2536360. 2536370

  12. [12]

    Yoav Helfman. 2024. Nimble, A New Columnar File Format. https://www. youtube.com/watch?v=bISBNVtXZ6M. Presentation at Meta

  13. [13]

    Kohei KaiGai. 2016. PG-Strom: GPU Accelerated Sequential Scan and JOIN for PostgreSQL. InPGCon 2016. https://heterodb.github.io/pg-strom/

  14. [14]

    Dimitrios Koutsoukos, Ingo Müller, Renato Marroquín, Ana Klimovic, and Gus- tavo Alonso. 2021. Modularis: modular relational analytics over heteroge- neous distributed platforms.Proc. VLDB Endow.14, 13 (Sept. 2021), 3308–3321. https://doi.org/10.14778/3484224.3484229

  15. [15]

    KvikIO Development Team. 2025. KvikIO’s C++ documentation. https://docs. rapids.ai/api/libkvikio/nightly/

  16. [16]

    Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke

    Yinan Li, Bailu Ding, Ziyun Wei, Lukas M. Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke. 2025. Scaling GPU-Accelerated Databases Beyond GPU Memory Size.Proc. VLDB Endow.18, 11 (Sept. 2025), 4518–4531. https://doi.org/10.14778/3749646.3749710

  17. [17]

    Todd Mostak. 2017. An Overview of MapD (now HeavyDB): A GPU-Powered Analytics Platform. (2017). https://www.heavy.ai/ HeavyDB (formerly MapD/OmniSciDB) is a GPU-accelerated SQL database for analytics

  18. [18]

    NVIDIA OpenSHMEM Development Team. 2025. NVIDIA OpenSHMEM Li- brary (NVSHMEM) Documentation. https://docs.open-mpi.org/en/v5.0.3/man- openshmem/index.html

  19. [19]

    Pedro Pedreira, Orri Erling, Masha Basmanova, Kevin Wilfong, Laith Sakka, Krishna Pai, Wei He, and Biswapesh Chattopadhyay. 2022. Velox: meta’s unified execution engine.Proc. VLDB Endow.15, 12 (Aug. 2022), 3372–3384. https: //doi.org/10.14778/3554821.3554829

  20. [20]

    RAPIDS Development Team. 2018. RAPIDS: Open GPU Data Science. https: //docs.rapids.ai/api/cudf/stable/

  21. [21]

    Amazon Web Services. 2025. AWS Pricing. https://aws.amazon.com/pricing/. Accessed: 2025-11-28

  22. [22]

    Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Funda- mental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data(Portland, OR, USA)(SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1617–1632. https://doi.org/10.114...

  23. [23]

    Akash Shankaran, George Gu, Weiting Chen, Binwei Yang, Chidamber Kulka- rni, Mark Rambacher, Nesime Tatbul, and David E. Cohen. 2023. The Gluten Open-Source Software Project: Modernizing Java-based Query Engines for the Lakehouse Era. InJoint Proceedings of Workshops at the 49th International Confer- ence on Very Large Data Bases (VLDB 2023)(Vancouver, Ca...

  24. [24]

    Agarwal, Ashish Mittal, Saksham Chintalapani, Rekha Singhal, and Biswapesh Chatterjee

    Ankit Shankhdhar, Saurabh Sethia, Hemant Sharma, Pulkit Salecha, Chunxu Zhang, Beinan Chen, Neha Jain, Qi Yan, Reetika Sethi, Abhisek Agrawal, Tim Park, Venkata Koganti, Sreeni Prasad, Arunachalam Sankar, Yi Xu, Manoj K. Agarwal, Ashish Mittal, Saksham Chintalapani, Rekha Singhal, and Biswapesh Chatterjee. 2024. HBO: History-Based Query Optimization in Pr...

  25. [25]

    Harshit Sharma and Anmol Sharma. 2024. A Comprehensive Overview of GPU Accelerated Databases. arXiv:2406.13831 [cs.DB] https://arxiv.org/abs/2406. 13831

  26. [26]

    Abadi, Adam Batkin, Xuedong Chen, Mitch Cher- niack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik

    Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cher- niack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2005. C-store: A Column-oriented DBMS. InProceedings of the 31st International Conference on Very Large Data Bases (VLDB ’05). VLDB Endowment, 553–564. h...

  27. [27]

    Yutian Sun, Tim Meehan, Rebecca Schlussel, Wenlei Xie, Masha Basmanova, Orri Erling, Andrii Rosa, Shixuan Fan, Rongrong Zhong, Arun Thirupathi, Nikhil Collooru, Ke Wang, Sameer Agarwal, Arjun Gupta, Dionysios Logothetis, Kostas Xirogiannopoulos, Amit Dutta, Varun Gajjala, Rohit Jain, Ajay Palakuzhy, Prithvi Pandian, Sergey Pershin, Abhisek Saikia, Pranjal...

  28. [28]

    The Unified Communication X Library

    The Unified Communication X Library [n.d.]. The Unified Communication X Library. http://www.openucx.org

  29. [29]

    X, formerly Twitter

    X, formerly Twitter 2013.Announcing Parquet 1.0: Columnar Storage for Hadoop. X, formerly Twitter. https://blog.x.com/engineering/en_us/a/2013/announcing- parquet-10-columnar-storage-for-hadoop Version 1.0

  30. [30]

    Bobbi Yogatama, Yifei Yang, Kevin Kristensen, Devesh Sarda, Abigale Kim, Adrian Cockcroft, Yu Teng, Joshua Patterson, Gregory Kimball, Wes McKinney, Weiwei Gong, and Xiangyao Yu. 2025. Rethinking Analytical Processing in the GPU Era. arXiv:2508.04701 [cs.DB] https://arxiv.org/abs/2508.04701

  31. [31]

    Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Pro- cessing Data Warehousing Queries on GPU Devices. InProceedings of the VLDB Endowment, Vol. 6. 817–828. https://doi.org/10.14778/2536206.2536210 13