Accelerating Presto with GPUs
Pith reviewed 2026-06-25 21:37 UTC · model grok-4.3
The pith
Presto can run GPU operators with efficient data movement to deliver up to 6x cost/performance gains on analytical benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We extended Presto to be GPU-aware by building mechanisms for efficient data transfer from storage to GPU operators and for data exchange between operators without leaving GPU memory, even when queries span multiple nodes. Initial experiments running TPC-H-derived queries on a multi-GPU cluster with cuDF measured the effects of different architectures and configurations. These measurements informed the integration of GPU execution paths into Presto, producing up to 6x cost/performance improvements over CPU Presto on standard analytical benchmarks.
What carries the argument
GPU-aware Presto extensions that manage data ingestion to GPU operators and inter-operator exchanges while keeping data resident in GPU memory across distributed nodes.
If this is right
- Analytical queries achieve up to 6x better cost/performance than CPU Presto on TPC-H benchmarks.
- Data can remain in GPU memory during exchanges between operators even in distributed settings.
- The open-source changes become available for production workloads through Presto/Velox.
- GPU execution paths integrate into the existing Presto framework without replacing the entire engine.
Where Pith is reading between the lines
- Similar data-movement patterns could reduce costs in other distributed SQL engines that adopt GPU operators.
- Mixed CPU/GPU query plans may need additional scheduling logic when production data distributions differ from benchmarks.
- Larger GPU clusters could expose new bottlenecks in cross-node data exchange that small-scale tests do not reveal.
Load-bearing premise
Performance characteristics measured in isolated cuDF experiments on TPC-H queries will carry over to the integrated Presto system when it runs full distributed queries that mix CPU and GPU paths on real production data volumes.
What would settle it
Measure end-to-end runtime and cost of the integrated Presto system on a multi-GPU cluster executing full distributed queries against the same workload run on CPU-only Presto, using production-scale data volumes rather than benchmark subsets.
Figures
read the original abstract
We describe how we extended Presto to be GPU-aware. We focus on two critical challenges: efficiently moving data from storage to GPU operators, and enabling data exchange between operators without leaving GPU memory even when a query is distributed. To guide our design, we conducted a series of initial experiments in which we executed queries derived from the TPC-H benchmark on a multi-GPU cluster using NVIDIA's C++ cuDF data-frame library, and measured how different architectures and configurations influenced performance. We show how these insights inform our extensions to Presto, detailing the architectural changes required to integrate GPU execution into the existing Presto framework. Our initial evaluation demonstrates substantial cost/performance (up to 6x) improvements over CPU Presto on standard analytical benchmarks. Our code is available as part of open-source Presto/Velox, and we have started to use it to run customer production workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes extensions to the Presto query engine to support GPU execution via NVIDIA cuDF. It identifies two key challenges—data movement from storage to GPU operators and GPU-resident data exchange for distributed queries—and uses TPC-H-derived experiments on a multi-GPU cluster to inform the design. Architectural changes for integration into Presto are detailed, and an initial evaluation is reported to show up to 6x cost/performance gains over CPU Presto on analytical benchmarks, with the implementation open-sourced in Presto/Velox and already deployed on some production workloads.
Significance. If the reported gains are shown to hold for the integrated Presto system under distributed mixed CPU/GPU workloads, the work would be significant for accelerating analytical query engines, offering a practical path to GPU offloading in production systems like Presto. The open-sourcing of the code and mention of production use are strengths that support reproducibility and real-world relevance.
major comments (1)
- [Abstract] Abstract: the central claim of 'up to 6x' cost/performance improvements over CPU Presto is load-bearing, yet the text supplies no end-to-end measurements for the modified Presto engine on full distributed queries that cross CPU/GPU boundaries. The 6x figure is presented as arising from standalone cuDF TPC-H experiments; without numbers that include storage-to-GPU transfer, operator handoff, and partial CPU fallback overheads, it is unclear whether the integrated system achieves the claimed gains.
minor comments (1)
- The manuscript would benefit from explicit enumeration of the exact query set, hardware configuration (GPU count, interconnect), and baseline Presto version used in the initial evaluation, even if only summarized.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the constructive comment on the abstract. We address it point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'up to 6x' cost/performance improvements over CPU Presto is load-bearing, yet the text supplies no end-to-end measurements for the modified Presto engine on full distributed queries that cross CPU/GPU boundaries. The 6x figure is presented as arising from standalone cuDF TPC-H experiments; without numbers that include storage-to-GPU transfer, operator handoff, and partial CPU fallback overheads, it is unclear whether the integrated system achieves the claimed gains.
Authors: We agree that the abstract as written can be read as claiming the 6x figure for the integrated Presto system. In fact the reported measurements come from the standalone cuDF TPC-H experiments that were performed to guide the design of data movement and GPU-resident exchange. The manuscript describes the required architectural changes to Presto but does not include end-to-end timings of the modified engine that incorporate storage-to-GPU transfers, operator hand-off, or CPU fallback. We will revise the abstract (and the corresponding evaluation paragraph) to state explicitly that the 6x gains are measured in the cuDF experiments and that full-system evaluation of the Presto integration remains future work. This change will be incorporated in the next revision. revision: yes
Circularity Check
No circularity: empirical engineering paper with external benchmark claims
full rationale
The paper describes a GPU extension to Presto and reports measured speedups on analytical benchmarks. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text. Claims rest on direct experimental comparisons to CPU Presto rather than any internal redefinition or prediction-by-construction. This is the expected non-finding for an implementation-focused systems paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NVIDIA GPUs and the cuDF library deliver efficient execution for the analytical operators tested in TPC-H queries
Reference graph
Works this paper leans on
-
[1]
Tanveer Ahmad. 2022. Benchmarking Apache Arrow Flight - A wire-speed protocol for data transfer, querying and microservices. InBenchmarking in the Data Center: Expanding to the Cloud(Seoul, Republic of Korea)(BID’22). As- sociation for Computing Machinery, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/3527199.3527264
-
[2]
2021.Amazon Athena – Big Data Analytics Options on A WS
Amazon. 2021.Amazon Athena – Big Data Analytics Options on A WS. Technical Report. Amazon Web Services. https://docs.aws.amazon.com/whitepapers/ latest/big-data-analytics-options/amazon-athena.html Amazon Athena uses Presto with full ANSI SQL support
2021
-
[3]
Felipe Aramburú, William Malpica, Kaouther Abrougui, Amin Aramoon, Romulo Auccapuclla, Claude Brisson, Matthijs Brobbel, Colby Farrell, Pradeep Garigipati, Joost Hoozemans, Supun Kamburugamuve, Akhil Nair, Alexander Ocsa, Johan Peltenburg, Rubén Quesada López, Deepak Sihag, Ahmet Uyar, Dhruv Vats, Michael Wendt, Jignesh M. Patel, and Rodrigo Aramburú. 202...
arXiv 2025
-
[4]
BlazingSQL Team. 2020. BlazingSQL: A GPU-Accelerated SQL Engine Built on RAPIDS. https://github.com/BlazingDB/blazingsql Open-source GPU SQL engine built on cuDF and Apache Arrow
2020
-
[5]
Peter Boncz, Thomas Neumann, and Orri Erling. 2014. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. InTechnology Conference on Performance Evaluation and Benchmarking. 61–76. https://doi. org/10.1007/978-3-319-04936-6_5
-
[6]
Sebastian Breß, Max Heimel, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2014. Exploring the Design Space of a GPU-Aware Database Architecture. InNew Trends in Databases and Information Systems, Barbara Catania, Tania Cerquitelli, Silvia Chiusano, Giovanna Guerrini, Mirko Kämpf, Alfons Kemper, Boris Novikov, Themis Palpanas, Jaroslav Pokorný, a...
2014
-
[7]
2025.TPC-H DBGEN
Electrum. 2025.TPC-H DBGEN. https://github.com/electrum/tpch-dbgen
2025
-
[8]
1994.MPI: A Message-Passing Interface Stan- dard
Message Passing Interface Forum. 1994.MPI: A Message-Passing Interface Stan- dard. Technical Report CS-94-230. University of Tennessee, Knoxville, TN, USA. https://www.mpi-forum.org/docs/mpi-1.1/mpi-11.ps
1994
-
[9]
Hao Gao and Nikolai Sakharnykh. 2021. Scaling Joins to a Thousand GPUs. In ADMS@VLDB. https://api.semanticscholar.org/CorpusID:237250537
2021
-
[10]
Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl
-
[11]
InProceed- ings of the VLDB Endowment, Vol
Hardware-Oblivious Parallelism for In-Memory Column-Stores. InProceed- ings of the VLDB Endowment, Vol. 6. 709–720. https://doi.org/10.14778/2536360. 2536370
-
[12]
Yoav Helfman. 2024. Nimble, A New Columnar File Format. https://www. youtube.com/watch?v=bISBNVtXZ6M. Presentation at Meta
2024
-
[13]
Kohei KaiGai. 2016. PG-Strom: GPU Accelerated Sequential Scan and JOIN for PostgreSQL. InPGCon 2016. https://heterodb.github.io/pg-strom/
2016
-
[14]
Dimitrios Koutsoukos, Ingo Müller, Renato Marroquín, Ana Klimovic, and Gus- tavo Alonso. 2021. Modularis: modular relational analytics over heteroge- neous distributed platforms.Proc. VLDB Endow.14, 13 (Sept. 2021), 3308–3321. https://doi.org/10.14778/3484224.3484229
-
[15]
KvikIO Development Team. 2025. KvikIO’s C++ documentation. https://docs. rapids.ai/api/libkvikio/nightly/
2025
-
[16]
Yinan Li, Bailu Ding, Ziyun Wei, Lukas M. Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke. 2025. Scaling GPU-Accelerated Databases Beyond GPU Memory Size.Proc. VLDB Endow.18, 11 (Sept. 2025), 4518–4531. https://doi.org/10.14778/3749646.3749710
-
[17]
Todd Mostak. 2017. An Overview of MapD (now HeavyDB): A GPU-Powered Analytics Platform. (2017). https://www.heavy.ai/ HeavyDB (formerly MapD/OmniSciDB) is a GPU-accelerated SQL database for analytics
2017
-
[18]
NVIDIA OpenSHMEM Development Team. 2025. NVIDIA OpenSHMEM Li- brary (NVSHMEM) Documentation. https://docs.open-mpi.org/en/v5.0.3/man- openshmem/index.html
2025
-
[19]
Pedro Pedreira, Orri Erling, Masha Basmanova, Kevin Wilfong, Laith Sakka, Krishna Pai, Wei He, and Biswapesh Chattopadhyay. 2022. Velox: meta’s unified execution engine.Proc. VLDB Endow.15, 12 (Aug. 2022), 3372–3384. https: //doi.org/10.14778/3554821.3554829
-
[20]
RAPIDS Development Team. 2018. RAPIDS: Open GPU Data Science. https: //docs.rapids.ai/api/cudf/stable/
2018
-
[21]
Amazon Web Services. 2025. AWS Pricing. https://aws.amazon.com/pricing/. Accessed: 2025-11-28
2025
-
[22]
Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Funda- mental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data(Portland, OR, USA)(SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1617–1632. https://doi.org/10.114...
-
[23]
Akash Shankaran, George Gu, Weiting Chen, Binwei Yang, Chidamber Kulka- rni, Mark Rambacher, Nesime Tatbul, and David E. Cohen. 2023. The Gluten Open-Source Software Project: Modernizing Java-based Query Engines for the Lakehouse Era. InJoint Proceedings of Workshops at the 49th International Confer- ence on Very Large Data Bases (VLDB 2023)(Vancouver, Ca...
2023
-
[24]
Agarwal, Ashish Mittal, Saksham Chintalapani, Rekha Singhal, and Biswapesh Chatterjee
Ankit Shankhdhar, Saurabh Sethia, Hemant Sharma, Pulkit Salecha, Chunxu Zhang, Beinan Chen, Neha Jain, Qi Yan, Reetika Sethi, Abhisek Agrawal, Tim Park, Venkata Koganti, Sreeni Prasad, Arunachalam Sankar, Yi Xu, Manoj K. Agarwal, Ashish Mittal, Saksham Chintalapani, Rekha Singhal, and Biswapesh Chatterjee. 2024. HBO: History-Based Query Optimization in Pr...
-
[25]
Harshit Sharma and Anmol Sharma. 2024. A Comprehensive Overview of GPU Accelerated Databases. arXiv:2406.13831 [cs.DB] https://arxiv.org/abs/2406. 13831
arXiv 2024
-
[26]
Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cher- niack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2005. C-store: A Column-oriented DBMS. InProceedings of the 31st International Conference on Very Large Data Bases (VLDB ’05). VLDB Endowment, 553–564. h...
arXiv 2005
-
[27]
Yutian Sun, Tim Meehan, Rebecca Schlussel, Wenlei Xie, Masha Basmanova, Orri Erling, Andrii Rosa, Shixuan Fan, Rongrong Zhong, Arun Thirupathi, Nikhil Collooru, Ke Wang, Sameer Agarwal, Arjun Gupta, Dionysios Logothetis, Kostas Xirogiannopoulos, Amit Dutta, Varun Gajjala, Rohit Jain, Ajay Palakuzhy, Prithvi Pandian, Sergey Pershin, Abhisek Saikia, Pranjal...
-
[28]
The Unified Communication X Library
The Unified Communication X Library [n.d.]. The Unified Communication X Library. http://www.openucx.org
-
[29]
X, formerly Twitter
X, formerly Twitter 2013.Announcing Parquet 1.0: Columnar Storage for Hadoop. X, formerly Twitter. https://blog.x.com/engineering/en_us/a/2013/announcing- parquet-10-columnar-storage-for-hadoop Version 1.0
2013
-
[30]
Bobbi Yogatama, Yifei Yang, Kevin Kristensen, Devesh Sarda, Abigale Kim, Adrian Cockcroft, Yu Teng, Joshua Patterson, Gregory Kimball, Wes McKinney, Weiwei Gong, and Xiangyao Yu. 2025. Rethinking Analytical Processing in the GPU Era. arXiv:2508.04701 [cs.DB] https://arxiv.org/abs/2508.04701
arXiv 2025
-
[31]
Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Pro- cessing Data Warehousing Queries on GPU Devices. InProceedings of the VLDB Endowment, Vol. 6. 817–828. https://doi.org/10.14778/2536206.2536210 13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.