PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

Carsten Binnig; Jigao Luo; Muhammad El-Hindi; Nils Boeschen

arxiv: 2512.02862 · v3 · pith:JQQWWX4Bnew · submitted 2025-12-02 · 💻 cs.DB

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

Jigao Luo , Nils Boeschen , Muhammad El-Hindi , Carsten Binnig This is my paper

Pith reviewed 2026-05-21 18:48 UTC · model grok-4.3

classification 💻 cs.DB

keywords PyTorchdistributed query processingGPU accelerationOLAPRDMA networksNVMe storageI/O overlapstorage-resident data

0 comments

The pith

PystachIO optimizes PyTorch I/O overlap to deliver up to 3x speedups for distributed GPU query processing on storage-resident OLAP data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how tensor computation runtimes such as PyTorch can support distributed query processing for large-scale OLAP workloads where data exceeds aggregated GPU memory and must reside on fast storage. It finds that direct use of PyTorch abstractions for RDMA networks and NVMe storage leaves GPU and I/O bandwidth underutilized because computation and data movement do not overlap enough. PystachIO adds targeted optimizations to improve this overlap and reports concrete end-to-end gains. A sympathetic reader would care because the work shows a route to running analytical queries on the same GPU-centric hardware already deployed for AI without requiring all data to fit in memory.

Core claim

PystachIO is a PyTorch-based distributed OLAP engine that pairs fast RDMA network and high-bandwidth NVMe storage I/O with optimizations that maximize overlap between computation and data movement, thereby raising utilization of GPU, network, and storage resources and producing up to 3x end-to-end speedups over existing distributed GPU query processing approaches.

What carries the argument

A collection of PyTorch-level optimizations that enforce sufficient overlap of computation with network and storage transfers over RDMA and NVMe.

If this is right

Up to 3x end-to-end speedups over prior distributed GPU approaches become attainable.
Query processing scales to storage-resident data that exceeds total GPU memory.
GPU, network, and storage bandwidth see higher utilization in distributed GPU clusters.
AI tensor runtimes become practical bases for analytical database engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlap techniques could allow AI training and analytics jobs to share GPU clusters and fast interconnects without separate infrastructure.
Porting the approach to other tensor runtimes would test whether the gains are specific to PyTorch or more general.
Measuring performance across query mixes with varying data skew or network latency would clarify which optimizations matter most in practice.

Load-bearing premise

The proposed optimizations will reliably increase bandwidth utilization across different workloads and hardware configurations without creating new bottlenecks.

What would settle it

Running the same queries and hardware with the PystachIO optimizations applied but measuring no improvement or a slowdown relative to naive PyTorch I/O would show the central performance claim does not hold.

Figures

Figures reproduced from arXiv: 2512.02862 by Carsten Binnig, Jigao Luo, Muhammad El-Hindi, Nils Boeschen.

**Figure 2.** Figure 2: PystachIO: a database architecture that exploits fast [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: TCRs provide fast networking and storage primi [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Naive and optimized variants of a distributed hash [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Distributed join performance on 2 GPU nodes for [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Timeline of bandwidth monitoring of one involved NIC during different distributed join variants. 2 Nodes, 120M/160M [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of transfer size and concurrent computation [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 10.** Figure 10: GPU Parquet reader scan on TPC-H SF300 lineitem: storage bus bandwidth monitored during reading, with different optimizations. 8 threads are used in all cases except for the blocking reader. Fully Overlapped Parquet Reader Kernels RG0 Kernels RG1 I/O RG0 I/O RG2 3: Kernels of all RGs Blocking Parquet Reader: I/O and GPU Running In Sequence + Metadata Caching single-threaded multi-threaded Kernels RG2 Time… view at source ↗

**Figure 12.** Figure 12: GPU Parquet reader scan on TPC-H SF300 lineitem: average storage bus bandwidth under different optimizations. Left: 8 threads are used in all cases except for the blocking reader. Right: Thread scalability from 1 to 8. GPU memory, marked with • in [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗

**Figure 11.** Figure 11: GPU Parquet reader: from blocking to fully over [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 13.** Figure 13: GPU Parquet reader scan on TPC-H SF300 lineitem: scalability across multiple SSDs compared with the CPU baseline (128 threads). Left: storage bus bandwidth. Right: effective storage bandwidth. producing millions of such synchronization triggers during multistream reading. To eliminate this bottleneck and achieve full GPU saturation, we enabled CUDA pinned memory and replaced all missync-inducing operati… view at source ↗

**Figure 14.** Figure 14: GPU Parquet reader scan on TPC-H SF300 lineitem: scalability across multiple dataset SFs, compared with the cuDF blocking reader using 1 SSD and 4 SSDs. rewrites [58]. To demonstrate PystachIO’s capability as a TCRbacked storage engine, we further evaluate its scalability with all optimizations enabled, showing that PystachIO can serve as a highthroughput database storage for OLAP using TCRs. Scalabilit… view at source ↗

**Figure 16.** Figure 16: TPC-H Query 3 runtime executed on 2 GPUs with [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗

**Figure 15.** Figure 15: Query processing in TCRs: from fully blocking to [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

**Figure 17.** Figure 17: Dataset scalability up to TPC-H SF1000: query [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗

**Figure 18.** Figure 18: TPC-H SF500 runtime breakdown with 2 GPUs. [PITH_FULL_IMAGE:figures/full_fig_p012_18.png] view at source ↗

**Figure 19.** Figure 19: GPU-node scalability up to 8 GPUs: query perfor [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗

read the original abstract

The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backed by high-bandwidth NVMe storage enable scalable computation and rapid access to storage-resident data. Tensor computation runtimes (TCRs), such as PyTorch, originally designed for AI workloads, have recently been shown to accelerate analytical workloads. However, prior work has primarily considered settings where the data fits in aggregated GPU memory. In this paper, we systematically study how TCRs can support scalable, distributed query processing for large-scale, storage-resident OLAP workloads. Although TCRs provide abstractions for network and storage I/O, naive use often underutilizes GPU and I/O bandwidth due to insufficient overlap between computation and data movement. As a core contribution, we present PystachIO, a prototype of a PyTorch-based distributed OLAP engine that combines fast network and storage I/O with key optimizations to maximize GPU, network, and storage utilization. Our evaluation shows up to 3x end-to-end speedups over existing distributed GPU-based query processing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents PystachIO, a PyTorch-based prototype for distributed OLAP query processing on storage-resident data in GPU clusters equipped with RDMA networks and high-bandwidth NVMe storage. It argues that naive use of PyTorch I/O abstractions underutilizes GPU and bandwidth resources due to insufficient overlap of computation and data movement, and introduces optimizations (including custom CUDA streams for RDMA overlap and prefetch scheduling) to address this. The central empirical claim is that these techniques deliver up to 3x end-to-end speedups relative to prior distributed GPU query engines.

Significance. If the speedups prove robust and attributable to the described I/O-overlap techniques rather than implementation artifacts, the work would be significant for showing how tensor runtimes originally built for AI can be adapted to out-of-core analytical workloads at scale. This could help bridge the gap between ML frameworks and database systems in modern GPU-centric data centers, with potential for broader adoption of PyTorch-style abstractions in query engines.

major comments (1)

[Evaluation] Evaluation section: The central claim of up to 3x end-to-end speedups is presented without reported details on workloads, data sizes, baselines, or measurement methodology. Without an ablation study that disables individual optimizations (e.g., RDMA overlap via custom CUDA streams or prefetch scheduling) while holding the rest of the PyTorch runtime fixed, or without re-implementing at least one baseline inside the identical PyTorch stack, it is impossible to attribute gains specifically to the proposed I/O fixes versus differences in query planning, data layout, or engineering effort.

minor comments (2)

[Introduction] The abstract and introduction would benefit from a concise table or paragraph summarizing the exact set of proposed optimizations and how each targets a specific underutilization issue.
[Methods] Notation for network and storage bandwidth utilization metrics should be defined explicitly before their first use in the methods or evaluation sections to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We agree that the evaluation section requires additional details and an ablation study to strengthen attribution of the reported speedups to the I/O optimizations. We will revise the manuscript to address these points.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The central claim of up to 3x end-to-end speedups is presented without reported details on workloads, data sizes, baselines, or measurement methodology. Without an ablation study that disables individual optimizations (e.g., RDMA overlap via custom CUDA streams or prefetch scheduling) while holding the rest of the PyTorch runtime fixed, or without re-implementing at least one baseline inside the identical PyTorch stack, it is impossible to attribute gains specifically to the proposed I/O fixes versus differences in query planning, data layout, or engineering effort.

Authors: We agree that the current manuscript lacks sufficient detail in the evaluation section, which limits the ability to attribute gains specifically to the I/O-overlap techniques. In the revision we will expand this section to report the exact workloads and queries, data sizes and scale factors, baseline systems with versions and configurations, and the full measurement methodology including hardware, repetition counts, and timing procedures. We will also add an ablation study that disables RDMA overlap via custom CUDA streams and prefetch scheduling independently while holding the remainder of the PyTorch runtime fixed. This will quantify the incremental benefit of each optimization. Re-implementing an external baseline inside the PyTorch stack is outside the scope of the work and would require substantial unrelated engineering; the internal ablation study addresses the core attribution concern without that requirement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with independent benchmarks

full rationale

The paper describes a prototype system (PystachIO) and reports measured end-to-end speedups on storage-resident OLAP workloads. No mathematical derivation chain, fitted parameters, or equations appear in the abstract or context. The central claim rests on direct experimental comparison rather than any quantity defined in terms of itself or reduced via self-citation to the authors' prior inputs. The evaluation is therefore self-contained against external hardware and baseline systems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper. No free parameters, mathematical axioms, or invented entities are described in the abstract; the contribution is a prototype implementation and performance measurements rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5740 in / 1064 out tokens · 41936 ms · 2026-05-21T18:48:18.912033+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PystachIO ... overlaps storage I/O, networking, and computation ... chunking ... deferred synchronization ... reader combining ... dynamic buffering
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

naive use of TCR I/O abstractions underutilizes GPU and I/O bandwidth

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Daniel Abadi. 2025. Parquet and ORC’s many shortfalls for machine learning (ML) workloads, and what should be done about them. https://www.starburst. io/blog/parquet-orc-machine-learning/

work page 2025
[2]

Azim Afroozeh, Lotte Felius, and Peter Boncz. 2024. Accelerating GPU Data Processing using FastLanes Compression. InProceedings of the 20th International Workshop on Data Management on New Hardware, DaMoN 2024, Santiago, Chile, 10 June 2024, Carsten Binnig and Nesime Tatbul (Eds.). ACM, 8:1–8:11. https: //doi.org/10.1145/3662010.3663450

work page doi:10.1145/3662010.3663450 2024
[3]

Patel, and Rodrigo Aramburú

Felipe Aramburú, William Malpica, Kaouther Abrougui, Amin Aramoon, Romulo Auccapuclla, Claude Brisson, Matthijs Brobbel, Colby Farrell, Pradeep Garigipati, Joost Hoozemans, Supun Kamburugamuve, Akhil Nair, Alexander Ocsa, Johan Peltenburg, Rubén Quesada López, Deepak Sihag, Ahmet Uyar, Dhruv Vats, Michael Wendt, Jignesh M. Patel, and Rodrigo Aramburú. 202...

work page arXiv 2025
[4]

Claude Barthels, Gustavo Alonso, Torsten Hoefler, Timo Schneider, and Ingo Müller. 2017. Distributed Join Algorithms on Thousands of Cores.Proc. VLDB Endow.10, 5 (2017), 517–528. https://doi.org/10.14778/3055540.3055545

work page doi:10.14778/3055540.3055545 2017
[5]

Nils Boeschen, Tobias Ziegler, and Carsten Binnig. 2024. GOLAP: A GPU-in- Data-Path Architecture for High-Speed OLAP.Proc. ACM Manag. Data2, 6 (2024), 237:1–237:26. https://doi.org/10.1145/3698812

work page doi:10.1145/3698812 2024
[6]

Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Processing in Co-Processor-accelerated Databases. InProceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 1891–1906. https://doi.org/10.114...

work page doi:10.1145/2882903.2882936 2016
[7]

Jiashen Cao, Rathijit Sen, Matteo Interlandi, Joy Arulraj, and Hyesoon Kim. 2023. GPU Database Systems Characterization and Optimization.Proc. VLDB Endow. 17, 3 (2023), 441–454. https://doi.org/10.14778/3632093.3632107

work page doi:10.14778/3632093.3632107 2023
[8]

NVIDIA Corporation. 2025. NVIDIA DGX A100 Hardware Overview. https://docs. nvidia.com/dgx/dgxa100-user-guide/introduction-to-dgxa100.html. Accessed: 2025-11-29

work page 2025
[9]

Microsoft Fabric. 2025. Understand V-Order for Microsoft Fabric Warehouse. https://learn.microsoft.com/en-us/fabric/data-warehouse/v-order. Accessed: 2025-11-27

work page 2025
[10]

Nuno Faria and Andrew Lamb with Apache DataFusion Contributors. 2025. DataFusion Pull Request: Cache Parquet metadata in built in parquet reader. https://github.com/apache/datafusion/pull/16971. Accessed: 2025-11-26

work page 2025
[11]

NumFOCUS Foundation. 2025. NumPy. https://numpy.org/

work page 2025
[12]

NumFOCUS Foundation and Wes McKinney. 2025. pandas - Python Data Analysis Library. https://pandas.pydata.org/

work page 2025
[13]

Hao Gao and Nikolai Sakharnykh. 2021. Scaling Joins to a Thousand GPUs. In International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2021, Copen- hagen, Denmark, August 16, 2021, Rajesh Bordawekar and Tirthankar Lahiri (Eds.). 55–64. http://www.adms-conf.org/2021-camera-ready/g...

work page 2021
[14]

Chengxin Guo, Hong Chen, Feng Zhang, and Cuiping Li. 2019. Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA. InProceedings of the 48th International Conference on Parallel Processing, ICPP 2019, Kyoto, Japan, August 05-08, 2019. ACM, 65:1–65:10. https://doi.org/10.1145/3337821.3337862

work page doi:10.1145/3337821.3337862 2019
[15]

Tushar Gupta. 2024. Evolution of Data Center Design to Handle AI Workloads. In34th International Telecommunication Networks and Applications Conference, ITNAC 2024, Sydney, Australia, November 27-29, 2024. IEEE, 1–8. https://doi.org/ 10.1109/ITNAC62915.2024.10815309

work page doi:10.1109/itnac62915.2024.10815309 2024
[16]

Xiangpeng Hao. 2024. Caching in DataFusion. https://blog.xiangpeng.systems/ posts/caching-datafusion/. Accessed: 2025-11-26

work page 2024
[17]

Xiangpeng Hao and Andrew Lamb. 2024. How Good is Parquet for Wide Tables (Machine Learning Workloads) Really? https://www.influxdata.com/blog/how- good-parquet-wide-tables/

work page 2024
[18]

Xiangpeng Hao, Andrew Lamb, Qi Zhu, Weston Pace, and Jigao Luo. 2025. Project idea: How good is well-configured parquet compared to proprietary file formats? https://github.com/XiangpengHao/liquid-cache/issues/227. Accessed: 2025-11- 27

work page 2025
[19]

Dong He, Supun Chathuranga Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstanti- nos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes.Proc. VLDB Endow.15, 11 (2022), 2811–2825. https: //doi.org/10.14778/3551793.3551833

work page doi:10.14778/3551793.3551833 2022
[20]

Sven Hepkema, Azim Afroozeh, Charlotte Felius, Peter Boncz, and Stefan Manegold. 2025. G-ALP: Rethinking Light-weight Encodings for GPUs. In Proceedings of the 21st International Workshop on Data Management on New Hardware, DaMoN 2025, Berlin, Germany, June 22-27, 2025. ACM, 11:1–11:10. https://doi.org/10.1145/3736227.3736242

work page doi:10.1145/3736227.3736242 2025
[21]

Zezhou Huang, Krystian Sakowski, Hans Lehnert, Wei Cui, Carlo Curino, Matteo Interlandi, Marius Dumitru, and Rathijit Sen. 2025. GPU Acceleration of SQL Analytics on Compressed Data.CoRRabs/2506.10092 (2025). https://doi.org/10. 48550/ARXIV.2506.10092 arXiv:2506.10092

work page arXiv 2025
[22]

Marko Kabic, Shriram Chandran, and Gustavo Alonso. 2025. Maximus: A Modular Accelerated Query Engine for Data Analytics on Heterogeneous Systems.Proc. ACM Manag. Data3, 3 (2025), 187:1–187:25. https://doi.org/10.1145/3725324

work page doi:10.1145/3725324 2025
[23]

Greg Kimball, Zoltan Arnold Nagy, Devavret Makkar, Daniel Bauer, and Chengcheng Jin. 2025. Accelerating Large-Scale Data Analytics with GPU-Native Velox and NVIDIA cuDF. https://developer.nvidia.com/blog/accelerating-large- scale-data-analytics-with-gpu-native-velox-and-nvidia-cudf/#hybrid_cpu- gpu_execution_in_apache_spark Accessed: 2025-11-26

work page 2025
[24]

Greg Kimball and Karthikeyan Natarajan. 2025. Accelerating Velox with RAPIDS cuDF - VeloxCon 2025. https://prestodb.io/wp-content/uploads/presto-users/ Accelerating-Velox-with-RAPIDS-cuDF-VeloxCon-April-2025.pdf Accessed: 2025-11-26

work page 2025
[25]

Laurens Kuiper. 2025. Query Engines: Gatekeepers of the Parquet File Format. https://duckdb.org/2025/01/22/parquet-encodings. Accessed: 2025-11-27

work page 2025
[26]

Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis

work page
[27]

ACM Manag

BtrBlocks: Efficient Columnar Compression for Data Lakes.Proc. ACM Manag. Data1, 2 (2023), 118:1–118:26. https://doi.org/10.1145/3589263

work page doi:10.1145/3589263 2023
[28]

Andrew Lamb. 2025. Apache Parquet Metadata Parsing Benchmarks. https: //github.com/alamb/parquet_footer_parsing. Accessed: 2025-11-26

work page 2025
[29]

Andrew Lamb, Yijie Shen, Daniël Heres, Jayjeet Chakraborty, Mehmet Ozan Kabak, Liang-Chi Hsieh, and Chao Sun. 2024. Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine. InCompanion of the 2024 International Conference on Management of Data, SIGMOD/PODS 2024, Santiago, Chile, June 9-15, 2024, Pablo Barceló, Nayat Sánchez-Pi, Alexandr...

work page doi:10.1145/3626246.3653368 2024
[30]

Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke

Yinan Li, Bailu Ding, Ziyun Wei, Lukas M. Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke. 2025. Scaling GPU-Accelerated Databases beyond GPU Memory Size.Proc. VLDB Endow.18, 11 (2025), 4518–4531. https://www.vldb.org/pvldb/vol18/p4518-li.pdf

work page 2025
[31]

Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. 2023. A Deep Dive into Common Open Formats for Analytical DBMSs.Proc. VLDB Endow.16, 11 (2023), 3044–3056. https://doi.org/10.14778/3611479.3611507

work page doi:10.14778/3611479.3611507 2023
[32]

Jigao Luo. 2025. cuDF Issue: Unstable pipelining performance in Parquet reading due to mis-sync. https://github.com/rapidsai/cudf/issues/18967. Accessed: 2025-11-26

work page 2025
[33]

Jigao Luo. 2025. [Story] Towards a faster Parquet reader with pipelining and mul- tistream optimization. https://github.com/rapidsai/cudf/issues/18892. Accessed: 2025-11-27

work page 2025
[34]

Jigao Luo and Muhammad Haseeb. 2025. cuDF POC Pull Request: Metadata caching prototype in Parquet reader. https://github.com/rapidsai/cudf/pull/18891. Accessed: 2025-11-26

work page 2025
[36]

Pump Up the Volume: Processing Large Data on GPUs with Fast In- terconnects. InProceedings of the 2020 International Conference on Manage- ment of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1633–1649. http...

work page doi:10.1145/3318464.3389705 2020
[37]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl

work page
[38]

InSIGMOD ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G

Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. InSIGMOD ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 1017–1032. https://doi.org/10.1145/3514221.3517911

work page doi:10.1145/3514221.3517911 2022
[39]

Linux Foundation Meta AI. 2025. PyTorch. https://pytorch.org/

work page 2025
[40]

Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware.Proc. VLDB Endow.4, 9 (2011), 539–550. https://doi.org/10.14778/ 2002938.2002940

work page arXiv 2011
[41]

Hamish Nicholson, Konstantinos Chasialis, Antonio Boffa, and Anastasia Aila- maki. 2025. The Effectiveness of Compression for GPU-Accelerated Queries on Out-of-Memory Datasets. InProceedings of the 21st International Workshop on Data Management on New Hardware, DaMoN 2025, Berlin, Germany, June 22-27,

work page 2025
[42]

https://doi.org/10.1145/3736227.3736240

ACM, 10:1–10:10. https://doi.org/10.1145/3736227.3736240

work page doi:10.1145/3736227.3736240
[43]

Jesse Noffsinger, Maria Goodpaster, Mark Patel, Haley Chang, Pankaj Sachdeva, and Arjita Bhan. 2025. The cost of compute: A $ 7 trillion race to scale data centers. https://www.mckinsey.com/industries/technology-media-and- telecommunications/our-insights/the-cost-of-compute-a-7-trillion-dollar- race-to-scale-data-centers

work page 2025
[44]

NVIDIA. 2025. NCCL. https://developer.nvidia.com/nccl

work page 2025
[45]

NVIDIA. 2025. NVIDIA GPUDirect RDMA. https://docs.nvidia.com/cuda/ gpudirect-rdma/

work page 2025
[46]

NVIDIA. 2025. NVIDIA GPUDirect Storage. https://docs.nvidia.com/gpudirect- storage/

work page 2025
[47]

NVIDIA. 2025. RAPIDS Accelerator for Apache Spark. https://docs.nvidia.com/ spark-rapids/index.html. Accessed: 2025-11-28. 13

work page 2025
[48]

2025.CUB: Cooperative primitives for CUDA

NVIDIA Corporation. 2025.CUB: Cooperative primitives for CUDA. https: //nvidia.github.io/cccl/cub/ Accessed: 2025-11-26

work page 2025
[49]

2025.CUDA C Programming Guide: Concurrent Execution between Host and Device

NVIDIA Corporation. 2025.CUDA C Programming Guide: Concurrent Execution between Host and Device. https://docs.nvidia.com/cuda/cuda-c-programming- guide/index.html#concurrent-execution-host-device Accessed: 2025-11-26

work page 2025
[50]

2025.CUDA Runtime API: API Synchronization Behav- ior

NVIDIA Corporation. 2025.CUDA Runtime API: API Synchronization Behav- ior. https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html# api-sync-behavior__memcpy-async Accessed: 2025-11-26

work page 2025
[51]

2025.Thrust: Parallel Algorithms Library

NVIDIA Corporation. 2025.Thrust: Parallel Algorithms Library. https://developer. nvidia.com/thrust Accessed: 2025-11-26

work page 2025
[52]

Weston Pace, Chang She, Lei Xu, Will Jones, Albert Lockett, Jun Wang, and Raunak Shah. 2025. Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings.CoRRabs/2504.15247 (2025). https://doi.org/10. 48550/ARXIV.2504.15247 arXiv:2504.15247

work page arXiv 2025
[53]

NVIDIA NCCL Repository. 2025. NCCL Overlap Issue. https://github.com/ NVIDIA/nccl/issues/338

work page 2025
[54]

Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Funda- mental Performance Characteristics of GPUs and CPUs for Database Analytics. InProceedings of the 2020 International Conference on Management of Data, SIG- MOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-C...

work page arXiv 2020
[55]

Amazon Staff. 2025. Amazon to invest up to $ 50 billion to ex- pand AI and supercomputing infrastructure for US Government agen- cies. https://www.aboutamazon.com/news/company-news/amazon-ai- investment-us-federal-agencies

work page 2025
[56]

Polars Team. 2024. GPU acceleration with Polars and NVIDIA RAPIDS. https: //pola.rs/posts/gpu-engine-release/. Accessed: 2025-11-28

work page 2024
[57]

RAPIDS Development Team. 2025. NVIDIA RAPIDS KvikIO. https://docs.rapids. ai/api/kvikio/stable/

work page 2025
[58]

RAPIDS Development Team. 2025. NVIDIA RAPIDS libcudf, pylibcudf and cuDF: GPU DataFrame Library. https://github.com/rapidsai/cudf

work page 2025
[59]

RAPIDS Development Team. 2025. NVIDIA RMM: RAPIDS Memory Manager. https://github.com/rapidsai/rmm

work page 2025
[60]

Lasse Thostrup, Gloria Doci, Nils Boeschen, Manisha Luthra, and Carsten Binnig

work page
[61]

ACM Manag

Distributed GPU Joins on Fast RDMA-capable Networks.Proc. ACM Manag. Data1, 1 (2023), 29:1–29:26. https://doi.org/10.1145/3588709

work page doi:10.1145/3588709 2023
[62]

In: Proc

Deepak Vohra. 2016.Practical Hadoop Ecosystem: A Definitive Guide to Hadoop- Related Frameworks and Tools(1st ed.). Apress, USA. https://doi.org/10.1007/978- 1-4842-2199-0

work page doi:10.1007/978- 2016
[63]

Yunsong Wang. 2025. cuCollections Pull Request: Use pinned host memory for improved async performance. https://github.com/NVIDIA/cuCollections/pull/

work page 2025
[64]

Accessed: 2025-11-27

work page 2025
[65]

Jake Hemstad with NVIDIA CCCL Team. 2023. CCCL Issue: Universal 64-bit index type support in Thrust/CUB algorithms. https://github.com/NVIDIA/cccl/ issues/47. Accessed: 2025-11-26

work page 2023
[66]

Greg Kimball with RAPIDS cuDF Team. 2023. cuDF Issue: Add 64-bit size type option at build-time for libcudf. https://github.com/rapidsai/cudf/issues/13159. Accessed: 2025-11-26

work page 2023
[67]

Bowen Wu, Wei Cui, Carlo Curino, Matteo Interlandi, and Rathijit Sen. 2025. Terabyte-Scale Analytics in the Blink of an Eye.CoRRabs/2506.09226 (2025). https://doi.org/10.48550/ARXIV.2506.09226 arXiv:2506.09226

work page doi:10.48550/arxiv.2506.09226 2025
[68]

Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, and Yonggang Wen. 2024. Deep Learning Workload Scheduling in GPU Datacenters: A Survey.ACM Comput. Surv.56, 6 (2024), 146:1–146:38. https://doi.org/10.1145/3638757

work page doi:10.1145/3638757 2024
[69]

Bobbi Yogatama, Yifei Yang, Kevin Kristensen, Devesh Sarda, Abigale Kim, Adrian Cockcroft, Yu Teng, Joshua Patterson, Gregory Kimball, Wes McKinney, Weiwei Gong, and Xiangyao Yu. 2025. Rethinking Analytical Processing in the GPU Era.CoRRabs/2508.04701 (2025). https://doi.org/10.48550/ARXIV.2508.04701 arXiv:2508.04701

work page doi:10.48550/arxiv.2508.04701 2025
[70]

Ben Zaitlen, Peter Entschev, and Rick Zamora. 2024. Best Practices for Multi-GPU Data Analysis Using RAPIDS with Dask. https://developer.nvidia.com/blog/best- practices-for-multi-gpu-data-analysis-using-rapids-with-dask/. Accessed: 2025- 11-28

work page 2024
[71]

Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats. Proc. VLDB Endow.17, 2 (2023), 148–161. https://doi.org/10.14778/3626292. 3626298 14

work page doi:10.14778/3626292 2023

[1] [1]

Daniel Abadi. 2025. Parquet and ORC’s many shortfalls for machine learning (ML) workloads, and what should be done about them. https://www.starburst. io/blog/parquet-orc-machine-learning/

work page 2025

[2] [2]

Azim Afroozeh, Lotte Felius, and Peter Boncz. 2024. Accelerating GPU Data Processing using FastLanes Compression. InProceedings of the 20th International Workshop on Data Management on New Hardware, DaMoN 2024, Santiago, Chile, 10 June 2024, Carsten Binnig and Nesime Tatbul (Eds.). ACM, 8:1–8:11. https: //doi.org/10.1145/3662010.3663450

work page doi:10.1145/3662010.3663450 2024

[3] [3]

Patel, and Rodrigo Aramburú

Felipe Aramburú, William Malpica, Kaouther Abrougui, Amin Aramoon, Romulo Auccapuclla, Claude Brisson, Matthijs Brobbel, Colby Farrell, Pradeep Garigipati, Joost Hoozemans, Supun Kamburugamuve, Akhil Nair, Alexander Ocsa, Johan Peltenburg, Rubén Quesada López, Deepak Sihag, Ahmet Uyar, Dhruv Vats, Michael Wendt, Jignesh M. Patel, and Rodrigo Aramburú. 202...

work page arXiv 2025

[4] [4]

Claude Barthels, Gustavo Alonso, Torsten Hoefler, Timo Schneider, and Ingo Müller. 2017. Distributed Join Algorithms on Thousands of Cores.Proc. VLDB Endow.10, 5 (2017), 517–528. https://doi.org/10.14778/3055540.3055545

work page doi:10.14778/3055540.3055545 2017

[5] [5]

Nils Boeschen, Tobias Ziegler, and Carsten Binnig. 2024. GOLAP: A GPU-in- Data-Path Architecture for High-Speed OLAP.Proc. ACM Manag. Data2, 6 (2024), 237:1–237:26. https://doi.org/10.1145/3698812

work page doi:10.1145/3698812 2024

[6] [6]

Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Processing in Co-Processor-accelerated Databases. InProceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 1891–1906. https://doi.org/10.114...

work page doi:10.1145/2882903.2882936 2016

[7] [7]

Jiashen Cao, Rathijit Sen, Matteo Interlandi, Joy Arulraj, and Hyesoon Kim. 2023. GPU Database Systems Characterization and Optimization.Proc. VLDB Endow. 17, 3 (2023), 441–454. https://doi.org/10.14778/3632093.3632107

work page doi:10.14778/3632093.3632107 2023

[8] [8]

NVIDIA Corporation. 2025. NVIDIA DGX A100 Hardware Overview. https://docs. nvidia.com/dgx/dgxa100-user-guide/introduction-to-dgxa100.html. Accessed: 2025-11-29

work page 2025

[9] [9]

Microsoft Fabric. 2025. Understand V-Order for Microsoft Fabric Warehouse. https://learn.microsoft.com/en-us/fabric/data-warehouse/v-order. Accessed: 2025-11-27

work page 2025

[10] [10]

Nuno Faria and Andrew Lamb with Apache DataFusion Contributors. 2025. DataFusion Pull Request: Cache Parquet metadata in built in parquet reader. https://github.com/apache/datafusion/pull/16971. Accessed: 2025-11-26

work page 2025

[11] [11]

NumFOCUS Foundation. 2025. NumPy. https://numpy.org/

work page 2025

[12] [12]

NumFOCUS Foundation and Wes McKinney. 2025. pandas - Python Data Analysis Library. https://pandas.pydata.org/

work page 2025

[13] [13]

Hao Gao and Nikolai Sakharnykh. 2021. Scaling Joins to a Thousand GPUs. In International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2021, Copen- hagen, Denmark, August 16, 2021, Rajesh Bordawekar and Tirthankar Lahiri (Eds.). 55–64. http://www.adms-conf.org/2021-camera-ready/g...

work page 2021

[14] [14]

Chengxin Guo, Hong Chen, Feng Zhang, and Cuiping Li. 2019. Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA. InProceedings of the 48th International Conference on Parallel Processing, ICPP 2019, Kyoto, Japan, August 05-08, 2019. ACM, 65:1–65:10. https://doi.org/10.1145/3337821.3337862

work page doi:10.1145/3337821.3337862 2019

[15] [15]

Tushar Gupta. 2024. Evolution of Data Center Design to Handle AI Workloads. In34th International Telecommunication Networks and Applications Conference, ITNAC 2024, Sydney, Australia, November 27-29, 2024. IEEE, 1–8. https://doi.org/ 10.1109/ITNAC62915.2024.10815309

work page doi:10.1109/itnac62915.2024.10815309 2024

[16] [16]

Xiangpeng Hao. 2024. Caching in DataFusion. https://blog.xiangpeng.systems/ posts/caching-datafusion/. Accessed: 2025-11-26

work page 2024

[17] [17]

Xiangpeng Hao and Andrew Lamb. 2024. How Good is Parquet for Wide Tables (Machine Learning Workloads) Really? https://www.influxdata.com/blog/how- good-parquet-wide-tables/

work page 2024

[18] [18]

Xiangpeng Hao, Andrew Lamb, Qi Zhu, Weston Pace, and Jigao Luo. 2025. Project idea: How good is well-configured parquet compared to proprietary file formats? https://github.com/XiangpengHao/liquid-cache/issues/227. Accessed: 2025-11- 27

work page 2025

[19] [19]

Dong He, Supun Chathuranga Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstanti- nos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes.Proc. VLDB Endow.15, 11 (2022), 2811–2825. https: //doi.org/10.14778/3551793.3551833

work page doi:10.14778/3551793.3551833 2022

[20] [20]

Sven Hepkema, Azim Afroozeh, Charlotte Felius, Peter Boncz, and Stefan Manegold. 2025. G-ALP: Rethinking Light-weight Encodings for GPUs. In Proceedings of the 21st International Workshop on Data Management on New Hardware, DaMoN 2025, Berlin, Germany, June 22-27, 2025. ACM, 11:1–11:10. https://doi.org/10.1145/3736227.3736242

work page doi:10.1145/3736227.3736242 2025

[21] [21]

Zezhou Huang, Krystian Sakowski, Hans Lehnert, Wei Cui, Carlo Curino, Matteo Interlandi, Marius Dumitru, and Rathijit Sen. 2025. GPU Acceleration of SQL Analytics on Compressed Data.CoRRabs/2506.10092 (2025). https://doi.org/10. 48550/ARXIV.2506.10092 arXiv:2506.10092

work page arXiv 2025

[22] [22]

Marko Kabic, Shriram Chandran, and Gustavo Alonso. 2025. Maximus: A Modular Accelerated Query Engine for Data Analytics on Heterogeneous Systems.Proc. ACM Manag. Data3, 3 (2025), 187:1–187:25. https://doi.org/10.1145/3725324

work page doi:10.1145/3725324 2025

[23] [23]

Greg Kimball, Zoltan Arnold Nagy, Devavret Makkar, Daniel Bauer, and Chengcheng Jin. 2025. Accelerating Large-Scale Data Analytics with GPU-Native Velox and NVIDIA cuDF. https://developer.nvidia.com/blog/accelerating-large- scale-data-analytics-with-gpu-native-velox-and-nvidia-cudf/#hybrid_cpu- gpu_execution_in_apache_spark Accessed: 2025-11-26

work page 2025

[24] [24]

Greg Kimball and Karthikeyan Natarajan. 2025. Accelerating Velox with RAPIDS cuDF - VeloxCon 2025. https://prestodb.io/wp-content/uploads/presto-users/ Accelerating-Velox-with-RAPIDS-cuDF-VeloxCon-April-2025.pdf Accessed: 2025-11-26

work page 2025

[25] [25]

Laurens Kuiper. 2025. Query Engines: Gatekeepers of the Parquet File Format. https://duckdb.org/2025/01/22/parquet-encodings. Accessed: 2025-11-27

work page 2025

[26] [26]

Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis

work page

[27] [27]

ACM Manag

BtrBlocks: Efficient Columnar Compression for Data Lakes.Proc. ACM Manag. Data1, 2 (2023), 118:1–118:26. https://doi.org/10.1145/3589263

work page doi:10.1145/3589263 2023

[28] [28]

Andrew Lamb. 2025. Apache Parquet Metadata Parsing Benchmarks. https: //github.com/alamb/parquet_footer_parsing. Accessed: 2025-11-26

work page 2025

[29] [29]

Andrew Lamb, Yijie Shen, Daniël Heres, Jayjeet Chakraborty, Mehmet Ozan Kabak, Liang-Chi Hsieh, and Chao Sun. 2024. Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine. InCompanion of the 2024 International Conference on Management of Data, SIGMOD/PODS 2024, Santiago, Chile, June 9-15, 2024, Pablo Barceló, Nayat Sánchez-Pi, Alexandr...

work page doi:10.1145/3626246.3653368 2024

[30] [30]

Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke

Yinan Li, Bailu Ding, Ziyun Wei, Lukas M. Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke. 2025. Scaling GPU-Accelerated Databases beyond GPU Memory Size.Proc. VLDB Endow.18, 11 (2025), 4518–4531. https://www.vldb.org/pvldb/vol18/p4518-li.pdf

work page 2025

[31] [31]

Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. 2023. A Deep Dive into Common Open Formats for Analytical DBMSs.Proc. VLDB Endow.16, 11 (2023), 3044–3056. https://doi.org/10.14778/3611479.3611507

work page doi:10.14778/3611479.3611507 2023

[32] [32]

Jigao Luo. 2025. cuDF Issue: Unstable pipelining performance in Parquet reading due to mis-sync. https://github.com/rapidsai/cudf/issues/18967. Accessed: 2025-11-26

work page 2025

[33] [33]

Jigao Luo. 2025. [Story] Towards a faster Parquet reader with pipelining and mul- tistream optimization. https://github.com/rapidsai/cudf/issues/18892. Accessed: 2025-11-27

work page 2025

[34] [34]

Jigao Luo and Muhammad Haseeb. 2025. cuDF POC Pull Request: Metadata caching prototype in Parquet reader. https://github.com/rapidsai/cudf/pull/18891. Accessed: 2025-11-26

work page 2025

[35] [36]

Pump Up the Volume: Processing Large Data on GPUs with Fast In- terconnects. InProceedings of the 2020 International Conference on Manage- ment of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1633–1649. http...

work page doi:10.1145/3318464.3389705 2020

[36] [37]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl

work page

[37] [38]

InSIGMOD ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G

Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. InSIGMOD ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 1017–1032. https://doi.org/10.1145/3514221.3517911

work page doi:10.1145/3514221.3517911 2022

[38] [39]

Linux Foundation Meta AI. 2025. PyTorch. https://pytorch.org/

work page 2025

[39] [40]

Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware.Proc. VLDB Endow.4, 9 (2011), 539–550. https://doi.org/10.14778/ 2002938.2002940

work page arXiv 2011

[40] [41]

Hamish Nicholson, Konstantinos Chasialis, Antonio Boffa, and Anastasia Aila- maki. 2025. The Effectiveness of Compression for GPU-Accelerated Queries on Out-of-Memory Datasets. InProceedings of the 21st International Workshop on Data Management on New Hardware, DaMoN 2025, Berlin, Germany, June 22-27,

work page 2025

[41] [42]

https://doi.org/10.1145/3736227.3736240

ACM, 10:1–10:10. https://doi.org/10.1145/3736227.3736240

work page doi:10.1145/3736227.3736240

[42] [43]

Jesse Noffsinger, Maria Goodpaster, Mark Patel, Haley Chang, Pankaj Sachdeva, and Arjita Bhan. 2025. The cost of compute: A $ 7 trillion race to scale data centers. https://www.mckinsey.com/industries/technology-media-and- telecommunications/our-insights/the-cost-of-compute-a-7-trillion-dollar- race-to-scale-data-centers

work page 2025

[43] [44]

NVIDIA. 2025. NCCL. https://developer.nvidia.com/nccl

work page 2025

[44] [45]

NVIDIA. 2025. NVIDIA GPUDirect RDMA. https://docs.nvidia.com/cuda/ gpudirect-rdma/

work page 2025

[45] [46]

NVIDIA. 2025. NVIDIA GPUDirect Storage. https://docs.nvidia.com/gpudirect- storage/

work page 2025

[46] [47]

NVIDIA. 2025. RAPIDS Accelerator for Apache Spark. https://docs.nvidia.com/ spark-rapids/index.html. Accessed: 2025-11-28. 13

work page 2025

[47] [48]

2025.CUB: Cooperative primitives for CUDA

NVIDIA Corporation. 2025.CUB: Cooperative primitives for CUDA. https: //nvidia.github.io/cccl/cub/ Accessed: 2025-11-26

work page 2025

[48] [49]

2025.CUDA C Programming Guide: Concurrent Execution between Host and Device

NVIDIA Corporation. 2025.CUDA C Programming Guide: Concurrent Execution between Host and Device. https://docs.nvidia.com/cuda/cuda-c-programming- guide/index.html#concurrent-execution-host-device Accessed: 2025-11-26

work page 2025

[49] [50]

2025.CUDA Runtime API: API Synchronization Behav- ior

NVIDIA Corporation. 2025.CUDA Runtime API: API Synchronization Behav- ior. https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html# api-sync-behavior__memcpy-async Accessed: 2025-11-26

work page 2025

[50] [51]

2025.Thrust: Parallel Algorithms Library

NVIDIA Corporation. 2025.Thrust: Parallel Algorithms Library. https://developer. nvidia.com/thrust Accessed: 2025-11-26

work page 2025

[51] [52]

Weston Pace, Chang She, Lei Xu, Will Jones, Albert Lockett, Jun Wang, and Raunak Shah. 2025. Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings.CoRRabs/2504.15247 (2025). https://doi.org/10. 48550/ARXIV.2504.15247 arXiv:2504.15247

work page arXiv 2025

[52] [53]

NVIDIA NCCL Repository. 2025. NCCL Overlap Issue. https://github.com/ NVIDIA/nccl/issues/338

work page 2025

[53] [54]

Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Funda- mental Performance Characteristics of GPUs and CPUs for Database Analytics. InProceedings of the 2020 International Conference on Management of Data, SIG- MOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-C...

work page arXiv 2020

[54] [55]

Amazon Staff. 2025. Amazon to invest up to $ 50 billion to ex- pand AI and supercomputing infrastructure for US Government agen- cies. https://www.aboutamazon.com/news/company-news/amazon-ai- investment-us-federal-agencies

work page 2025

[55] [56]

Polars Team. 2024. GPU acceleration with Polars and NVIDIA RAPIDS. https: //pola.rs/posts/gpu-engine-release/. Accessed: 2025-11-28

work page 2024

[56] [57]

RAPIDS Development Team. 2025. NVIDIA RAPIDS KvikIO. https://docs.rapids. ai/api/kvikio/stable/

work page 2025

[57] [58]

RAPIDS Development Team. 2025. NVIDIA RAPIDS libcudf, pylibcudf and cuDF: GPU DataFrame Library. https://github.com/rapidsai/cudf

work page 2025

[58] [59]

RAPIDS Development Team. 2025. NVIDIA RMM: RAPIDS Memory Manager. https://github.com/rapidsai/rmm

work page 2025

[59] [60]

Lasse Thostrup, Gloria Doci, Nils Boeschen, Manisha Luthra, and Carsten Binnig

work page

[60] [61]

ACM Manag

Distributed GPU Joins on Fast RDMA-capable Networks.Proc. ACM Manag. Data1, 1 (2023), 29:1–29:26. https://doi.org/10.1145/3588709

work page doi:10.1145/3588709 2023

[61] [62]

In: Proc

Deepak Vohra. 2016.Practical Hadoop Ecosystem: A Definitive Guide to Hadoop- Related Frameworks and Tools(1st ed.). Apress, USA. https://doi.org/10.1007/978- 1-4842-2199-0

work page doi:10.1007/978- 2016

[62] [63]

Yunsong Wang. 2025. cuCollections Pull Request: Use pinned host memory for improved async performance. https://github.com/NVIDIA/cuCollections/pull/

work page 2025

[63] [64]

Accessed: 2025-11-27

work page 2025

[64] [65]

Jake Hemstad with NVIDIA CCCL Team. 2023. CCCL Issue: Universal 64-bit index type support in Thrust/CUB algorithms. https://github.com/NVIDIA/cccl/ issues/47. Accessed: 2025-11-26

work page 2023

[65] [66]

Greg Kimball with RAPIDS cuDF Team. 2023. cuDF Issue: Add 64-bit size type option at build-time for libcudf. https://github.com/rapidsai/cudf/issues/13159. Accessed: 2025-11-26

work page 2023

[66] [67]

Bowen Wu, Wei Cui, Carlo Curino, Matteo Interlandi, and Rathijit Sen. 2025. Terabyte-Scale Analytics in the Blink of an Eye.CoRRabs/2506.09226 (2025). https://doi.org/10.48550/ARXIV.2506.09226 arXiv:2506.09226

work page doi:10.48550/arxiv.2506.09226 2025

[67] [68]

Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, and Yonggang Wen. 2024. Deep Learning Workload Scheduling in GPU Datacenters: A Survey.ACM Comput. Surv.56, 6 (2024), 146:1–146:38. https://doi.org/10.1145/3638757

work page doi:10.1145/3638757 2024

[68] [69]

Bobbi Yogatama, Yifei Yang, Kevin Kristensen, Devesh Sarda, Abigale Kim, Adrian Cockcroft, Yu Teng, Joshua Patterson, Gregory Kimball, Wes McKinney, Weiwei Gong, and Xiangyao Yu. 2025. Rethinking Analytical Processing in the GPU Era.CoRRabs/2508.04701 (2025). https://doi.org/10.48550/ARXIV.2508.04701 arXiv:2508.04701

work page doi:10.48550/arxiv.2508.04701 2025

[69] [70]

Ben Zaitlen, Peter Entschev, and Rick Zamora. 2024. Best Practices for Multi-GPU Data Analysis Using RAPIDS with Dask. https://developer.nvidia.com/blog/best- practices-for-multi-gpu-data-analysis-using-rapids-with-dask/. Accessed: 2025- 11-28

work page 2024

[70] [71]

Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats. Proc. VLDB Endow.17, 2 (2023), 148–161. https://doi.org/10.14778/3626292. 3626298 14

work page doi:10.14778/3626292 2023