Parallel R-tree-based Spatial Query Processing on a Commercial Processing-in-Memory System
Pith reviewed 2026-05-10 11:22 UTC · model grok-4.3
The pith
Broadcasting top R-tree levels to PIM chips enables up to 3.66x faster spatial range queries with 3.4x lower energy use than CPU execution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a broadcast-based mapping of R-tree range queries onto a commercial UPMEM PIM system, where the tree is built bottom-up on the CPU, top levels are broadcast to DPUs for global filtering, and lower levels are distributed for parallel batched execution, delivers up to 3.66x kernel speedup and 2.70x end-to-end speedup over CPU R-tree search while consuming about 3.4x less energy on datasets such as Lakes with 8.4M rectangles.
What carries the argument
The broadcast-based execution method, which sends upper R-tree levels to all DPUs for filtering and partitions lower levels across DPUs for parallel queries to keep communication from dominating.
If this is right
- Strong scaling from 512 to 2,540 DPUs reduces kernel time from 64.9 s to 17.6 s on the Lakes dataset.
- The PIM kernel uses roughly 3.4x less energy than the CPU R-tree search across the evaluated workloads.
- Broadcast-based execution consistently beats subtree partitioning by avoiding communication dominance on all tested datasets.
- The approach supports queries on up to 16M rectangles and 3.9M queries while maintaining speedup.
Where Pith is reading between the lines
- Similar broadcast strategies might extend to other hierarchical indexes such as quadtrees or kd-trees for spatial or multidimensional queries on PIM hardware.
- The energy savings could become more pronounced on larger clusters or with future PIM chips that have even higher DPU counts.
- This mapping reduces reliance on complex CPU-side partitioning for distributed spatial databases.
Load-bearing premise
The assumption that broadcasting top R-tree levels and distributing lower levels will consistently keep communication overhead from dominating execution across the tested UPMEM hardware without major bottlenecks or data skew.
What would settle it
Measuring kernel time and energy on a dataset with strong spatial skew or on a PIM system with higher broadcast latency would show whether the reported speedups and energy savings still appear.
Figures
read the original abstract
The growing volume of data in scientific domains has made spatial query processing increasingly challenging due to high data transfer costs across the memory hierarchy and limited memory bandwidth. To address these bottlenecks and reduce the energy consumed on data movement, this work explores Processing-in-Memory (PIM) systems by executing range queries directly inside memory chips. Unlike prior PIM studies centered on linear scans or hash-based queries, this work is the first to map R-tree range queries onto commercial PIM hardware. The proposed broadcast-based method constructs the R-tree bottom-up on the CPU, broadcasts top levels to UPMEM DPUs (DRAM Processing Units) for global filtering, and distributes lower levels for parallel batched queries in a CPU-DPU system. We evaluate our approach on two real spatial datasets, Sports (999K rectangles) and Lakes (8.4M rectangles), and assess scalability using a synthetic dataset with up to 16M rectangles and 3.9M queries on a commercial UPMEM PIM system with up to 2,540 DPUs. Across all datasets, broadcast-based execution consistently outperforms subtree partitioning by preventing communication from dominating execution. On the Lakes dataset, strong scaling from 512 to 2,540 DPUs reduces kernel time from 64.9 s to 17.6 s, yielding up to 3.66x kernel and 2.70x end-to-end speedup relative to the CPU R-tree search on the same system. The PIM kernel also consumes approximately 3.4x less energy than the corresponding CPU search (e.g., 59.6 kJ vs. 167.0 kJ on Lakes), demonstrating scalable and energy-efficient hierarchical spatial range queries.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first mapping of R-tree range queries to commercial UPMEM PIM hardware. It builds the R-tree bottom-up on the CPU, broadcasts top levels to DPUs for global filtering, and distributes lower levels for batched parallel search. Experiments on Sports (999K rectangles), Lakes (8.4M rectangles), and synthetic data (up to 16M rectangles, 3.9M queries) report consistent outperformance over subtree partitioning, strong scaling from 512 to 2540 DPUs (kernel time 64.9s to 17.6s on Lakes), up to 3.66x kernel and 2.70x end-to-end speedup vs. CPU R-tree, and ~3.4x lower energy (e.g., 59.6 kJ vs. 167.0 kJ).
Significance. If the empirical results hold under broader conditions, the work establishes feasibility of hierarchical spatial indexes on PIM, with demonstrated scaling and energy benefits over CPU baselines. Credit is due for using two real spatial datasets, synthetic scaling to 16M objects, and direct energy measurements on up to 2540 DPUs; these elements provide concrete evidence of practical viability beyond linear-scan PIM studies.
major comments (3)
- [Abstract and Evaluation] Abstract and Evaluation: The central speedup and scaling claims (3.66x kernel, 2.70x end-to-end, strong scaling 512→2540 DPUs) rest on the broadcast method preventing host-DPU communication from dominating. No per-phase timing breakdowns, DPU load-balance statistics, or experiments under controlled spatial skew are reported, leaving the assumption untested for clustered real-world data such as Lakes.
- [Method] Method: The description of distributing lower R-tree levels does not address how subtrees exceeding the 64 MB per-DPU memory limit are handled (splitting, paging, or additional transfers). This is load-bearing for the feasibility claim on the 8.4M-rectangle Lakes dataset.
- [Evaluation] Evaluation: Reported speedups and energy figures lack error bars, number of runs, or precise CPU baseline implementation details (e.g., R-tree variant, threading). This weakens assessment of the 3.4x energy reduction and consistent outperformance statements.
minor comments (1)
- [Abstract] Abstract: The phrase 'consistently outperforms subtree partitioning' would benefit from a brief parenthetical on the exact CPU baseline configuration used for the 2.70x end-to-end figure.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough review and valuable suggestions. We address each major comment point by point below, indicating planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation] The central speedup and scaling claims (3.66x kernel, 2.70x end-to-end, strong scaling 512→2540 DPUs) rest on the broadcast method preventing host-DPU communication from dominating. No per-phase timing breakdowns, DPU load-balance statistics, or experiments under controlled spatial skew are reported, leaving the assumption untested for clustered real-world data such as Lakes.
Authors: We agree that per-phase breakdowns and load-balance data would strengthen the claims. In the revised manuscript we will add per-phase timing breakdowns (host-DPU transfer vs. DPU kernel execution) for the Lakes dataset and report DPU load-balance statistics (e.g., min/max/mean rectangles per DPU). Although the Lakes dataset already exhibits real-world clustering, we will also include a controlled synthetic experiment with varying degrees of spatial skew to explicitly test robustness. These additions will appear in the Evaluation section. revision: yes
-
Referee: [Method] The description of distributing lower R-tree levels does not address how subtrees exceeding the 64 MB per-DPU memory limit are handled (splitting, paging, or additional transfers). This is load-bearing for the feasibility claim on the 8.4M-rectangle Lakes dataset.
Authors: We thank the referee for highlighting this omission. In our implementation, lower-level subtrees are partitioned across DPUs so that each assigned subtree (or subtree fragment) fits within the 64 MB limit; when a subtree exceeds the limit it is split at an internal node and the fragments are assigned to separate DPUs, with the host performing a lightweight merge for any query that crosses fragment boundaries. We will revise the Method section to describe this splitting and coordination strategy explicitly, including its application to the Lakes dataset. revision: yes
-
Referee: [Evaluation] Reported speedups and energy figures lack error bars, number of runs, or precise CPU baseline implementation details (e.g., R-tree variant, threading). This weakens assessment of the 3.4x energy reduction and consistent outperformance statements.
Authors: We agree that these details are necessary for reproducibility and confidence in the results. In the revised paper we will state that all timing and energy measurements were repeated five times and report averages with error bars, and we will specify the CPU baseline as a standard R-tree with bulk loading executed single-threaded on the host. These clarifications will be added to the Evaluation section. revision: yes
Circularity Check
No circularity: empirical systems benchmarking with direct measurements
full rationale
This is a systems implementation paper describing an R-tree mapping to UPMEM PIM hardware, with all performance claims (speedups, energy figures, scaling) derived from direct hardware measurements on real datasets rather than any mathematical derivation, fitted parameter, or self-citation chain. No equations, predictions, or load-bearing self-citations appear in the provided text; the broadcast-based partitioning is an engineering choice evaluated empirically, not derived by construction from prior results. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The UPMEM commercial PIM hardware (with up to 2540 DPUs) supports efficient broadcast of R-tree top levels and parallel batched queries on lower levels without prohibitive communication or synchronization costs.
Reference graph
Works this paper leans on
-
[1]
Hitting the memory wall: Implications of the obvious,
W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,”ACM SIGARCH computer architecture news, vol. 23, no. 1, pp. 20–24, 1995
work page 1995
-
[2]
Introduction to UPMEM PIM: Processing-in-Memory (PIM) on DRAM Accelerator (White Paper)
UPMEM, “Introduction to UPMEM PIM: Processing-in-Memory (PIM) on DRAM Accelerator (White Paper).” White paper, UPMEM, Greno- ble, France, 2018
work page 2018
-
[3]
J. G ´omez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,”IEEE Access, vol. 10, pp. 52565–52608, 2022
work page 2022
-
[4]
Accelerating large table scan using processing-in-memory technology,
A. Baumstark, M. A. Jibril, and K.-U. Sattler, “Accelerating large table scan using processing-in-memory technology,”Datenbank-Spektrum, vol. 23, no. 3, pp. 199–209, 2023
work page 2023
-
[5]
Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,
C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y . Kim, “Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,”Proceedings of the ACM on Management of Data, vol. 1, no. 2, pp. 1–27, 2023
work page 2023
-
[6]
Pim-trie: A skew-resistant trie for processing-in- memory,
H. Kang, Y . Zhao, G. E. Blelloch, L. Dhulipala, Y . Gu, C. McGuffey, and P. B. Gibbons, “Pim-trie: A skew-resistant trie for processing-in- memory,” inProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 1–14, 2023
work page 2023
-
[7]
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for intelligent RAM,” IEEE micro, vol. 17, no. 2, pp. 34–44, 2002
work page 2002
-
[8]
The true processing in memory accelerator,
F. Devaux, “The true processing in memory accelerator,” inProceedings of the IEEE Hot Chips 31 Symposium (HCS), pp. 1–24, IEEE Computer Society, 2019
work page 2019
-
[9]
MapReduce algorithms for GIS polygonal overlay processing,
S. Puri, D. Agarwal, X. He, and S. K. Prasad, “MapReduce algorithms for GIS polygonal overlay processing,” in2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, pp. 1009–1016, IEEE, 2013
work page 2013
-
[10]
GCMF: an efficient end-to-end spatial join system over large polygonal datasets on GPGPU platform,
D. Aghajarian, S. Puri, and S. Prasad, “GCMF: an efficient end-to-end spatial join system over large polygonal datasets on GPGPU platform,” inProceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL ’16, (New York, NY , USA), Association for Computing Machinery, 2016
work page 2016
-
[11]
S. Puri and S. K. Prasad, “A Parallel Algorithm for Clipping Polygons with Improved Bounds and a Distributed Overlay Processing System Using MPI,” in2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 576–585, 2015
work page 2015
-
[12]
Fine-grained dynamic load balancing in spatial join by work stealing on distributed memory,
J. Yang, S. Puri, and H. Zhou, “Fine-grained dynamic load balancing in spatial join by work stealing on distributed memory,” inProceedings of the 30th International Conference on Advances in Geographic Infor- mation Systems, SIGSPATIAL ’22, (New York, NY , USA), Association for Computing Machinery, 2022
work page 2022
-
[13]
Hierarchical Filter and Refinement System Over Large Polygonal Datasets on CPU-GPU,
Y . Liu, J. Yang, and S. Puri, “Hierarchical Filter and Refinement System Over Large Polygonal Datasets on CPU-GPU,” in2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 141–151, 2019
work page 2019
-
[14]
R-trees: A dynamic index structure for spatial searching,
A. Guttman, “R-trees: A dynamic index structure for spatial searching,” inProceedings of the 1984 ACM SIGMOD international conference on Management of data, pp. 47–57, 1984
work page 1984
-
[15]
STR: A simple and efficient algorithm for R-tree packing,
S. T. Leutenegger, M. A. Lopez, and J. Edgington, “STR: A simple and efficient algorithm for R-tree packing,” inProceedings 13th international conference on data engineering, pp. 497–506, IEEE, 1997
work page 1997
-
[16]
I. Kamel and C. Faloutsos, “Parallel R-trees,”ACM SIGMOD Record, vol. 21, no. 2, pp. 195–204, 1992
work page 1992
-
[17]
Parallel Range Query Processing on R-Tree with Graphics Processing Unit,
B. Yu, H. Kim, W. Choi, and D. Kwon, “Parallel Range Query Processing on R-Tree with Graphics Processing Unit,” in2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing, pp. 1235–1242, 2011
work page 2011
-
[18]
GPU-based Parallel R-tree Construction and Querying,
S. K. Prasad, M. McDermott, X. He, and S. Puri, “GPU-based Parallel R-tree Construction and Querying,” in2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 618–627, IEEE, 2015
work page 2015
-
[19]
Parallel multi-dimensional range query processing with R-trees on GPU,
J. Kim, S.-G. Kim, and B. Nam, “Parallel multi-dimensional range query processing with R-trees on GPU,”Journal of Parallel and Distributed Computing, vol. 73, no. 8, pp. 1195–1207, 2013
work page 2013
-
[20]
Parallel spatial query processing on gpus using R-trees,
S. You, J. Zhang, and L. Gruenwald, “Parallel spatial query processing on gpus using R-trees,” inProceedings of the 2nd ACM SIGSPATIAL international workshop on analytics for big geospatial data, pp. 23–31, 2013
work page 2013
-
[21]
LibRTS: A Spatial Indexing Library by Ray Tracing,
L. Geng, R. Lee, and X. Zhang, “LibRTS: A Spatial Indexing Library by Ray Tracing,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’25, (New York, NY , USA), p. 396–411, Association for Computing Machinery, 2025
work page 2025
-
[22]
Master-client R-trees: A new parallel R-tree architecture,
B. Schnitzer and S. T. Leutenegger, “Master-client R-trees: A new parallel R-tree architecture,” inProceedings. Eleventh International Conference on Scientific and Statistical Database Management, pp. 68– 77, IEEE, 1999
work page 1999
-
[23]
A Design of Parallel R-tree on Cluster of Workstations,
L. Shuhua, Z. Fenghua, and S. Yongqiang, “A Design of Parallel R-tree on Cluster of Workstations,” inInternational Workshop on Databases in Networked Information Systems, pp. 119–133, Springer, 2000
work page 2000
-
[24]
Catfish: Adaptive RDMA-enabled R-Tree for Low Latency and High Throughput,
M. Xiao, H. Wang, L. Geng, R. Lee, and X. Zhang, “Catfish: Adaptive RDMA-enabled R-Tree for Low Latency and High Throughput,” in2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 164–175, 2019
work page 2019
-
[25]
An RDMA-enabled In-memory Computing Platform for R-tree on Clusters,
M. Xiao, H. Wang, L. Geng, R. Lee, and X. Zhang, “An RDMA-enabled In-memory Computing Platform for R-tree on Clusters,”ACM Trans. Spatial Algorithms Syst., vol. 8, Feb. 2022
work page 2022
-
[26]
PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,
J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,”ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 336–348, 2015
work page 2015
-
[27]
Near-data processing: Insights from a micro- 46 workshop,
R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson, “Near-data processing: Insights from a micro- 46 workshop,”IEEE Micro, vol. 34, no. 4, pp. 36–42, 2014
work page 2014
-
[28]
Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology,
V . Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology,” inProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 273–287, 2017
work page 2017
-
[29]
H. Kim, Y . Zhao, A. Pavlo, and P. B. Gibbons, “No cap, this memory slaps: Breaking through the memory wall of transactional database systems with processing-in-memory,”Proceedings of the VLDB Endow- ment, vol. 18, no. 11, pp. 4241–4254, 2025
work page 2025
-
[30]
Pim-zd-tree: A fast space-partitioning index leveraging processing-in-memory,
Y . Zhao, H. Kang, Z. Men, Y . Gu, G. E. Blelloch, L. Dhulipala, C. McGuffey, and P. B. Gibbons, “Pim-zd-tree: A fast space-partitioning index leveraging processing-in-memory,” inProceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 480–495, 2026
work page 2026
-
[31]
Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,
S. Cai, B. Tian, H. Zhang, and M. Gao, “Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,”Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–25, 2024
work page 2024
-
[32]
UCR-STAR: The UCR spatio-temporal active repository,
S. Ghosh, T. Vu, M. A. Eskandari, and A. Eldawy, “UCR-STAR: The UCR spatio-temporal active repository,”SIGSPATIAL Special, vol. 11, no. 2, pp. 34–40, 2019
work page 2019
-
[33]
Spiderweb: A spatial data generator for big spatial data applications
P. Katiyar, T. Wu, S. Migliorini, A. Belussi, and A. Eldawy, “Spiderweb: A spatial data generator for big spatial data applications.” https://spider. cs.ucr.edu/, 2021
work page 2021
-
[34]
A full-system perspective on upmem performance,
B. Friesel, M. L ¨utke Dreimann, and O. Spinczyk, “A full-system perspective on upmem performance,” inProceedings of the 1st Workshop on Disruptive Memory Systems, pp. 1–7, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.