pith. sign in

arxiv: 2604.14445 · v1 · submitted 2026-04-15 · 💻 cs.DB · cs.DC

Parallel R-tree-based Spatial Query Processing on a Commercial Processing-in-Memory System

Pith reviewed 2026-05-10 11:22 UTC · model grok-4.3

classification 💻 cs.DB cs.DC
keywords R-treeProcessing-in-Memoryspatial range queriesUPMEMparallel query processingenergy efficiencyhierarchical indexes
0
0 comments X

The pith

Broadcasting top R-tree levels to PIM chips enables up to 3.66x faster spatial range queries with 3.4x lower energy use than CPU execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates how to run R-tree range queries directly inside commercial Processing-in-Memory hardware to avoid expensive data transfers across the memory hierarchy. It builds the tree on the CPU then broadcasts the upper levels to all DPUs for quick global filtering while distributing lower levels for batched parallel searches across thousands of units. Tests on real datasets with up to 8.4 million rectangles and synthetic sets up to 16 million show the method scales well and beats both CPU baselines and alternative PIM partitioning schemes. A reader would care because scientific data volumes keep growing and spatial queries are common, so cutting movement costs and energy could make large-scale location analysis faster and cheaper. The work focuses on hierarchical structures rather than simple scans or hashes to show PIM can handle complex index-based queries.

Core claim

The central claim is that a broadcast-based mapping of R-tree range queries onto a commercial UPMEM PIM system, where the tree is built bottom-up on the CPU, top levels are broadcast to DPUs for global filtering, and lower levels are distributed for parallel batched execution, delivers up to 3.66x kernel speedup and 2.70x end-to-end speedup over CPU R-tree search while consuming about 3.4x less energy on datasets such as Lakes with 8.4M rectangles.

What carries the argument

The broadcast-based execution method, which sends upper R-tree levels to all DPUs for filtering and partitions lower levels across DPUs for parallel queries to keep communication from dominating.

If this is right

  • Strong scaling from 512 to 2,540 DPUs reduces kernel time from 64.9 s to 17.6 s on the Lakes dataset.
  • The PIM kernel uses roughly 3.4x less energy than the CPU R-tree search across the evaluated workloads.
  • Broadcast-based execution consistently beats subtree partitioning by avoiding communication dominance on all tested datasets.
  • The approach supports queries on up to 16M rectangles and 3.9M queries while maintaining speedup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar broadcast strategies might extend to other hierarchical indexes such as quadtrees or kd-trees for spatial or multidimensional queries on PIM hardware.
  • The energy savings could become more pronounced on larger clusters or with future PIM chips that have even higher DPU counts.
  • This mapping reduces reliance on complex CPU-side partitioning for distributed spatial databases.

Load-bearing premise

The assumption that broadcasting top R-tree levels and distributing lower levels will consistently keep communication overhead from dominating execution across the tested UPMEM hardware without major bottlenecks or data skew.

What would settle it

Measuring kernel time and energy on a dataset with strong spatial skew or on a PIM system with higher broadcast latency would show whether the reported speedups and energy savings still appear.

Figures

Figures reproduced from arXiv: 2604.14445 by Michael Gowanlock, Satish Puri, Tasmia Jannat.

Figure 1
Figure 1. Figure 1: Processing-in-Memory (PIM) System Organization [8]. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow for Broadcast PIM R-tree range query [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three-step subtree-based PIM R-tree (PIM baseline) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Three-level STR R-tree layout for a dataset of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: DPU-index–guided upper-level filtering in Broadcast [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of CPU-parallel speedup on MILL and PIM kernel speedup on UPMEM-PIM across datasets. CPU speedup plateaus at approximately 6–7×, while PIM kernel speedup increases substantially with dataset size. Here, CPU speedup is relative to CPU sequential execution on MILL, while PIM kernel speedup is relative to CPU sequential execution on UPMEM-PIM [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Kernel and communication time breakdown for the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of tasklet parallelism on PIM performance. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Average per-batch timing: host-to-DPU transfer, DPU [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

The growing volume of data in scientific domains has made spatial query processing increasingly challenging due to high data transfer costs across the memory hierarchy and limited memory bandwidth. To address these bottlenecks and reduce the energy consumed on data movement, this work explores Processing-in-Memory (PIM) systems by executing range queries directly inside memory chips. Unlike prior PIM studies centered on linear scans or hash-based queries, this work is the first to map R-tree range queries onto commercial PIM hardware. The proposed broadcast-based method constructs the R-tree bottom-up on the CPU, broadcasts top levels to UPMEM DPUs (DRAM Processing Units) for global filtering, and distributes lower levels for parallel batched queries in a CPU-DPU system. We evaluate our approach on two real spatial datasets, Sports (999K rectangles) and Lakes (8.4M rectangles), and assess scalability using a synthetic dataset with up to 16M rectangles and 3.9M queries on a commercial UPMEM PIM system with up to 2,540 DPUs. Across all datasets, broadcast-based execution consistently outperforms subtree partitioning by preventing communication from dominating execution. On the Lakes dataset, strong scaling from 512 to 2,540 DPUs reduces kernel time from 64.9 s to 17.6 s, yielding up to 3.66x kernel and 2.70x end-to-end speedup relative to the CPU R-tree search on the same system. The PIM kernel also consumes approximately 3.4x less energy than the corresponding CPU search (e.g., 59.6 kJ vs. 167.0 kJ on Lakes), demonstrating scalable and energy-efficient hierarchical spatial range queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents the first mapping of R-tree range queries to commercial UPMEM PIM hardware. It builds the R-tree bottom-up on the CPU, broadcasts top levels to DPUs for global filtering, and distributes lower levels for batched parallel search. Experiments on Sports (999K rectangles), Lakes (8.4M rectangles), and synthetic data (up to 16M rectangles, 3.9M queries) report consistent outperformance over subtree partitioning, strong scaling from 512 to 2540 DPUs (kernel time 64.9s to 17.6s on Lakes), up to 3.66x kernel and 2.70x end-to-end speedup vs. CPU R-tree, and ~3.4x lower energy (e.g., 59.6 kJ vs. 167.0 kJ).

Significance. If the empirical results hold under broader conditions, the work establishes feasibility of hierarchical spatial indexes on PIM, with demonstrated scaling and energy benefits over CPU baselines. Credit is due for using two real spatial datasets, synthetic scaling to 16M objects, and direct energy measurements on up to 2540 DPUs; these elements provide concrete evidence of practical viability beyond linear-scan PIM studies.

major comments (3)
  1. [Abstract and Evaluation] Abstract and Evaluation: The central speedup and scaling claims (3.66x kernel, 2.70x end-to-end, strong scaling 512→2540 DPUs) rest on the broadcast method preventing host-DPU communication from dominating. No per-phase timing breakdowns, DPU load-balance statistics, or experiments under controlled spatial skew are reported, leaving the assumption untested for clustered real-world data such as Lakes.
  2. [Method] Method: The description of distributing lower R-tree levels does not address how subtrees exceeding the 64 MB per-DPU memory limit are handled (splitting, paging, or additional transfers). This is load-bearing for the feasibility claim on the 8.4M-rectangle Lakes dataset.
  3. [Evaluation] Evaluation: Reported speedups and energy figures lack error bars, number of runs, or precise CPU baseline implementation details (e.g., R-tree variant, threading). This weakens assessment of the 3.4x energy reduction and consistent outperformance statements.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'consistently outperforms subtree partitioning' would benefit from a brief parenthetical on the exact CPU baseline configuration used for the 2.70x end-to-end figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. We address each major comment point by point below, indicating planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] The central speedup and scaling claims (3.66x kernel, 2.70x end-to-end, strong scaling 512→2540 DPUs) rest on the broadcast method preventing host-DPU communication from dominating. No per-phase timing breakdowns, DPU load-balance statistics, or experiments under controlled spatial skew are reported, leaving the assumption untested for clustered real-world data such as Lakes.

    Authors: We agree that per-phase breakdowns and load-balance data would strengthen the claims. In the revised manuscript we will add per-phase timing breakdowns (host-DPU transfer vs. DPU kernel execution) for the Lakes dataset and report DPU load-balance statistics (e.g., min/max/mean rectangles per DPU). Although the Lakes dataset already exhibits real-world clustering, we will also include a controlled synthetic experiment with varying degrees of spatial skew to explicitly test robustness. These additions will appear in the Evaluation section. revision: yes

  2. Referee: [Method] The description of distributing lower R-tree levels does not address how subtrees exceeding the 64 MB per-DPU memory limit are handled (splitting, paging, or additional transfers). This is load-bearing for the feasibility claim on the 8.4M-rectangle Lakes dataset.

    Authors: We thank the referee for highlighting this omission. In our implementation, lower-level subtrees are partitioned across DPUs so that each assigned subtree (or subtree fragment) fits within the 64 MB limit; when a subtree exceeds the limit it is split at an internal node and the fragments are assigned to separate DPUs, with the host performing a lightweight merge for any query that crosses fragment boundaries. We will revise the Method section to describe this splitting and coordination strategy explicitly, including its application to the Lakes dataset. revision: yes

  3. Referee: [Evaluation] Reported speedups and energy figures lack error bars, number of runs, or precise CPU baseline implementation details (e.g., R-tree variant, threading). This weakens assessment of the 3.4x energy reduction and consistent outperformance statements.

    Authors: We agree that these details are necessary for reproducibility and confidence in the results. In the revised paper we will state that all timing and energy measurements were repeated five times and report averages with error bars, and we will specify the CPU baseline as a standard R-tree with bulk loading executed single-threaded on the host. These clarifications will be added to the Evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems benchmarking with direct measurements

full rationale

This is a systems implementation paper describing an R-tree mapping to UPMEM PIM hardware, with all performance claims (speedups, energy figures, scaling) derived from direct hardware measurements on real datasets rather than any mathematical derivation, fitted parameter, or self-citation chain. No equations, predictions, or load-bearing self-citations appear in the provided text; the broadcast-based partitioning is an engineering choice evaluated empirically, not derived by construction from prior results. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on engineering implementation choices and hardware assumptions rather than new theoretical constructs; no free parameters are fitted to produce the reported results, and no new entities are postulated.

axioms (1)
  • domain assumption The UPMEM commercial PIM hardware (with up to 2540 DPUs) supports efficient broadcast of R-tree top levels and parallel batched queries on lower levels without prohibitive communication or synchronization costs.
    Invoked in the description of the broadcast-based method and scaling experiments; treated as given by the commercial platform.

pith-pipeline@v0.9.0 · 5616 in / 1402 out tokens · 67719 ms · 2026-05-10T11:22:30.204788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Hitting the memory wall: Implications of the obvious,

    W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,”ACM SIGARCH computer architecture news, vol. 23, no. 1, pp. 20–24, 1995

  2. [2]

    Introduction to UPMEM PIM: Processing-in-Memory (PIM) on DRAM Accelerator (White Paper)

    UPMEM, “Introduction to UPMEM PIM: Processing-in-Memory (PIM) on DRAM Accelerator (White Paper).” White paper, UPMEM, Greno- ble, France, 2018

  3. [3]

    Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,

    J. G ´omez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,”IEEE Access, vol. 10, pp. 52565–52608, 2022

  4. [4]

    Accelerating large table scan using processing-in-memory technology,

    A. Baumstark, M. A. Jibril, and K.-U. Sattler, “Accelerating large table scan using processing-in-memory technology,”Datenbank-Spektrum, vol. 23, no. 3, pp. 199–209, 2023

  5. [5]

    Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,

    C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y . Kim, “Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,”Proceedings of the ACM on Management of Data, vol. 1, no. 2, pp. 1–27, 2023

  6. [6]

    Pim-trie: A skew-resistant trie for processing-in- memory,

    H. Kang, Y . Zhao, G. E. Blelloch, L. Dhulipala, Y . Gu, C. McGuffey, and P. B. Gibbons, “Pim-trie: A skew-resistant trie for processing-in- memory,” inProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 1–14, 2023

  7. [7]

    A case for intelligent RAM,

    D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for intelligent RAM,” IEEE micro, vol. 17, no. 2, pp. 34–44, 2002

  8. [8]

    The true processing in memory accelerator,

    F. Devaux, “The true processing in memory accelerator,” inProceedings of the IEEE Hot Chips 31 Symposium (HCS), pp. 1–24, IEEE Computer Society, 2019

  9. [9]

    MapReduce algorithms for GIS polygonal overlay processing,

    S. Puri, D. Agarwal, X. He, and S. K. Prasad, “MapReduce algorithms for GIS polygonal overlay processing,” in2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, pp. 1009–1016, IEEE, 2013

  10. [10]

    GCMF: an efficient end-to-end spatial join system over large polygonal datasets on GPGPU platform,

    D. Aghajarian, S. Puri, and S. Prasad, “GCMF: an efficient end-to-end spatial join system over large polygonal datasets on GPGPU platform,” inProceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL ’16, (New York, NY , USA), Association for Computing Machinery, 2016

  11. [11]

    A Parallel Algorithm for Clipping Polygons with Improved Bounds and a Distributed Overlay Processing System Using MPI,

    S. Puri and S. K. Prasad, “A Parallel Algorithm for Clipping Polygons with Improved Bounds and a Distributed Overlay Processing System Using MPI,” in2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 576–585, 2015

  12. [12]

    Fine-grained dynamic load balancing in spatial join by work stealing on distributed memory,

    J. Yang, S. Puri, and H. Zhou, “Fine-grained dynamic load balancing in spatial join by work stealing on distributed memory,” inProceedings of the 30th International Conference on Advances in Geographic Infor- mation Systems, SIGSPATIAL ’22, (New York, NY , USA), Association for Computing Machinery, 2022

  13. [13]

    Hierarchical Filter and Refinement System Over Large Polygonal Datasets on CPU-GPU,

    Y . Liu, J. Yang, and S. Puri, “Hierarchical Filter and Refinement System Over Large Polygonal Datasets on CPU-GPU,” in2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 141–151, 2019

  14. [14]

    R-trees: A dynamic index structure for spatial searching,

    A. Guttman, “R-trees: A dynamic index structure for spatial searching,” inProceedings of the 1984 ACM SIGMOD international conference on Management of data, pp. 47–57, 1984

  15. [15]

    STR: A simple and efficient algorithm for R-tree packing,

    S. T. Leutenegger, M. A. Lopez, and J. Edgington, “STR: A simple and efficient algorithm for R-tree packing,” inProceedings 13th international conference on data engineering, pp. 497–506, IEEE, 1997

  16. [16]

    Parallel R-trees,

    I. Kamel and C. Faloutsos, “Parallel R-trees,”ACM SIGMOD Record, vol. 21, no. 2, pp. 195–204, 1992

  17. [17]

    Parallel Range Query Processing on R-Tree with Graphics Processing Unit,

    B. Yu, H. Kim, W. Choi, and D. Kwon, “Parallel Range Query Processing on R-Tree with Graphics Processing Unit,” in2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing, pp. 1235–1242, 2011

  18. [18]

    GPU-based Parallel R-tree Construction and Querying,

    S. K. Prasad, M. McDermott, X. He, and S. Puri, “GPU-based Parallel R-tree Construction and Querying,” in2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 618–627, IEEE, 2015

  19. [19]

    Parallel multi-dimensional range query processing with R-trees on GPU,

    J. Kim, S.-G. Kim, and B. Nam, “Parallel multi-dimensional range query processing with R-trees on GPU,”Journal of Parallel and Distributed Computing, vol. 73, no. 8, pp. 1195–1207, 2013

  20. [20]

    Parallel spatial query processing on gpus using R-trees,

    S. You, J. Zhang, and L. Gruenwald, “Parallel spatial query processing on gpus using R-trees,” inProceedings of the 2nd ACM SIGSPATIAL international workshop on analytics for big geospatial data, pp. 23–31, 2013

  21. [21]

    LibRTS: A Spatial Indexing Library by Ray Tracing,

    L. Geng, R. Lee, and X. Zhang, “LibRTS: A Spatial Indexing Library by Ray Tracing,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’25, (New York, NY , USA), p. 396–411, Association for Computing Machinery, 2025

  22. [22]

    Master-client R-trees: A new parallel R-tree architecture,

    B. Schnitzer and S. T. Leutenegger, “Master-client R-trees: A new parallel R-tree architecture,” inProceedings. Eleventh International Conference on Scientific and Statistical Database Management, pp. 68– 77, IEEE, 1999

  23. [23]

    A Design of Parallel R-tree on Cluster of Workstations,

    L. Shuhua, Z. Fenghua, and S. Yongqiang, “A Design of Parallel R-tree on Cluster of Workstations,” inInternational Workshop on Databases in Networked Information Systems, pp. 119–133, Springer, 2000

  24. [24]

    Catfish: Adaptive RDMA-enabled R-Tree for Low Latency and High Throughput,

    M. Xiao, H. Wang, L. Geng, R. Lee, and X. Zhang, “Catfish: Adaptive RDMA-enabled R-Tree for Low Latency and High Throughput,” in2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 164–175, 2019

  25. [25]

    An RDMA-enabled In-memory Computing Platform for R-tree on Clusters,

    M. Xiao, H. Wang, L. Geng, R. Lee, and X. Zhang, “An RDMA-enabled In-memory Computing Platform for R-tree on Clusters,”ACM Trans. Spatial Algorithms Syst., vol. 8, Feb. 2022

  26. [26]

    PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,

    J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,”ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 336–348, 2015

  27. [27]

    Near-data processing: Insights from a micro- 46 workshop,

    R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson, “Near-data processing: Insights from a micro- 46 workshop,”IEEE Micro, vol. 34, no. 4, pp. 36–42, 2014

  28. [28]

    Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology,

    V . Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology,” inProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 273–287, 2017

  29. [29]

    No cap, this memory slaps: Breaking through the memory wall of transactional database systems with processing-in-memory,

    H. Kim, Y . Zhao, A. Pavlo, and P. B. Gibbons, “No cap, this memory slaps: Breaking through the memory wall of transactional database systems with processing-in-memory,”Proceedings of the VLDB Endow- ment, vol. 18, no. 11, pp. 4241–4254, 2025

  30. [30]

    Pim-zd-tree: A fast space-partitioning index leveraging processing-in-memory,

    Y . Zhao, H. Kang, Z. Men, Y . Gu, G. E. Blelloch, L. Dhulipala, C. McGuffey, and P. B. Gibbons, “Pim-zd-tree: A fast space-partitioning index leveraging processing-in-memory,” inProceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 480–495, 2026

  31. [31]

    Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,

    S. Cai, B. Tian, H. Zhang, and M. Gao, “Pimpam: Efficient graph pattern matching on real processing-in-memory hardware,”Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–25, 2024

  32. [32]

    UCR-STAR: The UCR spatio-temporal active repository,

    S. Ghosh, T. Vu, M. A. Eskandari, and A. Eldawy, “UCR-STAR: The UCR spatio-temporal active repository,”SIGSPATIAL Special, vol. 11, no. 2, pp. 34–40, 2019

  33. [33]

    Spiderweb: A spatial data generator for big spatial data applications

    P. Katiyar, T. Wu, S. Migliorini, A. Belussi, and A. Eldawy, “Spiderweb: A spatial data generator for big spatial data applications.” https://spider. cs.ucr.edu/, 2021

  34. [34]

    A full-system perspective on upmem performance,

    B. Friesel, M. L ¨utke Dreimann, and O. Spinczyk, “A full-system perspective on upmem performance,” inProceedings of the 1st Workshop on Disruptive Memory Systems, pp. 1–7, 2023