pith. machine review for the scientific record. sign in

arxiv: 2604.05885 · v1 · submitted 2026-04-07 · 💻 cs.DC · astro-ph.CO· physics.comp-ph

Recognition: no theorem link

JZ-Tree: GPU friendly neighbour search and friends-of-friends with dual tree walks in JAX plus CUDA

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:30 UTC · model grok-4.3

classification 💻 cs.DC astro-ph.COphysics.comp-ph
keywords GPUtree traversalnearest neighborsfriends-of-friendsMorton orderspatial searchJAXCUDA
0
0 comments X

The pith

Morton plane-based trees let dual-tree algorithms run efficiently on GPUs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a Morton plane-based tree hierarchy tailored for GPU architectures to overcome thread divergence and irregular memory access in traditional tree traversals. The flattened layout supports collaborative dual-tree walks across thread groups, enabling efficient exact k-nearest neighbour search and friends-of-friends clustering. For large datasets with over ten million points, these implementations show more than ten times better performance than other GPU libraries and scale strongly to multiple GPUs. A reader interested in high-performance spatial computing would care because these algorithms are foundational for simulations and data analysis in fields like astrophysics and machine learning.

Core claim

The authors establish that their Morton plane-based tree hierarchy creates a data layout allowing highly coalesced memory accesses and low divergence during dual-tree traversal with collaborative threads. This yields GPU implementations of k-nearest neighbour search and friends-of-friends that improve performance by more than an order of magnitude over competing libraries for large N and scale to distributed multi-GPU setups.

What carries the argument

Morton (z-order) plane-based tree hierarchy that flattens the spatial tree for GPU-optimized dual traversal.

If this is right

  • kNN search runs over 10x faster on GPUs for N greater than 10 million.
  • FoF clustering achieves similar speedups and multi-GPU scaling.
  • The framework can support a broad class of tree-based spatial algorithms on GPUs.
  • Strong scaling enables handling of even larger datasets across distributed systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This hierarchy might apply to other algorithms like Barnes-Hut n-body simulations for faster gravity calculations.
  • Performance could be further tested with varying data distributions to confirm robustness.
  • Integration in JAX opens possibilities for end-to-end differentiable spatial processing in ML models.

Load-bearing premise

The proposed Morton plane-based tree hierarchy will generate memory access patterns regular enough and thread divergence low enough on modern GPUs to achieve the reported speedups without major costs from building the tree or balancing the load.

What would settle it

Running a benchmark on a standard GPU with 10^7 particles where the k-nearest neighbor query time with JZ-Tree is not at least 10 times lower than with the best alternative GPU library.

Figures

Figures reproduced from arXiv: 2604.05885 by Adrian Gutierrez Adame, Jens St\"ucker, Lukas Winkler, Oliver Hahn, Thomas Fl\"oss.

Figure 1
Figure 1. Figure 1: Illustration of uncoalesced versus coalesced memory access. Coalesced [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The z-order curve for a regular grid (left) and for randomly distributed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the steps involved in building a split based tree structure. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Execution time in ms of different parts of the neighbour search [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: (a) Performance of JZ-TREE for finding k = 16 neighbours of different distributions in d = 3 dimensions. (b) Efficiency scaling for distributed computing for different numbers of GPUs as a function of the number of points per GPU. The method scales well across different problem setups and to large problem sizes on multi-GPU. SCIPY-CKDTREE turns out more than two orders of magnitude slower than JZ-TREE and … view at source ↗
Figure 8
Figure 8. Figure 8: Performance of the full FoF algorithm of [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Scaling of the kNN search for 106 points for (a) k = 16 neighbours with dimension (b) d = 3 with neighbour number and (c) k = 16 and d = 3 and given source count with varying query size. dimension number. However, that the performance gap to this brute-force approach is only a factor 10 makes it seem likely that per query point about 10% of all source points need to be checked. It is quite possible that th… view at source ↗
read the original abstract

Algorithms based on spatial tree traversal are widely regarded as among the most efficient and flexible approaches for many problems in CPU-based high-performance computing (HPC). However, directly transferring these algorithms to GPU architectures often yields substantially smaller performance gains than expected in light of the high computational throughput of modern GPUs. The branching nature of tree algorithms leads to thread divergence and irregular memory access patterns -- both of which may severely limit GPU performance. To address these challenges, we propose a Morton (z-order) 'plane-based tree hierarchy' that is specifically designed for GPU architectures. The resulting flattened data layout enables efficient dual-tree traversal with collaborative execution across thread groups, leading to highly coalesced memory access patterns. Based on this framework we present implementations of two important spatial algorithms -- exact $k$-nearest neighbour search and friends-of-friends (FoF) clustering. For both cases, we observe more than an order-of-magnitude performance improvement over the closest competing GPU libraries for large problem sizes ($N \gtrsim 10^7$), together with strong scaling to distributed multi-GPU systems. We provide an open-source implementation, 'JZ-Tree' (JAX z-order tree), which serves as a foundation for efficient GPU implementations of a broad class of tree-based algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces JZ-Tree, a Morton (z-order) plane-based tree hierarchy optimized for GPU architectures to support efficient collaborative dual-tree walks. It provides JAX+CUDA implementations of exact k-nearest neighbor search and friends-of-friends clustering, claiming >10x speedups over competing GPU libraries for N ≳ 10^7 together with strong multi-GPU scaling, and releases the code as open source.

Significance. If the performance and scaling results hold, the work would be a useful practical advance for GPU-accelerated spatial algorithms in HPC, addressing thread divergence and memory irregularity via the specialized flattened layout. Strengths include the detailed kernel pseudocode, separate timing of tree construction, benchmarks up to N=10^8 with explicit library comparisons, and the reproducible open-source release.

minor comments (3)
  1. [§4.2] §4.2: the description of the collaborative dual-tree walk would benefit from an additional sentence clarifying how thread-group synchronization avoids divergence in the plane-based traversal.
  2. [Figure 7] Figure 7: the multi-GPU scaling plot axes and legend are difficult to read at print size; consider increasing font size or splitting into separate panels.
  3. [§5.1] §5.1: while tree-build costs are stated to be sub-dominant, a short table summarizing build vs. query times across the tested N range would make this explicit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation for minor revision. The report accurately captures the key contributions of JZ-Tree, including the Morton plane-based hierarchy, collaborative dual-tree walks, performance claims for N ≳ 10^7, and the open-source release. No major comments were raised.

Circularity Check

0 steps flagged

No significant circularity; algorithmic implementation with external benchmarks

full rationale

The manuscript describes a Morton plane-based tree hierarchy, flattened layout, collaborative dual-tree walk, and CUDA kernels for kNN and FoF, supported by pseudocode, timing breakdowns, and direct comparisons to external libraries (nanoflann, cuML). Performance claims rest on empirical measurements for N up to 10^8 and multi-GPU scaling, not on any fitted parameters, self-referential predictions, or load-bearing self-citations. Tree construction costs are separately quantified and shown sub-dominant. The open-source JAX+CUDA release provides an independent reproduction path. No equations or uniqueness theorems reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The contribution rests on the assumption that Morton ordering plus a plane-based flattening yields coalesced accesses and low divergence on GPUs; no numerical free parameters are introduced in the abstract, and the new entity is the plane-based hierarchy itself.

axioms (1)
  • standard math Morton (z-order) curves preserve spatial locality sufficiently for tree-based neighbor queries
    Invoked as the foundation for the flattened data layout and dual-tree traversal.
invented entities (1)
  • plane-based tree hierarchy no independent evidence
    purpose: To produce a flattened layout that supports collaborative dual-tree walks with coalesced memory access on GPUs
    New data structure proposed to overcome thread divergence and irregular access in conventional tree algorithms.

pith-pipeline@v0.9.0 · 5548 in / 1516 out tokens · 51395 ms · 2026-05-10T19:30:21.110743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    2020, Proceedings of the National Academy of Science, 117, 30055, doi: 10.1073/pnas.1912789117

    K. Cranmer, J. Brehmer, and G. Louppe, “The frontier of simulation- based inference,”Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 055–30 062, 2020. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.1912789117

  2. [2]

    Jax: composable transformations of python+numpy pro- grams,

    J. Bradbury, R. Frostig, P. Hawkins, M. Johnson, C. Leary, D. Maclau- rin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “Jax: composable transformations of python+numpy pro- grams,” https://github.com/google/jax, 2018

  3. [3]

    Cuda c++ programming guide,

    NVIDIA Corporation, “Cuda c++ programming guide,” https://docs. nvidia.com/cuda/, 2024

  4. [4]

    Cuda c++ best practices guide,

    ——, “Cuda c++ best practices guide,” https://docs.nvidia.com/cuda/ cuda-c-best-practices-guide/, 2024

  5. [6]

    de Berg, O

    M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars,Computa- tional Geometry: Algorithms and Applications. Springer, 2008. GPU-FRIENDLY KNN AND FOF WITH DUAL TREE W ALKS IN JAX/CUDA 13

  6. [7]

    Multidimensional binary search trees used for associative searching,

    J. L. Bentley, “Multidimensional binary search trees used for associative searching,”Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975

  7. [8]

    A hierarchical o(n log n) force-calculation algorithm,

    J. Barnes and P. Hut, “A hierarchical o(n log n) force-calculation algorithm,”Nature, vol. 324, pp. 446–449, 1986

  8. [9]

    A fast algorithm for particle simulations,

    L. Greengard and V . Rokhlin, “A fast algorithm for particle simulations,” Journal of Computational Physics, vol. 73, pp. 325–348, 1987

  9. [10]

    Maximizing parallelism in the construction of bvhs, octrees, and k-d trees,

    T. Karras, “Maximizing parallelism in the construction of bvhs, octrees, and k-d trees,”High Performance Graphics, 2012

  10. [11]

    Fast bvh construction on gpus,

    C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, “Fast bvh construction on gpus,” inEurographics, 2009

  11. [12]

    Real-time kd-tree construction on graphics hardware,

    K. Zhou, Q. Hou, R. Wang, and B. Guo, “Real-time kd-tree construction on graphics hardware,” inACM SIGGRAPH Asia, 2008

  12. [13]

    Optimizing lbvh-construction and hierarchy- traversal to accelerate knn queries on point clouds using the gpu,

    J. Jakob and M. Guthe, “Optimizing lbvh-construction and hierarchy- traversal to accelerate knn queries on point clouds using the gpu,” in Computer Graphics Forum, vol. 40, no. 1. Wiley Online Library, 2021, pp. 124–137

  13. [14]

    ’n-body’ problems in statistical learn- ing,

    A. G. Gray and A. W. Moore, “’n-body’ problems in statistical learn- ing,” inProceedings of the 14th International Conference on Neural Information Processing Systems, ser. NIPS’00. Cambridge, MA, USA: MIT Press, 2000, p. 500–506

  14. [15]

    Tree-independent dual-tree algorithms,

    R. R. Curtin, W. B. March, P. Ram, D. V . Anderson, A. G. Gray, and C. L. Isbell, “Tree-independent dual-tree algorithms,” inProceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ser. ICML’13. JMLR.org, 2013, p. III–1435–III–1443

  15. [16]

    R. W. Hockney and J. W. Eastwood,Computer Simulation Using Particles. CRC Press, 1988

  16. [17]

    M. P. Allen and D. J. Tildesley,Computer Simulation of Liquids. Oxford University Press, 2017

  17. [18]

    Fast k nearest neighbor search using gpu,

    V . Garcia, E. Debreuve, and M. Barlaud, “Fast k nearest neighbor search using gpu,” inIEEE CVPR Workshops, 2008

  18. [19]

    Billion-scale similarity search with gpus,

    J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with gpus,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021

  19. [20]

    The Faiss library

    M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,” 2025. [Online]. Available: https://arxiv.org/abs/2401.08281

  20. [21]

    Product quantization for nearest neighbor search,

    H. J ´egou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011

  21. [22]

    Efficient and robust approxi- mate nearest neighbor search using hierarchical navigable small world graphs,

    Y . A. Malkov and D. A. Yashunin, “Efficient and robust approxi- mate nearest neighbor search using hierarchical navigable small world graphs,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 42, no. 4, pp. 824–836, 2020

  22. [23]

    Analyzing the Performance of Applications at Exascale,

    V . Kamel, H. Yan, and S. Chester, “Clover: A gpu-native, spatio- graph-based approach to exact knn,” inProceedings of the 39th ACM International Conference on Supercomputing, ser. ICS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 236–249. [Online]. Available: https://doi.org/10.1145/3721145.3730415

  23. [24]

    A computer oriented geodetic data base and a new technique in file sequencing,

    G. M. Morton, “A computer oriented geodetic data base and a new technique in file sequencing,” IBM, Tech. Rep., 1966

  24. [25]

    Sagan,Space-Filling Curves

    H. Sagan,Space-Filling Curves. Springer, 1994

  25. [26]

    Bader,Space-Filling Curves: An Introduction with Applications in Scientific Computing

    M. Bader,Space-Filling Curves: An Introduction with Applications in Scientific Computing. Springer, 2013

  26. [27]

    Samet,Foundations of Multidimensional and Metric Data Structures

    H. Samet,Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006

  27. [28]

    Fast construction of k-nearest neighbor graphs for point clouds,

    M. Connor and P. Kumar, “Fast construction of k-nearest neighbor graphs for point clouds,”IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 4, pp. 599–608, 2010

  28. [29]

    Ieee standard for floating-point arithmetic,

    “Ieee standard for floating-point arithmetic,”IEEE Std 754-2019 (Revi- sion of IEEE 754-2008), pp. 1–84, 2019

  29. [30]

    Cub: Cuda unbound library,

    NVIDIA, “Cub: Cuda unbound library,” 2017. [Online]. Available: https://nvlabs.github.io/cub/

  30. [31]

    Parallel sorting by regular sampling,

    H. Shi and J. Schaeffer, “Parallel sorting by regular sampling,” Journal of Parallel and Distributed Computing, vol. 14, no. 4, pp. 361–372, 1992. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/074373159290075X

  31. [32]

    Samplesort: A sampling approach to minimal storage tree sorting,

    W. D. Frazer and A. C. McKellar, “Samplesort: A sampling approach to minimal storage tree sorting,”J. ACM, vol. 17, no. 3, p. 496–507, Jul. 1970. [Online]. Available: https://doi.org/10.1145/321592.321600

  32. [33]

    Super scalar sample sort,

    P. Sanders and S. Winkel, “Super scalar sample sort,” inAlgorithms – ESA 2004, S. Albers and T. Radzik, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 784–796

  33. [34]

    Mitzenmacher and E

    M. Mitzenmacher and E. Upfal,Probability and Computing: Random- ized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005

  34. [35]

    Linear-time algorithms for pairwise statistical problems,

    P. Ram, D. Lee, W. March, and A. Gray, “Linear-time algorithms for pairwise statistical problems,” inAdvances in Neural Information Processing Systems, Y . Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, Eds., vol. 22. Curran Associates, Inc., 2009. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2009/file/2421fcb126...

  35. [36]

    Leonardo: A pan-european pre-exascale supercomputer for hpc and ai applications,

    M. Turisini, M. Cestari, and G. Amati, “Leonardo: A pan-european pre-exascale supercomputer for hpc and ai applications,”Journal of Large-Scale Research Facilities, vol. 8, p. A186, 2024. [Online]. Available: https://doi.org/10.17815/jlsrf-8-186

  36. [37]

    Scipy 1.0: Fundamental algorithms for scientific computing in python,

    P. e. a. Virtanen, “Scipy 1.0: Fundamental algorithms for scientific computing in python,”Nature Methods, vol. 17, pp. 261–272, 2020

  37. [38]

    jaxkd: Minimal JAX implementation of k-nearest neighbors using a k-d tree,

    B. Dodge, “jaxkd: Minimal JAX implementation of k-nearest neighbors using a k-d tree,” Jul. 2024. [Online]. Available: https: //github.com/dodgebc/jaxkd

  38. [39]

    jaxkd-cuda: Custom CUDA kernels for JAX k-d tree operations,

    ——, “jaxkd-cuda: Custom CUDA kernels for JAX k-d tree operations,” 2024, used via the jaxkd interface. [Online]. Available: https://github.com/dodgebc/jaxkd-cuda

  39. [40]

    A stack-free traversal algorithm for left-balanced k-d trees,

    I. Wald, “A stack-free traversal algorithm for left-balanced k-d trees,” Journal of Computer Graphics Techniques (JCGT), vol. 14, no. 1, pp. 40–54, 2025. [Online]. Available: http://jcgt.org/published/0014/01/03/

  40. [41]

    DISCO-DJ II: a differentiable particle-mesh code for cosmology,

    F. List, O. Hahn, T. Fl ¨oss, and L. Winkler, “DISCO-DJ II: a differentiable particle-mesh code for cosmology,”arXiv e-prints, p. arXiv:2510.05206, Oct. 2025

  41. [42]

    Planck 2018 results. VI. Cosmological parameters,

    Planck Collaboration et al., “Planck 2018 results. VI. Cosmological parameters,”A&A, vol. 641, p. A6, Sep. 2020

  42. [43]

    The evolution of large-scale structure in a universe dominated by cold dark matter,

    M. Davis, G. Efstathiou, C. S. Frenk, and S. D. M. White, “The evolution of large-scale structure in a universe dominated by cold dark matter,” The Astrophysical Journal, vol. 292, pp. 371–394, May 1985

  43. [44]

    Groups of Galaxies. I. Nearby groups,

    J. P. Huchra and M. J. Geller, “Groups of Galaxies. I. Nearby groups,” The Astrophysical Journal, vol. 257, pp. 423–437, Jun. 1982

  44. [45]

    Merger Rates in Hierarchical Models of Galaxy Formation - Part Two - Comparison with N-Body Simulations,

    C. Lacey and S. Cole, “Merger Rates in Hierarchical Models of Galaxy Formation - Part Two - Comparison with N-Body Simulations,”Monthly Notices of the Royal Astronomical Society, vol. 271, p. 676, Dec. 1994

  45. [46]

    Tree-less 3d friends-of-friends using spatial hashing,

    P. Creasey, “Tree-less 3d friends-of-friends using spatial hashing,” Astronomy and Computing, vol. 25, p. 159–167, Oct. 2018. [Online]. Available: http://dx.doi.org/10.1016/j.ascom.2018.09.010

  46. [47]

    Simulating cosmic structure formation with the GADGET-4 code,

    V . Springel, R. Pakmor, O. Zier, and M. Reinecke, “Simulating cosmic structure formation with the GADGET-4 code,”Monthly Notices of the Royal Astronomical Society, vol. 506, no. 2, pp. 2871–2949, Sep. 2021

  47. [48]

    jFoF: GPU Cluster Finding with Gradient Propagation,

    B. Horowitz and A. E. Bayer, “jFoF: GPU Cluster Finding with Gradient Propagation,”arXiv e-prints, p. arXiv:2510.26851, Oct. 2025