arxiv: 2604.05885 · v1 · submitted 2026-04-07 · 💻 cs.DC · astro-ph.CO· physics.comp-ph

Recognition: no theorem link

JZ-Tree: GPU friendly neighbour search and friends-of-friends with dual tree walks in JAX plus CUDA

Jens St\"ucker , Oliver Hahn , Lukas Winkler , Adrian Gutierrez Adame , Thomas Fl\"oss

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:30 UTC · model grok-4.3

classification 💻 cs.DC astro-ph.COphysics.comp-ph

keywords GPUtree traversalnearest neighborsfriends-of-friendsMorton orderspatial searchJAXCUDA

0 comments

The pith

Morton plane-based trees let dual-tree algorithms run efficiently on GPUs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a Morton plane-based tree hierarchy tailored for GPU architectures to overcome thread divergence and irregular memory access in traditional tree traversals. The flattened layout supports collaborative dual-tree walks across thread groups, enabling efficient exact k-nearest neighbour search and friends-of-friends clustering. For large datasets with over ten million points, these implementations show more than ten times better performance than other GPU libraries and scale strongly to multiple GPUs. A reader interested in high-performance spatial computing would care because these algorithms are foundational for simulations and data analysis in fields like astrophysics and machine learning.

Core claim

The authors establish that their Morton plane-based tree hierarchy creates a data layout allowing highly coalesced memory accesses and low divergence during dual-tree traversal with collaborative threads. This yields GPU implementations of k-nearest neighbour search and friends-of-friends that improve performance by more than an order of magnitude over competing libraries for large N and scale to distributed multi-GPU setups.

What carries the argument

Morton (z-order) plane-based tree hierarchy that flattens the spatial tree for GPU-optimized dual traversal.

If this is right

kNN search runs over 10x faster on GPUs for N greater than 10 million.
FoF clustering achieves similar speedups and multi-GPU scaling.
The framework can support a broad class of tree-based spatial algorithms on GPUs.
Strong scaling enables handling of even larger datasets across distributed systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This hierarchy might apply to other algorithms like Barnes-Hut n-body simulations for faster gravity calculations.
Performance could be further tested with varying data distributions to confirm robustness.
Integration in JAX opens possibilities for end-to-end differentiable spatial processing in ML models.

Load-bearing premise

The proposed Morton plane-based tree hierarchy will generate memory access patterns regular enough and thread divergence low enough on modern GPUs to achieve the reported speedups without major costs from building the tree or balancing the load.

What would settle it

Running a benchmark on a standard GPU with 10^7 particles where the k-nearest neighbor query time with JZ-Tree is not at least 10 times lower than with the best alternative GPU library.

Figures

Figures reproduced from arXiv: 2604.05885 by Adrian Gutierrez Adame, Jens St\"ucker, Lukas Winkler, Oliver Hahn, Thomas Fl\"oss.

**Figure 2.** Figure 2: The z-order curve for a regular grid (left) and for randomly distributed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the steps involved in building a split based tree structure. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Execution time in ms of different parts of the neighbour search [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: (a) Performance of JZ-TREE for finding k = 16 neighbours of different distributions in d = 3 dimensions. (b) Efficiency scaling for distributed computing for different numbers of GPUs as a function of the number of points per GPU. The method scales well across different problem setups and to large problem sizes on multi-GPU. SCIPY-CKDTREE turns out more than two orders of magnitude slower than JZ-TREE and … view at source ↗

**Figure 8.** Figure 8: Performance of the full FoF algorithm of [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Scaling of the kNN search for 106 points for (a) k = 16 neighbours with dimension (b) d = 3 with neighbour number and (c) k = 16 and d = 3 and given source count with varying query size. dimension number. However, that the performance gap to this brute-force approach is only a factor 10 makes it seem likely that per query point about 10% of all source points need to be checked. It is quite possible that th… view at source ↗

read the original abstract

Algorithms based on spatial tree traversal are widely regarded as among the most efficient and flexible approaches for many problems in CPU-based high-performance computing (HPC). However, directly transferring these algorithms to GPU architectures often yields substantially smaller performance gains than expected in light of the high computational throughput of modern GPUs. The branching nature of tree algorithms leads to thread divergence and irregular memory access patterns -- both of which may severely limit GPU performance. To address these challenges, we propose a Morton (z-order) 'plane-based tree hierarchy' that is specifically designed for GPU architectures. The resulting flattened data layout enables efficient dual-tree traversal with collaborative execution across thread groups, leading to highly coalesced memory access patterns. Based on this framework we present implementations of two important spatial algorithms -- exact $k$-nearest neighbour search and friends-of-friends (FoF) clustering. For both cases, we observe more than an order-of-magnitude performance improvement over the closest competing GPU libraries for large problem sizes ($N \gtrsim 10^7$), together with strong scaling to distributed multi-GPU systems. We provide an open-source implementation, 'JZ-Tree' (JAX z-order tree), which serves as a foundation for efficient GPU implementations of a broad class of tree-based algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JZ-Tree shows a working Morton plane-based GPU tree that delivers the claimed >10x speedups on kNN and FoF with open code and solid benchmarks.

read the letter

The main point is that this paper's Morton plane-based tree hierarchy actually produces the reported order-of-magnitude GPU gains for exact k-nearest neighbor search and friends-of-friends clustering. The flattened layout supports collaborative dual-tree traversal across thread groups, which improves coalescing and reduces divergence compared to standard approaches. They back this with pseudocode for the CUDA kernels, separate timing that shows tree construction stays sub-dominant, benchmarks up to N=10^8 particles, direct comparisons against nanoflann and cuML variants, and multi-GPU scaling curves. The open JAX+CUDA release lets anyone reproduce the numbers directly. That combination of new hierarchy, measured results, and available code is the real contribution here. The evidence lines up with the abstract claims once the full implementation details are included. Soft spots are limited. The tested particle distributions match typical cosmology workloads, so the load-balancing behavior is shown for those cases; more extreme clustering might expose additional imbalance, though the paper discusses mitigation steps. JAX overhead is present but already folded into the reported timings. No load-bearing gaps appear in the architecture description or the empirical data. This work is aimed at computational astrophysicists and cosmologists who run large-scale structure simulations and need faster spatial queries on GPU hardware. Anyone building or optimizing tree-based algorithms for GPUs will find the concrete layout and traversal method useful, plus the released code as a starting point. It deserves serious referee time because the central performance claims rest on reproducible implementation and explicit benchmarks rather than untested assumptions.

Referee Report

0 major / 3 minor

Summary. The paper introduces JZ-Tree, a Morton (z-order) plane-based tree hierarchy optimized for GPU architectures to support efficient collaborative dual-tree walks. It provides JAX+CUDA implementations of exact k-nearest neighbor search and friends-of-friends clustering, claiming >10x speedups over competing GPU libraries for N ≳ 10^7 together with strong multi-GPU scaling, and releases the code as open source.

Significance. If the performance and scaling results hold, the work would be a useful practical advance for GPU-accelerated spatial algorithms in HPC, addressing thread divergence and memory irregularity via the specialized flattened layout. Strengths include the detailed kernel pseudocode, separate timing of tree construction, benchmarks up to N=10^8 with explicit library comparisons, and the reproducible open-source release.

minor comments (3)

[§4.2] §4.2: the description of the collaborative dual-tree walk would benefit from an additional sentence clarifying how thread-group synchronization avoids divergence in the plane-based traversal.
[Figure 7] Figure 7: the multi-GPU scaling plot axes and legend are difficult to read at print size; consider increasing font size or splitting into separate panels.
[§5.1] §5.1: while tree-build costs are stated to be sub-dominant, a short table summarizing build vs. query times across the tested N range would make this explicit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation for minor revision. The report accurately captures the key contributions of JZ-Tree, including the Morton plane-based hierarchy, collaborative dual-tree walks, performance claims for N ≳ 10^7, and the open-source release. No major comments were raised.

Circularity Check

0 steps flagged

No significant circularity; algorithmic implementation with external benchmarks

full rationale

The manuscript describes a Morton plane-based tree hierarchy, flattened layout, collaborative dual-tree walk, and CUDA kernels for kNN and FoF, supported by pseudocode, timing breakdowns, and direct comparisons to external libraries (nanoflann, cuML). Performance claims rest on empirical measurements for N up to 10^8 and multi-GPU scaling, not on any fitted parameters, self-referential predictions, or load-bearing self-citations. Tree construction costs are separately quantified and shown sub-dominant. The open-source JAX+CUDA release provides an independent reproduction path. No equations or uniqueness theorems reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The contribution rests on the assumption that Morton ordering plus a plane-based flattening yields coalesced accesses and low divergence on GPUs; no numerical free parameters are introduced in the abstract, and the new entity is the plane-based hierarchy itself.

axioms (1)

standard math Morton (z-order) curves preserve spatial locality sufficiently for tree-based neighbor queries
Invoked as the foundation for the flattened data layout and dual-tree traversal.

invented entities (1)

plane-based tree hierarchy no independent evidence
purpose: To produce a flattened layout that supports collaborative dual-tree walks with coalesced memory access on GPUs
New data structure proposed to overcome thread divergence and irregular access in conventional tree algorithms.

pith-pipeline@v0.9.0 · 5548 in / 1516 out tokens · 51395 ms · 2026-05-10T19:30:21.110743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 9 canonical work pages · 1 internal anchor

[1]

2020, Proceedings of the National Academy of Science, 117, 30055, doi: 10.1073/pnas.1912789117

K. Cranmer, J. Brehmer, and G. Louppe, “The frontier of simulation- based inference,”Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 055–30 062, 2020. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.1912789117

work page doi:10.1073/pnas.1912789117 2020
[2]

Jax: composable transformations of python+numpy pro- grams,

J. Bradbury, R. Frostig, P. Hawkins, M. Johnson, C. Leary, D. Maclau- rin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “Jax: composable transformations of python+numpy pro- grams,” https://github.com/google/jax, 2018

2018
[3]

Cuda c++ programming guide,

NVIDIA Corporation, “Cuda c++ programming guide,” https://docs. nvidia.com/cuda/, 2024

2024
[4]

Cuda c++ best practices guide,

——, “Cuda c++ best practices guide,” https://docs.nvidia.com/cuda/ cuda-c-best-practices-guide/, 2024

2024
[6]

de Berg, O

M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars,Computa- tional Geometry: Algorithms and Applications. Springer, 2008. GPU-FRIENDLY KNN AND FOF WITH DUAL TREE W ALKS IN JAX/CUDA 13

2008
[7]

Multidimensional binary search trees used for associative searching,

J. L. Bentley, “Multidimensional binary search trees used for associative searching,”Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975

1975
[8]

A hierarchical o(n log n) force-calculation algorithm,

J. Barnes and P. Hut, “A hierarchical o(n log n) force-calculation algorithm,”Nature, vol. 324, pp. 446–449, 1986

1986
[9]

A fast algorithm for particle simulations,

L. Greengard and V . Rokhlin, “A fast algorithm for particle simulations,” Journal of Computational Physics, vol. 73, pp. 325–348, 1987

1987
[10]

Maximizing parallelism in the construction of bvhs, octrees, and k-d trees,

T. Karras, “Maximizing parallelism in the construction of bvhs, octrees, and k-d trees,”High Performance Graphics, 2012

2012
[11]

Fast bvh construction on gpus,

C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, “Fast bvh construction on gpus,” inEurographics, 2009

2009
[12]

Real-time kd-tree construction on graphics hardware,

K. Zhou, Q. Hou, R. Wang, and B. Guo, “Real-time kd-tree construction on graphics hardware,” inACM SIGGRAPH Asia, 2008

2008
[13]

Optimizing lbvh-construction and hierarchy- traversal to accelerate knn queries on point clouds using the gpu,

J. Jakob and M. Guthe, “Optimizing lbvh-construction and hierarchy- traversal to accelerate knn queries on point clouds using the gpu,” in Computer Graphics Forum, vol. 40, no. 1. Wiley Online Library, 2021, pp. 124–137

2021
[14]

’n-body’ problems in statistical learn- ing,

A. G. Gray and A. W. Moore, “’n-body’ problems in statistical learn- ing,” inProceedings of the 14th International Conference on Neural Information Processing Systems, ser. NIPS’00. Cambridge, MA, USA: MIT Press, 2000, p. 500–506

2000
[15]

Tree-independent dual-tree algorithms,

R. R. Curtin, W. B. March, P. Ram, D. V . Anderson, A. G. Gray, and C. L. Isbell, “Tree-independent dual-tree algorithms,” inProceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ser. ICML’13. JMLR.org, 2013, p. III–1435–III–1443

2013
[16]

R. W. Hockney and J. W. Eastwood,Computer Simulation Using Particles. CRC Press, 1988

1988
[17]

M. P. Allen and D. J. Tildesley,Computer Simulation of Liquids. Oxford University Press, 2017

2017
[18]

Fast k nearest neighbor search using gpu,

V . Garcia, E. Debreuve, and M. Barlaud, “Fast k nearest neighbor search using gpu,” inIEEE CVPR Workshops, 2008

2008
[19]

Billion-scale similarity search with gpus,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with gpus,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021

2021
[20]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,” 2025. [Online]. Available: https://arxiv.org/abs/2401.08281

work page internal anchor Pith review arXiv 2025
[21]

Product quantization for nearest neighbor search,

H. J ´egou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011

2011
[22]

Efficient and robust approxi- mate nearest neighbor search using hierarchical navigable small world graphs,

Y . A. Malkov and D. A. Yashunin, “Efficient and robust approxi- mate nearest neighbor search using hierarchical navigable small world graphs,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 42, no. 4, pp. 824–836, 2020

2020
[23]

Analyzing the Performance of Applications at Exascale,

V . Kamel, H. Yan, and S. Chester, “Clover: A gpu-native, spatio- graph-based approach to exact knn,” inProceedings of the 39th ACM International Conference on Supercomputing, ser. ICS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 236–249. [Online]. Available: https://doi.org/10.1145/3721145.3730415

work page doi:10.1145/3721145.3730415 2025
[24]

A computer oriented geodetic data base and a new technique in file sequencing,

G. M. Morton, “A computer oriented geodetic data base and a new technique in file sequencing,” IBM, Tech. Rep., 1966

1966
[25]

Sagan,Space-Filling Curves

H. Sagan,Space-Filling Curves. Springer, 1994

1994
[26]

Bader,Space-Filling Curves: An Introduction with Applications in Scientific Computing

M. Bader,Space-Filling Curves: An Introduction with Applications in Scientific Computing. Springer, 2013

2013
[27]

Samet,Foundations of Multidimensional and Metric Data Structures

H. Samet,Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006

2006
[28]

Fast construction of k-nearest neighbor graphs for point clouds,

M. Connor and P. Kumar, “Fast construction of k-nearest neighbor graphs for point clouds,”IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 4, pp. 599–608, 2010

2010
[29]

Ieee standard for floating-point arithmetic,

“Ieee standard for floating-point arithmetic,”IEEE Std 754-2019 (Revi- sion of IEEE 754-2008), pp. 1–84, 2019

2019
[30]

Cub: Cuda unbound library,

NVIDIA, “Cub: Cuda unbound library,” 2017. [Online]. Available: https://nvlabs.github.io/cub/

2017
[31]

Parallel sorting by regular sampling,

H. Shi and J. Schaeffer, “Parallel sorting by regular sampling,” Journal of Parallel and Distributed Computing, vol. 14, no. 4, pp. 361–372, 1992. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/074373159290075X

work page arXiv 1992
[32]

Samplesort: A sampling approach to minimal storage tree sorting,

W. D. Frazer and A. C. McKellar, “Samplesort: A sampling approach to minimal storage tree sorting,”J. ACM, vol. 17, no. 3, p. 496–507, Jul. 1970. [Online]. Available: https://doi.org/10.1145/321592.321600

work page doi:10.1145/321592.321600 1970
[33]

Super scalar sample sort,

P. Sanders and S. Winkel, “Super scalar sample sort,” inAlgorithms – ESA 2004, S. Albers and T. Radzik, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 784–796

2004
[34]

Mitzenmacher and E

M. Mitzenmacher and E. Upfal,Probability and Computing: Random- ized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005

2005
[35]

Linear-time algorithms for pairwise statistical problems,

P. Ram, D. Lee, W. March, and A. Gray, “Linear-time algorithms for pairwise statistical problems,” inAdvances in Neural Information Processing Systems, Y . Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, Eds., vol. 22. Curran Associates, Inc., 2009. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2009/file/2421fcb126...

2009
[36]

Leonardo: A pan-european pre-exascale supercomputer for hpc and ai applications,

M. Turisini, M. Cestari, and G. Amati, “Leonardo: A pan-european pre-exascale supercomputer for hpc and ai applications,”Journal of Large-Scale Research Facilities, vol. 8, p. A186, 2024. [Online]. Available: https://doi.org/10.17815/jlsrf-8-186

work page doi:10.17815/jlsrf-8-186 2024
[37]

Scipy 1.0: Fundamental algorithms for scientific computing in python,

P. e. a. Virtanen, “Scipy 1.0: Fundamental algorithms for scientific computing in python,”Nature Methods, vol. 17, pp. 261–272, 2020

2020
[38]

jaxkd: Minimal JAX implementation of k-nearest neighbors using a k-d tree,

B. Dodge, “jaxkd: Minimal JAX implementation of k-nearest neighbors using a k-d tree,” Jul. 2024. [Online]. Available: https: //github.com/dodgebc/jaxkd

2024
[39]

jaxkd-cuda: Custom CUDA kernels for JAX k-d tree operations,

——, “jaxkd-cuda: Custom CUDA kernels for JAX k-d tree operations,” 2024, used via the jaxkd interface. [Online]. Available: https://github.com/dodgebc/jaxkd-cuda

2024
[40]

A stack-free traversal algorithm for left-balanced k-d trees,

I. Wald, “A stack-free traversal algorithm for left-balanced k-d trees,” Journal of Computer Graphics Techniques (JCGT), vol. 14, no. 1, pp. 40–54, 2025. [Online]. Available: http://jcgt.org/published/0014/01/03/

2025
[41]

DISCO-DJ II: a differentiable particle-mesh code for cosmology,

F. List, O. Hahn, T. Fl ¨oss, and L. Winkler, “DISCO-DJ II: a differentiable particle-mesh code for cosmology,”arXiv e-prints, p. arXiv:2510.05206, Oct. 2025

work page arXiv 2025
[42]

Planck 2018 results. VI. Cosmological parameters,

Planck Collaboration et al., “Planck 2018 results. VI. Cosmological parameters,”A&A, vol. 641, p. A6, Sep. 2020

2018
[43]

The evolution of large-scale structure in a universe dominated by cold dark matter,

M. Davis, G. Efstathiou, C. S. Frenk, and S. D. M. White, “The evolution of large-scale structure in a universe dominated by cold dark matter,” The Astrophysical Journal, vol. 292, pp. 371–394, May 1985

1985
[44]

Groups of Galaxies. I. Nearby groups,

J. P. Huchra and M. J. Geller, “Groups of Galaxies. I. Nearby groups,” The Astrophysical Journal, vol. 257, pp. 423–437, Jun. 1982

1982
[45]

Merger Rates in Hierarchical Models of Galaxy Formation - Part Two - Comparison with N-Body Simulations,

C. Lacey and S. Cole, “Merger Rates in Hierarchical Models of Galaxy Formation - Part Two - Comparison with N-Body Simulations,”Monthly Notices of the Royal Astronomical Society, vol. 271, p. 676, Dec. 1994

1994
[46]

Tree-less 3d friends-of-friends using spatial hashing,

P. Creasey, “Tree-less 3d friends-of-friends using spatial hashing,” Astronomy and Computing, vol. 25, p. 159–167, Oct. 2018. [Online]. Available: http://dx.doi.org/10.1016/j.ascom.2018.09.010

work page doi:10.1016/j.ascom.2018.09.010 2018
[47]

Simulating cosmic structure formation with the GADGET-4 code,

V . Springel, R. Pakmor, O. Zier, and M. Reinecke, “Simulating cosmic structure formation with the GADGET-4 code,”Monthly Notices of the Royal Astronomical Society, vol. 506, no. 2, pp. 2871–2949, Sep. 2021

2021
[48]

jFoF: GPU Cluster Finding with Gradient Propagation,

B. Horowitz and A. E. Bayer, “jFoF: GPU Cluster Finding with Gradient Propagation,”arXiv e-prints, p. arXiv:2510.26851, Oct. 2025

work page arXiv 2025