Fast geodesic shooting for landmark matching using CUDA
Pith reviewed 2026-05-24 23:36 UTC · model grok-4.3
The pith
A CUDA implementation with shared memory reduction speeds up geodesic shooting for landmark matching by nearly 100 times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that replacing the naive CPU pairwise kernel and marginalization steps with a shared-memory CUDA reduction yields nearly two orders of magnitude speedup, improved numerical accuracy, and better final registration results when solving the geodesic shooting problem for landmark sets of size N in the thousands.
What carries the argument
Shared-memory CUDA reduction of the pairwise kernel K and the intermediate terms that must be summed in the forward and backward passes of the geodesic shooting algorithm.
If this is right
- Registration of landmark sets with several thousand points becomes feasible on desktop GPUs within seconds rather than minutes.
- The same reduction technique can be applied to other O(N squared) kernel summations that appear in the Hamiltonian formulation.
- Improved numerical accuracy reduces the need for ad-hoc stabilization steps in the shooting procedure.
- Biomedical pipelines that repeatedly solve landmark matching can now run at interactive rates.
Where Pith is reading between the lines
- The approach could be combined with multi-GPU or distributed reductions to handle even larger point clouds.
- Because the method preserves the original optimization path it can serve as a drop-in replacement in existing diffeomorphic registration toolboxes.
- Similar shared-memory patterns may accelerate other optimal-control registration methods that rely on dense kernel matrices.
Load-bearing premise
The shared-memory CUDA reduction produces numerically identical or better results than the CPU version without introducing floating-point differences that change the optimization trajectory.
What would settle it
Running both implementations on the same set of 1000 landmarks with identical initial p0 and optimizer settings and checking whether the final registration error and the sequence of p0 updates match to machine precision.
read the original abstract
Landmark matching via geodesic shooting is a prerequisite task for numerous registration based applications in biomedicine. Geodesic shooting has been developed as one solution approach and formulates the diffeomorphic registration as an optimal control problem under the Hamiltonian framework. In this framework, with landmark positions q0 fixed, the problem solely depends on the initial momentum p0 and evolves through time steps according to a set of constraint equations. Given an initial p0, the algorithm flows q and p forward through time steps, calculates a loss based on point-set mismatch and kinetic energy, back-propagate through time to calculate gradient on p0 and update it. In the forward and backward pass, a pair-wise kernel on landmark points K and additional intermediate terms have to be calculated and marginalized, leading to O(N2) computational complexity, N being the number of points to be registered. For medical image applications, N maybe in the range of thousands, rendering this operation computationally expensive. In this work we ropose a CUDA implementation based on shared memory reduction. Our implementation achieves nearly 2 orders magnitude speed up compared to a naive CPU-based implementation, in addition to improved numerical accuracy as well as better registration results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a CUDA implementation with shared-memory reduction to accelerate the O(N²) pairwise kernel K(q,q) and intermediate term computations required for forward flow, loss evaluation, and back-propagation in Hamiltonian geodesic shooting for landmark-based diffeomorphic registration. The central claim is a nearly 100-fold wall-clock speedup over a naive CPU baseline together with improved numerical accuracy and superior registration results.
Significance. If the performance and numerical-equivalence claims are substantiated, the work would make geodesic shooting practical for the thousands of landmarks typical in biomedical registration tasks, extending the applicability of optimal-control diffeomorphic methods.
major comments (2)
- [Abstract] Abstract: the claim of 'improved numerical accuracy as well as better registration results' is unsupported; the manuscript must demonstrate that the shared-memory CUDA reduction of the pairwise kernel and intermediate terms produces results that are numerically identical (or at least do not alter the converged p0 or final mismatch) to the reference CPU implementation, because parallel reductions can change floating-point summation order and rounding.
- [Abstract] Abstract: no quantitative benchmarks, timing tables, error-bar analysis, dataset descriptions, landmark counts, or direct CPU-vs-CUDA comparison of kernel outputs, convergence trajectories, or registration errors are supplied, so the central speedup and accuracy assertions cannot be evaluated.
Simulated Author's Rebuttal
We thank the referee for the comments. We address the two major points below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'improved numerical accuracy as well as better registration results' is unsupported; the manuscript must demonstrate that the shared-memory CUDA reduction of the pairwise kernel and intermediate terms produces results that are numerically identical (or at least do not alter the converged p0 or final mismatch) to the reference CPU implementation, because parallel reductions can change floating-point summation order and rounding.
Authors: We agree the claim in the abstract is currently unsupported by explicit evidence in the manuscript. Parallel reductions can alter floating-point summation order, so bit-identical results are not guaranteed. We will add a dedicated numerical validation subsection that directly compares kernel outputs (K(q,q) and intermediate terms), summed values, convergence trajectories of p0, and final mismatch between the reference CPU code and the CUDA version across multiple landmark sets. If the results show only negligible differences that do not change the converged solution, we will tone down or remove the 'improved numerical accuracy' phrasing; if they show systematic improvement we will document the conditions under which it occurs. revision: yes
-
Referee: [Abstract] Abstract: no quantitative benchmarks, timing tables, error-bar analysis, dataset descriptions, landmark counts, or direct CPU-vs-CUDA comparison of kernel outputs, convergence trajectories, or registration errors are supplied, so the central speedup and accuracy assertions cannot be evaluated.
Authors: We acknowledge that the current manuscript does not supply the quantitative detail required to evaluate the speedup and accuracy claims. We will add a results section containing: (i) timing tables for landmark counts ranging from a few hundred to several thousand, (ii) wall-clock comparisons against the naive CPU baseline with error bars from repeated runs, (iii) descriptions of the synthetic and real biomedical datasets used, (iv) direct CPU-vs-CUDA comparisons of kernel outputs and convergence behavior, and (v) registration error metrics. These additions will allow the central claims to be evaluated. revision: yes
Circularity Check
No circularity: implementation paper reports empirical measurements with no derivation chain or self-referential predictions
full rationale
The manuscript presents a CUDA shared-memory implementation of an existing geodesic shooting algorithm for landmark matching. Its central claims concern measured wall-clock speedup, registration quality, and numerical behavior relative to a CPU baseline. No equations derive new theoretical quantities from fitted parameters, no predictions are generated by construction from inputs, and no self-citations are invoked to establish uniqueness or load-bearing premises. The reported results are direct experimental outcomes, rendering the work self-contained against external benchmarks with no reduction of claims to their own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffeomorphic registration can be cast as an optimal-control problem under the Hamiltonian framework in which, with q0 fixed, the problem depends only on initial momentum p0.
Reference graph
Works this paper leans on
-
[1]
Landmark matching via large deformation diffeomorphisms
Joshi SC, Miller MI. Landmark matching via large deformation diffeomorphisms. IEEE Transactions on Image Processing. 2000 Aug;9(8):1357-70. 10
work page 2000
-
[2]
Geodesic shooting and diffeomorphic matching via textured meshes
Allassonnière S, Trouvé A, Younes L. Geodesic shooting and diffeomorphic matching via textured meshes. InEMMCVPR 2005 Nov 9 (pp. 365-381)
work page 2005
-
[3]
Geodesic shooting for computational anatomy
Miller MI, Trouvé A, Younes L. Geodesic shooting for computational anatomy. Journal of mathematical imaging and vision. 2006 Mar 8;24(2):209-28
work page 2006
-
[4]
Statistics on diffeomorphisms via tangent space representations
Vaillant M, Miller MI, Younes L, Trouvé A. Statistics on diffeomorphisms via tangent space representations. NeuroImage. 2004 Dec 31;23:S161-9
work page 2004
-
[5]
Statistics of shape via principal geodesic analysis on Lie groups
Fletcher PT, Lu C, Joshi S. Statistics of shape via principal geodesic analysis on Lie groups. InComputer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Com - puter Society Conference on 2003 Jun 18 (V ol. 1, pp. I-I). IEEE
work page 2003
-
[6]
Zhang M, Wells III WM, Golland P . Low-Dimensional Statistics of Anatomical Variability via Compact Representation of Image Deformations. InInternational Conference on Medi - cal Image Computing and Computer-Assisted Intervention 2016 Oct 17 (pp. 166-173). Springer International Publishing
work page 2016
-
[7]
Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease
Shamonin DP, Bron EE, Lelieveldt BP, Smits M, Klein S, Staring M. Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease. Fron - tiers in neuroinformatics. 2014 Jan 16;7:50
work page 2014
-
[8]
A survey of medical image registration on multicore and the GPU
Shams R, Sadeghi P , Kennedy RA, Hartley RI. A survey of medical image registration on multicore and the GPU. IEEE Signal Processing Magazine. 2010 Mar;27(2):50-60
work page 2010
-
[9]
Shams R, Sadeghi P , Kennedy R, Hartley R. Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images. Computer methods and programs in biomedicine. 2010 Aug 31;99(2):133-46
work page 2010
-
[10]
On the limited memory BFGS method for large scale optimization
Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical programming. 1989 Aug 1;45(1):503-28
work page 1989
-
[11]
The ITK Software Guide Book 1: Introduction and Development Guidelines-V olume 1
Johnson HJ, McCormick MM, Ibanez L. The ITK Software Guide Book 1: Introduction and Development Guidelines-V olume 1. Kitware, Inc.; 2015 Jan 16
work page 2015
-
[12]
Nvidia, Nvidia CUDA C Programming Guide v.5.0, 2012
work page 2012
-
[13]
cuBLAS product page, https://developer.nvidia.com/, last accessed 2018 Jan 16
work page 2018
-
[14]
Optimizing parallel reduction in CUDA
Mark H. Optimizing parallel reduction in CUDA. NVIDIA CUDA SDK. 2008;2
work page 2008
-
[15]
Faster parallel reductions on Kepler
Luitjens J. Faster parallel reductions on Kepler. Online] NVidia Coorporation. 2014 Feb
work page 2014
-
[16]
Benchmarking the cost of thread divergence in CUDA
Bialas P , Strzelecki A. Benchmarking the cost of thread divergence in CUDA. InInterna - tional Conference on Parallel Processing and Applied Mathematics 2015 Sep 6 (pp. 570- 579). Springer International Publishing
work page 2015
-
[17]
Xie L, Wisse LE, Das SR, Wang H, Wolk DA, Manjón JV , Yushkevich PA. Accounting for the Confound of Meninges in Segmenting Entorhinal and Perirhinal Cortices in T1- Weighted MRI. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention 2016 Oct 17 (pp. 564-571). Springer International Publishing
work page 2016
-
[18]
Xie L, Pluta JB, Das SR, Wisse LE, Wang H, Mancuso L, Kliot D, Avants BB, Ding SL, Manjón JV , Wolk DA. Multi-template analysis of human perirhinal cortex in brain MRI: Explicitly accounting for anatomical variability. NeuroImage. 2017 Jan 1;144:183-202
work page 2017
-
[19]
Efficient and flexible sampling with blue noise proper - ties of triangular meshes
Corsini M, Cignoni P, Scopigno R. Efficient and flexible sampling with blue noise proper - ties of triangular meshes. IEEE Transactions on Visualization and Computer Graphics. 2012 Jun;18(6):914-24
work page 2012
-
[20]
Dryden I and Mardia K. Statistical shape analysis. John Wiley & Sons, New York, 1998
work page 1998
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.