Fast geodesic shooting for landmark matching using CUDA

Jiancong Wang

arxiv: 1907.04839 · v1 · pith:5Q534DMOnew · submitted 2019-07-10 · 💻 cs.CV

Fast geodesic shooting for landmark matching using CUDA

Jiancong Wang This is my paper

Pith reviewed 2026-05-24 23:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords geodesic shootinglandmark matchingCUDAdiffeomorphic registrationoptimal controlpairwise kernelGPU accelerationbiomedical image registration

0 comments

The pith

A CUDA implementation with shared memory reduction speeds up geodesic shooting for landmark matching by nearly 100 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a GPU method for geodesic shooting, a technique that solves diffeomorphic landmark registration by evolving initial momentum through Hamiltonian dynamics. The core computation involves repeated O(N squared) pairwise kernel evaluations and summations during forward flow, loss calculation, and back-propagation. By moving these operations into shared-memory CUDA reductions the implementation runs nearly two orders of magnitude faster than a direct CPU version while also producing lower registration error. For biomedical applications where thousands of landmarks are common, the speedup makes the optimal-control formulation practical on standard hardware.

Core claim

The authors show that replacing the naive CPU pairwise kernel and marginalization steps with a shared-memory CUDA reduction yields nearly two orders of magnitude speedup, improved numerical accuracy, and better final registration results when solving the geodesic shooting problem for landmark sets of size N in the thousands.

What carries the argument

Shared-memory CUDA reduction of the pairwise kernel K and the intermediate terms that must be summed in the forward and backward passes of the geodesic shooting algorithm.

If this is right

Registration of landmark sets with several thousand points becomes feasible on desktop GPUs within seconds rather than minutes.
The same reduction technique can be applied to other O(N squared) kernel summations that appear in the Hamiltonian formulation.
Improved numerical accuracy reduces the need for ad-hoc stabilization steps in the shooting procedure.
Biomedical pipelines that repeatedly solve landmark matching can now run at interactive rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be combined with multi-GPU or distributed reductions to handle even larger point clouds.
Because the method preserves the original optimization path it can serve as a drop-in replacement in existing diffeomorphic registration toolboxes.
Similar shared-memory patterns may accelerate other optimal-control registration methods that rely on dense kernel matrices.

Load-bearing premise

The shared-memory CUDA reduction produces numerically identical or better results than the CPU version without introducing floating-point differences that change the optimization trajectory.

What would settle it

Running both implementations on the same set of 1000 landmarks with identical initial p0 and optimizer settings and checking whether the final registration error and the sequence of p0 updates match to machine precision.

read the original abstract

Landmark matching via geodesic shooting is a prerequisite task for numerous registration based applications in biomedicine. Geodesic shooting has been developed as one solution approach and formulates the diffeomorphic registration as an optimal control problem under the Hamiltonian framework. In this framework, with landmark positions q0 fixed, the problem solely depends on the initial momentum p0 and evolves through time steps according to a set of constraint equations. Given an initial p0, the algorithm flows q and p forward through time steps, calculates a loss based on point-set mismatch and kinetic energy, back-propagate through time to calculate gradient on p0 and update it. In the forward and backward pass, a pair-wise kernel on landmark points K and additional intermediate terms have to be calculated and marginalized, leading to O(N2) computational complexity, N being the number of points to be registered. For medical image applications, N maybe in the range of thousands, rendering this operation computationally expensive. In this work we ropose a CUDA implementation based on shared memory reduction. Our implementation achieves nearly 2 orders magnitude speed up compared to a naive CPU-based implementation, in addition to improved numerical accuracy as well as better registration results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CUDA implementation of geodesic shooting makes unsupported claims about speedup and accuracy due to missing benchmarks.

read the letter

This paper is a CUDA implementation of geodesic shooting for landmark matching that claims nearly 100x speedup along with better accuracy and registration results. The abstract gives no numbers or tables to support any of that. The contribution is the hardware-specific code using shared memory to reduce the pairwise kernel calculations. The algorithm itself is from prior literature on the Hamiltonian formulation. The description of the forward flow, loss, and back-propagation steps is clear, and the motivation for targeting medical image registration with thousands of points is reasonable. The soft spot is the missing evidence. Without benchmarks against a CPU version or other methods, and without any analysis of numerical differences from the reduction, the accuracy claim doesn't hold up. Floating point summation order matters here, and the paper needs to address whether the CUDA version produces the same or better results in practice. The stress test concern about equivalence is on point. If the shared memory reduction changes the values even slightly, it could alter the gradient updates and the final registration, making the improved claim hard to believe without data. Overall this is for practitioners who might want to try the code for faster registrations. It doesn't reorganize any theory or provide new insights into the method. I wouldn't bring it to a reading group or cite it. It doesn't deserve peer review until the results are added and verified. The authors would need to show actual timings, error metrics, and numerical comparisons first.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a CUDA implementation with shared-memory reduction to accelerate the O(N²) pairwise kernel K(q,q) and intermediate term computations required for forward flow, loss evaluation, and back-propagation in Hamiltonian geodesic shooting for landmark-based diffeomorphic registration. The central claim is a nearly 100-fold wall-clock speedup over a naive CPU baseline together with improved numerical accuracy and superior registration results.

Significance. If the performance and numerical-equivalence claims are substantiated, the work would make geodesic shooting practical for the thousands of landmarks typical in biomedical registration tasks, extending the applicability of optimal-control diffeomorphic methods.

major comments (2)

[Abstract] Abstract: the claim of 'improved numerical accuracy as well as better registration results' is unsupported; the manuscript must demonstrate that the shared-memory CUDA reduction of the pairwise kernel and intermediate terms produces results that are numerically identical (or at least do not alter the converged p0 or final mismatch) to the reference CPU implementation, because parallel reductions can change floating-point summation order and rounding.
[Abstract] Abstract: no quantitative benchmarks, timing tables, error-bar analysis, dataset descriptions, landmark counts, or direct CPU-vs-CUDA comparison of kernel outputs, convergence trajectories, or registration errors are supplied, so the central speedup and accuracy assertions cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address the two major points below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'improved numerical accuracy as well as better registration results' is unsupported; the manuscript must demonstrate that the shared-memory CUDA reduction of the pairwise kernel and intermediate terms produces results that are numerically identical (or at least do not alter the converged p0 or final mismatch) to the reference CPU implementation, because parallel reductions can change floating-point summation order and rounding.

Authors: We agree the claim in the abstract is currently unsupported by explicit evidence in the manuscript. Parallel reductions can alter floating-point summation order, so bit-identical results are not guaranteed. We will add a dedicated numerical validation subsection that directly compares kernel outputs (K(q,q) and intermediate terms), summed values, convergence trajectories of p0, and final mismatch between the reference CPU code and the CUDA version across multiple landmark sets. If the results show only negligible differences that do not change the converged solution, we will tone down or remove the 'improved numerical accuracy' phrasing; if they show systematic improvement we will document the conditions under which it occurs. revision: yes
Referee: [Abstract] Abstract: no quantitative benchmarks, timing tables, error-bar analysis, dataset descriptions, landmark counts, or direct CPU-vs-CUDA comparison of kernel outputs, convergence trajectories, or registration errors are supplied, so the central speedup and accuracy assertions cannot be evaluated.

Authors: We acknowledge that the current manuscript does not supply the quantitative detail required to evaluate the speedup and accuracy claims. We will add a results section containing: (i) timing tables for landmark counts ranging from a few hundred to several thousand, (ii) wall-clock comparisons against the naive CPU baseline with error bars from repeated runs, (iii) descriptions of the synthetic and real biomedical datasets used, (iv) direct CPU-vs-CUDA comparisons of kernel outputs and convergence behavior, and (v) registration error metrics. These additions will allow the central claims to be evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation paper reports empirical measurements with no derivation chain or self-referential predictions

full rationale

The manuscript presents a CUDA shared-memory implementation of an existing geodesic shooting algorithm for landmark matching. Its central claims concern measured wall-clock speedup, registration quality, and numerical behavior relative to a CPU baseline. No equations derive new theoretical quantities from fitted parameters, no predictions are generated by construction from inputs, and no self-citations are invoked to establish uniqueness or load-bearing premises. The reported results are direct experimental outcomes, rendering the work self-contained against external benchmarks with no reduction of claims to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the pre-existing Hamiltonian formulation of diffeomorphic registration; no new free parameters, axioms, or invented entities are introduced.

axioms (1)

domain assumption Diffeomorphic registration can be cast as an optimal-control problem under the Hamiltonian framework in which, with q0 fixed, the problem depends only on initial momentum p0.
Explicitly invoked in the abstract as the foundation of the forward/backward passes.

pith-pipeline@v0.9.0 · 5727 in / 1256 out tokens · 25836 ms · 2026-05-24T23:36:11.216005+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Landmark matching via large deformation diffeomorphisms

Joshi SC, Miller MI. Landmark matching via large deformation diffeomorphisms. IEEE Transactions on Image Processing. 2000 Aug;9(8):1357-70. 10

work page 2000
[2]

Geodesic shooting and diffeomorphic matching via textured meshes

Allassonnière S, Trouvé A, Younes L. Geodesic shooting and diffeomorphic matching via textured meshes. InEMMCVPR 2005 Nov 9 (pp. 365-381)

work page 2005
[3]

Geodesic shooting for computational anatomy

Miller MI, Trouvé A, Younes L. Geodesic shooting for computational anatomy. Journal of mathematical imaging and vision. 2006 Mar 8;24(2):209-28

work page 2006
[4]

Statistics on diffeomorphisms via tangent space representations

Vaillant M, Miller MI, Younes L, Trouvé A. Statistics on diffeomorphisms via tangent space representations. NeuroImage. 2004 Dec 31;23:S161-9

work page 2004
[5]

Statistics of shape via principal geodesic analysis on Lie groups

Fletcher PT, Lu C, Joshi S. Statistics of shape via principal geodesic analysis on Lie groups. InComputer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Com - puter Society Conference on 2003 Jun 18 (V ol. 1, pp. I-I). IEEE

work page 2003
[6]

Low-Dimensional Statistics of Anatomical Variability via Compact Representation of Image Deformations

Zhang M, Wells III WM, Golland P . Low-Dimensional Statistics of Anatomical Variability via Compact Representation of Image Deformations. InInternational Conference on Medi - cal Image Computing and Computer-Assisted Intervention 2016 Oct 17 (pp. 166-173). Springer International Publishing

work page 2016
[7]

Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease

Shamonin DP, Bron EE, Lelieveldt BP, Smits M, Klein S, Staring M. Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease. Fron - tiers in neuroinformatics. 2014 Jan 16;7:50

work page 2014
[8]

A survey of medical image registration on multicore and the GPU

Shams R, Sadeghi P , Kennedy RA, Hartley RI. A survey of medical image registration on multicore and the GPU. IEEE Signal Processing Magazine. 2010 Mar;27(2):50-60

work page 2010
[9]

Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images

Shams R, Sadeghi P , Kennedy R, Hartley R. Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images. Computer methods and programs in biomedicine. 2010 Aug 31;99(2):133-46

work page 2010
[10]

On the limited memory BFGS method for large scale optimization

Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical programming. 1989 Aug 1;45(1):503-28

work page 1989
[11]

The ITK Software Guide Book 1: Introduction and Development Guidelines-V olume 1

Johnson HJ, McCormick MM, Ibanez L. The ITK Software Guide Book 1: Introduction and Development Guidelines-V olume 1. Kitware, Inc.; 2015 Jan 16

work page 2015
[12]

Nvidia, Nvidia CUDA C Programming Guide v.5.0, 2012

work page 2012
[13]

cuBLAS product page, https://developer.nvidia.com/, last accessed 2018 Jan 16

work page 2018
[14]

Optimizing parallel reduction in CUDA

Mark H. Optimizing parallel reduction in CUDA. NVIDIA CUDA SDK. 2008;2

work page 2008
[15]

Faster parallel reductions on Kepler

Luitjens J. Faster parallel reductions on Kepler. Online] NVidia Coorporation. 2014 Feb

work page 2014
[16]

Benchmarking the cost of thread divergence in CUDA

Bialas P , Strzelecki A. Benchmarking the cost of thread divergence in CUDA. InInterna - tional Conference on Parallel Processing and Applied Mathematics 2015 Sep 6 (pp. 570- 579). Springer International Publishing

work page 2015
[17]

Accounting for the Confound of Meninges in Segmenting Entorhinal and Perirhinal Cortices in T1- Weighted MRI

Xie L, Wisse LE, Das SR, Wang H, Wolk DA, Manjón JV , Yushkevich PA. Accounting for the Confound of Meninges in Segmenting Entorhinal and Perirhinal Cortices in T1- Weighted MRI. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention 2016 Oct 17 (pp. 564-571). Springer International Publishing

work page 2016
[18]

Multi-template analysis of human perirhinal cortex in brain MRI: Explicitly accounting for anatomical variability

Xie L, Pluta JB, Das SR, Wisse LE, Wang H, Mancuso L, Kliot D, Avants BB, Ding SL, Manjón JV , Wolk DA. Multi-template analysis of human perirhinal cortex in brain MRI: Explicitly accounting for anatomical variability. NeuroImage. 2017 Jan 1;144:183-202

work page 2017
[19]

Efficient and flexible sampling with blue noise proper - ties of triangular meshes

Corsini M, Cignoni P, Scopigno R. Efficient and flexible sampling with blue noise proper - ties of triangular meshes. IEEE Transactions on Visualization and Computer Graphics. 2012 Jun;18(6):914-24

work page 2012
[20]

Statistical shape analysis

Dryden I and Mardia K. Statistical shape analysis. John Wiley & Sons, New York, 1998

work page 1998

[1] [1]

Landmark matching via large deformation diffeomorphisms

Joshi SC, Miller MI. Landmark matching via large deformation diffeomorphisms. IEEE Transactions on Image Processing. 2000 Aug;9(8):1357-70. 10

work page 2000

[2] [2]

Geodesic shooting and diffeomorphic matching via textured meshes

Allassonnière S, Trouvé A, Younes L. Geodesic shooting and diffeomorphic matching via textured meshes. InEMMCVPR 2005 Nov 9 (pp. 365-381)

work page 2005

[3] [3]

Geodesic shooting for computational anatomy

Miller MI, Trouvé A, Younes L. Geodesic shooting for computational anatomy. Journal of mathematical imaging and vision. 2006 Mar 8;24(2):209-28

work page 2006

[4] [4]

Statistics on diffeomorphisms via tangent space representations

Vaillant M, Miller MI, Younes L, Trouvé A. Statistics on diffeomorphisms via tangent space representations. NeuroImage. 2004 Dec 31;23:S161-9

work page 2004

[5] [5]

Statistics of shape via principal geodesic analysis on Lie groups

Fletcher PT, Lu C, Joshi S. Statistics of shape via principal geodesic analysis on Lie groups. InComputer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Com - puter Society Conference on 2003 Jun 18 (V ol. 1, pp. I-I). IEEE

work page 2003

[6] [6]

Low-Dimensional Statistics of Anatomical Variability via Compact Representation of Image Deformations

Zhang M, Wells III WM, Golland P . Low-Dimensional Statistics of Anatomical Variability via Compact Representation of Image Deformations. InInternational Conference on Medi - cal Image Computing and Computer-Assisted Intervention 2016 Oct 17 (pp. 166-173). Springer International Publishing

work page 2016

[7] [7]

Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease

Shamonin DP, Bron EE, Lelieveldt BP, Smits M, Klein S, Staring M. Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease. Fron - tiers in neuroinformatics. 2014 Jan 16;7:50

work page 2014

[8] [8]

A survey of medical image registration on multicore and the GPU

Shams R, Sadeghi P , Kennedy RA, Hartley RI. A survey of medical image registration on multicore and the GPU. IEEE Signal Processing Magazine. 2010 Mar;27(2):50-60

work page 2010

[9] [9]

Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images

Shams R, Sadeghi P , Kennedy R, Hartley R. Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images. Computer methods and programs in biomedicine. 2010 Aug 31;99(2):133-46

work page 2010

[10] [10]

On the limited memory BFGS method for large scale optimization

Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical programming. 1989 Aug 1;45(1):503-28

work page 1989

[11] [11]

The ITK Software Guide Book 1: Introduction and Development Guidelines-V olume 1

Johnson HJ, McCormick MM, Ibanez L. The ITK Software Guide Book 1: Introduction and Development Guidelines-V olume 1. Kitware, Inc.; 2015 Jan 16

work page 2015

[12] [12]

Nvidia, Nvidia CUDA C Programming Guide v.5.0, 2012

work page 2012

[13] [13]

cuBLAS product page, https://developer.nvidia.com/, last accessed 2018 Jan 16

work page 2018

[14] [14]

Optimizing parallel reduction in CUDA

Mark H. Optimizing parallel reduction in CUDA. NVIDIA CUDA SDK. 2008;2

work page 2008

[15] [15]

Faster parallel reductions on Kepler

Luitjens J. Faster parallel reductions on Kepler. Online] NVidia Coorporation. 2014 Feb

work page 2014

[16] [16]

Benchmarking the cost of thread divergence in CUDA

Bialas P , Strzelecki A. Benchmarking the cost of thread divergence in CUDA. InInterna - tional Conference on Parallel Processing and Applied Mathematics 2015 Sep 6 (pp. 570- 579). Springer International Publishing

work page 2015

[17] [17]

Accounting for the Confound of Meninges in Segmenting Entorhinal and Perirhinal Cortices in T1- Weighted MRI

Xie L, Wisse LE, Das SR, Wang H, Wolk DA, Manjón JV , Yushkevich PA. Accounting for the Confound of Meninges in Segmenting Entorhinal and Perirhinal Cortices in T1- Weighted MRI. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention 2016 Oct 17 (pp. 564-571). Springer International Publishing

work page 2016

[18] [18]

Multi-template analysis of human perirhinal cortex in brain MRI: Explicitly accounting for anatomical variability

Xie L, Pluta JB, Das SR, Wisse LE, Wang H, Mancuso L, Kliot D, Avants BB, Ding SL, Manjón JV , Wolk DA. Multi-template analysis of human perirhinal cortex in brain MRI: Explicitly accounting for anatomical variability. NeuroImage. 2017 Jan 1;144:183-202

work page 2017

[19] [19]

Efficient and flexible sampling with blue noise proper - ties of triangular meshes

Corsini M, Cignoni P, Scopigno R. Efficient and flexible sampling with blue noise proper - ties of triangular meshes. IEEE Transactions on Visualization and Computer Graphics. 2012 Jun;18(6):914-24

work page 2012

[20] [20]

Statistical shape analysis

Dryden I and Mardia K. Statistical shape analysis. John Wiley & Sons, New York, 1998

work page 1998