pith. sign in

arxiv: 1907.04839 · v1 · pith:5Q534DMOnew · submitted 2019-07-10 · 💻 cs.CV

Fast geodesic shooting for landmark matching using CUDA

Pith reviewed 2026-05-24 23:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords geodesic shootinglandmark matchingCUDAdiffeomorphic registrationoptimal controlpairwise kernelGPU accelerationbiomedical image registration
0
0 comments X

The pith

A CUDA implementation with shared memory reduction speeds up geodesic shooting for landmark matching by nearly 100 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a GPU method for geodesic shooting, a technique that solves diffeomorphic landmark registration by evolving initial momentum through Hamiltonian dynamics. The core computation involves repeated O(N squared) pairwise kernel evaluations and summations during forward flow, loss calculation, and back-propagation. By moving these operations into shared-memory CUDA reductions the implementation runs nearly two orders of magnitude faster than a direct CPU version while also producing lower registration error. For biomedical applications where thousands of landmarks are common, the speedup makes the optimal-control formulation practical on standard hardware.

Core claim

The authors show that replacing the naive CPU pairwise kernel and marginalization steps with a shared-memory CUDA reduction yields nearly two orders of magnitude speedup, improved numerical accuracy, and better final registration results when solving the geodesic shooting problem for landmark sets of size N in the thousands.

What carries the argument

Shared-memory CUDA reduction of the pairwise kernel K and the intermediate terms that must be summed in the forward and backward passes of the geodesic shooting algorithm.

If this is right

  • Registration of landmark sets with several thousand points becomes feasible on desktop GPUs within seconds rather than minutes.
  • The same reduction technique can be applied to other O(N squared) kernel summations that appear in the Hamiltonian formulation.
  • Improved numerical accuracy reduces the need for ad-hoc stabilization steps in the shooting procedure.
  • Biomedical pipelines that repeatedly solve landmark matching can now run at interactive rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be combined with multi-GPU or distributed reductions to handle even larger point clouds.
  • Because the method preserves the original optimization path it can serve as a drop-in replacement in existing diffeomorphic registration toolboxes.
  • Similar shared-memory patterns may accelerate other optimal-control registration methods that rely on dense kernel matrices.

Load-bearing premise

The shared-memory CUDA reduction produces numerically identical or better results than the CPU version without introducing floating-point differences that change the optimization trajectory.

What would settle it

Running both implementations on the same set of 1000 landmarks with identical initial p0 and optimizer settings and checking whether the final registration error and the sequence of p0 updates match to machine precision.

read the original abstract

Landmark matching via geodesic shooting is a prerequisite task for numerous registration based applications in biomedicine. Geodesic shooting has been developed as one solution approach and formulates the diffeomorphic registration as an optimal control problem under the Hamiltonian framework. In this framework, with landmark positions q0 fixed, the problem solely depends on the initial momentum p0 and evolves through time steps according to a set of constraint equations. Given an initial p0, the algorithm flows q and p forward through time steps, calculates a loss based on point-set mismatch and kinetic energy, back-propagate through time to calculate gradient on p0 and update it. In the forward and backward pass, a pair-wise kernel on landmark points K and additional intermediate terms have to be calculated and marginalized, leading to O(N2) computational complexity, N being the number of points to be registered. For medical image applications, N maybe in the range of thousands, rendering this operation computationally expensive. In this work we ropose a CUDA implementation based on shared memory reduction. Our implementation achieves nearly 2 orders magnitude speed up compared to a naive CPU-based implementation, in addition to improved numerical accuracy as well as better registration results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a CUDA implementation with shared-memory reduction to accelerate the O(N²) pairwise kernel K(q,q) and intermediate term computations required for forward flow, loss evaluation, and back-propagation in Hamiltonian geodesic shooting for landmark-based diffeomorphic registration. The central claim is a nearly 100-fold wall-clock speedup over a naive CPU baseline together with improved numerical accuracy and superior registration results.

Significance. If the performance and numerical-equivalence claims are substantiated, the work would make geodesic shooting practical for the thousands of landmarks typical in biomedical registration tasks, extending the applicability of optimal-control diffeomorphic methods.

major comments (2)
  1. [Abstract] Abstract: the claim of 'improved numerical accuracy as well as better registration results' is unsupported; the manuscript must demonstrate that the shared-memory CUDA reduction of the pairwise kernel and intermediate terms produces results that are numerically identical (or at least do not alter the converged p0 or final mismatch) to the reference CPU implementation, because parallel reductions can change floating-point summation order and rounding.
  2. [Abstract] Abstract: no quantitative benchmarks, timing tables, error-bar analysis, dataset descriptions, landmark counts, or direct CPU-vs-CUDA comparison of kernel outputs, convergence trajectories, or registration errors are supplied, so the central speedup and accuracy assertions cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address the two major points below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'improved numerical accuracy as well as better registration results' is unsupported; the manuscript must demonstrate that the shared-memory CUDA reduction of the pairwise kernel and intermediate terms produces results that are numerically identical (or at least do not alter the converged p0 or final mismatch) to the reference CPU implementation, because parallel reductions can change floating-point summation order and rounding.

    Authors: We agree the claim in the abstract is currently unsupported by explicit evidence in the manuscript. Parallel reductions can alter floating-point summation order, so bit-identical results are not guaranteed. We will add a dedicated numerical validation subsection that directly compares kernel outputs (K(q,q) and intermediate terms), summed values, convergence trajectories of p0, and final mismatch between the reference CPU code and the CUDA version across multiple landmark sets. If the results show only negligible differences that do not change the converged solution, we will tone down or remove the 'improved numerical accuracy' phrasing; if they show systematic improvement we will document the conditions under which it occurs. revision: yes

  2. Referee: [Abstract] Abstract: no quantitative benchmarks, timing tables, error-bar analysis, dataset descriptions, landmark counts, or direct CPU-vs-CUDA comparison of kernel outputs, convergence trajectories, or registration errors are supplied, so the central speedup and accuracy assertions cannot be evaluated.

    Authors: We acknowledge that the current manuscript does not supply the quantitative detail required to evaluate the speedup and accuracy claims. We will add a results section containing: (i) timing tables for landmark counts ranging from a few hundred to several thousand, (ii) wall-clock comparisons against the naive CPU baseline with error bars from repeated runs, (iii) descriptions of the synthetic and real biomedical datasets used, (iv) direct CPU-vs-CUDA comparisons of kernel outputs and convergence behavior, and (v) registration error metrics. These additions will allow the central claims to be evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation paper reports empirical measurements with no derivation chain or self-referential predictions

full rationale

The manuscript presents a CUDA shared-memory implementation of an existing geodesic shooting algorithm for landmark matching. Its central claims concern measured wall-clock speedup, registration quality, and numerical behavior relative to a CPU baseline. No equations derive new theoretical quantities from fitted parameters, no predictions are generated by construction from inputs, and no self-citations are invoked to establish uniqueness or load-bearing premises. The reported results are direct experimental outcomes, rendering the work self-contained against external benchmarks with no reduction of claims to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the pre-existing Hamiltonian formulation of diffeomorphic registration; no new free parameters, axioms, or invented entities are introduced.

axioms (1)
  • domain assumption Diffeomorphic registration can be cast as an optimal-control problem under the Hamiltonian framework in which, with q0 fixed, the problem depends only on initial momentum p0.
    Explicitly invoked in the abstract as the foundation of the forward/backward passes.

pith-pipeline@v0.9.0 · 5727 in / 1256 out tokens · 25836 ms · 2026-05-24T23:36:11.216005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Landmark matching via large deformation diffeomorphisms

    Joshi SC, Miller MI. Landmark matching via large deformation diffeomorphisms. IEEE Transactions on Image Processing. 2000 Aug;9(8):1357-70. 10

  2. [2]

    Geodesic shooting and diffeomorphic matching via textured meshes

    Allassonnière S, Trouvé A, Younes L. Geodesic shooting and diffeomorphic matching via textured meshes. InEMMCVPR 2005 Nov 9 (pp. 365-381)

  3. [3]

    Geodesic shooting for computational anatomy

    Miller MI, Trouvé A, Younes L. Geodesic shooting for computational anatomy. Journal of mathematical imaging and vision. 2006 Mar 8;24(2):209-28

  4. [4]

    Statistics on diffeomorphisms via tangent space representations

    Vaillant M, Miller MI, Younes L, Trouvé A. Statistics on diffeomorphisms via tangent space representations. NeuroImage. 2004 Dec 31;23:S161-9

  5. [5]

    Statistics of shape via principal geodesic analysis on Lie groups

    Fletcher PT, Lu C, Joshi S. Statistics of shape via principal geodesic analysis on Lie groups. InComputer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Com - puter Society Conference on 2003 Jun 18 (V ol. 1, pp. I-I). IEEE

  6. [6]

    Low-Dimensional Statistics of Anatomical Variability via Compact Representation of Image Deformations

    Zhang M, Wells III WM, Golland P . Low-Dimensional Statistics of Anatomical Variability via Compact Representation of Image Deformations. InInternational Conference on Medi - cal Image Computing and Computer-Assisted Intervention 2016 Oct 17 (pp. 166-173). Springer International Publishing

  7. [7]

    Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease

    Shamonin DP, Bron EE, Lelieveldt BP, Smits M, Klein S, Staring M. Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease. Fron - tiers in neuroinformatics. 2014 Jan 16;7:50

  8. [8]

    A survey of medical image registration on multicore and the GPU

    Shams R, Sadeghi P , Kennedy RA, Hartley RI. A survey of medical image registration on multicore and the GPU. IEEE Signal Processing Magazine. 2010 Mar;27(2):50-60

  9. [9]

    Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images

    Shams R, Sadeghi P , Kennedy R, Hartley R. Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images. Computer methods and programs in biomedicine. 2010 Aug 31;99(2):133-46

  10. [10]

    On the limited memory BFGS method for large scale optimization

    Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical programming. 1989 Aug 1;45(1):503-28

  11. [11]

    The ITK Software Guide Book 1: Introduction and Development Guidelines-V olume 1

    Johnson HJ, McCormick MM, Ibanez L. The ITK Software Guide Book 1: Introduction and Development Guidelines-V olume 1. Kitware, Inc.; 2015 Jan 16

  12. [12]

    Nvidia, Nvidia CUDA C Programming Guide v.5.0, 2012

  13. [13]

    cuBLAS product page, https://developer.nvidia.com/, last accessed 2018 Jan 16

  14. [14]

    Optimizing parallel reduction in CUDA

    Mark H. Optimizing parallel reduction in CUDA. NVIDIA CUDA SDK. 2008;2

  15. [15]

    Faster parallel reductions on Kepler

    Luitjens J. Faster parallel reductions on Kepler. Online] NVidia Coorporation. 2014 Feb

  16. [16]

    Benchmarking the cost of thread divergence in CUDA

    Bialas P , Strzelecki A. Benchmarking the cost of thread divergence in CUDA. InInterna - tional Conference on Parallel Processing and Applied Mathematics 2015 Sep 6 (pp. 570- 579). Springer International Publishing

  17. [17]

    Accounting for the Confound of Meninges in Segmenting Entorhinal and Perirhinal Cortices in T1- Weighted MRI

    Xie L, Wisse LE, Das SR, Wang H, Wolk DA, Manjón JV , Yushkevich PA. Accounting for the Confound of Meninges in Segmenting Entorhinal and Perirhinal Cortices in T1- Weighted MRI. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention 2016 Oct 17 (pp. 564-571). Springer International Publishing

  18. [18]

    Multi-template analysis of human perirhinal cortex in brain MRI: Explicitly accounting for anatomical variability

    Xie L, Pluta JB, Das SR, Wisse LE, Wang H, Mancuso L, Kliot D, Avants BB, Ding SL, Manjón JV , Wolk DA. Multi-template analysis of human perirhinal cortex in brain MRI: Explicitly accounting for anatomical variability. NeuroImage. 2017 Jan 1;144:183-202

  19. [19]

    Efficient and flexible sampling with blue noise proper - ties of triangular meshes

    Corsini M, Cignoni P, Scopigno R. Efficient and flexible sampling with blue noise proper - ties of triangular meshes. IEEE Transactions on Visualization and Computer Graphics. 2012 Jun;18(6):914-24

  20. [20]

    Statistical shape analysis

    Dryden I and Mardia K. Statistical shape analysis. John Wiley & Sons, New York, 1998