arxiv: 2605.00837 · v1 · submitted 2026-04-04 · 💻 cs.LG

Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions

Hao Xiao This is my paper

Pith reviewed 2026-05-13 18:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords optimal transportSinkhorn algorithmlog-domainCUDAGPU accelerationentropic regularizationnumerical stability

0 comments

The pith

A native CUDA log-domain Sinkhorn solver runs 12x faster than POT on dense 8192-by-8192 optimal transport problems while remaining stable at epsilon=10^{-4}.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FastSinkhorn, a lightweight native CUDA implementation of the log-domain Sinkhorn algorithm for entropic regularized optimal transport. It uses warp-level shuffle reductions and shared-memory tiling to achieve high GPU utilization without the numerical instability or framework overhead of prior methods. The approach supports regularization parameters down to 10^{-4} where standard-domain solvers fail, delivering 12x speedup over the POT library and 5.9x over PyTorch baselines on n=m=8192 problems with only 256 MB memory use. A sympathetic reader would care because Sinkhorn-based optimal transport underpins many machine learning tasks such as color transfer and point cloud matching, and a stable, fast native kernel removes a practical bottleneck for scaling these methods.

Core claim

The paper establishes that a carefully engineered native CUDA kernel for log-domain Sinkhorn iterations, built around warp-level shuffle reductions and shared-memory tiling, simultaneously delivers high throughput, low memory footprint, and numerical robustness on dense optimal transport problems, outperforming widely used libraries by substantial margins while preserving the mathematical properties of the algorithm.

What carries the argument

Warp-level shuffle reductions combined with shared-memory tiling that realize the log-domain Sinkhorn matrix scaling iterations directly on the GPU.

If this is right

Enables stable Sinkhorn computation for regularization values as small as 10^{-4} on large dense problems.
Reduces GPU memory consumption to 256 MB for n=m=8192 instances while still achieving high throughput.
Provides 12x speedup over the POT library and 5.9x speedup over PyTorch GPU baselines on the tested dense problems.
Supports direct application to image color transfer and 3D point cloud matching without framework overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-level GPU primitives could accelerate other iterative scaling algorithms that benefit from log-domain stability.
Native kernels of this style may allow optimal transport to be embedded directly inside larger GPU-resident machine learning pipelines without repeated host-device transfers.
The memory and speed profile suggests viability for real-time or near-real-time OT computations on mid-range GPUs.

Load-bearing premise

The warp-level reductions and shared-memory tiling correctly realize the mathematical log-domain Sinkhorn iterations without introducing additional numerical error or divergence beyond the claimed stability.

What would settle it

Running the solver on a dense 8192-by-8192 cost matrix at epsilon=10^{-4} and observing either divergence or a result differing from a reference high-precision log-domain implementation by more than floating-point roundoff would falsify the claim of correct stable realization.

Figures

Figures reproduced from arXiv: 2605.00837 by Hao Xiao.

**Figure 1.** Figure 1: Wall-clock time vs. problem size n = m on log-log scale (ε = 0.01). Our solver scales as O(n 2 ) with a smaller constant than framework-based approaches. 2 8 2 11 2 14 n 10 1 10 0 10 1 10 2 10 3 Time (ms) (a) Computation time O(n 2 ) 2 8 2 11 2 14 n 10 2 10 1 10 0 10 1 10 2 10 3 Memory (MB) (b) Peak GPU memory 4n 2 bytes 2 8 2 11 2 14 n 170 180 190 200 Iterations (c) Convergence iterations Mean = 185 [PIT… view at source ↗

**Figure 2.** Figure 2: Scaling analysis of FASTSINKHORN (ε = 0.01). (a) Computation time scales quadratically with n. (b) Peak GPU memory is dominated by the n 2 cost matrix. (c) Iteration count is approximately independent of n for fixed ε. 5.4 Numerical Stability A critical advantage of our log-domain formulation is robustness for small ε. We sweep ε from 1.0 to 10−4 with n = 512 and compare: • Log-domain (ours): computes dual… view at source ↗

**Figure 3.** Figure 3: Ablation study: contribution of each optimization to overall performance. Warp-level [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Numerical stability comparison (n = 512). (a) Transport cost as a function of ε. The standard-domain solver produces NaN for ε < 0.005, while our log-domain solver remains stable down to ε = 10−4 . (b) Convergence status across (ε, max C) configurations: converged (green), diverged (yellow), NaN (red). 0 2 4 6 8 10 Iteration 10 9 10 7 10 5 10 3 10 1 Marginal error 1 1 threshold = 10 6 Convergence profiles … view at source ↗

**Figure 5.** Figure 5: Convergence profiles for different ε values and problem sizes. All configurations converge to machine-precision marginal error within a small number of iterations, with smaller ε requiring more iterations. 8192. The log-domain formulation enables robust computation for regularization parameters as small as ε = 10−4 , where standard-domain methods fail due to floating-point overflow. Limitations. Our implem… view at source ↗

**Figure 6.** Figure 6: Image color transfer via optimal transport. The source image’s warm palette is transformed [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: 3D point cloud matching. (a) Source and target point clouds. (b) OT correspondences [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Entropic regularized optimal transport (OT) via the Sinkhorn algorithm has become a fundamental tool in machine learning, yet existing implementations either suffer from numerical instability for small regularization parameters or incur significant overhead from deep learning frameworks. We present FastSinkhorn, a lightweight, native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level shuffle reductions with shared-memory tiling to achieve high GPU utilization without sacrificing numerical stability. Our solver operates entirely in the log-domain, enabling robust computation for regularization parameters as small as epsilon = 10^{-4} where standard-domain methods fail. On dense OT problems with n = m = 8192, our implementation achieves 12x speedup over the widely-used POT library and 5.9x speedup over GPU-accelerated PyTorch baselines, while consuming only 256 MB of GPU memory. We validate our solver on image color transfer, 3D point cloud matching, and convergence analysis, demonstrating that native CUDA kernels with careful numerical treatment provide a practical and efficient foundation for large-scale optimal transport computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents FastSinkhorn, a lightweight native CUDA implementation of the log-domain Sinkhorn algorithm for entropic regularized optimal transport. It combines warp-level shuffle reductions with shared-memory tiling to achieve high GPU utilization while maintaining stability for small regularization parameters (down to epsilon=10^{-4}). On dense problems with n=m=8192 the implementation is reported to deliver 12x speedup versus the POT library and 5.9x versus GPU PyTorch baselines while using only 256 MB of memory; validation is provided on color transfer, 3D point-cloud matching, and convergence behavior.

Significance. If the custom kernels produce iteration trajectories numerically equivalent to a standard log-domain reference, the work supplies a practical, high-performance primitive for large-scale OT that could accelerate downstream ML tasks requiring tight regularization. The low memory footprint and concrete speedups against widely used baselines constitute a clear engineering contribution.

major comments (1)

[Section 3, Algorithm 1] Section 3 and Algorithm 1: the two-stage warp-then-block log-sum-exp reduction is described at the level of CUDA intrinsics, but no explicit bound or empirical verification is given for the additional rounding error relative to a fused reference implementation. Because the central stability claim (robustness at epsilon=10^{-4}) rests on numerical equivalence, the absence of a marginal-error or iteration-count comparison against POT on ill-conditioned cost matrices leaves open the possibility that observed stability is problem-dependent rather than guaranteed by the kernel design.

minor comments (1)

[Experiments] The experimental section would benefit from an explicit statement of the GPU model, CUDA version, and exact problem-generation procedure used for the n=8192 timing results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern regarding numerical equivalence and additional rounding error in the two-stage reduction below, and will strengthen the manuscript with the requested empirical verification.

read point-by-point responses

Referee: [Section 3, Algorithm 1] Section 3 and Algorithm 1: the two-stage warp-then-block log-sum-exp reduction is described at the level of CUDA intrinsics, but no explicit bound or empirical verification is given for the additional rounding error relative to a fused reference implementation. Because the central stability claim (robustness at epsilon=10^{-4}) rests on numerical equivalence, the absence of a marginal-error or iteration-count comparison against POT on ill-conditioned cost matrices leaves open the possibility that observed stability is problem-dependent rather than guaranteed by the kernel design.

Authors: We agree that an explicit verification of numerical equivalence is important to fully support the stability claims. The log-domain formulation prevents underflow for small epsilon, but the custom warp-shuffle plus block-level log-sum-exp reduction can introduce small additional floating-point discrepancies relative to a single fused kernel. In the revised manuscript we will add an empirical comparison (new figure or table in Section 4 or an appendix) that reports iteration counts and L1 marginal errors against POT on ill-conditioned cost matrices, including high-condition-number instances generated from clustered Gaussians and noisy distance matrices. We will also quantify the maximum relative difference in the scaling vectors produced by our kernel versus a reference implementation. This will demonstrate that any extra rounding error remains negligible and does not compromise convergence or stability for the epsilon range and problem sizes considered. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external benchmarks

full rationale

The manuscript presents an engineering implementation of log-domain Sinkhorn using warp-level shuffles and shared-memory tiling. All reported speedups (12x vs POT, 5.9x vs PyTorch) are direct wall-clock measurements on fixed problem sizes against independent external libraries. No mathematical derivation chain exists; there are no fitted parameters renamed as predictions, no self-definitional equations, and no load-bearing self-citations. Numerical stability is asserted from the standard log-domain formulation rather than derived from the paper's own kernels. The central claims are therefore falsifiable by re-running the same benchmarks on the cited baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, new mathematical axioms, or invented entities are introduced; the work relies on standard GPU hardware assumptions and the existing Sinkhorn algorithm.

axioms (1)

domain assumption NVIDIA warp shuffle instructions and shared-memory behavior match vendor documentation
Implementation correctness depends on these low-level GPU features operating as specified.

pith-pipeline@v0.9.0 · 5472 in / 1154 out tokens · 61851 ms · 2026-05-13T18:06:43.813912+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Sinkhorn Distances: Lightspeed Computation of Optimal Transport , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page
[2]

Foundations and Trends in Machine Learning , volume =

Computational Optimal Transport: With Applications to Data Science , author =. Foundations and Trends in Machine Learning , volume =

work page
[3]

SIAM Journal on Scientific Computing , volume =

Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems , author =. SIAM Journal on Scientific Computing , volume =

work page
[4]

The American Mathematical Monthly , volume =

Diagonal Equivalence to Matrices with Prescribed Row and Column Sums , author =. The American Mathematical Monthly , volume =

work page
[5]

2003 , publisher =

Topics in Optimal Transportation , author =. 2003 , publisher =

work page 2003
[6]

2009 , publisher =

Optimal Transport: Old and New , author =. 2009 , publisher =

work page 2009
[7]

Interpolating between Optimal Transport and

Feydy, Jean and S. Interpolating between Optimal Transport and. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) , pages =

work page
[8]

Journal of Machine Learning Research , volume =

Flamary, R. Journal of Machine Learning Research , volume =

work page
[9]

Kernel Operations on the

Charlier, Benjamin and Feydy, Jean and Glaun. Kernel Operations on the. Journal of Machine Learning Research , volume =

work page
[10]

Near-linear Time Approximation Algorithms for Optimal Transport via

Altschuler, Jason and Weed, Jonathan and Rigollet, Philippe , booktitle =. Near-linear Time Approximation Algorithms for Optimal Transport via

work page
[11]

Low-Rank

Scetbon, Meyer and Cuturi, Marco and Peyr. Low-Rank. Proceedings of the 38th International Conference on Machine Learning (ICML) , pages =

work page
[12]

Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by

Dvurechensky, Pavel and Gasnikov, Alexander and Kroshnin, Alexey , booktitle =. Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by

work page
[13]

Proceedings of the 36th International Conference on Machine Learning (ICML) , pages =

On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms , author =. Proceedings of the 36th International Conference on Machine Learning (ICML) , pages =

work page
[14]

Proceedings of the 34th International Conference on Machine Learning (ICML) , pages =

Wasserstein Generative Adversarial Networks , author =. Proceedings of the 34th International Conference on Machine Learning (ICML) , pages =

work page
[15]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

Optimal Transport for Domain Adaptation , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

work page
[16]

Proceedings of the 32nd International Conference on Machine Learning (ICML) , pages =

From Word Embeddings To Document Distances , author =. Proceedings of the 32nd International Conference on Machine Learning (ICML) , pages =

work page
[17]

Convolutional

Solomon, Justin and de Goes, Fernando and Peyr. Convolutional. ACM Transactions on Graphics (SIGGRAPH) , volume =

work page
[18]

Sliced and

Bonneel, Nicolas and Rabin, Julien and Peyr. Sliced and. Journal of Mathematical Imaging and Vision , volume =

work page
[19]

Optimizing Parallel Reduction in

Harris, Mark , booktitle =. Optimizing Parallel Reduction in

work page
[20]

Journal of Machine Learning Research , volume =

Multiscale Strategies for Computing Optimal Transport , author =. Journal of Machine Learning Research , volume =

work page
[21]

Mathematics of Computation , volume =

Scaling Algorithms for Unbalanced Optimal Transport Problems , author =. Mathematics of Computation , volume =

work page
[22]

Scaling of Matrices to Achieve Specified Row and Column Sums , author =. Numer. Math. , volume =

work page