pith. sign in

arxiv: 2602.03067 · v3 · pith:TZRN6NGVnew · submitted 2026-02-03 · 💻 cs.LG · cs.AI· cs.NA· math.NA

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

Pith reviewed 2026-05-22 11:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NAmath.NA
keywords entropic optimal transportSinkhorn algorithmGPU accelerationIO-aware kernelsTritonFlashAttentionpoint cloud processingtransport plan
0
0 comments X

The pith

FlashSinkhorn rewrites Sinkhorn iterations as fused row-wise LogSumExp reductions on biased dot products to cut GPU memory traffic for entropic optimal transport.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Entropic optimal transport via Sinkhorn iterations sees wide use in machine learning, yet current GPU implementations waste time moving large matrices between slow and fast memory. The paper demonstrates that for squared Euclidean costs the stabilized log-domain updates can be recast as row-wise LogSumExp operations on biased dot-product scores, exactly the pattern used in transformer attention. This reformulation permits FlashAttention-style tiling and fusion inside custom Triton kernels that keep each tile in on-chip SRAM, update the dual potentials in a single pass, and therefore reduce high-bandwidth memory traffic per iteration. Linear memory usage is preserved while the transport plan itself can be applied with additional streaming kernels. Experiments on A100 GPUs report up to 32 times faster forward passes and 161 times faster end-to-end optimization on point-cloud tasks.

Core claim

Stabilized Sinkhorn updates for squared Euclidean costs can be expressed as row-wise LogSumExp reductions of biased dot-product scores, enabling FlashAttention-style fused tiling in Triton kernels that stream tiles through SRAM, update dual potentials in one pass, and substantially lower HBM IO per iteration while retaining linear memory and numerical stability.

What carries the argument

Fused Triton kernels that perform row-wise LogSumExp reductions on biased dot-product scores with FlashAttention-style tiling and SRAM streaming to update dual potentials in a single pass.

If this is right

  • Up to 32 times faster forward passes than state-of-the-art online baselines on A100 GPUs for point-cloud optimal transport.
  • Up to 161 times faster end-to-end optimization when the transport plan is used inside first- or second-order methods.
  • Improved scalability for downstream tasks that repeatedly solve optimal transport problems.
  • Linear-memory streaming kernels for applying the transport plan without materializing dense matrices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same IO-fusion pattern could be applied to other iterative algorithms whose inner steps resemble attention or matrix-normalization operations.
  • Extending the kernels to additional cost functions would require only a change in the score computation while keeping the LogSumExp tiling intact.
  • Integration into larger machine-learning pipelines could reduce the wall-clock cost of using entropic OT as a regularizer or loss term.

Load-bearing premise

The cost matrix must be squared Euclidean distance so the kernel becomes a biased dot product, and the chosen tiling must preserve numerical stability of the log-domain updates without extra safeguards.

What would settle it

Measure forward-pass time and solution accuracy on a non-Euclidean cost matrix using the same fused kernels and check whether speedups vanish or accuracy degrades relative to a standard implementation.

Figures

Figures reproduced from arXiv: 2602.03067 by An Yu, Davis Wertheimer, Felix X.-F. Ye, Linsong Chu, Ming-Ching Chang, Xingjie Li.

Figure 1
Figure 1. Figure 1: FlashSinkhorn kernels via IO-aware tiling. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FlashSinkhorn performance benchmarks. (Top) Forward and forward+backward timing versus n at d=64 and versus d at n=20k. (Bottom left) Memory scaling at d=1024: FlashSinkhorn maintains O(n) while tensorized methods scale as O(n 1.7–n 1.9 ). (Bottom right) HVP timing; FlashSinkhorn alone scales to n=50k and d≥256. A100-80GB, ε=0.1. Speed. FlashSinkhorn is the fastest method in our GPU baselines (Section 4.1)… view at source ↗
Figure 3
Figure 3. Figure 3: Scaling OTDD to large datasets (MNIST↔Fashion-MNIST, d=512). FlashSinkhorn vs tensorized baseline for (a,b) time and (c,d) memory on OTDD distance computation and gradient flow. 0 100 200 300 400 Iteration 10 0 2 × 10 0 3 × 10 0 4 × 10 0 L o s s (a) Loss: Adam vs Newton Adam Adam (cont.) Newton Escape 0 100 200 Iteration 10 2 10 1 10 0 (b) Gradient Norm Adam Newton Adam (cont.) Escape 0 50 100 150 200 Iter… view at source ↗
Figure 4
Figure 4. Figure 4: Saddle escape trajectory comparison: Adam vs Newton n = 40000, ε = 0.1 (a) Loss: Adam decay slowly in saddle region; post-escape Newton converges in 7 steps vs 90 for Adam continuation. (b) Gradient norm: Adam stalls near 0.1; Newton breaks through rapidly. (c) Minimum Hessian eigenvalue via Lanczos; negative←positive transition at step 200 triggers Newton switch. (d) Wall time: Adam+Newton (2423s) vs Adam… view at source ↗
Figure 5
Figure 5. Figure 5: FlashSinkhorn HVP peak memory vs. problem size ( [PITH_FULL_IMAGE:figures/full_fig_p037_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: No-label Sinkhorn divergence benchmark. Comparison of FlashSinkhorn against GeomLoss with KeOps (online) and Tensorized backends on MNIST (d = 784, debiased Sinkhorn, ε = 0.1). (a,b) Forward pass: FlashSinkhorn matches Tensorized speed while using 38× less mem￾ory. KeOps is 14-26× slower. (c,d) Gradient flow (forward + backward per step): FlashSinkhorn scales to n = 60000 while KeOps and Tensorized run out… view at source ↗
Figure 7
Figure 7. Figure 7: Multi-saddle escape in shuffled regression (n=40,000, ε=0.25, seed=0). (a) loss, (b) gradient norm, (c) estimated λmin(H). Green dashed lines mark escapes (λmin>0.001; switch to Newton) and orange dotted lines mark re-entries (λmin<0.001; fallback to Adam). Overall: 3 escapes, 2 re-entries; loss 3.76→1.77. 2. Steps 106-110 (Newton): Newton’s large steps cross into an adjacent shallow saddle (λmin = −0.0027… view at source ↗
read the original abstract

Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbf{FlashSinkhorn}, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at https://github.com/ot-triton-lab/flash-sinkhorn .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents FlashSinkhorn, an IO-aware GPU implementation of entropic optimal transport via Sinkhorn iterations for squared Euclidean costs. It rewrites the stabilized log-domain updates as row-wise LogSumExp reductions over biased dot-product scores (analogous to transformer attention), enabling FlashAttention-style fusion and tiling in Triton kernels that stream tiles through SRAM to reduce HBM traffic while keeping linear memory. Reported results include up to 32× forward-pass and 161× end-to-end speedups on A100 GPUs versus state-of-the-art online baselines for point-cloud OT, plus improved scalability for downstream OT tasks. The authors release open-source code.

Significance. If numerical equivalence to reference Sinkhorn is preserved, the work offers a practical advance in scaling OT computations in machine learning by cutting memory IO without sacrificing the core algorithm. The release of reproducible open-source Triton kernels is a clear strength that supports verification and adoption.

major comments (2)
  1. [§3] §3: The fused Triton kernels are stated to implement the stabilized log-domain Sinkhorn recurrence via per-tile LogSumExp on biased dot products, but the manuscript provides neither a proof of equivalence nor quantitative accuracy tables (e.g., marginal violation or OT-cost error versus POT baseline) at the largest reported problem sizes. This is load-bearing for the speedup claims, because any drift in the fixed-point dual potentials would render the performance numbers incomparable to the baselines.
  2. [Abstract] Abstract and experimental section: Speedup figures are presented without error bars, without ablation on numerical precision across tile sizes, and without explicit verification that the chosen fusion and max-subtraction strategy preserves stability under the reported scales.
minor comments (2)
  1. Clarify the precise scope of the squared-Euclidean assumption and whether the kernel design extends to other cost functions.
  2. Add a short table or figure panel reporting marginal violation and OT-cost error versus a reference implementation across the range of n and m tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the importance of numerical validation. We agree that stronger evidence of equivalence and stability is needed to support the speedup claims and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3: The fused Triton kernels are stated to implement the stabilized log-domain Sinkhorn recurrence via per-tile LogSumExp on biased dot products, but the manuscript provides neither a proof of equivalence nor quantitative accuracy tables (e.g., marginal violation or OT-cost error versus POT baseline) at the largest reported problem sizes. This is load-bearing for the speedup claims, because any drift in the fixed-point dual potentials would render the performance numbers incomparable to the baselines.

    Authors: We acknowledge that the current manuscript lacks an explicit proof and quantitative accuracy tables at the largest scales. The per-tile LogSumExp with max-subtraction and bias is mathematically equivalent to the standard stabilized log-domain Sinkhorn because it performs identical row-wise normalization on the same biased scores. In the revised version we will add a short proof of equivalence in §3 and a new table in the experiments reporting marginal violation (L1 error on marginals) and OT-cost error versus the POT baseline for all reported problem sizes up to the largest n. These additions will confirm that dual potentials match within floating-point tolerance and that the reported speedups are directly comparable. revision: yes

  2. Referee: [Abstract] Abstract and experimental section: Speedup figures are presented without error bars, without ablation on numerical precision across tile sizes, and without explicit verification that the chosen fusion and max-subtraction strategy preserves stability under the reported scales.

    Authors: We agree that error bars, precision ablations, and explicit stability checks are missing. We will update the abstract and experimental sections to report mean speedups with standard deviation over five independent runs. We will also add an ablation subsection varying tile sizes (e.g., 32×32 to 128×128) and precision (FP32 vs. FP16), together with convergence plots showing that the max-subtraction strategy reaches the same fixed-point dual potentials as the reference implementation at all reported scales. These changes will directly address concerns about numerical reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical implementation with external validation

full rationale

The paper describes an engineering contribution that rewrites existing stabilized Sinkhorn iterations into a FlashAttention-compatible form to enable kernel fusion and reduced HBM traffic. All performance numbers are measured speedups against independent external baselines (POT, online Sinkhorn implementations) on concrete hardware. No first-principles derivation, fitted parameter, or self-citation chain is invoked to produce the claimed results; the mathematical equivalence between log-domain Sinkhorn and row-wise LogSumExp on biased dot products is a standard algebraic identity, not a reduction to the paper's own outputs. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, axioms, or invented entities. It relies on the standard mathematical properties of the Sinkhorn algorithm in log domain and the correctness of the Triton compiler for the fused kernels.

axioms (1)
  • standard math Stabilized log-domain Sinkhorn iterations converge to the same fixed point as the original multiplicative updates for positive costs.
    Invoked implicitly when the authors state that the rewritten LogSumExp form retains the original algorithm.

pith-pipeline@v0.9.0 · 5775 in / 1376 out tokens · 35526 ms · 2026-05-22T11:26:00.714713+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection

    cs.LG 2026-05 conditional novelty 7.0

    ASAP amortizes Sinkhorn-based doubly-stochastic attention by learning a parametric map from 1D potentials to the Sinkhorn dual and reconstructing the plan via two-sided entropic c-transform, delivering 5.3x faster inf...

  2. Spherical Harmonic Optimal Transport: Application to Climate Models Comparisons

    cs.LG 2026-05 unverdicted novelty 6.0

    Heat kernel Sinkhorn algorithm on the 2-sphere converges to OT cost with O(n) memory and O(n^{3/2}) time per iteration, retaining geometric properties and applied to climate model evaluation.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    Stochastic EM for Shuffled Linear Regression

    Abubakar Abid and James Zou. Stochastic em for shuffled linear regression.arXiv preprint arXiv:1804.00681, 2018

  2. [2]

    Near-linear time approxima- tion algorithms for optimal transport via sinkhorn iteration.Advances in neural information processing systems, 30, 2017

    Jason Altschuler, Jonathan Niles-Weed, and Philippe Rigollet. Near-linear time approxima- tion algorithms for optimal transport via sinkhorn iteration.Advances in neural information processing systems, 30, 2017

  3. [3]

    Geometric dataset distances via optimal transport

    David Alvarez-Melis and Nicolo Fusi. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428–21439, 2020

  4. [4]

    Dataset dynamics via gradient flows in probability space

    David Alvarez-Melis and Nicol` o Fusi. Dataset dynamics via gradient flows in probability space. InInternational conference on machine learning, pages 219–230. PMLR, 2021

  5. [5]

    Wasserstein generative adversarial networks

    Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein generative adversarial networks. InInternational Conference on Machine Learning, pages 214–223, 2017

  6. [6]

    Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices.Advances in neural information processing systems, 27, 2014

    Austin R Benson, Jason D Lee, Bartek Rajwa, and David F Gleich. Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices.Advances in neural information processing systems, 27, 2014. 13

  7. [7]

    Nonlinear programming.Journal of the Operational Research Society, 48(3):334–334, 1997

    Dimitri P Bertsekas. Nonlinear programming.Journal of the Operational Research Society, 48(3):334–334, 1997

  8. [8]

    A Sinkhorn-Newton method for entropic optimal transport

    Christoph Brauer, Christian Clason, Dirk Lorenz, and Benedikt Wirth. A sinkhorn-newton method for entropic optimal transport.arXiv preprint arXiv:1710.06635, 2017

  9. [9]

    Optimal transport for single-cell and spatial omics.Nature Reviews Methods Primers, 4(1):58, 2024

    Charlotte Bunne, Geoffrey Schiebinger, Andreas Krause, Aviv Regev, and Marco Cuturi. Optimal transport for single-cell and spatial omics.Nature Reviews Methods Primers, 4(1):58, 2024

  10. [10]

    Kernel operations on the gpu, with autodiff, without memory overflows.Journal of Machine Learning Research, 22(74):1–6, 2021

    Benjamin Charlier, Jean Feydy, Joan Alexis Glaunes, Fran¸ cois-David Collin, and Ghislain Durif. Kernel operations on the gpu, with autodiff, without memory overflows.Journal of Machine Learning Research, 22(74):1–6, 2021

  11. [11]

    Sinkhorn distances: Lightspeed computation of optimal transport

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InAdvances in Neural Information Processing Systems, volume 26, 2013

  12. [12]

    Fast computation of Wasserstein barycenters

    Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. InInterna- tional Conference on Machine Learning, pages 685–693, 2014

  13. [13]

    Optimal transport tools (OTT): A JAX toolbox for all things Wasserstein

    Marco Cuturi, Laetitia Meng-Papaxanthos, Yingtao Tian, Charlotte Bunne, Geoff Davis, and Olivier Teboul. Optimal transport tools (OTT): A JAX toolbox for all things Wasserstein. arXiv preprint arXiv:2201.12324, 2022

  14. [14]

    Escaping saddles with stochastic gradients

    Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. InInternational Conference on Machine Learning, pages 1155–1164. PMLR, 2018

  15. [15]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

  16. [16]

    FlashAttention: Fast and memory-efficient exact attention with IO-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022

  17. [17]

    The mnist database of handwritten digit images for machine learning research [best of the web].IEEE signal processing magazine, 29(6):141–142, 2012

    Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web].IEEE signal processing magazine, 29(6):141–142, 2012

  18. [18]

    A unified framework for implicit Sinkhorn differentiation

    Marvin Eisenberger, Aysim Toker, Laura Leal-Taix´ e, and Daniel Cremers. A unified framework for implicit Sinkhorn differentiation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 509–518, 2022

  19. [19]

    Geometric loss functions between sampled measures, images and volumes.URL https://www

    Jean Feydy. Geometric loss functions between sampled measures, images and volumes.URL https://www. kernel-operations. io/geomloss, 2019

  20. [20]

    Geometric data analysis, beyond convolutions.Applied Mathematics, 3, 2020

    Jean Feydy. Geometric data analysis, beyond convolutions.Applied Mathematics, 3, 2020

  21. [21]

    Interpolating between optimal transport and MMD using Sinkhorn divergences

    Jean Feydy, Thibault S´ ejourn´ e, Fran¸ cois-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyr´ e. Interpolating between optimal transport and MMD using Sinkhorn divergences. InInternational Conference on Artificial Intelligence and Statistics, pages 2681–2690, 2019

  22. [22]

    POT: Python optimal transport.Journal of Machine Learning Research, 22(78):1–8, 2021

    R´ emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z Alaya, Aur´ elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, et al. POT: Python optimal transport.Journal of Machine Learning Research, 22(78):1–8, 2021. 14

  23. [23]

    Escaping from saddle points—online stochastic gradient for tensor decomposition

    Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. InConference on learning theory, pages 797–

  24. [24]

    JHU press, 2013

    Gene H Golub and Charles F Van Loan.Matrix computations. JHU press, 2013

  25. [25]

    How to escape saddle points efficiently

    Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. InInternational conference on machine learning, pages 1724–1732. PMLR, 2017

  26. [26]

    Unsupervised domain adaptation within deep foundation latent spaces.arXiv preprint arXiv:2402.14976, 2024

    Dmitry Kangin and Plamen Angelov. Unsupervised domain adaptation within deep foundation latent spaces.arXiv preprint arXiv:2402.14976, 2024

  27. [27]

    A truncated new- ton method for optimal transport

    Mete Kemertas, Amir-massoud Farahmand, and Allan Douglas Jepson. A truncated new- ton method for optimal transport. InThe Thirteenth International Conference on Learning Representations, 2025

  28. [28]

    Coreset selection for the sinkhorn divergence and generic smooth divergences.arXiv preprint arXiv:2504.20194, 2025

    Alex Kokot and Alex Luedtke. Coreset selection for the sinkhorn divergence and generic smooth divergences.arXiv preprint arXiv:2504.20194, 2025

  29. [29]

    Tritonforge: Profiling-guided framework for automated triton kernel optimization.arXiv preprint arXiv:2512.09196, 2025

    Haonan Li, Keyu Man, Partha Kanuparthy, Hanning Chen, Wei Sun, Sreen Tallam, Chenguang Zhu, Kevin Zhu, and Zhiyun Qian. Tritonforge: Profiling-guided framework for automated triton kernel optimization.arXiv preprint arXiv:2512.09196, 2025

  30. [30]

    Robust first-and second-order differentiation for regularized optimal transport.SIAM Journal on Scientific Computing, 47(3):C630–C654, 2025

    Xingjie Li, Fei Lu, Molei Tao, and Felix X-F Ye. Robust first-and second-order differentiation for regularized optimal transport.SIAM Journal on Scientific Computing, 47(3):C630–C654, 2025

  31. [31]

    Bringing regularized optimal transport to lightspeed: a splitting method adapted for gpus.Advances in Neural Information Processing Systems, 36:26845–26871, 2023

    Jacob Lindb¨ ack, Zesen Wang, and Mikael Johansson. Bringing regularized optimal transport to lightspeed: a splitting method adapted for gpus.Advances in Neural Information Processing Systems, 36:26845–26871, 2023

  32. [32]

    Scaled-dot-product attention as one-sided entropic optimal transport.arXiv preprint arXiv:2508.08369, 2025

    Elon Litman. Scaled-dot-product attention as one-sided entropic optimal transport.arXiv preprint arXiv:2508.08369, 2025

  33. [33]

    You need better attention priors.arXiv preprint arXiv:2601.15380, 2026

    Elon Litman and Gabe Guo. You need better attention priors.arXiv preprint arXiv:2601.15380, 2026

  34. [34]

    Deep learning via hessian-free optimization

    James Martens et al. Deep learning via hessian-free optimization. InIcml, volume 27, pages 735–742, 2010

  35. [35]

    Learning latent permu- tations with gumbel-sinkhorn networks

    Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. Learning latent permu- tations with gumbel-sinkhorn networks. InInternational Conference on Learning Representa- tions, 2018

  36. [36]

    Online sinkhorn: Optimal transport distances from sample streams.Advances in Neural Information Processing Systems, 33:1657–1667, 2020

    Arthur Mensch and Gabriel Peyr´ e. Online sinkhorn: Optimal transport distances from sample streams.Advances in Neural Information Processing Systems, 33:1657–1667, 2020

  37. [37]

    Online normalizer calculation for softmax

    Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018

  38. [38]

    Computational optimal transport.Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019

    Gabriel Peyr´ e and Marco Cuturi. Computational optimal transport.Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019. 15

  39. [39]

    Entropic estimation of optimal trans- port maps.arXiv preprint arXiv:2109.12004, 2021

    Aram-Alexandre Pooladian and Jonathan Niles-Weed. Entropic estimation of optimal trans- port maps.arXiv preprint arXiv:2109.12004, 2021

  40. [40]

    The anatomy of a triton attention kernel.arXiv preprint arXiv:2511.11581, 2025

    Burkhard Ringlein, Jan van Lunteren, Radu Stoica, and Thomas Parnell. The anatomy of a triton attention kernel.arXiv preprint arXiv:2511.11581, 2025

  41. [41]

    Sinkformers: Trans- formers with doubly stochastic attention

    Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr´ e. Sinkformers: Trans- formers with doubly stochastic attention. InInternational Conference on Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022

  42. [42]

    Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming.Cell, 176(4):928–943, 2019

    Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berber, et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming.Cell, 176(4):928–943, 2019

  43. [43]

    Stabilized sparse scaling algorithms for entropy regularized transport problems.SIAM Journal on Scientific Computing, 41(3):A1443–A1481, 2019

    Bernhard Schmitzer. Stabilized sparse scaling algorithms for entropy regularized transport problems.SIAM Journal on Scientific Computing, 41(3):A1443–A1481, 2019

  44. [44]

    Sinkhorn natural gradient for generative models.Advances in Neural Information Processing Systems, 33:1646–1656, 2020

    Zebang Shen, Zhenfu Wang, Alejandro Ribeiro, and Hamed Hassani. Sinkhorn natural gradient for generative models.Advances in Neural Information Processing Systems, 33:1646–1656, 2020

  45. [45]

    Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains.ACM Transactions on Graphics, 34(4):1–11, 2015

    Justin Solomon, Fernando De Goes, Gabriel Peyr´ e, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains.ACM Transactions on Graphics, 34(4):1–11, 2015

  46. [46]

    Safe and sparse newton method for entropic-regularized optimal transport.Advances in Neural Information Processing Systems, 37:129914–129943, 2024

    Zihao Tang and Yixuan Qiu. Safe and sparse newton method for entropic-regularized optimal transport.Advances in Neural Information Processing Systems, 37:129914–129943, 2024

  47. [47]

    Sparse sinkhorn attention

    Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. InInternational conference on machine learning, pages 9438–9447. PMLR, 2020

  48. [48]

    Numerical optimization.Springer Science, 35(67-68):7, 1999

    Stephen Wright, Jorge Nocedal, et al. Numerical optimization.Springer Science, 35(67-68):7, 1999

  49. [49]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017

  50. [50]

    A hypergradient approach to robust regression without correspondence.arXiv preprint arXiv:2012.00123, 2020

    Yujia Xie, Yixiu Mao, Simiao Zuo, Hongteng Xu, Xiaojing Ye, Tuo Zhao, and Hongyuan Zha. A hypergradient approach to robust regression without correspondence.arXiv preprint arXiv:2012.00123, 2020

  51. [51]

    Hessian-guided perturbed wasserstein gradient flows for escaping saddle points.arXiv preprint arXiv:2509.16974, 2025

    Naoya Yamamoto, Juno Kim, and Taiji Suzuki. Hessian-guided perturbed wasserstein gradient flows for escaping saddle points.arXiv preprint arXiv:2509.16974, 2025. 16 A Notation Table 4: Notation used throughout the paper. Symbol Meaning Discrete measures and cost X= [x 1;. . .;x n]∈R n×d Source points (rows),x i ∈R d. Y= [y 1;. . .;y m]∈R m×d Target points...

  52. [52]

    All methods use regularizationε= 0.1 and 10 Sinkhorn iterations for forward/backward benchmarks, or 100 Sinkhorn iterations for HVP benchmarks

    Points are sampled uniformly from [0,1] d withn∈ [5k,50k],d∈[4,1024] for forward/backward benchmarks andd∈[4,512] for HVP examples. All methods use regularizationε= 0.1 and 10 Sinkhorn iterations for forward/backward benchmarks, or 100 Sinkhorn iterations for HVP benchmarks. For HVP, the conjugate gradient solver uses K= 50 fixed iterations with dampingτ=...

  53. [53]

    is not benchmarked here since our evaluation targets GPU implementations. GeomLoss also providesbackend=’multiscale’, a fast octree-style multiscale routine intended for very large 31 Table 5: NCU profiling of the forward pass (n=m=10000,d=64, 10 Sinkhorn iterations, A100- 80GB).Note:Working set (5 MB) fits in A100 L2 cache (40 MB); HBM traffic reflects L...

  54. [54]

    Ground truth transformation:W ∗ ∈R 5×5 with entriesW ∗ ij ∼ N(0,1/5)

  55. [55]

    Clean targets:Y clean =XW ∗

  56. [56]

    Noisy targets:Y=Y clean +E, whereE ij ∼ N(0, σ 2) withσ= 0.05·std(Y clean)

  57. [57]

    Shuffled observations: eY= Π ∗(Y) for unknown permutation Π ∗. The optimization objective is min W L(W), where L(W) = OT ε   1 n nX i=1 δyi, 1 n nX j=1 δeyj   = min P∈Π( 1 n 1 n, 1 n 1 n) ⟨C(W), P⟩+εKL P∥ 1 n n ⊗ 1 n n wherey i =x iWandC ij(W) =∥y i −eyj∥2 2 Optimizer ConfigurationWe test three regularization strengthsε∈0.1,0.25,0.5. The solver uses e...