Linear Convergence of Gradient Descent for Quadratically Regularized Optimal Transport

Alberto Gonz\'alez-Sanz; Andr\'es Riveros Valdevenito; Marcel Nutz

arxiv: 2509.08547 · v3 · pith:SXCF3O3Vnew · submitted 2025-09-10 · 🧮 math.OC · math.AP· math.FA· math.PR

Linear Convergence of Gradient Descent for Quadratically Regularized Optimal Transport

Alberto Gonz\'alez-Sanz , Marcel Nutz , Andr\'es Riveros Valdevenito This is my paper

Pith reviewed 2026-05-21 22:47 UTC · model grok-4.3

classification 🧮 math.OC math.APmath.FAmath.PR

keywords optimal transportquadratic regularizationgradient descentlinear convergencedual problemspectral analysiscontinuous setting

0 comments

The pith

Gradient descent on the dual of quadratically regularized optimal transport converges linearly in L2 after a burn-in period.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that gradient descent applied to the dual formulation of quadratically regularized optimal transport achieves linear convergence measured in the L2 norm. This means the distance from the current potentials to the optimal ones shrinks exponentially fast. The proof proceeds by linearizing the iteration around the solution and showing via spectral analysis that the resulting operator is a contraction mapping, with the nonlinear dynamics following suit after some initial steps. This result is significant because the dual problem lacks strong convexity, making standard convergence arguments unavailable, yet practical algorithms still succeed.

Core claim

We establish linear convergence in L², that is, the L² distance between the iterates and the limiting potentials decreases exponentially fast. The proof is based on a spectral analysis of the linearized gradient descent operator at the optimum. We show that this operator is a strict contraction and that the nonlinear iteration inherits this property after a burn-in period.

What carries the argument

The linearized gradient descent operator at the optimum, whose spectral radius less than one establishes it as a strict contraction in continuous function space.

If this is right

The dual gradient descent iteration converges exponentially fast in the L2 norm to the limiting potentials.
The nonlinear dynamics inherit the contraction property of the linearized operator after a finite number of steps.
This linear rate applies in a continuous function-space setting without relying on strong convexity of the dual objective.
The result supplies a convergence guarantee for computing quadratically regularized transport plans via gradient methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectral contraction may hold for other dual problems that are convex but not strongly convex, suggesting a general technique for proving linear rates.
Numerical verification of the spectral radius for specific cost functions could serve as a practical check before running the algorithm.
The burn-in period length might be bounded in terms of problem parameters, allowing explicit error estimates from the start.

Load-bearing premise

The spectral radius of the linearized gradient descent operator at the optimum is strictly less than one in the continuous setting.

What would settle it

A numerical test on concrete measures and costs where the ratio of successive L2 errors between iterates fails to approach a fixed value strictly below one after the burn-in phase.

Figures

Figures reproduced from arXiv: 2509.08547 by Alberto Gonz\'alez-Sanz, Andr\'es Riveros Valdevenito, Marcel Nutz.

**Figure 1.** Figure 1: Transporting P = N(0, 1) to Q = N(1, 1); the measures are truncated to [-3,3] and [-2,4], respectively. Left: Dual solutions of QOTϵ and OT. Right: Supports of the optimal couplings are sparse and converge to the optimal transport map. approaches have been considered, mostly targeting the dual problem of (5), sup f,g:Rd→R Z f(x) + g(y) − 1 2ε f(x) + g(y) − 1 2 ∥x − y∥ 2 2 + d(P ⊗ Q)(x, y) (6) or its fir… view at source ↗

**Figure 2.** Figure 2: Gradient descent algorithm for three pairs of marginals: (top) [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

**Figure 3.** Figure 3: Repeating the first experiment (top row of Figure [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

read the original abstract

In optimal transport, quadratic regularization is an alternative to entropic regularization when sparse couplings or small regularization parameters are desired. Quadratic regularization penalizes transport couplings by the squared $L^2$ norm of their density, or equivalently by the $\chi^2$ divergence. While a number of computational approaches have been shown to work in practice, the dual problem is not strongly convex and theoretical convergence results are scarce. We focus on the dual gradient descent algorithm in a continuous setting and establish linear convergence in $L^2$, that is, the $L^2$ distance between the iterates and the limiting potentials decreases exponentially fast. The proof is based on a spectral analysis of the linearized gradient descent operator at the optimum. We show that this operator is a strict contraction and that the nonlinear iteration inherits this property after a burn-in period.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves linear L2 convergence for dual gradient descent on quadratically regularized OT via spectral analysis of the linearized operator, filling a gap left by entropic results, though the contraction hinges on a spectral radius claim that needs verification in general continuous settings.

read the letter

The main thing to know is that this paper establishes linear convergence in L2 for gradient descent on the dual of quadratically regularized optimal transport in continuous spaces. They linearize the iteration at the optimum, analyze the spectrum of that operator, and conclude it is a strict contraction so the nonlinear sequence inherits an exponential rate after a burn-in period. This is new relative to the entropic literature cited in the abstract, where quadratic penalties produce different operators and favor sparsity for small regularization parameters. The work does well by targeting a setting that matters in practice when entropic smoothing is too diffuse, and the spectral approach is a direct way to handle the lack of strong convexity in the dual. The argument stays focused on the continuous dual without jumping to discretizations first. The soft spot is the central spectral claim. The abstract asserts that the linearized gradient descent operator has spectral radius strictly less than one, which makes it a contraction in the function space. In infinite dimensions this is not automatic for arbitrary costs and marginals, and it is not obvious whether the radius stays below one uniformly or only under extra conditions such as interior positivity or compactness away from boundaries. If the proof requires hidden restrictions or if the radius can reach or exceed one for some data, the linear rate and the controlled burn-in would not follow in full generality. The paper is for researchers working on convergence theory for regularized optimal transport and related dual optimization methods. A reader who cares about rates for non-strongly convex problems or about alternatives to entropy would find the spectral technique useful. It deserves a serious referee to check the function-space assumptions and the explicit control on the spectral radius, even if revisions are needed to clarify the scope.

Referee Report

2 major / 1 minor

Summary. The manuscript establishes linear convergence in L² for the dual gradient-descent iterates of quadratically regularized optimal transport. The central argument proceeds by linearizing the gradient-descent map at the dual optimum, showing via spectral analysis that the resulting linear operator is a strict contraction on L², and then arguing that the nonlinear iteration inherits the linear rate after a finite burn-in period.

Significance. If the spectral-radius claim holds with a contraction constant independent of discretization and under standard assumptions on the cost and marginals, the result supplies the first rigorous linear-rate guarantee for gradient descent on the dual of quadratically regularized OT. The direct spectral approach (rather than any self-referential or fitted quantity) is a methodological strength that could extend to other non-strongly-convex regularizations.

major comments (2)

[proof strategy paragraph and spectral-analysis section] The proof that the linearized gradient-descent operator has spectral radius strictly less than one (invoked to obtain the contraction) is asserted in the continuous L² setting but the manuscript does not supply explicit bounds on the spectrum or the step-size restriction that guarantee this for arbitrary continuous marginals and costs. Without such bounds or additional regularity assumptions (e.g., strict positivity or compactness), the contraction property and the subsequent inheritance by the nonlinear map remain unverified.
[statement of main theorem and burn-in argument] The burn-in period after which the nonlinear iteration inherits the linear rate is stated to be finite, yet no uniform control (independent of the initial potentials or the discretization parameter) is provided. This leaves open whether the overall convergence rate remains linear with a constant independent of discretization, which is central to the practical relevance of the result.

minor comments (1)

Notation for the linearized operator and its spectrum should be introduced with an explicit equation number at first use to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below with clarifications and indicate the revisions we will make.

read point-by-point responses

Referee: [proof strategy paragraph and spectral-analysis section] The proof that the linearized gradient-descent operator has spectral radius strictly less than one (invoked to obtain the contraction) is asserted in the continuous L² setting but the manuscript does not supply explicit bounds on the spectrum or the step-size restriction that guarantee this for arbitrary continuous marginals and costs. Without such bounds or additional regularity assumptions (e.g., strict positivity or compactness), the contraction property and the subsequent inheritance by the nonlinear map remain unverified.

Authors: We appreciate the referee drawing attention to the level of detail in the spectral analysis. The manuscript establishes the strict contraction by analyzing the eigenvalues of the linearized operator, which arise from the composition of multiplication by the cost function and the orthogonal projections enforcing the marginal constraints; under the standing assumptions of continuous cost and marginals on a compact domain, this yields a spectral radius strictly less than one for all step sizes below an explicit threshold depending on the sup-norms of the cost and marginal densities. To address the request for greater explicitness, we will revise the spectral-analysis section to include a dedicated lemma that states the precise step-size restriction and the resulting bound on the spectral radius in terms of these quantities. The inheritance by the nonlinear map then follows from standard local contraction arguments in Banach space. revision: yes
Referee: [statement of main theorem and burn-in argument] The burn-in period after which the nonlinear iteration inherits the linear rate is stated to be finite, yet no uniform control (independent of the initial potentials or the discretization parameter) is provided. This leaves open whether the overall convergence rate remains linear with a constant independent of discretization, which is central to the practical relevance of the result.

Authors: The burn-in period is finite because the gradient map is continuously differentiable and the linearized operator is a contraction at the fixed point, so the iterates eventually enter a neighborhood in which the linear rate governs the behavior. We agree that uniformity with respect to discretization is desirable for practical relevance. The contraction constant itself is independent of discretization in the continuous setting; we will add a remark to the statement of the main theorem clarifying that the same constant carries over to consistent discretizations via operator-norm convergence of the discrete linearized operators to the continuous one. This preserves the linear rate while making the dependence on discretization explicit. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via direct spectral analysis

full rationale

The paper derives linear L² convergence of gradient descent for quadratically regularized OT by performing a spectral analysis of the linearized operator at the dual optimum and establishing that this operator is a strict contraction (with the nonlinear iteration inheriting the property after burn-in). This rests on explicit operator properties in the continuous function-space setting rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain. No step reduces the claimed result to its own inputs by construction; the central argument supplies independent mathematical content through the spectrum bound.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the existence of an optimum in suitable Sobolev or L2 spaces and on the well-posedness of the linearized operator; these are standard domain assumptions rather than new postulates.

axioms (2)

domain assumption The dual problem admits a unique optimum in the chosen function space.
Required for the limiting potentials to exist and for linearization around that point.
domain assumption The linearized gradient descent operator is well-defined and bounded on L2.
Invoked to perform the spectral analysis.

pith-pipeline@v0.9.0 · 5684 in / 1242 out tokens · 39189 ms · 2026-05-21T22:47:45.303171+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Bayraktar, S

E. Bayraktar, S. Eckstein, and X. Zhang. Stability and sample complexity of divergence regu- larized optimal transport.Bernoulli, 31(1):213–239, 2025

work page 2025
[2]

Blondel, V

M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. volume 84 of Proceedings of Machine Learning Research, pages 880–889, 2018

work page 2018
[3]

H. Brezis. Functional analysis, Sobolev spaces and partial differential equations. Universitext. Springer, New York, 2011. 22

work page 2011
[4]

G. Carlier. On the linear convergence of the multimarginal Sinkhorn algorithm.SIAM J. Optim., 32(2):786–794, 2022

work page 2022
[5]

Chizat, A

L. Chizat, A. Delalande, and T. Vaškevičius. Sharper exponential convergence rates for Sinkhorn’s algorithm in continuous settings.Math. Program., 2025

work page 2025
[6]

M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors,Advances in Neural Infor- mation Processing Systems, volume 26. Curran Associates, Inc., 2013

work page 2013
[7]

Eckstein and M

S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problems via penalization and neural networks.Appl. Math. Optim., 83(2):639–667, 2021

work page 2021
[8]

Eckstein and M

S. Eckstein and M. Nutz. Convergence rates for regularized optimal transport via quantization. Math. Oper. Res., 49(2):1223–1240, 2024

work page 2024
[9]

Essid and J

M. Essid and J. Solomon. Quadratically regularized optimal transport on graphs.SIAM J. Sci. Comput., 40(4):A1961–A1986, 2018

work page 2018
[10]

Franklin and J

J. Franklin and J. Lorenz. On the scaling of multidimensional matrices.Linear Algebra Appl., 114/115:717–735, 1989

work page 1989
[11]

Garriz-Molina, A

A. Garriz-Molina, A. González-Sanz, and G. Mordant. Infinitesimal behavior of quadrat- ically regularized optimal transport and its relation with the porous medium equation. arXiv:2407.21528, 2024

work page arXiv 2024
[12]

Genevay, M

A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems 29, pages 3440–3448, 2016

work page 2016
[13]

González-Sanz, E

A. González-Sanz, E. del Barrio, and M. Nutz. Sample complexity of quadratically regularized optimal transport. Forthcoming, 2025

work page 2025
[14]

González-Sanz, S

A. González-Sanz, S. Eckstein, and M. Nutz. Sparse regularized optimal transport without curse of dimensionality, 2025

work page 2025
[15]

González-Sanz and M

A. González-Sanz and M. Nutz. Sparsity of quadratically regularized optimal transport: Scalar case. arXiv:2410.03353, 2024

work page arXiv 2024
[16]

Gulrajani, F

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. InProceedings of the 31st International Conference on Neural Information Processing Systems, pages 5769–5779, 2017

work page 2017
[17]

N. Lahn, D. Mulchandani, and S. Raghvendra. A graph theoretic additive approximation of optimal transport. InProceedings of the 33rd International Conference on Neural Information Processing Systems, pages 13831–13841. Curran Associates, Inc., 2019

work page 2019
[18]

L. Li, A. Genevay, M. Yurochkin, and J. Solomon. Continuous regularized Wasserstein barycen- ters. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 17755–17765. Curran Associates, Inc., 2020

work page 2020
[19]

Lorenz and H

D. Lorenz and H. Mahler. Orlicz space regularization of continuous optimal transport problems. Appl. Math. Optim., 85(2):Paper No. 14, 33, 2022

work page 2022
[20]

D. A. Lorenz and H. Mahler. Orlicz-space regularization for optimal transport and algorithms for quadratic regularization.arXiv:1909.06082, 2019

work page arXiv 1909
[21]

D. A. Lorenz, P. Manns, and C. Meyer. Quadratically regularized optimal transport. Appl. Math. Optim., 83(3):1919–1949, 2021

work page 1919
[22]

Muzellec, R

B. Muzellec, R. Nock, G. Patrini, and F. Nielsen. Tsallis regularized optimal transport and ecological inference. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Feb. 2017. 23

work page 2017
[23]

Nesterov

Y. Nesterov. Introductory lectures on convex optimization, volume 87 ofApplied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. A basic course

work page 2004
[24]

M. Nutz. Quadratically regularized optimal transport: existence and multiplicity of potentials. SIAM J. Math. Anal., 57(3):2622–2649, 2025

work page 2025
[25]

Peyré and M

G. Peyré and M. Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5–6):355–607, 2019

work page 2019
[26]

Schmitzer

B. Schmitzer. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM J. Sci. Comput., 41(3):A1443–A1481, 2019

work page 2019
[27]

Seguy, B

V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large scale optimal transport and mapping estimation. InInternational Conference on Learning Represen- tations, 2018

work page 2018
[28]

Wiesel and X

J. Wiesel and X. Xu. Sparsity of quadratically regularized optimal transport: Bounds on con- centration and bias.Preprint arXiv:2410.03425v1, 2024

work page arXiv 2024
[29]

Zhang, G

S. Zhang, G. Mordant, T. Matsumoto, and G. Schiebinger. Manifold learning with sparse regularised optimal transport.arXiv:2307.09816, 2023. 24

work page arXiv 2023

[1] [1]

Bayraktar, S

E. Bayraktar, S. Eckstein, and X. Zhang. Stability and sample complexity of divergence regu- larized optimal transport.Bernoulli, 31(1):213–239, 2025

work page 2025

[2] [2]

Blondel, V

M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. volume 84 of Proceedings of Machine Learning Research, pages 880–889, 2018

work page 2018

[3] [3]

H. Brezis. Functional analysis, Sobolev spaces and partial differential equations. Universitext. Springer, New York, 2011. 22

work page 2011

[4] [4]

G. Carlier. On the linear convergence of the multimarginal Sinkhorn algorithm.SIAM J. Optim., 32(2):786–794, 2022

work page 2022

[5] [5]

Chizat, A

L. Chizat, A. Delalande, and T. Vaškevičius. Sharper exponential convergence rates for Sinkhorn’s algorithm in continuous settings.Math. Program., 2025

work page 2025

[6] [6]

M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors,Advances in Neural Infor- mation Processing Systems, volume 26. Curran Associates, Inc., 2013

work page 2013

[7] [7]

Eckstein and M

S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problems via penalization and neural networks.Appl. Math. Optim., 83(2):639–667, 2021

work page 2021

[8] [8]

Eckstein and M

S. Eckstein and M. Nutz. Convergence rates for regularized optimal transport via quantization. Math. Oper. Res., 49(2):1223–1240, 2024

work page 2024

[9] [9]

Essid and J

M. Essid and J. Solomon. Quadratically regularized optimal transport on graphs.SIAM J. Sci. Comput., 40(4):A1961–A1986, 2018

work page 2018

[10] [10]

Franklin and J

J. Franklin and J. Lorenz. On the scaling of multidimensional matrices.Linear Algebra Appl., 114/115:717–735, 1989

work page 1989

[11] [11]

Garriz-Molina, A

A. Garriz-Molina, A. González-Sanz, and G. Mordant. Infinitesimal behavior of quadrat- ically regularized optimal transport and its relation with the porous medium equation. arXiv:2407.21528, 2024

work page arXiv 2024

[12] [12]

Genevay, M

A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems 29, pages 3440–3448, 2016

work page 2016

[13] [13]

González-Sanz, E

A. González-Sanz, E. del Barrio, and M. Nutz. Sample complexity of quadratically regularized optimal transport. Forthcoming, 2025

work page 2025

[14] [14]

González-Sanz, S

A. González-Sanz, S. Eckstein, and M. Nutz. Sparse regularized optimal transport without curse of dimensionality, 2025

work page 2025

[15] [15]

González-Sanz and M

A. González-Sanz and M. Nutz. Sparsity of quadratically regularized optimal transport: Scalar case. arXiv:2410.03353, 2024

work page arXiv 2024

[16] [16]

Gulrajani, F

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. InProceedings of the 31st International Conference on Neural Information Processing Systems, pages 5769–5779, 2017

work page 2017

[17] [17]

N. Lahn, D. Mulchandani, and S. Raghvendra. A graph theoretic additive approximation of optimal transport. InProceedings of the 33rd International Conference on Neural Information Processing Systems, pages 13831–13841. Curran Associates, Inc., 2019

work page 2019

[18] [18]

L. Li, A. Genevay, M. Yurochkin, and J. Solomon. Continuous regularized Wasserstein barycen- ters. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 17755–17765. Curran Associates, Inc., 2020

work page 2020

[19] [19]

Lorenz and H

D. Lorenz and H. Mahler. Orlicz space regularization of continuous optimal transport problems. Appl. Math. Optim., 85(2):Paper No. 14, 33, 2022

work page 2022

[20] [20]

D. A. Lorenz and H. Mahler. Orlicz-space regularization for optimal transport and algorithms for quadratic regularization.arXiv:1909.06082, 2019

work page arXiv 1909

[21] [21]

D. A. Lorenz, P. Manns, and C. Meyer. Quadratically regularized optimal transport. Appl. Math. Optim., 83(3):1919–1949, 2021

work page 1919

[22] [22]

Muzellec, R

B. Muzellec, R. Nock, G. Patrini, and F. Nielsen. Tsallis regularized optimal transport and ecological inference. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Feb. 2017. 23

work page 2017

[23] [23]

Nesterov

Y. Nesterov. Introductory lectures on convex optimization, volume 87 ofApplied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. A basic course

work page 2004

[24] [24]

M. Nutz. Quadratically regularized optimal transport: existence and multiplicity of potentials. SIAM J. Math. Anal., 57(3):2622–2649, 2025

work page 2025

[25] [25]

Peyré and M

G. Peyré and M. Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5–6):355–607, 2019

work page 2019

[26] [26]

Schmitzer

B. Schmitzer. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM J. Sci. Comput., 41(3):A1443–A1481, 2019

work page 2019

[27] [27]

Seguy, B

V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large scale optimal transport and mapping estimation. InInternational Conference on Learning Represen- tations, 2018

work page 2018

[28] [28]

Wiesel and X

J. Wiesel and X. Xu. Sparsity of quadratically regularized optimal transport: Bounds on con- centration and bias.Preprint arXiv:2410.03425v1, 2024

work page arXiv 2024

[29] [29]

Zhang, G

S. Zhang, G. Mordant, T. Matsumoto, and G. Schiebinger. Manifold learning with sparse regularised optimal transport.arXiv:2307.09816, 2023. 24

work page arXiv 2023