Linear Convergence of Gradient Descent for Quadratically Regularized Optimal Transport
Pith reviewed 2026-05-21 22:47 UTC · model grok-4.3
The pith
Gradient descent on the dual of quadratically regularized optimal transport converges linearly in L2 after a burn-in period.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish linear convergence in L², that is, the L² distance between the iterates and the limiting potentials decreases exponentially fast. The proof is based on a spectral analysis of the linearized gradient descent operator at the optimum. We show that this operator is a strict contraction and that the nonlinear iteration inherits this property after a burn-in period.
What carries the argument
The linearized gradient descent operator at the optimum, whose spectral radius less than one establishes it as a strict contraction in continuous function space.
If this is right
- The dual gradient descent iteration converges exponentially fast in the L2 norm to the limiting potentials.
- The nonlinear dynamics inherit the contraction property of the linearized operator after a finite number of steps.
- This linear rate applies in a continuous function-space setting without relying on strong convexity of the dual objective.
- The result supplies a convergence guarantee for computing quadratically regularized transport plans via gradient methods.
Where Pith is reading between the lines
- The same spectral contraction may hold for other dual problems that are convex but not strongly convex, suggesting a general technique for proving linear rates.
- Numerical verification of the spectral radius for specific cost functions could serve as a practical check before running the algorithm.
- The burn-in period length might be bounded in terms of problem parameters, allowing explicit error estimates from the start.
Load-bearing premise
The spectral radius of the linearized gradient descent operator at the optimum is strictly less than one in the continuous setting.
What would settle it
A numerical test on concrete measures and costs where the ratio of successive L2 errors between iterates fails to approach a fixed value strictly below one after the burn-in phase.
Figures
read the original abstract
In optimal transport, quadratic regularization is an alternative to entropic regularization when sparse couplings or small regularization parameters are desired. Quadratic regularization penalizes transport couplings by the squared $L^2$ norm of their density, or equivalently by the $\chi^2$ divergence. While a number of computational approaches have been shown to work in practice, the dual problem is not strongly convex and theoretical convergence results are scarce. We focus on the dual gradient descent algorithm in a continuous setting and establish linear convergence in $L^2$, that is, the $L^2$ distance between the iterates and the limiting potentials decreases exponentially fast. The proof is based on a spectral analysis of the linearized gradient descent operator at the optimum. We show that this operator is a strict contraction and that the nonlinear iteration inherits this property after a burn-in period.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript establishes linear convergence in L² for the dual gradient-descent iterates of quadratically regularized optimal transport. The central argument proceeds by linearizing the gradient-descent map at the dual optimum, showing via spectral analysis that the resulting linear operator is a strict contraction on L², and then arguing that the nonlinear iteration inherits the linear rate after a finite burn-in period.
Significance. If the spectral-radius claim holds with a contraction constant independent of discretization and under standard assumptions on the cost and marginals, the result supplies the first rigorous linear-rate guarantee for gradient descent on the dual of quadratically regularized OT. The direct spectral approach (rather than any self-referential or fitted quantity) is a methodological strength that could extend to other non-strongly-convex regularizations.
major comments (2)
- [proof strategy paragraph and spectral-analysis section] The proof that the linearized gradient-descent operator has spectral radius strictly less than one (invoked to obtain the contraction) is asserted in the continuous L² setting but the manuscript does not supply explicit bounds on the spectrum or the step-size restriction that guarantee this for arbitrary continuous marginals and costs. Without such bounds or additional regularity assumptions (e.g., strict positivity or compactness), the contraction property and the subsequent inheritance by the nonlinear map remain unverified.
- [statement of main theorem and burn-in argument] The burn-in period after which the nonlinear iteration inherits the linear rate is stated to be finite, yet no uniform control (independent of the initial potentials or the discretization parameter) is provided. This leaves open whether the overall convergence rate remains linear with a constant independent of discretization, which is central to the practical relevance of the result.
minor comments (1)
- Notation for the linearized operator and its spectrum should be introduced with an explicit equation number at first use to improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below with clarifications and indicate the revisions we will make.
read point-by-point responses
-
Referee: [proof strategy paragraph and spectral-analysis section] The proof that the linearized gradient-descent operator has spectral radius strictly less than one (invoked to obtain the contraction) is asserted in the continuous L² setting but the manuscript does not supply explicit bounds on the spectrum or the step-size restriction that guarantee this for arbitrary continuous marginals and costs. Without such bounds or additional regularity assumptions (e.g., strict positivity or compactness), the contraction property and the subsequent inheritance by the nonlinear map remain unverified.
Authors: We appreciate the referee drawing attention to the level of detail in the spectral analysis. The manuscript establishes the strict contraction by analyzing the eigenvalues of the linearized operator, which arise from the composition of multiplication by the cost function and the orthogonal projections enforcing the marginal constraints; under the standing assumptions of continuous cost and marginals on a compact domain, this yields a spectral radius strictly less than one for all step sizes below an explicit threshold depending on the sup-norms of the cost and marginal densities. To address the request for greater explicitness, we will revise the spectral-analysis section to include a dedicated lemma that states the precise step-size restriction and the resulting bound on the spectral radius in terms of these quantities. The inheritance by the nonlinear map then follows from standard local contraction arguments in Banach space. revision: yes
-
Referee: [statement of main theorem and burn-in argument] The burn-in period after which the nonlinear iteration inherits the linear rate is stated to be finite, yet no uniform control (independent of the initial potentials or the discretization parameter) is provided. This leaves open whether the overall convergence rate remains linear with a constant independent of discretization, which is central to the practical relevance of the result.
Authors: The burn-in period is finite because the gradient map is continuously differentiable and the linearized operator is a contraction at the fixed point, so the iterates eventually enter a neighborhood in which the linear rate governs the behavior. We agree that uniformity with respect to discretization is desirable for practical relevance. The contraction constant itself is independent of discretization in the continuous setting; we will add a remark to the statement of the main theorem clarifying that the same constant carries over to consistent discretizations via operator-norm convergence of the discrete linearized operators to the continuous one. This preserves the linear rate while making the dependence on discretization explicit. revision: partial
Circularity Check
No significant circularity; derivation is self-contained via direct spectral analysis
full rationale
The paper derives linear L² convergence of gradient descent for quadratically regularized OT by performing a spectral analysis of the linearized operator at the dual optimum and establishing that this operator is a strict contraction (with the nonlinear iteration inheriting the property after burn-in). This rests on explicit operator properties in the continuous function-space setting rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain. No step reduces the claimed result to its own inputs by construction; the central argument supplies independent mathematical content through the spectrum bound.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The dual problem admits a unique optimum in the chosen function space.
- domain assumption The linearized gradient descent operator is well-defined and bounded on L2.
Reference graph
Works this paper leans on
-
[1]
E. Bayraktar, S. Eckstein, and X. Zhang. Stability and sample complexity of divergence regu- larized optimal transport.Bernoulli, 31(1):213–239, 2025
work page 2025
-
[2]
M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. volume 84 of Proceedings of Machine Learning Research, pages 880–889, 2018
work page 2018
-
[3]
H. Brezis. Functional analysis, Sobolev spaces and partial differential equations. Universitext. Springer, New York, 2011. 22
work page 2011
-
[4]
G. Carlier. On the linear convergence of the multimarginal Sinkhorn algorithm.SIAM J. Optim., 32(2):786–794, 2022
work page 2022
- [5]
-
[6]
M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors,Advances in Neural Infor- mation Processing Systems, volume 26. Curran Associates, Inc., 2013
work page 2013
-
[7]
S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problems via penalization and neural networks.Appl. Math. Optim., 83(2):639–667, 2021
work page 2021
-
[8]
S. Eckstein and M. Nutz. Convergence rates for regularized optimal transport via quantization. Math. Oper. Res., 49(2):1223–1240, 2024
work page 2024
-
[9]
M. Essid and J. Solomon. Quadratically regularized optimal transport on graphs.SIAM J. Sci. Comput., 40(4):A1961–A1986, 2018
work page 2018
-
[10]
J. Franklin and J. Lorenz. On the scaling of multidimensional matrices.Linear Algebra Appl., 114/115:717–735, 1989
work page 1989
-
[11]
A. Garriz-Molina, A. González-Sanz, and G. Mordant. Infinitesimal behavior of quadrat- ically regularized optimal transport and its relation with the porous medium equation. arXiv:2407.21528, 2024
-
[12]
A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems 29, pages 3440–3448, 2016
work page 2016
-
[13]
A. González-Sanz, E. del Barrio, and M. Nutz. Sample complexity of quadratically regularized optimal transport. Forthcoming, 2025
work page 2025
-
[14]
A. González-Sanz, S. Eckstein, and M. Nutz. Sparse regularized optimal transport without curse of dimensionality, 2025
work page 2025
-
[15]
A. González-Sanz and M. Nutz. Sparsity of quadratically regularized optimal transport: Scalar case. arXiv:2410.03353, 2024
-
[16]
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. InProceedings of the 31st International Conference on Neural Information Processing Systems, pages 5769–5779, 2017
work page 2017
-
[17]
N. Lahn, D. Mulchandani, and S. Raghvendra. A graph theoretic additive approximation of optimal transport. InProceedings of the 33rd International Conference on Neural Information Processing Systems, pages 13831–13841. Curran Associates, Inc., 2019
work page 2019
-
[18]
L. Li, A. Genevay, M. Yurochkin, and J. Solomon. Continuous regularized Wasserstein barycen- ters. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 17755–17765. Curran Associates, Inc., 2020
work page 2020
-
[19]
D. Lorenz and H. Mahler. Orlicz space regularization of continuous optimal transport problems. Appl. Math. Optim., 85(2):Paper No. 14, 33, 2022
work page 2022
- [20]
-
[21]
D. A. Lorenz, P. Manns, and C. Meyer. Quadratically regularized optimal transport. Appl. Math. Optim., 83(3):1919–1949, 2021
work page 1919
-
[22]
B. Muzellec, R. Nock, G. Patrini, and F. Nielsen. Tsallis regularized optimal transport and ecological inference. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Feb. 2017. 23
work page 2017
- [23]
-
[24]
M. Nutz. Quadratically regularized optimal transport: existence and multiplicity of potentials. SIAM J. Math. Anal., 57(3):2622–2649, 2025
work page 2025
-
[25]
G. Peyré and M. Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5–6):355–607, 2019
work page 2019
- [26]
- [27]
-
[28]
J. Wiesel and X. Xu. Sparsity of quadratically regularized optimal transport: Bounds on con- centration and bias.Preprint arXiv:2410.03425v1, 2024
- [29]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.