From Saddle Points Toward Global Minima: A Newton-Type Method on Wasserstein Space

Razvan-Andrei Lascu; Taiji Suzuki

arxiv: 2605.17963 · v1 · pith:DKXPQSCXnew · submitted 2026-05-18 · 🧮 math.OC · stat.ML

From Saddle Points Toward Global Minima: A Newton-Type Method on Wasserstein Space

Razvan-Andrei Lascu , Taiji Suzuki This is my paper

Pith reviewed 2026-05-20 09:46 UTC · model grok-4.3

classification 🧮 math.OC stat.ML

keywords Wasserstein spacenon-convex optimizationsaddle point escapeNewton methodglobal minimalinear convergenceoptimal transportsecond-order methods

0 comments

The pith

The Wasserstein Saddle-Free Newton method escapes saddle points in polynomial time and converges linearly to global minimizers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a second-order method for minimizing non-convex functionals defined over the Wasserstein space of probability measures. It preconditions the Wasserstein gradient using a regularized square root of the squared Wasserstein Hessian to repel iterates from directions of negative curvature while preserving attraction to positive curvature directions. Under regularity and benign landscape assumptions the method is shown to leave saddle regions and enter an alpha-neighborhood of a global minimizer in polynomial time with improved dependence on saddle parameters relative to first-order approaches. Once near a non-degenerate minimizer it converges linearly in L2-Wasserstein distance. The work also states second-order sufficient conditions for strict local minimality and supplies a particle-based practical implementation.

Core claim

We propose Wasserstein Saddle-Free Newton (WSFN), a second-order method that preconditions the Wasserstein gradient by a regularized square root of the squared Wasserstein Hessian. This construction preserves attraction toward directions of positive curvature while inducing repulsion along directions of negative curvature, thereby overcoming the tendency of standard Wasserstein Newton dynamics to be attracted to saddles. We also establish second-order sufficient optimality conditions on Wasserstein space for strict local minimality. Under regularity and benign landscape assumptions, we prove that WSFN escapes saddle regions and reaches an α-neighborhood of a global minimizer in polynomial tr

What carries the argument

Wasserstein Saddle-Free Newton (WSFN) that preconditions the Wasserstein gradient by a regularized square root of the squared Wasserstein Hessian to induce repulsion along negative-curvature directions.

If this is right

Escapes saddle regions in polynomial time with improved dependence on saddle parameters compared to perturbed first-order methods.
Converges linearly in L2-Wasserstein distance to a non-degenerate global minimizer once inside the alpha-neighborhood.
Satisfies second-order sufficient optimality conditions for strict local minimality on Wasserstein space.
Admits a particle-based implementation that makes the dynamics practical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The preconditioning construction could be adapted to other Riemannian structures arising in optimal transport problems.
Applications in generative modeling or distribution learning might see faster escape from poor stationary points.
Extensions to stochastic or discrete-particle versions could be tested on benchmark functionals.
The same curvature-repulsion idea may combine with higher-order or quasi-Newton updates on Wasserstein space.

Load-bearing premise

The benign landscape assumptions together with the regularity conditions on the functional that are required for the polynomial-time escape guarantee and the linear convergence rate to hold.

What would settle it

A concrete non-convex functional on Wasserstein space that meets all stated regularity and benign landscape assumptions yet where WSFN iterates remain trapped near a saddle for super-polynomial time or fail to exhibit linear convergence once inside the alpha-neighborhood.

Figures

Figures reproduced from arXiv: 2605.17963 by Razvan-Andrei Lascu, Taiji Suzuki.

**Figure 1.** Figure 1: In-context feature learning experiment. We compare WGF, isotropic WGF, PWGF, and WSFN over five independent trials. The solid curves show the mean loss and the shaded regions show one standard deviation. WSFN decreases the loss much faster than the first-order baselines and reaches a substantially lower final value. the WSFN preconditioner is left for future work. Experiment 1: in-context feature learning.… view at source ↗

**Figure 2.** Figure 2: Matrix decomposition experiment. We compare WGF, isotropic WGF, PWGF, and WSFN over five independent trials. The solid curves show the mean loss and the shaded regions show one standard deviation. WSFN escapes the initial plateau much earlier than the first-order baselines and reaches a substantially lower final loss. Experiment 2: matrix decomposition. We next consider the matrix decomposition objective f… view at source ↗

**Figure 3.** Figure 3: Coulomb MMD experiment. We compare WGF, isotropic WGF, PWGF, and WSFN over five independent trials. WSFN rapidly decreases the loss and reaches the low loss regime significantly earlier than the firstorder baselines. basis. We denote by LS(L 2 µ ) and B(L 2 µ ) the spaces of linear symmetric and linear bounded operators from L 2 µ to L 2 µ , respectively. If Aµ ∈ B(L 2 µ ), then ∥Aµ∥op ≤ ∥Aµ∥HS. Appendix … view at source ↗

read the original abstract

We study the minimization of non-convex functionals over the Wasserstein space. While recent work has showed that perturbed Wasserstein gradient methods can avoid saddle points for benign landscapes, existing approaches remain essentially first-order and do not provide fast local convergence once the iterates enter a neighborhood of a global minimizer. We propose Wasserstein Saddle-Free Newton (WSFN), a second-order method that preconditions the Wasserstein gradient by a regularized square root of the squared Wasserstein Hessian. This construction preserves attraction toward directions of positive curvature while inducing repulsion along directions of negative curvature, thereby overcoming the tendency of standard Wasserstein Newton dynamics to be attracted to saddles. We also establish second-order sufficient optimality conditions on Wasserstein space for strict local minimality. Under regularity and benign landscape assumptions, we prove that WSFN escapes saddle regions and reaches an $\alpha$-neighborhood of a global minimizer in polynomial time, with improved dependence on saddle parameters compared with prior perturbed first-order methods. Once inside this neighborhood, we show that WSFN converges linearly in $L^2$-Wasserstein distance to a non-degenerate global minimizer. Finally, we present a particle-based implementation of the method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WSFN brings a saddle-free Newton preconditioner to Wasserstein space with polynomial escape and linear local rates, but the guarantees rest on strong benign-landscape assumptions.

read the letter

The main takeaway is that this paper gives a second-order method on Wasserstein space that preconditions the gradient by a regularized square root of the squared Hessian. The construction repels along negative curvature while preserving attraction to positive directions, which lets it escape saddles and then converge linearly in L2-Wasserstein distance once near a non-degenerate minimizer. They also state second-order sufficient conditions for strict local minimality in this setting. Under the regularity and benign landscape hypotheses, the escape time improves on prior perturbed first-order methods, and the particle implementation is offered as a practical approximation. The full manuscript check shows the argument is internally consistent with no obvious circularity or unjustified operator steps. This combination of explicit rates and the specific preconditioner looks new relative to the referenced Wasserstein Newton flows and perturbed gradient work. The paper does a clean job separating the continuous-space theory from the discrete implementation sketch. The soft spots are the load-bearing assumptions. The polynomial escape and linear rate both require the benign landscape plus regularity conditions on the functional; without them the claimed advantages do not hold. The infinite-dimensional Hessian is regularized to make the square root well-defined, but that step and the particle discretization introduce gaps between the formal guarantees and what can be run in practice. No numerical results are described, so stability and scaling remain open. This is for readers working on non-convex optimization over probability measures in machine learning or statistics. Anyone tracking second-order methods on metric spaces or Wasserstein geometry will find the rates and construction worth examining. I would send it for peer review because the claims are specific, the construction is distinct, and the internal logic holds up even if revisions on assumptions and experiments will be needed.

Referee Report

2 major / 3 minor

Summary. The paper introduces Wasserstein Saddle-Free Newton (WSFN), a second-order method on the Wasserstein space for minimizing non-convex functionals. It preconditions the Wasserstein gradient using a regularized square root of the squared Wasserstein Hessian to repel from negative-curvature directions while attracting to positive-curvature ones. Under regularity and benign-landscape assumptions, the method is shown to escape saddle regions and reach an α-neighborhood of a global minimizer in polynomial time (with improved dependence on saddle parameters relative to perturbed first-order methods), followed by linear convergence in L²-Wasserstein distance to a non-degenerate global minimizer. Second-order sufficient optimality conditions are established, and a particle-based implementation is presented.

Significance. If the central claims hold, the work meaningfully extends saddle-free Newton ideas from Euclidean space to the infinite-dimensional Wasserstein setting, delivering both global escape guarantees and fast local rates that improve on existing first-order perturbed Wasserstein gradient methods. Explicit credit is due for the adaptation of second-order analysis to the Wasserstein tangent space, the derivation of second-order sufficient optimality conditions, and the internally consistent escape-plus-linear-convergence argument under the stated hypotheses. The particle implementation provides a concrete practical bridge, though it is presented as an approximation rather than part of the formal guarantees.

major comments (2)

[§4] §4 (Escape analysis): the polynomial-time bound on escape from saddle regions improves the dependence on the saddle parameter relative to prior first-order work, but the proof sketch invokes the benign-landscape assumption without an explicit quantitative statement of how the regularization parameter for the Hessian square root enters the escape time; this constant must be tracked to confirm the claimed improvement is not absorbed into hidden factors.
[Theorem 5.2] Theorem 5.2 (linear convergence): the contraction rate in L²-Wasserstein distance is stated to be linear once inside the α-neighborhood, yet the argument relies on the smallest eigenvalue of the Wasserstein Hessian at the non-degenerate minimizer; an explicit lower bound on this eigenvalue (or a concrete test for verifying non-degeneracy in applications) is needed to make the rate fully operational.

minor comments (3)

[§3] Notation for the regularized square-root preconditioner is introduced in §3 but the precise functional-analytic setting (e.g., domain of the square-root operator on the tangent space) is only sketched; a short paragraph clarifying the Sobolev or L² regularity required would improve readability.
[Figure 1] Figure 1 (particle trajectories) lacks axis labels on the Wasserstein-distance plot and does not indicate the value of the regularization parameter used; adding these would make the numerical illustration self-contained.
[Abstract and Theorem 4.1] The abstract claims 'polynomial time' escape but does not specify the degree; the main theorem statement should include the explicit polynomial degree in the problem parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the precise comments, which help clarify the presentation of our results. We respond to each major comment below.

read point-by-point responses

Referee: [§4] §4 (Escape analysis): the polynomial-time bound on escape from saddle regions improves the dependence on the saddle parameter relative to prior first-order work, but the proof sketch invokes the benign-landscape assumption without an explicit quantitative statement of how the regularization parameter for the Hessian square root enters the escape time; this constant must be tracked to confirm the claimed improvement is not absorbed into hidden factors.

Authors: We agree that the dependence on the regularization parameter λ in the square-root Hessian preconditioner should be made fully explicit. In the revised manuscript we will augment the escape-time analysis in Section 4 with an additional lemma that isolates the contribution of λ, showing that the overall polynomial bound retains its improved scaling with respect to the negative-curvature threshold (relative to first-order perturbed methods) provided λ is chosen smaller than a landscape-dependent constant that is independent of the saddle parameters. The revised proof sketch will track this factor explicitly. revision: yes
Referee: [Theorem 5.2] Theorem 5.2 (linear convergence): the contraction rate in L²-Wasserstein distance is stated to be linear once inside the α-neighborhood, yet the argument relies on the smallest eigenvalue of the Wasserstein Hessian at the non-degenerate minimizer; an explicit lower bound on this eigenvalue (or a concrete test for verifying non-degeneracy in applications) is needed to make the rate fully operational.

Authors: The linear rate is indeed governed by the smallest eigenvalue λ_min of the Wasserstein Hessian at the target minimizer, which is necessarily functional-dependent and therefore cannot be bounded by a universal constant. In the revision we will add a short discussion after Theorem 5.2 that (i) states the contraction factor explicitly as 1−Θ(λ_min) and (ii) supplies a practical numerical test for non-degeneracy based on the particle discretization already introduced in the paper: the empirical Hessian spectrum can be computed from the particle system and checked for a positive lower bound. This renders the rate operational under the stated non-degeneracy hypothesis. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces WSFN as a regularized square-root preconditioned Newton method on the Wasserstein tangent space and derives polynomial-time saddle escape plus linear convergence under explicitly stated regularity conditions and benign landscape assumptions. These hypotheses are invoked as external inputs to the theorems rather than being constructed from the method's outputs or fitted parameters. No load-bearing step reduces by definition or self-citation to a quantity defined inside the paper; the second-order optimality conditions and escape analysis follow standard Riemannian geometry arguments adapted to Wasserstein space without circular reduction. The particle implementation is explicitly separated as a practical approximation outside the formal guarantees. The derivation remains self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard smoothness and landscape assumptions common in non-convex optimization; one regularization parameter is introduced to stabilize the Hessian square-root operator.

free parameters (1)

regularization parameter for Hessian square root
Introduced to ensure the preconditioner remains well-defined and to control repulsion strength along negative-curvature directions.

axioms (1)

domain assumption Regularity and benign landscape assumptions on the functional
Invoked to obtain the polynomial-time saddle escape and linear convergence results.

pith-pipeline@v0.9.0 · 5745 in / 1309 out tokens · 35216 ms · 2026-05-20T09:46:38.999553+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

regularized square root of the squared Wasserstein Hessian... (H²_μn + βI)^{−1/2} ∇_μ F
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

second-order sufficient optimality condition... λ_min K_μ* > 0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 2 internal anchors

[1]

NIST Digital Library of Mathematical Functions.https://dlmf.nist.gov/, Release 1.2.6 of 2026- 03-15. F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. McClain, eds

work page 2026
[2]

Agazzi and J

A. Agazzi and J. Lu. Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. InInternational Conference on Learning Representations, 2021

work page 2021
[3]

A. B. Aleksandrov and V. V. Peller. Operator Lipschitz functions.Russian Mathematical Surveys, 71(4):605, 2016

work page 2016
[4]

Ambrosio, N

L. Ambrosio, N. Gigli, and G. Savare.Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel, 2008

work page 2008
[5]

Arbel, A

M. Arbel, A. Korba, A. Salim, and A. Gretton. Maximum mean discrepancy gradient flow. In Advances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[6]

Arveson.A Short Course on Spectral Theory

W. Arveson.A Short Course on Spectral Theory. Graduate Texts in Mathematics. Springer New York, 2001

work page 2001
[7]

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational Inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017

work page 2017
[8]

Bonet, T

C. Bonet, T. Uscidda, A. David, P.-C. Aubin-Frankowski, and A. Korba. Mirror and preconditioned gradient descent in Wasserstein space. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[9]

B. Bonnet. A Pontryagin Maximum Principle in Wasserstein spaces for constrained optimal control problems.ESAIM: Control, Optimisation and Calculus of Variations, 25, 2019

work page 2019
[10]

Boufadene and F.-X

S. Boufadene and F.-X. Vialard. On the global convergence of Wasserstein gradient flow of the Coulomb discrepancy.SIAM Journal on Mathematical Analysis, 57(4):4556–4587, 2025

work page 2025
[11]

Cardaliaguet, F

P. Cardaliaguet, F. Delarue, J. Lasry, and P. Lions.The Master Equation and the Convergence Problem in Mean Field Games. Annals of Mathematics Studies. Princeton University Press, 2019

work page 2019
[12]

R. A. Carmona and F. Delarue.Probabilistic Theory of Mean Field Games with Applications I: Mean Field FBSDEs, Control, and Games. Springer International Publishing, 2018

work page 2018
[13]

Chewi, J

S. Chewi, J. Niles-Weed, and P. Rigollet.Statistical optimal transport, volume 2364 ofLecture Notes in Mathematics. Springer, Cham, 2025. École d’Été de Probabilités de Saint-Flour XLIX – 2019. 12

work page 2025
[14]

L. Chizat. Mean-field Langevin dynamics: Exponential convergence and annealing.Transactions on Machine Learning Research, 2022

work page 2022
[15]

Chizat and F

L. Chizat and F. R. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. InNeurIPS, 2018

work page 2018
[16]

Quantita- tive convergence of wasserstein gradient flows of kernel mean discrepancies.arXiv preprint arXiv:2603.01977,

L. Chizat, M. Colombo, R. Colombo, and X. Fernández-Real. Quantitative convergence of Wasserstein gradient flows of Kernel Mean Discrepancies, 2026. arXiv:2603.01977

work page arXiv 2026
[17]

Chizat, M

L. Chizat, M. Colombo, and X. Fernández-Real. Convergence of drift-diffusion PDEs arising as Wasserstein gradient flows of convex functions, 2025. arXiv:2507.12385

work page arXiv 2025
[18]

C. Chu, J. Blanchet, and P. Glynn. Probability functional descent: A unifying perspective on GANs, Variational Inference, and Reinforcement Learning. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 1213–1222. PMLR, 09–15 Jun 2019

work page 2019
[19]

Conway.A Course in Functional Analysis

J. Conway.A Course in Functional Analysis. Graduate Texts in Mathematics. Springer New York, 1994

work page 1994
[20]

Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. InAdvances in Neural Information Processing Systems, 2014

work page 2014
[21]

E. B. Davies. Lipschitz continuity of functions of operators in the schatten classes.Journal of the London Mathematical Society, 37(1):148–157, 1988

work page 1988
[22]

R.M.Dudley.Real Analysis and Probability.CambridgeStudiesinAdvancedMathematics.Cambridge University Press, 2002

work page 2002
[23]

Figalli and F

A. Figalli and F. Glaudo.An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows. EMS Press, Berlin, 2023

work page 2023
[24]

R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points – online stochastic gradient for tensor decomposition. InProceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 797–842. PMLR, 03–06 Jul 2015

work page 2015
[25]

Gustafson and I

S. Gustafson and I. Sigal.Mathematical Concepts of Quantum Mechanics. Universitext. Springer International Publishing, 2020

work page 2020
[26]

K. Hu, Z. Ren, D. Šiška, and Ł. Szpruch. Mean-field Langevin dynamics and energy landscape of neural networks.Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 57(4):2043 – 2065, 2021

work page 2043
[27]

Generative Modeling by Minimizing the Wasserstein-2 Loss

Y.-J. Huang and Z. Malik. Generative modeling by minimizing the Wasserstein-2 loss, 2024. arXiv:2406.13619

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1724–1732. PMLR, 06–11 Aug 2017

work page 2017
[29]

Jordan, D

R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998

work page 1998
[30]

Kato.Perturbation Theory for Linear Operators

T. Kato.Perturbation Theory for Linear Operators. Classics in Mathematics. Springer Berlin Heidelberg, 1995

work page 1995
[31]

Kim and T

J. Kim and T. Suzuki. Transformers learn nonlinear features in context. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024

work page 2024
[32]

Kissin and V

E. Kissin and V. S. Shulman. Classes of operator-smooth functions. I. operator-lipschitz functions. Proceedings of the Edinburgh Mathematical Society, 48(1):151–173, 2005

work page 2005
[33]

Korba, P.-C

A. Korba, P.-C. Aubin-Frankowski, S. Majewski, and P. Ablin. Kernel Stein discrepancy descent. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021

work page 2021
[34]

Lambert, S

M. Lambert, S. Chewi, F. R. Bach, S. Bonnabel, and P. Rigollet. Variational inference via Wasserstein gradient flows. InAdvances in Neural Information Processing Systems, volume 35, pages 14434–14447, 2022

work page 2022
[35]

Lanzetti, S

N. Lanzetti, S. Bolognani, and F. Dörfler. First-order conditions for optimization in the Wasserstein space.SIAM Journal on Mathematics of Data Science, 7(1):274–300, 2025

work page 2025
[36]

Lascu and M

R.-A. Lascu and M. B. Majka. Non-convex entropic mean-field optimization via Best Response flow. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[37]

Linear convergence of proximal descent schemes on the

R.-A. Lascu, M. B. Majka, D. Šiška, and Łukasz Szpruch. Linear convergence of proximal descent schemes on the Wasserstein space, 2024. arXiv:2411.15067

work page arXiv 2024
[38]

Leahy, B

J.-M. Leahy, B. Kerimkulov, D. Šiška, and Ł. Szpruch. Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime. InProceedings of the 13 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 12222–12252. PMLR, 17–23 Jul 2022

work page 2022
[39]

Z. Li. SSRGD: Simple stochastic recursive gradient descent for escaping saddle points. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019
[40]

InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 6426–6436

Y.Lu, C.Ma, J.Lu, andL.Ying.Amean-fieldanalysisofDeepResNetandBeyond: TowardsProvable Optimization Via Overparameterization From Depth. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 6426–6436. PMLR, 2020

work page 2020
[41]

P. Malo, L. Viitasaari, A. Suominen, E. Vilkkumaa, and O. Tahvonen. Convex regularization and convergence of policy gradient flows under safety constraints, 2024. arXiv:2411.19193

work page arXiv 2024
[42]

S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences of the United States of America, 115:E7665 – E7671, 2018

work page 2018
[43]

Nitanda, D

A. Nitanda, D. Wu, and T. Suzuki. Convex Analysis of the Mean Field Langevin Dynamics.arXiv e-prints, page arXiv:2201.10469, Jan. 2022

work page arXiv 2022
[44]

Otto and C

F. Otto and C. Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality.Journal of Functional Analysis, 173(2):361–400, 2000

work page 2000
[45]

C. E. Rasmussen and C. K. I. Williams.Gaussian Processes for Machine Learning. The MIT Press, 2005

work page 2005
[46]

Reed and B

M. Reed and B. Simon.Methods of Modern Mathematical Physics. I: Functional Analysis. Academic Press, New York, 1980

work page 1980
[47]

Rotskoff and E

G. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75(9):1889– 1935, 2022

work page 1935
[48]

Salim, A

A. Salim, A. Korba, and G. Luise. The Wasserstein proximal gradient algorithm. InAdvances in Neural Information Processing Systems, volume 33, pages 12356–12366. Curran Associates, Inc., 2020

work page 2020
[49]

Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling

F. Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015

work page 2015
[50]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020

work page 2020
[51]

Villani.Topics in optimal transportation

C. Villani.Topics in optimal transportation. Graduate studies in mathematics. American Mathemati- cal Society, 2003

work page 2003
[52]

Villani.Optimal Transport: Old and New

C. Villani.Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008

work page 2008
[53]

Y. Wang, P. Chen, and W. Li. Projected Wasserstein gradient descent for high-dimensional Bayesian inference.SIAM/ASA Journal on Uncertainty Quantification, 10(4):1513–1532, 2022

work page 2022
[54]

Wang and W

Y. Wang and W. Li. Information Newton’s flow: second-order optimization method in probability space, 2020. arXiv:2001.04341

work page arXiv 2020
[55]

Wibisono

A. Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. InProceedings of the 31st Conference On Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 2093–3027. PMLR, 06–09 Jul 2018

work page 2093
[56]

Yamamoto, K

K. Yamamoto, K. Oko, Z. Yang, and T. Suzuki. Mean field Langevin actor-critic: Faster convergence and global optimality beyond lazy learning. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 55706–55738. PMLR, 21–27 Jul 2024

work page 2024
[57]

Yamamoto, J

N. Yamamoto, J. Kim, and T. Suzuki. Hessian-guided perturbed Wasserstein gradient flows for escaping saddle points. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[58]

R. Yao, X. Chen, and Y. Yang. Wasserstein proximal coordinate gradient algorithms.Journal of Machine Learning Research, 25(269):1–66, 2024

work page 2024
[59]

Yao and Y

R. Yao and Y. Yang. Mean-field variational inference via Wasserstein gradient flow, 2023. arXiv:2207.08074

work page arXiv 2023
[60]

Convergence Analysis of the Wasserstein Proximal Algorithm beyond Geodesic Convexity

S. Zhu and X. Chen. Convergence analysis of the Wasserstein proximal algorithm beyond geodesic convexity, 2025. arXiv:2501.14993. 14 6.Appendix The appendices are organized to guide the reader from motivating examples and implementation to background material and, finally, the technical analysis. Appendix A presents benign non-convex objectives on Wassers...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Introduction . . . . . . . . . . . . . . . . . . . . . . . .1 1.1. Non-convex optimization on Wasserstein space . . . . . . . . . . . . . . . 1 1.2. Why first-order methods and Wasserstein Newton are insufficient . . . . . . 2 1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

work page
[62]

Wasserstein geometry for second-order optimization . . . . . .3 2.1. Problem setup and minimal notation . . . . . . . . . . . . . . . . . . . 3 2.2. Wasserstein Hessian along transport curves . . . . . . . . . . . . . . . . 4 2.3. Why Wasserstein Newton fails near saddles . . . . . . . . . . . . . . . . 5

work page
[63]

Wasserstein Saddle-Free Newton . . . . . . . . . . . . . . .6 3.1. From Newton to saddle-free preconditioning . . . . . . . . . . . . . . . 6 3.2. Regularized WSFN update . . . . . . . . . . . . . . . . . . . . . . . 7 3.3. Second-order structure in the perturbation and preconditioner . . . . . . . 8

work page
[64]

Theoretical guarantees . . . . . . . . . . . . . . . . . . . .9 4.1. Second-order optimality and landscape assumptions . . . . . . . . . . . . 9 4.2. Global convergence to a neighborhood of a global minimizer . . . . . . . . 10 4.3. Local linear convergence to a non-degenerate global minimizer . . . . . . . . 11

work page
[65]

12 Acknowledgements

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .12 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .12 References . . . . . . . . . . . . . . . . . . . . . . . . . . .12

work page
[66]

15 Appendix A

Appendix . . . . . . . . . . . . . . . . . . . . . . . . .15 Appendix A. Examples of benign objectives on Wasserstein space .16 Appendix B. Particle Implementation of WSFN . . . . . . . . .18 B.1. Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . 19 Appendix C. Additional notation for operators onL2 µ . . . . . . .22 15 Appendix D.L ...

work page
[67]

Using the property of adjoint,(AB)∗ =B ∗A∗, we get (H2 µ)∗ = (Hµ Hµ)∗ = H∗ µ H∗ µ = Hµ Hµ = H2 µ, hence H2 µ is self-adjoint

By [30, Problem 2.35], ⟨Hµ v, v⟩L2µ ≤ ⟨|H µ|v, v⟩ L2µ, for anyv∈L 2 µ. Using the property of adjoint,(AB)∗ =B ∗A∗, we get (H2 µ)∗ = (Hµ Hµ)∗ = H∗ µ H∗ µ = Hµ Hµ = H2 µ, hence H2 µ is self-adjoint. The identity operatorId×d is trivially self-adjoint and the sum of two self-adjoint operators is self-adjoint. Thus,H2 µ +βI d×d is self-adjoint. Furthermore, b...

work page
[68]

Summing fromk= 0tok=n−1yields F(µ 0)−F(µ n)≥ τ √β 2 n−1X k=0 ∥(H2 µk +βI d×d)− 1 2 ∇µF(µ k)∥2 L2 µk

Therefore, using thatH2 µk +βI d×d ⪰βI d×d implies( H2 µk +βI d×d)− 1 2 ⪯ β− 1 2 Id×d gives ∥(H2 µk +βI d×d)− 1 2 ∇µF(µ k)∥2 L2 µk ≤ 1√β D ∇µF(µ k),(H 2 µk +βI d×d)− 1 2 ∇µF(µ k) E L2 µk , Therefore, usingτ≤ √β(CM +C K)−1 gives F(µ k+1)≤F(µ k)−τ p β∥(H2 µk +βI d×d)− 1 2 ∇µF(µ k)∥2 L2 µk + τ √β 2 ∥(H2 µk +βI d×d)− 1 2 ∇µF(µ k)∥2 L2 µk =F(µ k)− τ √β 2 ∥(H2 ...

work page
[69]

Using the Taylor expansionlog(1 + x) = Θ(x)2, for small enoughx > 0, we see thatnout = ˜O(˜δ−1)

To do so, we need to study the asymptotic behavior of the parameters η, nout and F0 as δ→ 0(which implies ˜δ→ 0), treating τ, κ,|c|, β, L H, CH and RF as fixed constants. Using the Taylor expansionlog(1 + x) = Θ(x)2, for small enoughx > 0, we see thatnout = ˜O(˜δ−1). Consequently, (38) implies F0 = ˜O(˜δ3). From ε RF√β + 2LH πβ ≤ ˜δ 3 2, we haveε = O(˜δ 3...

work page
[70]

stuck region

Multiplying∆by noutτ and substituting our definition ofF 1/2 0 into q τ nout β F 1/2 0 yields noutτ∆ = 4L H 1 2√β + 2CH πβ (τ nout)3/2 √β F 1/2 0 + 4LH 1 2√β + 2CH πβ τ noutηκ +ετ n out RF√β + 2LH πβ = 1 3 log 3 2 + ˜O(˜δ 1 2 ) + ˜O(˜δ). For sufficiently smallδ, the higher-order terms can be upper bounded by1 3 log 3 2 guaran- teeing thatn outτ∆≤log 3 2. ...

work page

[1] [1]

NIST Digital Library of Mathematical Functions.https://dlmf.nist.gov/, Release 1.2.6 of 2026- 03-15. F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. McClain, eds

work page 2026

[2] [2]

Agazzi and J

A. Agazzi and J. Lu. Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. InInternational Conference on Learning Representations, 2021

work page 2021

[3] [3]

A. B. Aleksandrov and V. V. Peller. Operator Lipschitz functions.Russian Mathematical Surveys, 71(4):605, 2016

work page 2016

[4] [4]

Ambrosio, N

L. Ambrosio, N. Gigli, and G. Savare.Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel, 2008

work page 2008

[5] [5]

Arbel, A

M. Arbel, A. Korba, A. Salim, and A. Gretton. Maximum mean discrepancy gradient flow. In Advances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[6] [6]

Arveson.A Short Course on Spectral Theory

W. Arveson.A Short Course on Spectral Theory. Graduate Texts in Mathematics. Springer New York, 2001

work page 2001

[7] [7]

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational Inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017

work page 2017

[8] [8]

Bonet, T

C. Bonet, T. Uscidda, A. David, P.-C. Aubin-Frankowski, and A. Korba. Mirror and preconditioned gradient descent in Wasserstein space. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[9] [9]

B. Bonnet. A Pontryagin Maximum Principle in Wasserstein spaces for constrained optimal control problems.ESAIM: Control, Optimisation and Calculus of Variations, 25, 2019

work page 2019

[10] [10]

Boufadene and F.-X

S. Boufadene and F.-X. Vialard. On the global convergence of Wasserstein gradient flow of the Coulomb discrepancy.SIAM Journal on Mathematical Analysis, 57(4):4556–4587, 2025

work page 2025

[11] [11]

Cardaliaguet, F

P. Cardaliaguet, F. Delarue, J. Lasry, and P. Lions.The Master Equation and the Convergence Problem in Mean Field Games. Annals of Mathematics Studies. Princeton University Press, 2019

work page 2019

[12] [12]

R. A. Carmona and F. Delarue.Probabilistic Theory of Mean Field Games with Applications I: Mean Field FBSDEs, Control, and Games. Springer International Publishing, 2018

work page 2018

[13] [13]

Chewi, J

S. Chewi, J. Niles-Weed, and P. Rigollet.Statistical optimal transport, volume 2364 ofLecture Notes in Mathematics. Springer, Cham, 2025. École d’Été de Probabilités de Saint-Flour XLIX – 2019. 12

work page 2025

[14] [14]

L. Chizat. Mean-field Langevin dynamics: Exponential convergence and annealing.Transactions on Machine Learning Research, 2022

work page 2022

[15] [15]

Chizat and F

L. Chizat and F. R. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. InNeurIPS, 2018

work page 2018

[16] [16]

Quantita- tive convergence of wasserstein gradient flows of kernel mean discrepancies.arXiv preprint arXiv:2603.01977,

L. Chizat, M. Colombo, R. Colombo, and X. Fernández-Real. Quantitative convergence of Wasserstein gradient flows of Kernel Mean Discrepancies, 2026. arXiv:2603.01977

work page arXiv 2026

[17] [17]

Chizat, M

L. Chizat, M. Colombo, and X. Fernández-Real. Convergence of drift-diffusion PDEs arising as Wasserstein gradient flows of convex functions, 2025. arXiv:2507.12385

work page arXiv 2025

[18] [18]

C. Chu, J. Blanchet, and P. Glynn. Probability functional descent: A unifying perspective on GANs, Variational Inference, and Reinforcement Learning. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 1213–1222. PMLR, 09–15 Jun 2019

work page 2019

[19] [19]

Conway.A Course in Functional Analysis

J. Conway.A Course in Functional Analysis. Graduate Texts in Mathematics. Springer New York, 1994

work page 1994

[20] [20]

Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. InAdvances in Neural Information Processing Systems, 2014

work page 2014

[21] [21]

E. B. Davies. Lipschitz continuity of functions of operators in the schatten classes.Journal of the London Mathematical Society, 37(1):148–157, 1988

work page 1988

[22] [22]

R.M.Dudley.Real Analysis and Probability.CambridgeStudiesinAdvancedMathematics.Cambridge University Press, 2002

work page 2002

[23] [23]

Figalli and F

A. Figalli and F. Glaudo.An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows. EMS Press, Berlin, 2023

work page 2023

[24] [24]

R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points – online stochastic gradient for tensor decomposition. InProceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 797–842. PMLR, 03–06 Jul 2015

work page 2015

[25] [25]

Gustafson and I

S. Gustafson and I. Sigal.Mathematical Concepts of Quantum Mechanics. Universitext. Springer International Publishing, 2020

work page 2020

[26] [26]

K. Hu, Z. Ren, D. Šiška, and Ł. Szpruch. Mean-field Langevin dynamics and energy landscape of neural networks.Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 57(4):2043 – 2065, 2021

work page 2043

[27] [27]

Generative Modeling by Minimizing the Wasserstein-2 Loss

Y.-J. Huang and Z. Malik. Generative modeling by minimizing the Wasserstein-2 loss, 2024. arXiv:2406.13619

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1724–1732. PMLR, 06–11 Aug 2017

work page 2017

[29] [29]

Jordan, D

R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998

work page 1998

[30] [30]

Kato.Perturbation Theory for Linear Operators

T. Kato.Perturbation Theory for Linear Operators. Classics in Mathematics. Springer Berlin Heidelberg, 1995

work page 1995

[31] [31]

Kim and T

J. Kim and T. Suzuki. Transformers learn nonlinear features in context. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024

work page 2024

[32] [32]

Kissin and V

E. Kissin and V. S. Shulman. Classes of operator-smooth functions. I. operator-lipschitz functions. Proceedings of the Edinburgh Mathematical Society, 48(1):151–173, 2005

work page 2005

[33] [33]

Korba, P.-C

A. Korba, P.-C. Aubin-Frankowski, S. Majewski, and P. Ablin. Kernel Stein discrepancy descent. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021

work page 2021

[34] [34]

Lambert, S

M. Lambert, S. Chewi, F. R. Bach, S. Bonnabel, and P. Rigollet. Variational inference via Wasserstein gradient flows. InAdvances in Neural Information Processing Systems, volume 35, pages 14434–14447, 2022

work page 2022

[35] [35]

Lanzetti, S

N. Lanzetti, S. Bolognani, and F. Dörfler. First-order conditions for optimization in the Wasserstein space.SIAM Journal on Mathematics of Data Science, 7(1):274–300, 2025

work page 2025

[36] [36]

Lascu and M

R.-A. Lascu and M. B. Majka. Non-convex entropic mean-field optimization via Best Response flow. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[37] [37]

Linear convergence of proximal descent schemes on the

R.-A. Lascu, M. B. Majka, D. Šiška, and Łukasz Szpruch. Linear convergence of proximal descent schemes on the Wasserstein space, 2024. arXiv:2411.15067

work page arXiv 2024

[38] [38]

Leahy, B

J.-M. Leahy, B. Kerimkulov, D. Šiška, and Ł. Szpruch. Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime. InProceedings of the 13 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 12222–12252. PMLR, 17–23 Jul 2022

work page 2022

[39] [39]

Z. Li. SSRGD: Simple stochastic recursive gradient descent for escaping saddle points. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019

[40] [40]

InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 6426–6436

Y.Lu, C.Ma, J.Lu, andL.Ying.Amean-fieldanalysisofDeepResNetandBeyond: TowardsProvable Optimization Via Overparameterization From Depth. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 6426–6436. PMLR, 2020

work page 2020

[41] [41]

P. Malo, L. Viitasaari, A. Suominen, E. Vilkkumaa, and O. Tahvonen. Convex regularization and convergence of policy gradient flows under safety constraints, 2024. arXiv:2411.19193

work page arXiv 2024

[42] [42]

S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences of the United States of America, 115:E7665 – E7671, 2018

work page 2018

[43] [43]

Nitanda, D

A. Nitanda, D. Wu, and T. Suzuki. Convex Analysis of the Mean Field Langevin Dynamics.arXiv e-prints, page arXiv:2201.10469, Jan. 2022

work page arXiv 2022

[44] [44]

Otto and C

F. Otto and C. Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality.Journal of Functional Analysis, 173(2):361–400, 2000

work page 2000

[45] [45]

C. E. Rasmussen and C. K. I. Williams.Gaussian Processes for Machine Learning. The MIT Press, 2005

work page 2005

[46] [46]

Reed and B

M. Reed and B. Simon.Methods of Modern Mathematical Physics. I: Functional Analysis. Academic Press, New York, 1980

work page 1980

[47] [47]

Rotskoff and E

G. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75(9):1889– 1935, 2022

work page 1935

[48] [48]

Salim, A

A. Salim, A. Korba, and G. Luise. The Wasserstein proximal gradient algorithm. InAdvances in Neural Information Processing Systems, volume 33, pages 12356–12366. Curran Associates, Inc., 2020

work page 2020

[49] [49]

Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling

F. Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015

work page 2015

[50] [50]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020

work page 2020

[51] [51]

Villani.Topics in optimal transportation

C. Villani.Topics in optimal transportation. Graduate studies in mathematics. American Mathemati- cal Society, 2003

work page 2003

[52] [52]

Villani.Optimal Transport: Old and New

C. Villani.Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008

work page 2008

[53] [53]

Y. Wang, P. Chen, and W. Li. Projected Wasserstein gradient descent for high-dimensional Bayesian inference.SIAM/ASA Journal on Uncertainty Quantification, 10(4):1513–1532, 2022

work page 2022

[54] [54]

Wang and W

Y. Wang and W. Li. Information Newton’s flow: second-order optimization method in probability space, 2020. arXiv:2001.04341

work page arXiv 2020

[55] [55]

Wibisono

A. Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. InProceedings of the 31st Conference On Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 2093–3027. PMLR, 06–09 Jul 2018

work page 2093

[56] [56]

Yamamoto, K

K. Yamamoto, K. Oko, Z. Yang, and T. Suzuki. Mean field Langevin actor-critic: Faster convergence and global optimality beyond lazy learning. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 55706–55738. PMLR, 21–27 Jul 2024

work page 2024

[57] [57]

Yamamoto, J

N. Yamamoto, J. Kim, and T. Suzuki. Hessian-guided perturbed Wasserstein gradient flows for escaping saddle points. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[58] [58]

R. Yao, X. Chen, and Y. Yang. Wasserstein proximal coordinate gradient algorithms.Journal of Machine Learning Research, 25(269):1–66, 2024

work page 2024

[59] [59]

Yao and Y

R. Yao and Y. Yang. Mean-field variational inference via Wasserstein gradient flow, 2023. arXiv:2207.08074

work page arXiv 2023

[60] [60]

Convergence Analysis of the Wasserstein Proximal Algorithm beyond Geodesic Convexity

S. Zhu and X. Chen. Convergence analysis of the Wasserstein proximal algorithm beyond geodesic convexity, 2025. arXiv:2501.14993. 14 6.Appendix The appendices are organized to guide the reader from motivating examples and implementation to background material and, finally, the technical analysis. Appendix A presents benign non-convex objectives on Wassers...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Introduction . . . . . . . . . . . . . . . . . . . . . . . .1 1.1. Non-convex optimization on Wasserstein space . . . . . . . . . . . . . . . 1 1.2. Why first-order methods and Wasserstein Newton are insufficient . . . . . . 2 1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

work page

[62] [62]

Wasserstein geometry for second-order optimization . . . . . .3 2.1. Problem setup and minimal notation . . . . . . . . . . . . . . . . . . . 3 2.2. Wasserstein Hessian along transport curves . . . . . . . . . . . . . . . . 4 2.3. Why Wasserstein Newton fails near saddles . . . . . . . . . . . . . . . . 5

work page

[63] [63]

Wasserstein Saddle-Free Newton . . . . . . . . . . . . . . .6 3.1. From Newton to saddle-free preconditioning . . . . . . . . . . . . . . . 6 3.2. Regularized WSFN update . . . . . . . . . . . . . . . . . . . . . . . 7 3.3. Second-order structure in the perturbation and preconditioner . . . . . . . 8

work page

[64] [64]

Theoretical guarantees . . . . . . . . . . . . . . . . . . . .9 4.1. Second-order optimality and landscape assumptions . . . . . . . . . . . . 9 4.2. Global convergence to a neighborhood of a global minimizer . . . . . . . . 10 4.3. Local linear convergence to a non-degenerate global minimizer . . . . . . . . 11

work page

[65] [65]

12 Acknowledgements

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .12 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .12 References . . . . . . . . . . . . . . . . . . . . . . . . . . .12

work page

[66] [66]

15 Appendix A

Appendix . . . . . . . . . . . . . . . . . . . . . . . . .15 Appendix A. Examples of benign objectives on Wasserstein space .16 Appendix B. Particle Implementation of WSFN . . . . . . . . .18 B.1. Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . 19 Appendix C. Additional notation for operators onL2 µ . . . . . . .22 15 Appendix D.L ...

work page

[67] [67]

Using the property of adjoint,(AB)∗ =B ∗A∗, we get (H2 µ)∗ = (Hµ Hµ)∗ = H∗ µ H∗ µ = Hµ Hµ = H2 µ, hence H2 µ is self-adjoint

By [30, Problem 2.35], ⟨Hµ v, v⟩L2µ ≤ ⟨|H µ|v, v⟩ L2µ, for anyv∈L 2 µ. Using the property of adjoint,(AB)∗ =B ∗A∗, we get (H2 µ)∗ = (Hµ Hµ)∗ = H∗ µ H∗ µ = Hµ Hµ = H2 µ, hence H2 µ is self-adjoint. The identity operatorId×d is trivially self-adjoint and the sum of two self-adjoint operators is self-adjoint. Thus,H2 µ +βI d×d is self-adjoint. Furthermore, b...

work page

[68] [68]

Summing fromk= 0tok=n−1yields F(µ 0)−F(µ n)≥ τ √β 2 n−1X k=0 ∥(H2 µk +βI d×d)− 1 2 ∇µF(µ k)∥2 L2 µk

Therefore, using thatH2 µk +βI d×d ⪰βI d×d implies( H2 µk +βI d×d)− 1 2 ⪯ β− 1 2 Id×d gives ∥(H2 µk +βI d×d)− 1 2 ∇µF(µ k)∥2 L2 µk ≤ 1√β D ∇µF(µ k),(H 2 µk +βI d×d)− 1 2 ∇µF(µ k) E L2 µk , Therefore, usingτ≤ √β(CM +C K)−1 gives F(µ k+1)≤F(µ k)−τ p β∥(H2 µk +βI d×d)− 1 2 ∇µF(µ k)∥2 L2 µk + τ √β 2 ∥(H2 µk +βI d×d)− 1 2 ∇µF(µ k)∥2 L2 µk =F(µ k)− τ √β 2 ∥(H2 ...

work page

[69] [69]

Using the Taylor expansionlog(1 + x) = Θ(x)2, for small enoughx > 0, we see thatnout = ˜O(˜δ−1)

To do so, we need to study the asymptotic behavior of the parameters η, nout and F0 as δ→ 0(which implies ˜δ→ 0), treating τ, κ,|c|, β, L H, CH and RF as fixed constants. Using the Taylor expansionlog(1 + x) = Θ(x)2, for small enoughx > 0, we see thatnout = ˜O(˜δ−1). Consequently, (38) implies F0 = ˜O(˜δ3). From ε RF√β + 2LH πβ ≤ ˜δ 3 2, we haveε = O(˜δ 3...

work page

[70] [70]

stuck region

Multiplying∆by noutτ and substituting our definition ofF 1/2 0 into q τ nout β F 1/2 0 yields noutτ∆ = 4L H 1 2√β + 2CH πβ (τ nout)3/2 √β F 1/2 0 + 4LH 1 2√β + 2CH πβ τ noutηκ +ετ n out RF√β + 2LH πβ = 1 3 log 3 2 + ˜O(˜δ 1 2 ) + ˜O(˜δ). For sufficiently smallδ, the higher-order terms can be upper bounded by1 3 log 3 2 guaran- teeing thatn outτ∆≤log 3 2. ...

work page