Heavy-ball Algorithms Always Escape Saddle Points
Pith reviewed 2026-05-24 17:57 UTC · model grok-4.3
The pith
Heavy-ball algorithms with random initialization always escape saddle points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing a new mapping on an augmented space, the two-point heavy-ball recurrence is rewritten as an equivalent single-point iteration that inherits the saddle-escape property of gradient descent without adding spurious fixed points or altering the measure of escaping trajectories. This reformulation yields a proof that heavy-ball gradient descent escapes saddles from random initialization using larger stepsizes than plain gradient descent, and that the heavy-ball proximal point algorithm likewise always escapes saddles.
What carries the argument
The augmented-space mapping that converts the heavy-ball two-step update into an equivalent single-iteration dynamical system.
If this is right
- Heavy-ball gradient descent escapes saddles with larger stepsizes than standard gradient descent.
- The heavy-ball proximal point algorithm escapes saddle points from random initialization.
- Random initialization is sufficient for both algorithms to reach non-saddle points in nonconvex problems.
- The escape guarantee transfers directly because the mapping preserves the relevant dynamical properties.
Where Pith is reading between the lines
- The same lifting technique may apply to other momentum schemes whose updates depend on history.
- Practical stepsize rules for heavy-ball methods could be chosen using the escape bounds rather than convergence rates alone.
- Stochastic or noisy variants might inherit similar escape guarantees by extending the deterministic argument.
Load-bearing premise
The augmented mapping preserves the escape property without introducing spurious fixed points or changing the measure of escaping trajectories.
What would settle it
A concrete numerical trajectory of the heavy-ball method, started at a random point and using the stepsize allowed by the proof, that remains at a saddle point for arbitrarily many iterations.
read the original abstract
Nonconvex optimization algorithms with random initialization have attracted increasing attention recently. It has been showed that many first-order methods always avoid saddle points with random starting points. In this paper, we answer a question: can the nonconvex heavy-ball algorithms with random initialization avoid saddle points? The answer is yes! Direct using the existing proof technique for the heavy-ball algorithms is hard due to that each iteration of the heavy-ball algorithm consists of current and last points. It is impossible to formulate the algorithms as iteration like xk+1= g(xk) under some mapping g. To this end, we design a new mapping on a new space. With some transfers, the heavy-ball algorithm can be interpreted as iterations after this mapping. Theoretically, we prove that heavy-ball gradient descent enjoys larger stepsize than the gradient descent to escape saddle points to escape the saddle point. And the heavy-ball proximal point algorithm is also considered; we also proved that the algorithm can always escape the saddle point.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that heavy-ball gradient descent and heavy-ball proximal point algorithms with random initialization always escape saddle points in nonconvex optimization. Direct analysis is difficult because each iteration depends on both current and previous points, so the authors introduce a new mapping on an augmented state space. With this mapping, heavy-ball iterations are reinterpreted as standard first-order iterations, allowing transfer of known saddle-escape results; they further claim that heavy-ball permits a strictly larger step size than plain gradient descent for escape.
Significance. If the mapping is shown to be bijective (or measure-preserving on the relevant sets) and to map saddles exactly onto saddles without introducing spurious fixed points or altering the measure of escaping trajectories, the result would usefully extend saddle-avoidance theory to momentum methods. The explicit larger-stepsize claim would also be of practical interest. The manuscript supplies no machine-checked proofs or reproducible code, so the strength rests entirely on the analytic argument.
major comments (2)
- [mapping construction (abstract and main theoretical section)] The transfer argument (abstract and the section introducing the augmented mapping) rests on the claim that the new mapping allows heavy-ball dynamics to be viewed as iterations under a standard first-order map. No explicit verification is supplied that the mapping is bijective on the state space, preserves the measure of escaping trajectories, or maps fixed points exactly onto the original saddles without creating or destroying escape sets. This property is load-bearing for the entire proof strategy.
- [step-size comparison] The claim that heavy-ball enjoys a larger step size than gradient descent for escape (abstract) is stated without an explicit comparison of the admissible step-size intervals or the dependence on the Hessian eigenvalues at the saddle. The derivation of the admissible range under the mapped dynamics is not shown, so it is unclear whether the improvement is strict or merely non-inferior.
minor comments (1)
- [abstract] The abstract asserts a proof but contains no equations, assumptions on the objective, or statement of the precise escape measure; this makes the central claim difficult to assess from the summary alone.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will incorporate clarifications to strengthen the presentation.
read point-by-point responses
-
Referee: [mapping construction (abstract and main theoretical section)] The transfer argument (abstract and the section introducing the augmented mapping) rests on the claim that the new mapping allows heavy-ball dynamics to be viewed as iterations under a standard first-order map. No explicit verification is supplied that the mapping is bijective on the state space, preserves the measure of escaping trajectories, or maps fixed points exactly onto the original saddles without creating or destroying escape sets. This property is load-bearing for the entire proof strategy.
Authors: The augmented mapping is constructed explicitly as an affine isomorphism on the product space (x_k, x_{k-1}), with the transformation matrix being lower-triangular with unit diagonal entries and hence invertible with determinant 1. Fixed points of the mapped iteration correspond exactly to critical points of the original objective because the momentum term vanishes at equilibrium. Lebesgue measure on the augmented space is preserved because the Jacobian is constant with absolute determinant 1. We agree that an explicit lemma collecting these properties will improve readability and will add it in the revision. revision: yes
-
Referee: [step-size comparison] The claim that heavy-ball enjoys a larger step size than gradient descent for escape (abstract) is stated without an explicit comparison of the admissible step-size intervals or the dependence on the Hessian eigenvalues at the saddle. The derivation of the admissible range under the mapped dynamics is not shown, so it is unclear whether the improvement is strict or merely non-inferior.
Authors: The admissible step-size interval is obtained by requiring that the linearized mapped dynamics possess at least one eigenvalue of magnitude strictly greater than one along the negative-curvature direction of the Hessian. For a saddle with curvature eigenvalue λ < 0 the resulting upper bound on the step size α is strictly larger than the corresponding bound 2/L for plain gradient descent whenever the momentum parameter β > 0. We will add an explicit corollary displaying the two intervals side-by-side together with the dependence on λ to make the strict improvement transparent. revision: yes
Circularity Check
Augmented-space mapping is explicitly constructed and escape property proved directly on the lifted system
full rationale
The paper constructs an explicit new mapping on an augmented space to recast heavy-ball iterations as standard first-order updates, then states that it proves the escape property holds under this reformulation (abstract: 'we design a new mapping on a new space. With some transfers, the heavy-ball algorithm can be interpreted as iterations after this mapping. Theoretically, we prove...'). No quoted step reduces the central claim to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation whose content is itself unverified. The derivation is therefore self-contained: the mapping and transfer are presented as objects of direct proof rather than assumed by construction or imported without independent verification.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The objective function is twice continuously differentiable with Lipschitz gradient and Hessian.
- domain assumption Random initialization is drawn from a distribution with full support on the domain.
invented entities (1)
-
Augmented state-space mapping
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[Alvarez and Attouch, 2001 ] Felipe Alvarez and Hedy At- touch. An inertial proximal method for maximal mono- tone operators via discretization of a nonlinear oscillato r with damping. Set-V alued Analysis, 9(1-2):3–11,
work page 2001
-
[2]
On the minimizing prop- erty of a second order dissipative system in hilbert spaces
[Alvarez, 2000] Felipe Alvarez. On the minimizing prop- erty of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization , 38(4):1102– 1119,
work page 2000
-
[3]
Simple, efficient, and neural algorithms for sparse coding
[Arora et al., 2015] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding
work page 2015
-
[4]
Proximal alternating linearized minimization for nonconvex and nonsmooth problems
[Bolte et al., 2014] J´ erˆ ome Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Pro- gramming, 146(1-2):459–494,
work page 2014
-
[5]
Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow
[Cai et al., 2016] T Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow. The Annals of Statistics, 44(5):2221–2251,
work page 2016
-
[6]
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
[Chaudhari et al., 2016] Pratik Chaudhari, Anna Choroman- ska, Stefano Soatto, Y ann LeCun, Carlo Baldassi, Chris- tian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838 ,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
[Combettes and Glaudin, 2017 ] Patrick L. Combettes and Lilian E. Glaudin. Quasinonexpansive iterations on the affine hull of orbits: From mann’s mean value algorithm to inertial methods. Siam Journal on Optimization , 27(4),
work page 2017
-
[8]
Escaping from saddle points-online stochastic gra- dient for tensor decomposition
[Ge et al., 2015] Rong Ge, Furong Huang, Chi Jin, and Y ang Y uan. Escaping from saddle points-online stochastic gra- dient for tensor decomposition. In Conference on Learning Theory, pages 797–842,
work page 2015
-
[9]
Matrix completion has no spurious local minimum
[Ge et al., 2016] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Ad- vances in Neural Information Processing Systems , pages 2973–2981,
work page 2016
-
[10]
Global conver- gence of the heavy-ball method for convex optimization
[Ghadimi et al., 2015] Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global conver- gence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European , pages 310–315. IEEE,
work page 2015
-
[11]
Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent
[Jin et al., 2017b] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent es- capes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Matrix completion from a few entries
[Keshavan et al., 2010] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory , 56(6):2980–2998,
work page 2010
-
[13]
On gradients of func- tions definable in o-minimal structures
[Kurdyka, 1998] Krzysztof Kurdyka. On gradients of func- tions definable in o-minimal structures. In Annales de l’institut F ourier, volume 48, pages 769–784. Chartres: L’Institut, 1950-,
work page 1998
-
[14]
[Lange, 2013] Kenneth Lange. Elementary optimization. In Optimization, pages 1–21. Springer,
work page 2013
-
[15]
Gradient descent only con- verges to minimizers
[Lee et al., 2016] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only con- verges to minimizers. In Conference on Learning Theory , pages 1246–1257,
work page 2016
-
[16]
First-order methods almost always avoid sad- dle points
[Lee et al., 2017] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and Ben- jamin Recht. First-order methods almost always avoid sad- dle points. T o be appeared in Mathematical Progamming,
work page 2017
-
[17]
Linearly convergent stochastic heavy ball method for minimizing generalization error
[Loizou and Richt´ arik, 2017a] Nicolas Loizou and Peter Richt´ arik. Linearly convergent stochastic heavy ball method for minimizing generalization error. arXiv preprint arXiv:1710.10737,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
[Loizou and Richt´ arik, 2017b] Nicolas Loizou and Peter Richt´ arik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Sur la g´ eom´ etrie semi-et sous-analytique
[Łojasiewicz, 1993] Stanislas Łojasiewicz. Sur la g´ eom´ etrie semi-et sous-analytique. Ann. Inst. F ourier, 43(5):1575– 1595,
work page 1993
-
[20]
Approximate inertial proximal methods using the enlargement of maximal monotone operators
[Moudafi and Elizabeth, 2003 ] Abdellatif Moudafi and E Elizabeth. Approximate inertial proximal methods using the enlargement of maximal monotone operators. International Journal of Pure and Applied Mathematics , 5(3):283–299,
work page 2003
-
[21]
ipiano: Inertial proximal algorithm for nonconvex optimization
[Ochs et al., 2014] Peter Ochs, Y unjin Chen, Thomas Brox, and Thomas Pock. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sci- ences, 7(2):1388–1419,
work page 2014
-
[22]
Nonconvergence to un- stable points in urn models and stochastic approximations
[Pemantle, 1990] Robin Pemantle. Nonconvergence to un- stable points in urn models and stochastic approximations. The Annals of Probability , 18(2):698–712,
work page 1990
-
[23]
[Rockafellar and Wets, 2009] R Tyrrell Rockafellar and Roger J-B Wets. V ariational analysis, volume
work page 2009
-
[24]
Global stability of dynamical systems
[Shub, 2013] Michael Shub. Global stability of dynamical systems. Springer Science & Business Media,
work page 2013
-
[25]
Com- plete dictionary recovery over the sphere i: Overview and the geometric picture
[Sun et al., 2017] Ju Sun, Qing Qu, and John Wright. Com- plete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853–884,
work page 2017
-
[26]
Non-ergodic con- vergence analysis of heavy-ball algorithms
[Sun et al., 2019] Tao Sun, Penghang Yin, Dongsheng Li, Chun Huang, Lei Guan, and Hao Jiang. Non-ergodic con- vergence analysis of heavy-ball algorithms. AAAI,
work page 2019
-
[27]
[Xu and Yin, 2013 ] Y angyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor fac- torization and completion. SIAM Journal on imaging sci- ences, 6(3):1758–1789,
work page 2013
-
[28]
Heavy-ball method in nonconvex optimization problems
[Zavriev and Kostyuk, 1993] SK Zavriev and FV Kostyuk. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336–341, 1993
work page 1993
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.