pith. sign in

arxiv: 1907.09697 · v1 · pith:FWDTLJMRnew · submitted 2019-07-23 · 🧮 math.OC · cs.LG· stat.ML

Heavy-ball Algorithms Always Escape Saddle Points

Pith reviewed 2026-05-24 17:57 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords heavy-ball methodsaddle point escapenonconvex optimizationmomentum gradient descentproximal point algorithmrandom initializationaugmented space mapping
0
0 comments X

The pith

Heavy-ball algorithms with random initialization always escape saddle points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that heavy-ball gradient descent and heavy-ball proximal point methods avoid saddle points when started randomly. Because each step depends on both the current and previous point, the authors lift the recurrence to an augmented space via a specially constructed mapping. Under this mapping the iteration becomes a standard single-step dynamical system whose escape behavior can be analyzed directly. The resulting guarantee shows that the gradient version permits strictly larger stepsizes than ordinary gradient descent while still escaping. Readers interested in nonconvex problems such as neural-network training would therefore expect these momentum methods to succeed from random starts without additional safeguards.

Core claim

By constructing a new mapping on an augmented space, the two-point heavy-ball recurrence is rewritten as an equivalent single-point iteration that inherits the saddle-escape property of gradient descent without adding spurious fixed points or altering the measure of escaping trajectories. This reformulation yields a proof that heavy-ball gradient descent escapes saddles from random initialization using larger stepsizes than plain gradient descent, and that the heavy-ball proximal point algorithm likewise always escapes saddles.

What carries the argument

The augmented-space mapping that converts the heavy-ball two-step update into an equivalent single-iteration dynamical system.

If this is right

  • Heavy-ball gradient descent escapes saddles with larger stepsizes than standard gradient descent.
  • The heavy-ball proximal point algorithm escapes saddle points from random initialization.
  • Random initialization is sufficient for both algorithms to reach non-saddle points in nonconvex problems.
  • The escape guarantee transfers directly because the mapping preserves the relevant dynamical properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lifting technique may apply to other momentum schemes whose updates depend on history.
  • Practical stepsize rules for heavy-ball methods could be chosen using the escape bounds rather than convergence rates alone.
  • Stochastic or noisy variants might inherit similar escape guarantees by extending the deterministic argument.

Load-bearing premise

The augmented mapping preserves the escape property without introducing spurious fixed points or changing the measure of escaping trajectories.

What would settle it

A concrete numerical trajectory of the heavy-ball method, started at a random point and using the stepsize allowed by the proof, that remains at a saddle point for arbitrarily many iterations.

read the original abstract

Nonconvex optimization algorithms with random initialization have attracted increasing attention recently. It has been showed that many first-order methods always avoid saddle points with random starting points. In this paper, we answer a question: can the nonconvex heavy-ball algorithms with random initialization avoid saddle points? The answer is yes! Direct using the existing proof technique for the heavy-ball algorithms is hard due to that each iteration of the heavy-ball algorithm consists of current and last points. It is impossible to formulate the algorithms as iteration like xk+1= g(xk) under some mapping g. To this end, we design a new mapping on a new space. With some transfers, the heavy-ball algorithm can be interpreted as iterations after this mapping. Theoretically, we prove that heavy-ball gradient descent enjoys larger stepsize than the gradient descent to escape saddle points to escape the saddle point. And the heavy-ball proximal point algorithm is also considered; we also proved that the algorithm can always escape the saddle point.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that heavy-ball gradient descent and heavy-ball proximal point algorithms with random initialization always escape saddle points in nonconvex optimization. Direct analysis is difficult because each iteration depends on both current and previous points, so the authors introduce a new mapping on an augmented state space. With this mapping, heavy-ball iterations are reinterpreted as standard first-order iterations, allowing transfer of known saddle-escape results; they further claim that heavy-ball permits a strictly larger step size than plain gradient descent for escape.

Significance. If the mapping is shown to be bijective (or measure-preserving on the relevant sets) and to map saddles exactly onto saddles without introducing spurious fixed points or altering the measure of escaping trajectories, the result would usefully extend saddle-avoidance theory to momentum methods. The explicit larger-stepsize claim would also be of practical interest. The manuscript supplies no machine-checked proofs or reproducible code, so the strength rests entirely on the analytic argument.

major comments (2)
  1. [mapping construction (abstract and main theoretical section)] The transfer argument (abstract and the section introducing the augmented mapping) rests on the claim that the new mapping allows heavy-ball dynamics to be viewed as iterations under a standard first-order map. No explicit verification is supplied that the mapping is bijective on the state space, preserves the measure of escaping trajectories, or maps fixed points exactly onto the original saddles without creating or destroying escape sets. This property is load-bearing for the entire proof strategy.
  2. [step-size comparison] The claim that heavy-ball enjoys a larger step size than gradient descent for escape (abstract) is stated without an explicit comparison of the admissible step-size intervals or the dependence on the Hessian eigenvalues at the saddle. The derivation of the admissible range under the mapped dynamics is not shown, so it is unclear whether the improvement is strict or merely non-inferior.
minor comments (1)
  1. [abstract] The abstract asserts a proof but contains no equations, assumptions on the objective, or statement of the precise escape measure; this makes the central claim difficult to assess from the summary alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will incorporate clarifications to strengthen the presentation.

read point-by-point responses
  1. Referee: [mapping construction (abstract and main theoretical section)] The transfer argument (abstract and the section introducing the augmented mapping) rests on the claim that the new mapping allows heavy-ball dynamics to be viewed as iterations under a standard first-order map. No explicit verification is supplied that the mapping is bijective on the state space, preserves the measure of escaping trajectories, or maps fixed points exactly onto the original saddles without creating or destroying escape sets. This property is load-bearing for the entire proof strategy.

    Authors: The augmented mapping is constructed explicitly as an affine isomorphism on the product space (x_k, x_{k-1}), with the transformation matrix being lower-triangular with unit diagonal entries and hence invertible with determinant 1. Fixed points of the mapped iteration correspond exactly to critical points of the original objective because the momentum term vanishes at equilibrium. Lebesgue measure on the augmented space is preserved because the Jacobian is constant with absolute determinant 1. We agree that an explicit lemma collecting these properties will improve readability and will add it in the revision. revision: yes

  2. Referee: [step-size comparison] The claim that heavy-ball enjoys a larger step size than gradient descent for escape (abstract) is stated without an explicit comparison of the admissible step-size intervals or the dependence on the Hessian eigenvalues at the saddle. The derivation of the admissible range under the mapped dynamics is not shown, so it is unclear whether the improvement is strict or merely non-inferior.

    Authors: The admissible step-size interval is obtained by requiring that the linearized mapped dynamics possess at least one eigenvalue of magnitude strictly greater than one along the negative-curvature direction of the Hessian. For a saddle with curvature eigenvalue λ < 0 the resulting upper bound on the step size α is strictly larger than the corresponding bound 2/L for plain gradient descent whenever the momentum parameter β > 0. We will add an explicit corollary displaying the two intervals side-by-side together with the dependence on λ to make the strict improvement transparent. revision: yes

Circularity Check

0 steps flagged

Augmented-space mapping is explicitly constructed and escape property proved directly on the lifted system

full rationale

The paper constructs an explicit new mapping on an augmented space to recast heavy-ball iterations as standard first-order updates, then states that it proves the escape property holds under this reformulation (abstract: 'we design a new mapping on a new space. With some transfers, the heavy-ball algorithm can be interpreted as iterations after this mapping. Theoretically, we prove...'). No quoted step reduces the central claim to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation whose content is itself unverified. The derivation is therefore self-contained: the mapping and transfer are presented as objects of direct proof rather than assumed by construction or imported without independent verification.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The proof rests on standard smoothness and non-degeneracy assumptions for the objective plus the correctness of the state-space lifting; no free parameters or new physical entities are introduced.

axioms (2)
  • domain assumption The objective function is twice continuously differentiable with Lipschitz gradient and Hessian.
    Required for the saddle-escape analysis and the mapping to be well-defined; standard in the field but not stated in the abstract.
  • domain assumption Random initialization is drawn from a distribution with full support on the domain.
    Needed for the probability of escape to be 1; implicit in all random-init escape results.
invented entities (1)
  • Augmented state-space mapping no independent evidence
    purpose: To recast the two-step heavy-ball recurrence as a single map amenable to existing escape arguments.
    The abstract states that direct formulation as x_{k+1}=g(x_k) is impossible, so a new space is constructed; no independent evidence is given that the mapping preserves escape probability.

pith-pipeline@v0.9.0 · 5706 in / 1444 out tokens · 20245 ms · 2026-05-24T17:57:13.261770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

  1. [1]

    An inertial proximal method for maximal mono- tone operators via discretization of a nonlinear oscillato r with damping

    [Alvarez and Attouch, 2001 ] Felipe Alvarez and Hedy At- touch. An inertial proximal method for maximal mono- tone operators via discretization of a nonlinear oscillato r with damping. Set-V alued Analysis, 9(1-2):3–11,

  2. [2]

    On the minimizing prop- erty of a second order dissipative system in hilbert spaces

    [Alvarez, 2000] Felipe Alvarez. On the minimizing prop- erty of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization , 38(4):1102– 1119,

  3. [3]

    Simple, efficient, and neural algorithms for sparse coding

    [Arora et al., 2015] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding

  4. [4]

    Proximal alternating linearized minimization for nonconvex and nonsmooth problems

    [Bolte et al., 2014] J´ erˆ ome Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Pro- gramming, 146(1-2):459–494,

  5. [5]

    Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow

    [Cai et al., 2016] T Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow. The Annals of Statistics, 44(5):2221–2251,

  6. [6]

    Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

    [Chaudhari et al., 2016] Pratik Chaudhari, Anna Choroman- ska, Stefano Soatto, Y ann LeCun, Carlo Baldassi, Chris- tian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838 ,

  7. [7]

    Combettes and Lilian E

    [Combettes and Glaudin, 2017 ] Patrick L. Combettes and Lilian E. Glaudin. Quasinonexpansive iterations on the affine hull of orbits: From mann’s mean value algorithm to inertial methods. Siam Journal on Optimization , 27(4),

  8. [8]

    Escaping from saddle points-online stochastic gra- dient for tensor decomposition

    [Ge et al., 2015] Rong Ge, Furong Huang, Chi Jin, and Y ang Y uan. Escaping from saddle points-online stochastic gra- dient for tensor decomposition. In Conference on Learning Theory, pages 797–842,

  9. [9]

    Matrix completion has no spurious local minimum

    [Ge et al., 2016] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Ad- vances in Neural Information Processing Systems , pages 2973–2981,

  10. [10]

    Global conver- gence of the heavy-ball method for convex optimization

    [Ghadimi et al., 2015] Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global conver- gence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European , pages 310–315. IEEE,

  11. [11]

    Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

    [Jin et al., 2017b] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent es- capes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456,

  12. [12]

    Matrix completion from a few entries

    [Keshavan et al., 2010] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory , 56(6):2980–2998,

  13. [13]

    On gradients of func- tions definable in o-minimal structures

    [Kurdyka, 1998] Krzysztof Kurdyka. On gradients of func- tions definable in o-minimal structures. In Annales de l’institut F ourier, volume 48, pages 769–784. Chartres: L’Institut, 1950-,

  14. [14]

    Elementary optimization

    [Lange, 2013] Kenneth Lange. Elementary optimization. In Optimization, pages 1–21. Springer,

  15. [15]

    Gradient descent only con- verges to minimizers

    [Lee et al., 2016] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only con- verges to minimizers. In Conference on Learning Theory , pages 1246–1257,

  16. [16]

    First-order methods almost always avoid sad- dle points

    [Lee et al., 2017] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and Ben- jamin Recht. First-order methods almost always avoid sad- dle points. T o be appeared in Mathematical Progamming,

  17. [17]

    Linearly convergent stochastic heavy ball method for minimizing generalization error

    [Loizou and Richt´ arik, 2017a] Nicolas Loizou and Peter Richt´ arik. Linearly convergent stochastic heavy ball method for minimizing generalization error. arXiv preprint arXiv:1710.10737,

  18. [18]

    Momentum and Stochastic Momentum for Stochastic Gradient, Newton, Proximal Point and Subspace Descent Methods

    [Loizou and Richt´ arik, 2017b] Nicolas Loizou and Peter Richt´ arik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677 ,

  19. [19]

    Sur la g´ eom´ etrie semi-et sous-analytique

    [Łojasiewicz, 1993] Stanislas Łojasiewicz. Sur la g´ eom´ etrie semi-et sous-analytique. Ann. Inst. F ourier, 43(5):1575– 1595,

  20. [20]

    Approximate inertial proximal methods using the enlargement of maximal monotone operators

    [Moudafi and Elizabeth, 2003 ] Abdellatif Moudafi and E Elizabeth. Approximate inertial proximal methods using the enlargement of maximal monotone operators. International Journal of Pure and Applied Mathematics , 5(3):283–299,

  21. [21]

    ipiano: Inertial proximal algorithm for nonconvex optimization

    [Ochs et al., 2014] Peter Ochs, Y unjin Chen, Thomas Brox, and Thomas Pock. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sci- ences, 7(2):1388–1419,

  22. [22]

    Nonconvergence to un- stable points in urn models and stochastic approximations

    [Pemantle, 1990] Robin Pemantle. Nonconvergence to un- stable points in urn models and stochastic approximations. The Annals of Probability , 18(2):698–712,

  23. [23]

    V ariational analysis, volume

    [Rockafellar and Wets, 2009] R Tyrrell Rockafellar and Roger J-B Wets. V ariational analysis, volume

  24. [24]

    Global stability of dynamical systems

    [Shub, 2013] Michael Shub. Global stability of dynamical systems. Springer Science & Business Media,

  25. [25]

    Com- plete dictionary recovery over the sphere i: Overview and the geometric picture

    [Sun et al., 2017] Ju Sun, Qing Qu, and John Wright. Com- plete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853–884,

  26. [26]

    Non-ergodic con- vergence analysis of heavy-ball algorithms

    [Sun et al., 2019] Tao Sun, Penghang Yin, Dongsheng Li, Chun Huang, Lei Guan, and Hao Jiang. Non-ergodic con- vergence analysis of heavy-ball algorithms. AAAI,

  27. [27]

    A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor fac- torization and completion

    [Xu and Yin, 2013 ] Y angyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor fac- torization and completion. SIAM Journal on imaging sci- ences, 6(3):1758–1789,

  28. [28]

    Heavy-ball method in nonconvex optimization problems

    [Zavriev and Kostyuk, 1993] SK Zavriev and FV Kostyuk. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336–341, 1993