Local linear convergence of gradient methods for overparameterized Gaussian mixtures

Jingxing Wang; Maryam Fazel; Vasileios Charisopoulos

arxiv: 2605.30936 · v1 · pith:LC6E5D6Vnew · submitted 2026-05-29 · 💻 cs.LG · math.OC· stat.ML

Local linear convergence of gradient methods for overparameterized Gaussian mixtures

Jingxing Wang , Vasileios Charisopoulos , Maryam Fazel This is my paper

Pith reviewed 2026-06-28 23:38 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords Gaussian mixture modelsoverparameterizationgradient methodsPolyak stepsizelocal linear convergenceexpectation-maximizationloss landscape

0 comments

The pith

Gradient method with alternating short and long steps achieves locally linear convergence for overparameterized Gaussian mixtures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that overparameterization in Gaussian mixture models enables global recovery but slows local convergence of standard gradient methods. Under assumptions on the mixture weights, the divergence measure minimized by learning procedures has a manifold of slow growth where the Polyak stepsize produces geometric loss reduction. The authors design a method that alternates several short gradient descent steps to approach this manifold with long Polyak steps that contract the distance to minimizers, yielding locally linear convergence. The same method also reaches nearly optimal solutions up to a misspecification threshold when mixture weights are arbitrary. This indicates that slow local convergence is not inherent to overparameterization but can be addressed by using the loss landscape structure.

Core claim

Under certain assumptions on the mixture weights, a standard divergence measure possesses a manifold of slow growth on which the Polyak stepsize reduces the loss geometrically, and the designed gradient-based method converges to minimizers at a locally linear rate. Additionally, the method converges to nearly optimal solutions up to a natural misspecification threshold for mixtures with arbitrary weights. At a high level, the method alternates between several short gradient descent steps that approach the manifold and long Polyak steps that contract the distance to minimizers.

What carries the argument

The manifold of slow growth in the divergence measure, on which the Polyak stepsize produces geometric contraction once reached by short gradient steps.

If this is right

The designed method converges to minimizers at a locally linear rate under the stated mixture-weight assumptions.
The method reaches solutions that are nearly optimal up to a natural misspecification threshold when mixture weights are arbitrary.
Slow local convergence is not an intrinsic feature of overparameterization but follows from not exploiting the manifold structure in the loss landscape.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar slow-growth manifolds may exist in the loss landscapes of other overparameterized statistical estimation problems.
The alternating short-and-long step strategy could be tested empirically on synthetic mixtures with controlled weight assumptions to measure observed convergence rates.
Relaxing the mixture-weight assumptions while preserving the manifold property would extend the linear-rate guarantee to a wider class of models.

Load-bearing premise

The existence of a manifold of slow growth in the divergence measure together with the mixture-weight assumptions that make the Polyak stepsize produce geometric contraction on that manifold.

What would settle it

An explicit construction of a mixture where the Polyak stepsize fails to produce geometric reduction in loss once on the identified manifold would falsify the local linear convergence claim.

Figures

Figures reproduced from arXiv: 2605.30936 by Jingxing Wang, Maryam Fazel, Vasileios Charisopoulos.

**Figure 1.** Figure 1: Local acceleration for loss and parameter distance. The square markers indicate iterates obtained with Polyak steps, which are interleaved with several steps of gradient descent. In this instance, we have dimension d = 5, m = 3 teacher components, and n = 20 student components. denote the so-called responsibility of the i th mixture component. We also write L(θ) for the population KL loss between p ⋆ and p… view at source ↗

**Figure 2.** Figure 2: Loss landscape for the example in (9). Figure 2a: the KL landscape is steep across the tangent space and flat along it. Figure 2b: the loss grows quadratically along the normal direction and quartically along the tangent direction. Our algorithm alternates gradient descent steps, which lead towards the ravine, with Polyak steps near the ravine to reduce the loss. Overparameterization, slowdown, and pruning… view at source ↗

**Figure 3.** Figure 3: Illustration of algorithmic behavior. The Polyak step, µ +, makes significant progress towards µ ⋆ along the ravine M but can escape the tube surrounding it. A few “short steps” of gradient descent rapidly restore proximity to M, wherein a new Polyak step can be attempted. Proof sketch. Our proof separates the neighborhood of the minimizer into three regions: (i) a tube T around the manifold M, where distM… view at source ↗

**Figure 4.** Figure 4: Experiment on an instance with ∆π = 0. Gradient EM struggles to reduce the intra-cluster dispersion, which contributes the majority of the loss near minimizers. Polyak steps (indicated by square marks) sharply reduce the dispersion, but slightly increase the cluster bias due to leaving the vicinity of the ravine; interleaved gradient descent steps restore proximity to the manifold. 10 [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 5.** Figure 5: Experiment with general mixtures. Both methods are initialized identically; as in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of weight mismatch ∆π on acceleration. (6a): small ∆π maintain the contraction rate and final loss essentially unaffected, while larger ∆π stall at higher loss floors. (6b): gradient EM rapidly reduces ∆π, which stabilizes within less than 15 iterations. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: shows the number of active components is typically larger than m = 3, especially in low dimensions. 3 5 7 9 11 13+ 0 5 10 > m d = 2 Active components Trials 3 5 7 9 11 13+ 0 5 10 d = 5 Active components 3 5 7 9 11 13+ 0 5 10 d = 10 Active components [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

We study the problem of learning Gaussian mixture models under overparameterization. Prior work has shown that while overparameterization is essential for avoiding spurious local optima and enables global recovery of the ground-truth model using the gradient-EM (expectation-maximization) algorithm, it can dramatically slow down the local rate of convergence. Under certain assumptions on the mixture weights, we show that a standard divergence measure minimized by statistical learning procedures possesses a manifold of slow growth on which the well-known Polyak stepsize reduces the loss geometrically, and design a gradient-based method that converges to minimizers at a locally linear rate. Additionally, we show that our method converges to nearly optimal solutions -- up to a natural misspecification threshold -- for mixtures with arbitrary weights. At a high level, the method alternates between several "short" gradient descent steps that approach the manifold and "long" Polyak steps that contract the distance to minimizers. Our results suggest that slow convergence is not an intrinsic challenge of overparameterization, but can be overcome by exploiting the favorable structure of the loss landscape.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is an alternating short-GD/long-Polyak procedure that targets a slow-growth manifold to recover local linear convergence in overparameterized GMMs under weight assumptions.

read the letter

The punchline is that overparameterization in Gaussian mixtures does not have to trap you in slow local convergence. The authors identify a manifold of slow growth in a standard divergence and show that Polyak steps contract geometrically on it, then build an alternating method that reaches the manifold with short gradient steps and makes progress with longer Polyak steps.

What is new is the specific alternating construction. Earlier gradient-EM results already established that overparameterization avoids bad local minima and allows global recovery, but local rates remained poor. This work turns the landscape structure into a concrete algorithmic fix rather than just another landscape analysis.

It also extends the claim to arbitrary weights, showing convergence to nearly optimal solutions up to a misspecification threshold. That broadens the result beyond the restrictive weight assumptions needed for the linear rate.

The soft spots are the weight assumptions that create the manifold in the first place. If those fail, the geometric contraction may not hold. The abstract states the claims cleanly but gives no explicit bounds or derivation steps, so the tightness of the local linear rate is hard to judge without the full proofs. No circularity is obvious from the high-level description.

This is for people working on non-convex statistical optimization and practical GMM fitting. It deserves a serious referee because it supplies a targeted algorithmic response to a documented practical slowdown, even if the assumptions limit immediate generality.

Referee Report

1 major / 0 minor

Summary. The paper studies the problem of learning Gaussian mixture models in the overparameterized regime using gradient methods. Under assumptions on the mixture weights, it identifies a manifold of slow growth in a standard divergence measure on which the Polyak stepsize yields geometric loss reduction. It designs an alternating procedure of short gradient descent steps (to approach the manifold) and long Polyak steps (to contract toward minimizers) that achieves locally linear convergence. The work also establishes convergence to nearly optimal solutions (up to a misspecification threshold) for mixtures with arbitrary weights.

Significance. If the derivations hold, the result is significant because it shows that the known slowdown in local convergence for overparameterized GMMs is not intrinsic but can be overcome by exploiting favorable loss-landscape structure. The explicit construction of a manifold admitting geometric contraction under Polyak steps, together with the alternating algorithm, supplies both a theoretical explanation and a practical gradient-based procedure with linear-rate guarantees. The extension to arbitrary weights further broadens applicability.

major comments (1)

Abstract (paragraph beginning 'Under certain assumptions on the mixture weights'): the central claim rests on the existence of a manifold of slow growth together with mixture-weight assumptions that make the Polyak stepsize produce geometric contraction; however, the abstract supplies no explicit construction, error bounds, or derivation of this manifold, so the load-bearing step cannot be verified from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful summary and for recognizing the potential significance of the results. We address the single major comment below.

read point-by-point responses

Referee: [—] Abstract (paragraph beginning 'Under certain assumptions on the mixture weights'): the central claim rests on the existence of a manifold of slow growth together with mixture-weight assumptions that make the Polyak stepsize produce geometric contraction; however, the abstract supplies no explicit construction, error bounds, or derivation of this manifold, so the load-bearing step cannot be verified from the provided text.

Authors: We agree that the abstract is deliberately high-level and does not contain the explicit construction, assumptions, or proof details; this is standard for abstracts to maintain brevity. The manifold of slow growth is explicitly constructed in Section 3 as the set of parameters where the gradient of the divergence aligns with the mixture-weight vector under Assumption 2. The geometric contraction property under the Polyak stepsize is stated and proved as Theorem 4.1, with the contraction factor and error bounds derived in the proof (see the local linear rate in Equation (15)). The alternating short-GD / long-Polyak procedure and its local linear convergence are developed in Section 5. Because the full manuscript supplies the requested construction, bounds, and derivation, we do not believe the abstract itself requires expansion, but we are happy to add a one-sentence pointer to the main theorem if the editor prefers. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation chain begins from external prior results on gradient-EM for overparameterized GMMs and proceeds by establishing (under stated mixture-weight assumptions) the existence of a slow-growth manifold on which Polyak steps contract geometrically; the alternating short-GD/long-Polyak procedure is then constructed from that landscape property. No step reduces by definition to a fitted quantity inside the paper, no self-citation is load-bearing for the central claim, and the uniqueness or ansatz elements are not imported from the authors' own prior work. The result is therefore self-contained against the cited external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on domain assumptions about mixture weights and the structural claim that a slow-growth manifold exists for the divergence; no free parameters or invented entities are visible in the abstract.

axioms (2)

domain assumption Certain assumptions on the mixture weights
Invoked to guarantee the manifold of slow growth and geometric contraction under Polyak steps.
ad hoc to paper Existence of a manifold of slow growth for the divergence measure
The paper asserts this manifold exists and that Polyak steps contract on it.

pith-pipeline@v0.9.1-grok · 5726 in / 1389 out tokens · 26366 ms · 2026-06-28T23:38:55.241701+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 10 canonical work pages

[1]

URLhttps://doi.org/10.1214/16-AOS1435

Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45 0 (1), 2017. ISSN 0090-5364. doi:10.1214/16-aos1435

work page doi:10.1214/16-aos1435 2017
[2]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, G \'a bor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117 0 (48): 0 30063--30070, 2020. doi:10.1073/pnas.1907378117

work page doi:10.1073/pnas.1907378117 2020
[3]

Reconciling modern machine learning practice and the bias-variance trade-off

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. Proceedings of the National Academy of Sciences, 116 0 (32), 2019

2019
[4]

Two models of double descent for weak features

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. Proceedings of the National Academy of Sciences, 117 0 (51): 0 32153--32160, 2020. doi:10.1073/pnas.2006596117

work page doi:10.1073/pnas.2006596117 2020
[5]

Local minima structures in gaussian mixture models

Yudong Chen, Dogyoon Song, Xumei Xi, and Yuqian Zhang. Local minima structures in gaussian mixture models. IEEE Transactions on Information Theory, 70 0 (6): 0 4218--4257, 2024

2024
[6]

Ten steps of em suffice for mixtures of two gaussians

Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em suffice for mixtures of two gaussians. In Conference on Learning Theory, pages 704--710. PMLR, 2017

2017
[7]

Asymptotic normality and optimality in nonsmooth stochastic approximation

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Asymptotic normality and optimality in nonsmooth stochastic approximation. The Annals of Statistics, 52 0 (4): 0 1485--1508, 2024

2024
[8]

Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization. Foundations of Computational Mathematics, 26 0 (2): 0 779--861, January 2025 a . ISSN 1615-3383. doi:10.1007/s10208-025-09691-0

work page doi:10.1007/s10208-025-09691-0 2025
[9]

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth. Mathematical Programming, pages 1--66, 2025 b

2025
[10]

A. P. Dempster, N. M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39 0 (1): 0 1--38, 1977

1977
[11]

Drusvyatskiy and A

D. Drusvyatskiy and A. S. Lewis. Optimality, identifiability, and sensitivity. Mathematical Programming, 147 0 (1-2): 0 467--498, November 2013. ISSN 1436-4646. doi:10.1007/s10107-013-0730-4

work page doi:10.1007/s10107-013-0730-4 2013
[12]

Sharp analysis of expectation-maximization for weakly identifiable models

Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin Wainwright, Michael Jordan, and Bin Yu. Sharp analysis of expectation-maximization for weakly identifiable models. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning R...

2020
[13]

Singularity, misspecification and the convergence rate of EM

Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin J Wainwright, Michael I Jordan, and Bin Yu. Singularity, misspecification and the convergence rate of EM . The Annals of Statistics, 48 0 (6): 0 3161--3182, 2020 b

2020
[14]

Identifying active constraints via partial smoothness and prox-regularity

Warren L Hare and Adrian S Lewis. Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis, 11 0 (2): 0 251--266, 2004

2004
[15]

On strong identifiability and convergence rates of parameter estimation in finite mixtures

Nhat Ho and XuanLong Nguyen. On strong identifiability and convergence rates of parameter estimation in finite mixtures. Electronic Journal of Statistics, 10 0 (1), January 2016. ISSN 1935-7524. doi:10.1214/16-ejs1105

work page doi:10.1214/16-ejs1105 2016
[16]

Wainwright, Sivaraman Balakrishnan, and Michael I

Chi Jin, Yuchen Zhang, Martin J. Wainwright, Sivaraman Balakrishnan, and Michael I. Jordan. Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences. In Advances in Neural Information Processing Systems, 2016

2016
[17]

Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition. In Joint European conference on machine learning and knowledge discovery in databases, pages 795--811. Springer, 2016

2016
[18]

Information theory and statistics

Solomon Kullback. Information theory and statistics. Courier Corporation, 1997

1997
[19]

The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians

Jeongyeol Kwon and Constantine Caramanis. The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians . In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2425--2487. PMLR, 09--12 Jul 2020. URL https://proce...

2020
[20]

The U -Lagrangian of a Convex Function

Claude Lemar \'e chal, Fran c ois Oustry, and Claudia Sagastiz \'a bal. The U -Lagrangian of a Convex Function . Transactions of the American Mathematical Society, 352 0 (2): 0 711--729, 2000. ISSN 00029947. URL http://www.jstor.org/stable/118061

2000
[21]

Active sets, nonsmoothness, and sensitivity

Adrian S Lewis. Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, 13 0 (3): 0 702--725, 2002

2002
[22]

A proximal method for composite minimization

Adrian S Lewis and Stephen J Wright. A proximal method for composite minimization. Mathematical Programming, 158 0 (1): 0 501--546, 2016

2016
[23]

A topological property of real analytic subsets

Stanislaw ojasiewicz. A topological property of real analytic subsets. Coll. du CNRS, Les \'e quations aux d \'e riv \'e es partielles , 117 0 (87-89): 0 2, 1963

1963
[24]

Proximal points are on the fast track

Robert Mifflin and Claudia Sagastiz \'a bal. Proximal points are on the fast track. Journal of Convex Analysis, 9 0 (2): 0 563--580, 2002

2002
[25]

Gradient methods for minimizing functionals

Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal vychislitel'noi matematiki i matematicheskoi fiziki, 3 0 (4): 0 643--653, 1963

1963
[26]

Minimization of unsmooth functionals

Boris Teodorovich Polyak. Minimization of unsmooth functionals. USSR Computational Mathematics and Mathematical Physics, 9 0 (3): 0 14--29, 1969

1969
[27]

Mixture densities, maximum likelihood and the EM algorithm

Richard A Redner and Homer F Walker. Mixture densities, maximum likelihood and the EM algorithm . SIAM review, 26 0 (2): 0 195--239, 1984

1984
[28]

Identification in parametric models

Thomas J Rothenberg. Identification in parametric models. Econometrica: Journal of the Econometric Society, pages 577--591, 1971

1971
[29]

Asymptotic behaviour of the posterior distribution in overfitted mixture models

Judith Rousseau and Kerrie Mengersen. Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B, 73 0 (5): 0 689--710, 2011

2011
[30]

Improved convergence guarantees for learning gaussian mixture models by EM and gradient EM

Nimrod Segol and Boaz Nadler. Improved convergence guarantees for learning gaussian mixture models by EM and gradient EM . Electronic Journal of Statistics, 15 0 (2), January 2021. ISSN 1935-7524. doi:10.1214/21-ejs1905

work page doi:10.1214/21-ejs1905 2021
[31]

A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices

Henry Teicher. Identifiability of finite mixtures. The Annals of Mathematical Statistics, 34 0 (4): 0 1265--1269, December 1963. ISSN 0003-4851. doi:10.1214/aoms/1177703862

work page doi:10.1214/aoms/1177703862 1963
[32]

Identifiable surfaces in constrained optimization

Stephen J Wright. Identifiable surfaces in constrained optimization. SIAM Journal on Control and Optimization, 31 0 (4): 0 1063--1079, 1993

1993
[33]

C. F. Jeff Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11 0 (1): 0 95--103, 1983

1983
[34]

Global analysis of expectation maximization for mixtures of two gaussians

Ji Xu, Daniel Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two gaussians. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NeurIPS'16, page 2684–2692, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819

2016
[35]

Weihang Xu and Simon S. Du. Over-parameterization exponentially slows down gradient descent for learning a single neuron. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of the Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 1155--1198. PMLR, 2023

2023
[36]

Toward global convergence of gradient em for over-parameterized gaussian mixture models

Weihang Xu, Maryam Fazel, and Simon S Du. Toward global convergence of gradient em for over-parameterized gaussian mixture models. Advances in Neural Information Processing Systems, 37: 0 10770--10800, 2024

2024
[37]

Convergence of gradient em on multi-component mixture of gaussians

Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient em on multi-component mixture of gaussians. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017
[38]

Electronic Journal of Statistics14(1), 632–660 (2020)

Ruofei Zhao, Yuanzhi Li, and Yuekai Sun. Statistical convergence of the em algorithm on gaussian mixture models. Electronic Journal of Statistics, 14 0 (1), January 2020. ISSN 1935-7524. doi:10.1214/19-ejs1660

work page doi:10.1214/19-ejs1660 2020
[39]

Mo Zhou, Weihang Xu, Maryam Fazel, and Simon S. Du. Global convergence of gradient EM for over-parameterized gaussian mixtures. arXiv preprint arXiv:2506.06584, 2025. URL https://arxiv.org/abs/2506.06584

work page arXiv 2025

[1] [1]

URLhttps://doi.org/10.1214/16-AOS1435

Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45 0 (1), 2017. ISSN 0090-5364. doi:10.1214/16-aos1435

work page doi:10.1214/16-aos1435 2017

[2] [2]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, G \'a bor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117 0 (48): 0 30063--30070, 2020. doi:10.1073/pnas.1907378117

work page doi:10.1073/pnas.1907378117 2020

[3] [3]

Reconciling modern machine learning practice and the bias-variance trade-off

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. Proceedings of the National Academy of Sciences, 116 0 (32), 2019

2019

[4] [4]

Two models of double descent for weak features

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. Proceedings of the National Academy of Sciences, 117 0 (51): 0 32153--32160, 2020. doi:10.1073/pnas.2006596117

work page doi:10.1073/pnas.2006596117 2020

[5] [5]

Local minima structures in gaussian mixture models

Yudong Chen, Dogyoon Song, Xumei Xi, and Yuqian Zhang. Local minima structures in gaussian mixture models. IEEE Transactions on Information Theory, 70 0 (6): 0 4218--4257, 2024

2024

[6] [6]

Ten steps of em suffice for mixtures of two gaussians

Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em suffice for mixtures of two gaussians. In Conference on Learning Theory, pages 704--710. PMLR, 2017

2017

[7] [7]

Asymptotic normality and optimality in nonsmooth stochastic approximation

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Asymptotic normality and optimality in nonsmooth stochastic approximation. The Annals of Statistics, 52 0 (4): 0 1485--1508, 2024

2024

[8] [8]

Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization. Foundations of Computational Mathematics, 26 0 (2): 0 779--861, January 2025 a . ISSN 1615-3383. doi:10.1007/s10208-025-09691-0

work page doi:10.1007/s10208-025-09691-0 2025

[9] [9]

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth. Mathematical Programming, pages 1--66, 2025 b

2025

[10] [10]

A. P. Dempster, N. M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39 0 (1): 0 1--38, 1977

1977

[11] [11]

Drusvyatskiy and A

D. Drusvyatskiy and A. S. Lewis. Optimality, identifiability, and sensitivity. Mathematical Programming, 147 0 (1-2): 0 467--498, November 2013. ISSN 1436-4646. doi:10.1007/s10107-013-0730-4

work page doi:10.1007/s10107-013-0730-4 2013

[12] [12]

Sharp analysis of expectation-maximization for weakly identifiable models

Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin Wainwright, Michael Jordan, and Bin Yu. Sharp analysis of expectation-maximization for weakly identifiable models. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning R...

2020

[13] [13]

Singularity, misspecification and the convergence rate of EM

Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin J Wainwright, Michael I Jordan, and Bin Yu. Singularity, misspecification and the convergence rate of EM . The Annals of Statistics, 48 0 (6): 0 3161--3182, 2020 b

2020

[14] [14]

Identifying active constraints via partial smoothness and prox-regularity

Warren L Hare and Adrian S Lewis. Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis, 11 0 (2): 0 251--266, 2004

2004

[15] [15]

On strong identifiability and convergence rates of parameter estimation in finite mixtures

Nhat Ho and XuanLong Nguyen. On strong identifiability and convergence rates of parameter estimation in finite mixtures. Electronic Journal of Statistics, 10 0 (1), January 2016. ISSN 1935-7524. doi:10.1214/16-ejs1105

work page doi:10.1214/16-ejs1105 2016

[16] [16]

Wainwright, Sivaraman Balakrishnan, and Michael I

Chi Jin, Yuchen Zhang, Martin J. Wainwright, Sivaraman Balakrishnan, and Michael I. Jordan. Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences. In Advances in Neural Information Processing Systems, 2016

2016

[17] [17]

Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition. In Joint European conference on machine learning and knowledge discovery in databases, pages 795--811. Springer, 2016

2016

[18] [18]

Information theory and statistics

Solomon Kullback. Information theory and statistics. Courier Corporation, 1997

1997

[19] [19]

The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians

Jeongyeol Kwon and Constantine Caramanis. The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians . In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2425--2487. PMLR, 09--12 Jul 2020. URL https://proce...

2020

[20] [20]

The U -Lagrangian of a Convex Function

Claude Lemar \'e chal, Fran c ois Oustry, and Claudia Sagastiz \'a bal. The U -Lagrangian of a Convex Function . Transactions of the American Mathematical Society, 352 0 (2): 0 711--729, 2000. ISSN 00029947. URL http://www.jstor.org/stable/118061

2000

[21] [21]

Active sets, nonsmoothness, and sensitivity

Adrian S Lewis. Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, 13 0 (3): 0 702--725, 2002

2002

[22] [22]

A proximal method for composite minimization

Adrian S Lewis and Stephen J Wright. A proximal method for composite minimization. Mathematical Programming, 158 0 (1): 0 501--546, 2016

2016

[23] [23]

A topological property of real analytic subsets

Stanislaw ojasiewicz. A topological property of real analytic subsets. Coll. du CNRS, Les \'e quations aux d \'e riv \'e es partielles , 117 0 (87-89): 0 2, 1963

1963

[24] [24]

Proximal points are on the fast track

Robert Mifflin and Claudia Sagastiz \'a bal. Proximal points are on the fast track. Journal of Convex Analysis, 9 0 (2): 0 563--580, 2002

2002

[25] [25]

Gradient methods for minimizing functionals

Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal vychislitel'noi matematiki i matematicheskoi fiziki, 3 0 (4): 0 643--653, 1963

1963

[26] [26]

Minimization of unsmooth functionals

Boris Teodorovich Polyak. Minimization of unsmooth functionals. USSR Computational Mathematics and Mathematical Physics, 9 0 (3): 0 14--29, 1969

1969

[27] [27]

Mixture densities, maximum likelihood and the EM algorithm

Richard A Redner and Homer F Walker. Mixture densities, maximum likelihood and the EM algorithm . SIAM review, 26 0 (2): 0 195--239, 1984

1984

[28] [28]

Identification in parametric models

Thomas J Rothenberg. Identification in parametric models. Econometrica: Journal of the Econometric Society, pages 577--591, 1971

1971

[29] [29]

Asymptotic behaviour of the posterior distribution in overfitted mixture models

Judith Rousseau and Kerrie Mengersen. Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B, 73 0 (5): 0 689--710, 2011

2011

[30] [30]

Improved convergence guarantees for learning gaussian mixture models by EM and gradient EM

Nimrod Segol and Boaz Nadler. Improved convergence guarantees for learning gaussian mixture models by EM and gradient EM . Electronic Journal of Statistics, 15 0 (2), January 2021. ISSN 1935-7524. doi:10.1214/21-ejs1905

work page doi:10.1214/21-ejs1905 2021

[31] [31]

A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices

Henry Teicher. Identifiability of finite mixtures. The Annals of Mathematical Statistics, 34 0 (4): 0 1265--1269, December 1963. ISSN 0003-4851. doi:10.1214/aoms/1177703862

work page doi:10.1214/aoms/1177703862 1963

[32] [32]

Identifiable surfaces in constrained optimization

Stephen J Wright. Identifiable surfaces in constrained optimization. SIAM Journal on Control and Optimization, 31 0 (4): 0 1063--1079, 1993

1993

[33] [33]

C. F. Jeff Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11 0 (1): 0 95--103, 1983

1983

[34] [34]

Global analysis of expectation maximization for mixtures of two gaussians

Ji Xu, Daniel Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two gaussians. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NeurIPS'16, page 2684–2692, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819

2016

[35] [35]

Weihang Xu and Simon S. Du. Over-parameterization exponentially slows down gradient descent for learning a single neuron. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of the Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 1155--1198. PMLR, 2023

2023

[36] [36]

Toward global convergence of gradient em for over-parameterized gaussian mixture models

Weihang Xu, Maryam Fazel, and Simon S Du. Toward global convergence of gradient em for over-parameterized gaussian mixture models. Advances in Neural Information Processing Systems, 37: 0 10770--10800, 2024

2024

[37] [37]

Convergence of gradient em on multi-component mixture of gaussians

Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient em on multi-component mixture of gaussians. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017

[38] [38]

Electronic Journal of Statistics14(1), 632–660 (2020)

Ruofei Zhao, Yuanzhi Li, and Yuekai Sun. Statistical convergence of the em algorithm on gaussian mixture models. Electronic Journal of Statistics, 14 0 (1), January 2020. ISSN 1935-7524. doi:10.1214/19-ejs1660

work page doi:10.1214/19-ejs1660 2020

[39] [39]

Mo Zhou, Weihang Xu, Maryam Fazel, and Simon S. Du. Global convergence of gradient EM for over-parameterized gaussian mixtures. arXiv preprint arXiv:2506.06584, 2025. URL https://arxiv.org/abs/2506.06584

work page arXiv 2025