arxiv: 2604.05842 · v1 · submitted 2026-04-07 · 💻 cs.LG · cs.IT· math.IT· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Expectation Maximization (EM) Converges for General Agnostic Mixtures

Avishek Ghosh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:01 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITstat.ML

keywords expectation maximizationagnostic mixturesparametric function fittingmixed linear regressiongradient EMconvergence analysisstrongly convex lossmixture models

0 comments

The pith

Gradient EM converges exponentially to the population loss minimizers for fitting any strongly convex parametric functions to data in the agnostic setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a gradient form of Expectation Maximization can fit k arbitrary parametric functions to observed data points when no probabilistic generative model is assumed. It extends earlier guarantees for mixed linear regression to any strongly convex and smooth loss, which includes mixed logistic regression, mixed support vector machines, and generalized linear models. A sympathetic reader would care because the result supplies convergence theory for EM-style methods in a wide range of practical fitting tasks that lack clean generative assumptions. Under suitable initialization and a separation condition on the target functions, the iterates reach the best population loss minimizers exponentially fast and with high probability.

Core claim

In the agnostic mixture fitting problem, where k parametric functions are to be fit to data points by minimizing a strongly convex and smooth loss, the gradient EM algorithm converges exponentially fast to the population loss minimizers with high probability, provided the initialization is suitable and the parametric functions satisfy a separation condition.

What carries the argument

Gradient EM procedure that alternates soft data-point assignments to components with gradient steps on the weighted component losses.

If this is right

The same convergence result applies to regularized mixed linear regression and to mixed logistic regression.
Exponential convergence holds for any strongly convex smooth loss, not only quadratic loss on linear models.
The algorithm reaches the optimal population solution in non-generative agnostic data settings beyond linear regression.
High-probability guarantees are available for the sequence of iterates once the separation and initialization conditions hold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The analysis framework could be reused to study EM variants on mixtures of neural networks whenever the loss satisfies strong convexity and smoothness.
Similar separation-based arguments may yield convergence proofs for other iterative assignment methods in unsupervised parametric fitting.
Practitioners could test separation empirically before running gradient EM on unlabeled data to justify expecting fast convergence to the global minimizer.

Load-bearing premise

The loss must be strongly convex and smooth, the parametric functions must satisfy a separation condition, and the initialization must be sufficiently close to the target.

What would settle it

A dataset where the parametric functions violate the separation condition yet the gradient EM iterates still converge exponentially to the population minimizers would disprove the stated guarantee.

read the original abstract

Mixture of linear regression is well studied in statistics and machine learning, where the data points are generated probabilistically using $k$ linear models. Algorithms like Expectation Maximization (EM) may be used to recover the ground truth regressors for this problem. Recently, in \cite{pal2022learning,ghosh_agnostic} the mixed linear regression problem is studied in the agnostic setting, where no generative model on data is assumed. Rather, given a set of data points, the objective is \emph{fit} $k$ lines by minimizing a suitable loss function. It is shown that a modification of EM, namely gradient EM converges exponentially to appropriately defined loss minimizer even in the agnostic setting. In this paper, we study the problem of \emph{fitting} $k$ parametric functions to given set of data points. We adhere to the agnostic setup. However, instead of fitting lines equipped with quadratic loss, we consider any arbitrary parametric function fitting equipped with a strongly convex and smooth loss. This framework encompasses a large class of problems including mixed linear regression (regularized), mixed linear classifiers (mixed logistic regression, mixed Support Vector Machines) and mixed generalized linear regression. We propose and analyze gradient EM for this problem and show that with proper initialization and separation condition, the iterates of gradient EM converge exponentially to appropriately defined population loss minimizers with high probability. This shows the effectiveness of EM type algorithm which converges to \emph{optimal} solution in the non-generative setup beyond mixture of linear regression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper extends gradient EM convergence guarantees from agnostic mixed linear regression to general parametric function fitting under strongly convex smooth losses.

read the letter

The main point is that gradient EM converges exponentially to population loss minimizers when fitting k parametric functions in the agnostic setting, as long as the loss is strongly convex and smooth, the functions satisfy a separation condition, and initialization is suitable. The result holds with high probability and applies to a range of problems including mixed logistic regression, mixed SVMs, and mixed GLMs, plus regularized linear regression. They achieve this by replacing the quadratic loss in the earlier linear-regression analyses with any strongly convex smooth surrogate while keeping the same EM update structure and proof skeleton. That move is the actual novelty; it is not just a routine swap because the prior work was tied to the specific quadratic case. The argument stays non-circular and rests on standard properties of gradient methods on convex functions plus the separation to keep components distinguishable. If the full proof checks out, the extension is substantive and useful for theorists working on EM outside generative models. The assumptions are stated explicitly and match the linear case, so no hidden contradictions appear. On the downside, the abstract gives no error-bound details or constant sizes, and there are no experiments to show how the separation requirement behaves in practice or how tight the exponential rate is. Those gaps are real but not fatal for a theory paper. This is for readers already familiar with agnostic mixture results who want to see the scope widened to classifiers and GLMs. It deserves a serious referee because the generalization is grounded and the claims are precise enough to evaluate.

Referee Report

0 major / 3 minor

Summary. The manuscript extends prior work on gradient EM for agnostic mixed linear regression to the general setting of fitting k arbitrary parametric functions to data under a strongly convex and smooth loss. It claims that, given suitable initialization and a separation condition on the parametric functions, the gradient EM iterates converge exponentially fast to appropriately defined population loss minimizers with high probability. The framework is shown to encompass regularized mixed linear regression, mixed logistic regression, mixed SVMs, and mixed GLMs by replacing the quadratic loss with a general surrogate while preserving the algorithmic structure.

Significance. If the convergence result holds, the work supplies a unified theoretical justification for EM-style algorithms in agnostic (non-generative) data-fitting problems across a broad family of parametric models. This is a meaningful generalization beyond linear regression, as the same analysis skeleton applies once strong convexity and smoothness are assumed, and it supplies explicit high-probability exponential rates that could inform practical use in classification and regression tasks.

minor comments (3)

The separation condition (stated in the main theorem) is only described at a high level in the abstract and introduction; an explicit statement of the minimal distance required between the k parametric functions would improve readability.
Notation for the population loss minimizer and the empirical surrogate is introduced without a dedicated preliminary section; a short table or paragraph collecting all symbols would reduce cross-referencing.
The high-probability statement in the main theorem depends on sample size n and failure probability δ, but the dependence is not summarized in a single corollary or remark; adding such a summary would clarify the sample complexity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and for recommending minor revision. The assessment accurately captures the paper's contribution in extending gradient EM convergence guarantees to general agnostic parametric mixtures under strong convexity and smoothness assumptions.

Circularity Check

0 steps flagged

Minor self-citation to prior agnostic linear case; convergence follows from standard strongly convex gradient analysis

full rationale

The derivation applies the standard exponential convergence of gradient methods on strongly convex smooth objectives, augmented by a separation condition and suitable initialization, to the agnostic mixture setting for general parametric families. The extension from the cited mixed linear regression results (ghosh_agnostic, pal2022learning) preserves the same algorithmic skeleton and loss properties without redefining any minimizer or prediction in terms of fitted parameters. No step reduces by construction to its own inputs, and the assumptions are stated explicitly as external conditions rather than derived internally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard optimization assumptions rather than new free parameters or invented entities.

axioms (2)

domain assumption The loss function is strongly convex and smooth
Invoked to obtain exponential convergence rates for the gradient EM iterates.
domain assumption The parametric functions satisfy a separation condition
Required so that the algorithm converges to the global population minimizers rather than local ones.

pith-pipeline@v0.9.0 · 5580 in / 1191 out tokens · 55363 ms · 2026-05-10T19:01:02.262064+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We assume the base losses F(x,y;θj) to be M-smooth and m-strongly convex... gradient EM... converges exponentially... (Theorem 1)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

soft-min loss ℓ = sum p_j F_j with p_j = exp(-β F_j)/sum exp(-β F_l)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 8 canonical work pages

[1]

On learning mixture of linear regressions in the non-realizable setting

Soumyabrata Pal, Arya Mazumdar, Rajat Sen, and Avishek Ghosh. On learning mixture of linear regressions in the non-realizable setting. In International Conference on Machine Learning, pages 17202–17220. PMLR, 2022

2022
[2]

Agnostic learning of mixed linear regressions with em and am algorithms

Avishek Ghosh and Arya Mazumdar. Agnostic learning of mixed linear regressions with em and am algorithms. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[3]

Alternating minimization for mixed linear regression

Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed linear regression. InInternational Conference on Machine Learning, pages 613–621. PMLR, 2014

2014
[4]

Estimating the coefficients of a mixture of two linear regressions by expectation maximization.IEEE Transactions on Information Theory, 65(6):3515– 3524, 2019

Jason M Klusowski, Dana Yang, and WD Brinda. Estimating the coefficients of a mixture of two linear regressions by expectation maximization.IEEE Transactions on Information Theory, 65(6):3515– 3524, 2019

2019
[5]

Spectral experts for estimating mixtures of linear regressions

Arun Tejasvi Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions. InInternational Conference on Machine Learning, pages 1040–1048. PMLR, 2013

2013
[6]

arXiv preprint arXiv:1608.05749 , year=

Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Solving a mixture of many random linear equations by tensor decomposition and alternating minimization.arXiv preprint arXiv:1608.05749, 2016

work page arXiv 2016
[7]

Statistical guarantees for the em algorithm: From population to sample-based analysis.The Annals of Statistics, 45(1):77–120, 2017

Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis.The Annals of Statistics, 45(1):77–120, 2017

2017
[8]

Global convergence of em algorithm for mixtures of two component linear regression.arXiv preprint arXiv:1810.05752, 2018

Jeongyeol Kwon and Constantine Caramanis. Global convergence of em algorithm for mixtures of two component linear regression.arXiv preprint arXiv:1810.05752, 2018

work page arXiv 2018
[9]

Ten steps of em suffice for mixtures of two gaussians

Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em suffice for mixtures of two gaussians. In Satyen Kale and Ohad Shamir, editors,Proceedings of the 2017 Conference on Learning Theory, volume 65 ofProceedings of Machine Learning Research, pages 704–710. PMLR, 07–10 Jul 2017

2017
[10]

Faster and sample near-optimal algorithms for proper learning mixtures of gaussians

Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, 2014

2014
[11]

Wright and Benjamin Recht.Optimization for Data Analysis

Stephen J. Wright and Benjamin Recht.Optimization for Data Analysis. Cambridge University Press, 2022

2022
[12]

SCAFFOLD: Stochastic controlled averaging for federated learning

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. In Hal Daum ´e III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 51...

2020
[13]

Byzantine-robust distributed learning: Towards optimal statistical rates

Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceed- ings of Machine Learning Research, pages 5650–5659. PMLR, 10–15 Jul 2018

2018
[14]

Communication-efficient and byzantine- robust distributed learning with error feedback.IEEE Journal on Selected Areas in Information Theory, 2(3):942–953, 2021

Avishek Ghosh, Raj Kumar Maity, Swanand Kadhe, Arya Mazumdar, and Kannan Ramchandran. Communication-efficient and byzantine- robust distributed learning with error feedback.IEEE Journal on Selected Areas in Information Theory, 2(3):942–953, 2021

2021
[15]

Personalized federated learn- ing: A meta-learning approach.arXiv preprint arXiv:2002.07948,

Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personal- ized federated learning: A meta-learning approach.arXiv preprint arXiv:2002.07948, 2020

work page arXiv 2002
[16]

The power of in- terpolation: Understanding the effectiveness of sgd in modern over- parametrized learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of in- terpolation: Understanding the effectiveness of sgd in modern over- parametrized learning. InInternational Conference on Machine Learn- ing, pages 3325–3334. PMLR, 2018

2018
[17]

High-dimensional variance-reduced stochastic gradient expectation- maximization algorithm

Rongda Zhu, Lingxiao Wang, Chengxiang Zhai, and Quanquan Gu. High-dimensional variance-reduced stochastic gradient expectation- maximization algorithm. InInternational Conference on Machine Learning, pages 4180–4188. PMLR, 2017

2017
[18]

Differentially private (gradient) expectation maximization algorithm with statistical guarantees.arXiv preprint arXiv:2010.13520, 2020

Di Wang, Jiahao Ding, Lijie Hu, Zejun Xie, Miao Pan, and Jinhui Xu. Differentially private (gradient) expectation maximization algorithm with statistical guarantees.arXiv preprint arXiv:2010.13520, 2020

work page arXiv 2010
[19]

Phase retrieval using alternating minimization.IEEE Transactions on Signal Processing, 63(18):4814–4826, 2015

Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization.IEEE Transactions on Signal Processing, 63(18):4814–4826, 2015

2015
[20]

Learning mix- tures of linear classifiers

Yuekai Sun, Stratis Ioannidis, and Andrea Montanari. Learning mix- tures of linear classifiers. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 721– 729, Bejing, China, 22–24 Jun 2014. PMLR

2014
[21]

Alternating minimization converges super-linearly for mixed linear regression

Avishek Ghosh and Ramchandran Kannan. Alternating minimization converges super-linearly for mixed linear regression. InInternational Conference on Artificial Intelligence and Statistics, pages 1093–1103. PMLR, 2020

2020
[22]

Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval.Mathematical Programming, 176:5–37, 2019

Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval.Mathematical Programming, 176:5–37, 2019

2019
[23]

Low-rank matrix completion using alternating minimization

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. InProceedings of the forty- fifth annual ACM symposium on Theory of computing, pages 665–674, 2013

2013
[24]

arXiv preprint arXiv:1906.09255 , year=

Avishek Ghosh, Ashwin Pananjady, Adityanand Guntuboyina, and Kan- nan Ramchandran. Max-affine regression: Provable, tractable, and near- optimal statistical estimation.arXiv preprint arXiv:1906.09255, 2019

work page arXiv 1906
[25]

An efficient framework for clustered federated learning.arXiv preprint arXiv:2006.04088, 2020

Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning.arXiv preprint arXiv:2006.04088, 2020

work page arXiv 2006
[26]

Max-affine regression with universal parameter estima- tion for small-ball designs

Avishek Ghosh, Ashwin Pananjady, Aditya Guntuboyina, and Kannan Ramchandran. Max-affine regression with universal parameter estima- tion for small-ball designs. In2020 IEEE International Symposium on Information Theory (ISIT), pages 2706–2710. IEEE, 2020

2020
[27]

Iterative least trimmed squares for mixed linear regression.arXiv preprint arXiv:1902.03653, 2019

Yanyao Shen and Sujay Sanghavi. Iterative least trimmed squares for mixed linear regression.arXiv preprint arXiv:1902.03653, 2019

work page arXiv 1902
[28]

Sharp global convergence guarantees for iterative noncon- vex optimization: A gaussian process perspective.arXiv preprint arXiv:2109.09859, 2021

Kabir Aladin Chandrasekher, Ashwin Pananjady, and Christos Thram- poulidis. Sharp global convergence guarantees for iterative noncon- vex optimization: A gaussian process perspective.arXiv preprint arXiv:2109.09859, 2021. III. PROOF OF THETHEOREM We focus on the iterateθ + 1 and show that the distance with θ∗ 1 reduces (hence contraction) over one iterati...

work page arXiv 2021
[29]

Recall thatF i(.)ism-strongly convex

We have T 2 1 =∥θ 1 −θ ∗ 1 −ˆγ 1 |S∗ 1 | X i∈S∗ 1 p(θ1)∇Fi(θ1)∥2 =∥θ 1 −θ ∗ 1∥2 − 2ˆγ |S∗ 1 | X i∈S∗ 1 p(θ1)⟨∇Fi(θ1), θ1 −θ ∗ 1⟩ + ˆγ2∥ 1 |S∗ 1 | X i∈S∗ 1 p(θ1)∇Fi(θ1)∥2. Recall thatF i(.)ism-strongly convex. Using that, we obtain ⟨∇Fi(θ1), θ1 −θ ∗ 1⟩ ≥ ⟨∇F i(θ∗ 1), θ1 −θ ∗ 1⟩+ m 2 ∥θ1 −θ ∗ 1∥2 ≥ m 2 ∥θ1 −θ ∗ 1∥2 − ∥∇Fi(θ∗ 1)∥∥θ1 −θ ∗ 1∥ ≥ m 2 ∥θ1 −θ ∗ 1∥...