Tighter Performance Theory of FedExProx
Pith reviewed 2026-05-23 18:55 UTC · model grok-4.3
The pith
FedExProx achieves provably better total complexity than gradient descent on quadratic problems through a tighter analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FedExProx, a distributed optimization method using extrapolation, has a tighter linear convergence rate than previously established, allowing it to provably outperform gradient descent in total computation-plus-communication complexity on quadratic problems and to achieve faster rates than prior strongly convex analyses when applied to Polyak-Lojasiewicz functions.
What carries the argument
the novel analysis framework that establishes tighter linear convergence rates for non-strongly convex quadratic problems while accounting for both computation and communication costs
If this is right
- FedExProx has lower total complexity than gradient descent on quadratics.
- Adaptive extrapolation strategies based on gradient diversity or Polyak stepsizes significantly outperform previous results under partial participation.
- The method achieves better convergence rates on Polyak-Lojasiewicz functions than under previous strongly convex analyses.
- Empirical results support the theoretical improvements.
Where Pith is reading between the lines
- This opens the door to applying similar tighter analyses to other extrapolation-based distributed methods.
- The benefits may extend to non-quadratic problems beyond the Polyak-Lojasiewicz class if similar frameworks can be developed.
- Practical implementations would need to verify that the adaptive parameter choices do not introduce hidden costs.
Load-bearing premise
The extrapolation parameters can be selected via gradient diversity or Polyak stepsizes in a way that meets the derived bounds without extra communication or prior knowledge of problem constants.
What would settle it
A direct comparison experiment on quadratic problems where the measured total computation and communication cost of FedExProx equals or exceeds that of gradient descent, or where the observed convergence rate matches the old bound rather than the tighter one.
Figures
read the original abstract
We revisit FedExProx - a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems. By incorporating both computation and communication costs, we demonstrate that FedExProx can indeed provably outperform GD, in stark contrast to the original analysis. Furthermore, we consider partial participation scenarios and analyze two adaptive extrapolation strategies - based on gradient diversity and Polyak stepsizes - again significantly outperforming previous results. Moving beyond quadratics, we extend the applicability of our analysis to general functions satisfying the Polyak-Lojasiewicz condition, outperforming the previous strongly convex analysis while operating under weaker assumptions. Backed by empirical results, our findings point to a new and stronger potential of FedExProx, paving the way for further exploration of the benefits of extrapolation in federated learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper revisits FedExProx and shows that its prior quadratic guarantees are no better than GD. It develops a new analysis framework that yields tighter linear convergence rates for non-strongly convex quadratics. Incorporating both computation and communication costs, it claims FedExProx provably outperforms GD. For partial participation it analyzes two adaptive extrapolation rules (gradient diversity and Polyak stepsizes) that again improve on prior results. The analysis is extended to PL functions under weaker assumptions than strong convexity, with empirical support.
Significance. If the tighter rates and complexity ordering hold under implementable parameter choices, the work supplies a stronger theoretical case for extrapolation-based methods in federated settings and relaxes the strong-convexity assumption for PL problems.
major comments (1)
- The headline claim that FedExProx outperforms GD in total computation-plus-communication complexity (and the partial-participation adaptive results) rests on the extrapolation parameter satisfying the derived bounds via the two adaptive strategies. The manuscript must explicitly verify, in the partial-participation analysis, that selecting these parameters (gradient diversity or Polyak) incurs no extra communication rounds or oracle calls outside the model used to compare against GD; otherwise the complexity ordering reverses.
Simulated Author's Rebuttal
We thank the referee for the thorough review and insightful comments. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: The headline claim that FedExProx outperforms GD in total computation-plus-communication complexity (and the partial-participation adaptive results) rests on the extrapolation parameter satisfying the derived bounds via the two adaptive strategies. The manuscript must explicitly verify, in the partial-participation analysis, that selecting these parameters (gradient diversity or Polyak) incurs no extra communication rounds or oracle calls outside the model used to compare against GD; otherwise the complexity ordering reverses.
Authors: We agree that an explicit verification is necessary to support the complexity claims. In our partial-participation analysis, both the gradient-diversity measure and the Polyak stepsize are computed locally at each client using only the client's own gradient vector and the current model iterate; no additional server-client communication rounds or extra gradient evaluations are required beyond the standard local oracle calls already budgeted in the complexity model. We will add a dedicated paragraph (likely in Section 4.2) that explicitly states this locality property and confirms that the total communication-plus-computation count remains unchanged relative to the GD baseline, thereby preserving the reported ordering. revision: yes
Circularity Check
Derivation chain is self-contained; no circular reductions identified
full rationale
The paper presents a novel analysis framework establishing tighter linear convergence for FedExProx on non-strongly convex quadratics and PL functions, then incorporates computation-plus-communication costs to claim outperformance over GD. This rests on standard quadratic/PL assumptions and explicit bounds on extrapolation parameters rather than any self-definitional mapping, fitted-input prediction, or load-bearing self-citation chain. The adaptive strategies (gradient diversity, Polyak) are analyzed under the derived conditions without reducing the claimed rates to tautological inputs. The framework is independent of the quantities it bounds and does not rename known results or smuggle ansatzes via citation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FedExProx with α = 1/γLγ finds ¯x such that E[f(¯x)] − f(x∗) ≤ ε after O(Lγ/μ+γ log 1/ε) iterations
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
time complexity π(γ) = μ + τ(γLmax + 1)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Stabilized Proximal Point Method via Trust Region Control
A trust-region stabilized proximal point method enforces a displacement condition to achieve linear descent for general nonsmooth convex problems.
Reference graph
Works this paper leans on
-
[1]
First-Order Methods in Optimization
Amir Beck. First-Order Methods in Optimization. SIAM-Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2017. ISBN 1611974984
work page 2017
-
[2]
Incremental proximal methods for large scale convex optimization
Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical programming, 129 0 (2): 0 163--195, 2011
work page 2011
-
[3]
On biased compression for distributed learning
Aleksandr Beznosikov, Samuel Horv \'a th, Peter Richt \'a rik, and Mher Safaryan. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410, 2020
-
[4]
P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26 0 (4): 0 2235--2260, 2016
work page 2016
- [5]
-
[6]
Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections
Patrick L Combettes. Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Transactions on Image Processing, 6 0 (4): 0 493--506, 1997
work page 1997
-
[7]
L. Condat and P. Richt \'a rik. RandProx : P rimal-dual optimization algorithms with randomized proximal updates. In Proc.\ of Int. Conf. Learning Representations (ICLR) , 2023
work page 2023
-
[8]
Local SGD : Unified theory and new efficient methods
Eduard Gorbunov, Filip Hanzely, and Peter Richt \'a rik. Local SGD : Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pp.\ 3556--3564. PMLR, 2021
work page 2021
-
[9]
Kaja Gruntkowska, Alexander Tyurin, and Peter Richt \'a rik. Improving the worst-case bidirectional communication complexity for nonconvex distributed optimization under function similarity. arXiv preprint arXiv:2402.06412, 2024
-
[10]
Federated Learning for Mobile Keyboard Prediction
Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Fran c oise Beaufays, Sean Augenstein, Hubert Eichner, Chlo \'e Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Adaptive learning rates for faster stochastic gradient methods
Samuel Horv \'a th, Konstantin Mishchenko, and Peter Richt \'a rik. Adaptive learning rates for faster stochastic gradient methods. arXiv preprint arXiv:2208.05287, 2022
-
[12]
Fed E x P : Speeding up federated averaging via extrapolation
Divyansh Jhunjhunwala, Shiqiang Wang, and Gauri Joshi. Fed E x P : Speeding up federated averaging via extrapolation. arXiv preprint arXiv:2301.09604, 2023
-
[13]
Advances and open problems in federated learning
Peter Kairouz, H Brendan McMahan, Brendan Avent, Aur \'e lien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning , 14 0 (1--2): 0 1--210, 2021
work page 2021
-
[14]
Linear convergence of gradient and proximal-gradient methods under the P olyak- ojasiewicz condition
Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the P olyak- ojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp.\ 795--811. Springer, 2016
work page 2016
-
[15]
Scaffold: Stochastic controlled averaging for federated learning
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pp.\ 5132--5143. PMLR, 2020
work page 2020
-
[16]
Faster federated optimization under second-order similarity
Ahmed Khaled and Chi Jin. Faster federated optimization under second-order similarity. arXiv preprint arXiv:2209.02257, 2022
-
[17]
First analysis of local GD on heterogeneous data
Ahmed Khaled, Konstantin Mishchenko, and Peter Richt \'a rik. First analysis of local GD on heterogeneous data. arXiv preprint arXiv:1909.04715, 2019
-
[18]
A unified theory of decentralized SGD with changing topology and local updates
Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning, pp.\ 5381--5393. PMLR, 2020
work page 2020
-
[19]
Federated Learning: Strategies for Improving Communication Efficiency
Jakub Kone c n \'y , H Brendan McMahan, Felix X Yu, Peter Richt \'a rik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
The power of extrapolation in federated learning
Hanmin Li, Kirill Acharya, and Peter Richtarik. The power of extrapolation in federated learning. arXiv preprint arXiv:2405.13766, 2024
-
[21]
Federated optimization in heterogeneous networks
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2: 0 429--450, 2020
work page 2020
-
[22]
On the limited memory BFGS method for large scale optimization
Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45 0 (1): 0 503--528, 1989
work page 1989
-
[23]
Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In International Conference on Machine Learning, pp.\ 3325--3334. PMLR, 2018
work page 2018
-
[24]
Communication-efficient learning of deep networks from decentralized data
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.\ 1273--1282. PMLR, 2017
work page 2017
-
[25]
Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richt \'a rik. Prox S kip: Yes! local gradient steps provably lead to communication acceleration! F inally! In International Conference on Machine Learning, pp.\ 15750--15769. PMLR, 2022
work page 2022
-
[26]
J.J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93: 0 273--299, 1965
work page 1965
-
[27]
Randomized projection methods for convex feasibility: Conditioning and convergence rates
Ion Necoara, Peter Richt \'a rik, and Andrei Patrascu. Randomized projection methods for convex feasibility: Conditioning and convergence rates. SIAM Journal on Optimization, 29 0 (4): 0 2814--2852, 2019
work page 2019
-
[28]
Lectures on convex optimization, volume 137
Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018
work page 2018
-
[29]
A. Patrascu and I. Necoara. Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. Journal of Machine Learning Research, 18: 0 1--42, 2018
work page 2018
-
[30]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.\ 8821--8831. PMLR, 2021
work page 2021
-
[31]
Parallel coordinate descent methods for big data optimization
Peter Richt \'a rik and Martin Tak \'a c . Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156: 0 433--484, 2016
work page 2016
-
[32]
Stochastic reformulations of linear systems: algorithms and convergence theory
Peter Richt \'a rik and Martin Tak \'a c . Stochastic reformulations of linear systems: algorithms and convergence theory. SIAM Journal on Matrix Analysis and Applications, 41 0 (2): 0 487--524, 2020
work page 2020
-
[33]
A unified theory of stochastic proximal point methods without smoothness
Peter Richt \'a rik, Abdurakhmon Sadiev, and Yury Demidovich. A unified theory of stochastic proximal point methods without smoothness. arXiv preprint arXiv:2405.15941, 2024
-
[34]
The future of digital health with federated learning
Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. The future of digital health with federated learning. NPJ digital medicine, 3 0 (1): 0 1--7, 2020
work page 2020
-
[35]
R. Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control Optim., 14 0 (5): 0 877--898, 1976
work page 1976
-
[36]
Tyrrell Rockafellar and Roger J-B Wets
R. Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften. Springer Berlin, Heidelberg, 1998
work page 1998
-
[37]
Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent
Ernest K Ryu and Stephen Boyd. Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft, 2014
work page 2014
-
[38]
Stochastic proximal point methods for monotone inclusions under expected similarity
Abdurakhmon Sadiev, Laurent Condat, and Peter Richt \'a rik. Stochastic proximal point methods for monotone inclusions under expected similarity. arXiv preprint arXiv:2405.14255, 2024
-
[39]
Variance reduction techniques for stochastic proximal point algorithms
Cheik Traor \'e , Vassilis Apidopoulos, Saverio Salzo, and Silvia Villa. Variance reduction techniques for stochastic proximal point algorithms. Journal of Optimization Theory and Applications, pp.\ 1--30, 2024
work page 2024
-
[40]
Alexander Tyurin and Peter Richt \'a rik. A computation and communication efficient method for distributed nonconvex problems in the partial participation setting. Advances in Neural Information Processing Systems, 2023
work page 2023
-
[41]
Sharper rates and flexible framework for nonconvex SGD with client and data sampling
Alexander Tyurin, Lukang Sun, Konstantin Burlachenko, and Peter Richt \'a rik. Sharper rates and flexible framework for nonconvex SGD with client and data sampling. arXiv preprint arXiv:2206.02275, 2022
-
[42]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[43]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[44]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[45]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.