pith. sign in

arxiv: 2410.15368 · v2 · submitted 2024-10-20 · 🧮 math.OC · cs.LG· stat.ML

Tighter Performance Theory of FedExProx

Pith reviewed 2026-05-23 18:55 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords FedExProxdistributed optimizationextrapolationquadratic problemsPolyak-Lojasiewicz conditionfederated learningconvergence analysiscommunication complexity
0
0 comments X

The pith

FedExProx achieves provably better total complexity than gradient descent on quadratic problems through a tighter analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper finds that earlier theory for FedExProx on quadratics offered no improvement over plain gradient descent. A new analysis framework delivers tighter linear convergence rates for non-strongly convex quadratics by incorporating computation and communication costs. This shows FedExProx can outperform gradient descent overall. The work also handles partial participation with adaptive extrapolation choices based on gradient diversity or Polyak stepsizes. It extends the improved rates to functions satisfying the Polyak-Lojasiewicz condition under weaker assumptions than strong convexity.

Core claim

FedExProx, a distributed optimization method using extrapolation, has a tighter linear convergence rate than previously established, allowing it to provably outperform gradient descent in total computation-plus-communication complexity on quadratic problems and to achieve faster rates than prior strongly convex analyses when applied to Polyak-Lojasiewicz functions.

What carries the argument

the novel analysis framework that establishes tighter linear convergence rates for non-strongly convex quadratic problems while accounting for both computation and communication costs

If this is right

  • FedExProx has lower total complexity than gradient descent on quadratics.
  • Adaptive extrapolation strategies based on gradient diversity or Polyak stepsizes significantly outperform previous results under partial participation.
  • The method achieves better convergence rates on Polyak-Lojasiewicz functions than under previous strongly convex analyses.
  • Empirical results support the theoretical improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This opens the door to applying similar tighter analyses to other extrapolation-based distributed methods.
  • The benefits may extend to non-quadratic problems beyond the Polyak-Lojasiewicz class if similar frameworks can be developed.
  • Practical implementations would need to verify that the adaptive parameter choices do not introduce hidden costs.

Load-bearing premise

The extrapolation parameters can be selected via gradient diversity or Polyak stepsizes in a way that meets the derived bounds without extra communication or prior knowledge of problem constants.

What would settle it

A direct comparison experiment on quadratic problems where the measured total computation and communication cost of FedExProx equals or exceeds that of gradient descent, or where the observed convergence rate matches the old bound rather than the tighter one.

Figures

Figures reproduced from arXiv: 2410.15368 by Alexander Tyurin, Kaja Gruntkowska, Peter Richt\'arik, Wojciech Anyszka.

Figure 1
Figure 1. Figure 1: Empirical time complexities of FedExProx on a quadratic optimization task. 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 3 10 4 10 5 Time 100.0 Total empirical time complexity with S=1 Total empirical time complexity with S=7 Total empirical time complexity with S=14 1 max max(Ai) ( / ) 1 max max(Ai) 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 4 10 5 Time 1000.0 Total empirical time complexity with S=1 To… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical time complexities of FedExProx with partial client participation on a quadratic optimization task for S ∈ {1, 7, 14} clients participating in each round. 8 EXPERIMENTS To verify our theoretical results, we assess the empirical time complexity of FedExProx as a function of γ. We consider two types of optimization problems: synthetic quadratic optimization tasks and classification problems with a s… view at source ↗
Figure 3
Figure 3. Figure 3: Empirical time complexities of FedExProx on a classification task with smooth hinge loss. 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Time 100.0 Total theoretical time complexity Total empirical time complexity 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 4 10 5 Time 1000.0 Total theoretical time complexity Total empirical time complexity 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 5… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of theoretical time complexity ( [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

We revisit FedExProx - a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems. By incorporating both computation and communication costs, we demonstrate that FedExProx can indeed provably outperform GD, in stark contrast to the original analysis. Furthermore, we consider partial participation scenarios and analyze two adaptive extrapolation strategies - based on gradient diversity and Polyak stepsizes - again significantly outperforming previous results. Moving beyond quadratics, we extend the applicability of our analysis to general functions satisfying the Polyak-Lojasiewicz condition, outperforming the previous strongly convex analysis while operating under weaker assumptions. Backed by empirical results, our findings point to a new and stronger potential of FedExProx, paving the way for further exploration of the benefits of extrapolation in federated learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper revisits FedExProx and shows that its prior quadratic guarantees are no better than GD. It develops a new analysis framework that yields tighter linear convergence rates for non-strongly convex quadratics. Incorporating both computation and communication costs, it claims FedExProx provably outperforms GD. For partial participation it analyzes two adaptive extrapolation rules (gradient diversity and Polyak stepsizes) that again improve on prior results. The analysis is extended to PL functions under weaker assumptions than strong convexity, with empirical support.

Significance. If the tighter rates and complexity ordering hold under implementable parameter choices, the work supplies a stronger theoretical case for extrapolation-based methods in federated settings and relaxes the strong-convexity assumption for PL problems.

major comments (1)
  1. The headline claim that FedExProx outperforms GD in total computation-plus-communication complexity (and the partial-participation adaptive results) rests on the extrapolation parameter satisfying the derived bounds via the two adaptive strategies. The manuscript must explicitly verify, in the partial-participation analysis, that selecting these parameters (gradient diversity or Polyak) incurs no extra communication rounds or oracle calls outside the model used to compare against GD; otherwise the complexity ordering reverses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and insightful comments. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: The headline claim that FedExProx outperforms GD in total computation-plus-communication complexity (and the partial-participation adaptive results) rests on the extrapolation parameter satisfying the derived bounds via the two adaptive strategies. The manuscript must explicitly verify, in the partial-participation analysis, that selecting these parameters (gradient diversity or Polyak) incurs no extra communication rounds or oracle calls outside the model used to compare against GD; otherwise the complexity ordering reverses.

    Authors: We agree that an explicit verification is necessary to support the complexity claims. In our partial-participation analysis, both the gradient-diversity measure and the Polyak stepsize are computed locally at each client using only the client's own gradient vector and the current model iterate; no additional server-client communication rounds or extra gradient evaluations are required beyond the standard local oracle calls already budgeted in the complexity model. We will add a dedicated paragraph (likely in Section 4.2) that explicitly states this locality property and confirms that the total communication-plus-computation count remains unchanged relative to the GD baseline, thereby preserving the reported ordering. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper presents a novel analysis framework establishing tighter linear convergence for FedExProx on non-strongly convex quadratics and PL functions, then incorporates computation-plus-communication costs to claim outperformance over GD. This rests on standard quadratic/PL assumptions and explicit bounds on extrapolation parameters rather than any self-definitional mapping, fitted-input prediction, or load-bearing self-citation chain. The adaptive strategies (gradient diversity, Polyak) are analyzed under the derived conditions without reducing the claimed rates to tautological inputs. The framework is independent of the quantities it bounds and does not rename known results or smuggle ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The analysis presumably relies on standard quadratic and PL function classes plus the ability to realize the stated adaptive extrapolation rules.

pith-pipeline@v0.9.0 · 5734 in / 1092 out tokens · 26018 ms · 2026-05-23T18:55:50.521946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stabilized Proximal Point Method via Trust Region Control

    math.OC 2026-04 unverdicted novelty 6.0

    A trust-region stabilized proximal point method enforces a displacement condition to achieve linear descent for general nonsmooth convex problems.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    First-Order Methods in Optimization

    Amir Beck. First-Order Methods in Optimization. SIAM-Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2017. ISBN 1611974984

  2. [2]

    Incremental proximal methods for large scale convex optimization

    Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical programming, 129 0 (2): 0 163--195, 2011

  3. [3]

    On biased compression for distributed learning

    Aleksandr Beznosikov, Samuel Horv \'a th, Peter Richt \'a rik, and Mher Safaryan. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410, 2020

  4. [4]

    P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26 0 (4): 0 2235--2260, 2016

  5. [5]

    P. L. Combettes and J. I. Madariaga. Randomly activated proximal methods for nonsmooth convex minimization. preprint arXiv:2403.10673, 2024

  6. [6]

    Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections

    Patrick L Combettes. Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Transactions on Image Processing, 6 0 (4): 0 493--506, 1997

  7. [7]

    Condat and P

    L. Condat and P. Richt \'a rik. RandProx : P rimal-dual optimization algorithms with randomized proximal updates. In Proc.\ of Int. Conf. Learning Representations (ICLR) , 2023

  8. [8]

    Local SGD : Unified theory and new efficient methods

    Eduard Gorbunov, Filip Hanzely, and Peter Richt \'a rik. Local SGD : Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pp.\ 3556--3564. PMLR, 2021

  9. [9]

    Improving the worst-case bidirectional communication complexity for nonconvex distributed optimization under function similarity

    Kaja Gruntkowska, Alexander Tyurin, and Peter Richt \'a rik. Improving the worst-case bidirectional communication complexity for nonconvex distributed optimization under function similarity. arXiv preprint arXiv:2402.06412, 2024

  10. [10]

    Federated Learning for Mobile Keyboard Prediction

    Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Fran c oise Beaufays, Sean Augenstein, Hubert Eichner, Chlo \'e Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018

  11. [11]

    Adaptive learning rates for faster stochastic gradient methods

    Samuel Horv \'a th, Konstantin Mishchenko, and Peter Richt \'a rik. Adaptive learning rates for faster stochastic gradient methods. arXiv preprint arXiv:2208.05287, 2022

  12. [12]

    Fed E x P : Speeding up federated averaging via extrapolation

    Divyansh Jhunjhunwala, Shiqiang Wang, and Gauri Joshi. Fed E x P : Speeding up federated averaging via extrapolation. arXiv preprint arXiv:2301.09604, 2023

  13. [13]

    Advances and open problems in federated learning

    Peter Kairouz, H Brendan McMahan, Brendan Avent, Aur \'e lien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning , 14 0 (1--2): 0 1--210, 2021

  14. [14]

    Linear convergence of gradient and proximal-gradient methods under the P olyak- ojasiewicz condition

    Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the P olyak- ojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp.\ 795--811. Springer, 2016

  15. [15]

    Scaffold: Stochastic controlled averaging for federated learning

    Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pp.\ 5132--5143. PMLR, 2020

  16. [16]

    Faster federated optimization under second-order similarity

    Ahmed Khaled and Chi Jin. Faster federated optimization under second-order similarity. arXiv preprint arXiv:2209.02257, 2022

  17. [17]

    First analysis of local GD on heterogeneous data

    Ahmed Khaled, Konstantin Mishchenko, and Peter Richt \'a rik. First analysis of local GD on heterogeneous data. arXiv preprint arXiv:1909.04715, 2019

  18. [18]

    A unified theory of decentralized SGD with changing topology and local updates

    Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning, pp.\ 5381--5393. PMLR, 2020

  19. [19]

    Federated Learning: Strategies for Improving Communication Efficiency

    Jakub Kone c n \'y , H Brendan McMahan, Felix X Yu, Peter Richt \'a rik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016

  20. [20]

    The power of extrapolation in federated learning

    Hanmin Li, Kirill Acharya, and Peter Richtarik. The power of extrapolation in federated learning. arXiv preprint arXiv:2405.13766, 2024

  21. [21]

    Federated optimization in heterogeneous networks

    Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2: 0 429--450, 2020

  22. [22]

    On the limited memory BFGS method for large scale optimization

    Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45 0 (1): 0 503--528, 1989

  23. [23]

    The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning

    Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In International Conference on Machine Learning, pp.\ 3325--3334. PMLR, 2018

  24. [24]

    Communication-efficient learning of deep networks from decentralized data

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.\ 1273--1282. PMLR, 2017

  25. [25]

    Prox S kip: Yes! local gradient steps provably lead to communication acceleration! F inally! In International Conference on Machine Learning, pp.\ 15750--15769

    Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richt \'a rik. Prox S kip: Yes! local gradient steps provably lead to communication acceleration! F inally! In International Conference on Machine Learning, pp.\ 15750--15769. PMLR, 2022

  26. [26]

    J.J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93: 0 273--299, 1965

  27. [27]

    Randomized projection methods for convex feasibility: Conditioning and convergence rates

    Ion Necoara, Peter Richt \'a rik, and Andrei Patrascu. Randomized projection methods for convex feasibility: Conditioning and convergence rates. SIAM Journal on Optimization, 29 0 (4): 0 2814--2852, 2019

  28. [28]

    Lectures on convex optimization, volume 137

    Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018

  29. [29]

    Patrascu and I

    A. Patrascu and I. Necoara. Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. Journal of Machine Learning Research, 18: 0 1--42, 2018

  30. [30]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.\ 8821--8831. PMLR, 2021

  31. [31]

    Parallel coordinate descent methods for big data optimization

    Peter Richt \'a rik and Martin Tak \'a c . Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156: 0 433--484, 2016

  32. [32]

    Stochastic reformulations of linear systems: algorithms and convergence theory

    Peter Richt \'a rik and Martin Tak \'a c . Stochastic reformulations of linear systems: algorithms and convergence theory. SIAM Journal on Matrix Analysis and Applications, 41 0 (2): 0 487--524, 2020

  33. [33]

    A unified theory of stochastic proximal point methods without smoothness

    Peter Richt \'a rik, Abdurakhmon Sadiev, and Yury Demidovich. A unified theory of stochastic proximal point methods without smoothness. arXiv preprint arXiv:2405.15941, 2024

  34. [34]

    The future of digital health with federated learning

    Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. The future of digital health with federated learning. NPJ digital medicine, 3 0 (1): 0 1--7, 2020

  35. [35]

    Tyrrell Rockafellar

    R. Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control Optim., 14 0 (5): 0 877--898, 1976

  36. [36]

    Tyrrell Rockafellar and Roger J-B Wets

    R. Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften. Springer Berlin, Heidelberg, 1998

  37. [37]

    Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent

    Ernest K Ryu and Stephen Boyd. Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft, 2014

  38. [38]

    Stochastic proximal point methods for monotone inclusions under expected similarity

    Abdurakhmon Sadiev, Laurent Condat, and Peter Richt \'a rik. Stochastic proximal point methods for monotone inclusions under expected similarity. arXiv preprint arXiv:2405.14255, 2024

  39. [39]

    Variance reduction techniques for stochastic proximal point algorithms

    Cheik Traor \'e , Vassilis Apidopoulos, Saverio Salzo, and Silvia Villa. Variance reduction techniques for stochastic proximal point algorithms. Journal of Optimization Theory and Applications, pp.\ 1--30, 2024

  40. [40]

    A computation and communication efficient method for distributed nonconvex problems in the partial participation setting

    Alexander Tyurin and Peter Richt \'a rik. A computation and communication efficient method for distributed nonconvex problems in the partial participation setting. Advances in Neural Information Processing Systems, 2023

  41. [41]

    Sharper rates and flexible framework for nonconvex SGD with client and data sampling

    Alexander Tyurin, Lukang Sun, Konstantin Burlachenko, and Peter Richt \'a rik. Sharper rates and flexible framework for nonconvex SGD with client and data sampling. arXiv preprint arXiv:2206.02275, 2022

  42. [42]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  43. [43]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  44. [44]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  45. [45]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...