Tighter Performance Theory of FedExProx

Alexander Tyurin; Kaja Gruntkowska; Peter Richt\'arik; Wojciech Anyszka

arxiv: 2410.15368 · v2 · submitted 2024-10-20 · 🧮 math.OC · cs.LG· stat.ML

Tighter Performance Theory of FedExProx

Wojciech Anyszka , Kaja Gruntkowska , Alexander Tyurin , Peter Richt\'arik This is my paper

Pith reviewed 2026-05-23 18:55 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords FedExProxdistributed optimizationextrapolationquadratic problemsPolyak-Lojasiewicz conditionfederated learningconvergence analysiscommunication complexity

0 comments

The pith

FedExProx achieves provably better total complexity than gradient descent on quadratic problems through a tighter analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper finds that earlier theory for FedExProx on quadratics offered no improvement over plain gradient descent. A new analysis framework delivers tighter linear convergence rates for non-strongly convex quadratics by incorporating computation and communication costs. This shows FedExProx can outperform gradient descent overall. The work also handles partial participation with adaptive extrapolation choices based on gradient diversity or Polyak stepsizes. It extends the improved rates to functions satisfying the Polyak-Lojasiewicz condition under weaker assumptions than strong convexity.

Core claim

FedExProx, a distributed optimization method using extrapolation, has a tighter linear convergence rate than previously established, allowing it to provably outperform gradient descent in total computation-plus-communication complexity on quadratic problems and to achieve faster rates than prior strongly convex analyses when applied to Polyak-Lojasiewicz functions.

What carries the argument

the novel analysis framework that establishes tighter linear convergence rates for non-strongly convex quadratic problems while accounting for both computation and communication costs

If this is right

FedExProx has lower total complexity than gradient descent on quadratics.
Adaptive extrapolation strategies based on gradient diversity or Polyak stepsizes significantly outperform previous results under partial participation.
The method achieves better convergence rates on Polyak-Lojasiewicz functions than under previous strongly convex analyses.
Empirical results support the theoretical improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This opens the door to applying similar tighter analyses to other extrapolation-based distributed methods.
The benefits may extend to non-quadratic problems beyond the Polyak-Lojasiewicz class if similar frameworks can be developed.
Practical implementations would need to verify that the adaptive parameter choices do not introduce hidden costs.

Load-bearing premise

The extrapolation parameters can be selected via gradient diversity or Polyak stepsizes in a way that meets the derived bounds without extra communication or prior knowledge of problem constants.

What would settle it

A direct comparison experiment on quadratic problems where the measured total computation and communication cost of FedExProx equals or exceeds that of gradient descent, or where the observed convergence rate matches the old bound rather than the tighter one.

Figures

Figures reproduced from arXiv: 2410.15368 by Alexander Tyurin, Kaja Gruntkowska, Peter Richt\'arik, Wojciech Anyszka.

**Figure 1.** Figure 1: Empirical time complexities of FedExProx on a quadratic optimization task. 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 3 10 4 10 5 Time 100.0 Total empirical time complexity with S=1 Total empirical time complexity with S=7 Total empirical time complexity with S=14 1 max max(Ai) ( / ) 1 max max(Ai) 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 4 10 5 Time 1000.0 Total empirical time complexity with S=1 To… view at source ↗

**Figure 2.** Figure 2: Empirical time complexities of FedExProx with partial client participation on a quadratic optimization task for S ∈ {1, 7, 14} clients participating in each round. 8 EXPERIMENTS To verify our theoretical results, we assess the empirical time complexity of FedExProx as a function of γ. We consider two types of optimization problems: synthetic quadratic optimization tasks and classification problems with a s… view at source ↗

**Figure 3.** Figure 3: Empirical time complexities of FedExProx on a classification task with smooth hinge loss. 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Time 100.0 Total theoretical time complexity Total empirical time complexity 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 4 10 5 Time 1000.0 Total theoretical time complexity Total empirical time complexity 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 5… view at source ↗

**Figure 4.** Figure 4: Comparison of theoretical time complexity ( [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

We revisit FedExProx - a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems. By incorporating both computation and communication costs, we demonstrate that FedExProx can indeed provably outperform GD, in stark contrast to the original analysis. Furthermore, we consider partial participation scenarios and analyze two adaptive extrapolation strategies - based on gradient diversity and Polyak stepsizes - again significantly outperforming previous results. Moving beyond quadratics, we extend the applicability of our analysis to general functions satisfying the Polyak-Lojasiewicz condition, outperforming the previous strongly convex analysis while operating under weaker assumptions. Backed by empirical results, our findings point to a new and stronger potential of FedExProx, paving the way for further exploration of the benefits of extrapolation in federated learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fixes the prior FedExProx bounds to match GD on quadratics, then supplies a new analysis that counts communication and shows better total complexity plus an extension to PL functions, but the adaptive parameter choices look likely to require uncounted extra communication.

read the letter

The core contribution is spotting that earlier FedExProx guarantees on quadratics were no stronger than plain GD, then building a fresh analysis that includes communication rounds and obtains strictly better total complexity. The same framework covers partial participation via two adaptive extrapolation rules and extends the rates to PL functions under weaker assumptions than strong convexity. That is the actual new material: the equivalence observation, the comm-inclusive comparison, and the PL extension. The paper does this cleanly enough on paper to make the outperformance claim concrete where the original work did not. Empirical backing is mentioned but not central to the argument. The main soft spot sits with the adaptive strategies. Gradient-diversity and Polyak-style rules need either aggregated local gradient norms or knowledge of smoothness/PL constants; both typically demand extra rounds or oracles outside the cost model used to beat GD. If those costs are real, the complexity ordering reverses and the headline advantage does not hold. The abstract itself flags that the parameters must satisfy the derived bounds, so this is the load-bearing point. The assumptions stay standard quadratic and PL, with no obvious circularity or self-referential fitting. The work is aimed at people already working on federated proximal methods and extrapolation techniques. It is coherent on its own terms and deserves a serious referee to check the derivations and the exact accounting of the adaptive costs.

Referee Report

1 major / 0 minor

Summary. The paper revisits FedExProx and shows that its prior quadratic guarantees are no better than GD. It develops a new analysis framework that yields tighter linear convergence rates for non-strongly convex quadratics. Incorporating both computation and communication costs, it claims FedExProx provably outperforms GD. For partial participation it analyzes two adaptive extrapolation rules (gradient diversity and Polyak stepsizes) that again improve on prior results. The analysis is extended to PL functions under weaker assumptions than strong convexity, with empirical support.

Significance. If the tighter rates and complexity ordering hold under implementable parameter choices, the work supplies a stronger theoretical case for extrapolation-based methods in federated settings and relaxes the strong-convexity assumption for PL problems.

major comments (1)

The headline claim that FedExProx outperforms GD in total computation-plus-communication complexity (and the partial-participation adaptive results) rests on the extrapolation parameter satisfying the derived bounds via the two adaptive strategies. The manuscript must explicitly verify, in the partial-participation analysis, that selecting these parameters (gradient diversity or Polyak) incurs no extra communication rounds or oracle calls outside the model used to compare against GD; otherwise the complexity ordering reverses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and insightful comments. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: The headline claim that FedExProx outperforms GD in total computation-plus-communication complexity (and the partial-participation adaptive results) rests on the extrapolation parameter satisfying the derived bounds via the two adaptive strategies. The manuscript must explicitly verify, in the partial-participation analysis, that selecting these parameters (gradient diversity or Polyak) incurs no extra communication rounds or oracle calls outside the model used to compare against GD; otherwise the complexity ordering reverses.

Authors: We agree that an explicit verification is necessary to support the complexity claims. In our partial-participation analysis, both the gradient-diversity measure and the Polyak stepsize are computed locally at each client using only the client's own gradient vector and the current model iterate; no additional server-client communication rounds or extra gradient evaluations are required beyond the standard local oracle calls already budgeted in the complexity model. We will add a dedicated paragraph (likely in Section 4.2) that explicitly states this locality property and confirms that the total communication-plus-computation count remains unchanged relative to the GD baseline, thereby preserving the reported ordering. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper presents a novel analysis framework establishing tighter linear convergence for FedExProx on non-strongly convex quadratics and PL functions, then incorporates computation-plus-communication costs to claim outperformance over GD. This rests on standard quadratic/PL assumptions and explicit bounds on extrapolation parameters rather than any self-definitional mapping, fitted-input prediction, or load-bearing self-citation chain. The adaptive strategies (gradient diversity, Polyak) are analyzed under the derived conditions without reducing the claimed rates to tautological inputs. The framework is independent of the quantities it bounds and does not rename known results or smuggle ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The analysis presumably relies on standard quadratic and PL function classes plus the ability to realize the stated adaptive extrapolation rules.

pith-pipeline@v0.9.0 · 5734 in / 1092 out tokens · 26018 ms · 2026-05-23T18:55:50.521946+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FedExProx with α = 1/γLγ finds ¯x such that E[f(¯x)] − f(x∗) ≤ ε after O(Lγ/μ+γ log 1/ε) iterations
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

time complexity π(γ) = μ + τ(γLmax + 1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stabilized Proximal Point Method via Trust Region Control
math.OC 2026-04 unverdicted novelty 6.0

A trust-region stabilized proximal point method enforces a displacement condition to achieve linear descent for general nonsmooth convex problems.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

First-Order Methods in Optimization

Amir Beck. First-Order Methods in Optimization. SIAM-Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2017. ISBN 1611974984

work page 2017
[2]

Incremental proximal methods for large scale convex optimization

Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical programming, 129 0 (2): 0 163--195, 2011

work page 2011
[3]

On biased compression for distributed learning

Aleksandr Beznosikov, Samuel Horv \'a th, Peter Richt \'a rik, and Mher Safaryan. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410, 2020

work page arXiv 2002
[4]

P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26 0 (4): 0 2235--2260, 2016

work page 2016
[5]

P. L. Combettes and J. I. Madariaga. Randomly activated proximal methods for nonsmooth convex minimization. preprint arXiv:2403.10673, 2024

work page arXiv 2024
[6]

Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections

Patrick L Combettes. Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Transactions on Image Processing, 6 0 (4): 0 493--506, 1997

work page 1997
[7]

Condat and P

L. Condat and P. Richt \'a rik. RandProx : P rimal-dual optimization algorithms with randomized proximal updates. In Proc.\ of Int. Conf. Learning Representations (ICLR) , 2023

work page 2023
[8]

Local SGD : Unified theory and new efficient methods

Eduard Gorbunov, Filip Hanzely, and Peter Richt \'a rik. Local SGD : Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pp.\ 3556--3564. PMLR, 2021

work page 2021
[9]

Improving the worst-case bidirectional communication complexity for nonconvex distributed optimization under function similarity

Kaja Gruntkowska, Alexander Tyurin, and Peter Richt \'a rik. Improving the worst-case bidirectional communication complexity for nonconvex distributed optimization under function similarity. arXiv preprint arXiv:2402.06412, 2024

work page arXiv 2024
[10]

Federated Learning for Mobile Keyboard Prediction

Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Fran c oise Beaufays, Sean Augenstein, Hubert Eichner, Chlo \'e Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Adaptive learning rates for faster stochastic gradient methods

Samuel Horv \'a th, Konstantin Mishchenko, and Peter Richt \'a rik. Adaptive learning rates for faster stochastic gradient methods. arXiv preprint arXiv:2208.05287, 2022

work page arXiv 2022
[12]

Fed E x P : Speeding up federated averaging via extrapolation

Divyansh Jhunjhunwala, Shiqiang Wang, and Gauri Joshi. Fed E x P : Speeding up federated averaging via extrapolation. arXiv preprint arXiv:2301.09604, 2023

work page arXiv 2023
[13]

Advances and open problems in federated learning

Peter Kairouz, H Brendan McMahan, Brendan Avent, Aur \'e lien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning , 14 0 (1--2): 0 1--210, 2021

work page 2021
[14]

Linear convergence of gradient and proximal-gradient methods under the P olyak- ojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the P olyak- ojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp.\ 795--811. Springer, 2016

work page 2016
[15]

Scaffold: Stochastic controlled averaging for federated learning

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pp.\ 5132--5143. PMLR, 2020

work page 2020
[16]

Faster federated optimization under second-order similarity

Ahmed Khaled and Chi Jin. Faster federated optimization under second-order similarity. arXiv preprint arXiv:2209.02257, 2022

work page arXiv 2022
[17]

First analysis of local GD on heterogeneous data

Ahmed Khaled, Konstantin Mishchenko, and Peter Richt \'a rik. First analysis of local GD on heterogeneous data. arXiv preprint arXiv:1909.04715, 2019

work page arXiv 1909
[18]

A unified theory of decentralized SGD with changing topology and local updates

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning, pp.\ 5381--5393. PMLR, 2020

work page 2020
[19]

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Kone c n \'y , H Brendan McMahan, Felix X Yu, Peter Richt \'a rik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

The power of extrapolation in federated learning

Hanmin Li, Kirill Acharya, and Peter Richtarik. The power of extrapolation in federated learning. arXiv preprint arXiv:2405.13766, 2024

work page arXiv 2024
[21]

Federated optimization in heterogeneous networks

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2: 0 429--450, 2020

work page 2020
[22]

On the limited memory BFGS method for large scale optimization

Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45 0 (1): 0 503--528, 1989

work page 1989
[23]

The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In International Conference on Machine Learning, pp.\ 3325--3334. PMLR, 2018

work page 2018
[24]

Communication-efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.\ 1273--1282. PMLR, 2017

work page 2017
[25]

Prox S kip: Yes! local gradient steps provably lead to communication acceleration! F inally! In International Conference on Machine Learning, pp.\ 15750--15769

Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richt \'a rik. Prox S kip: Yes! local gradient steps provably lead to communication acceleration! F inally! In International Conference on Machine Learning, pp.\ 15750--15769. PMLR, 2022

work page 2022
[26]

J.J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93: 0 273--299, 1965

work page 1965
[27]

Randomized projection methods for convex feasibility: Conditioning and convergence rates

Ion Necoara, Peter Richt \'a rik, and Andrei Patrascu. Randomized projection methods for convex feasibility: Conditioning and convergence rates. SIAM Journal on Optimization, 29 0 (4): 0 2814--2852, 2019

work page 2019
[28]

Lectures on convex optimization, volume 137

Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018

work page 2018
[29]

Patrascu and I

A. Patrascu and I. Necoara. Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. Journal of Machine Learning Research, 18: 0 1--42, 2018

work page 2018
[30]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.\ 8821--8831. PMLR, 2021

work page 2021
[31]

Parallel coordinate descent methods for big data optimization

Peter Richt \'a rik and Martin Tak \'a c . Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156: 0 433--484, 2016

work page 2016
[32]

Stochastic reformulations of linear systems: algorithms and convergence theory

Peter Richt \'a rik and Martin Tak \'a c . Stochastic reformulations of linear systems: algorithms and convergence theory. SIAM Journal on Matrix Analysis and Applications, 41 0 (2): 0 487--524, 2020

work page 2020
[33]

A unified theory of stochastic proximal point methods without smoothness

Peter Richt \'a rik, Abdurakhmon Sadiev, and Yury Demidovich. A unified theory of stochastic proximal point methods without smoothness. arXiv preprint arXiv:2405.15941, 2024

work page arXiv 2024
[34]

The future of digital health with federated learning

Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. The future of digital health with federated learning. NPJ digital medicine, 3 0 (1): 0 1--7, 2020

work page 2020
[35]

Tyrrell Rockafellar

R. Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control Optim., 14 0 (5): 0 877--898, 1976

work page 1976
[36]

Tyrrell Rockafellar and Roger J-B Wets

R. Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften. Springer Berlin, Heidelberg, 1998

work page 1998
[37]

Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent

Ernest K Ryu and Stephen Boyd. Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft, 2014

work page 2014
[38]

Stochastic proximal point methods for monotone inclusions under expected similarity

Abdurakhmon Sadiev, Laurent Condat, and Peter Richt \'a rik. Stochastic proximal point methods for monotone inclusions under expected similarity. arXiv preprint arXiv:2405.14255, 2024

work page arXiv 2024
[39]

Variance reduction techniques for stochastic proximal point algorithms

Cheik Traor \'e , Vassilis Apidopoulos, Saverio Salzo, and Silvia Villa. Variance reduction techniques for stochastic proximal point algorithms. Journal of Optimization Theory and Applications, pp.\ 1--30, 2024

work page 2024
[40]

A computation and communication efficient method for distributed nonconvex problems in the partial participation setting

Alexander Tyurin and Peter Richt \'a rik. A computation and communication efficient method for distributed nonconvex problems in the partial participation setting. Advances in Neural Information Processing Systems, 2023

work page 2023
[41]

Sharper rates and flexible framework for nonconvex SGD with client and data sampling

Alexander Tyurin, Lukang Sun, Konstantin Burlachenko, and Peter Richt \'a rik. Sharper rates and flexible framework for nonconvex SGD with client and data sampling. arXiv preprint arXiv:2206.02275, 2022

work page arXiv 2022
[42]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[43]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[44]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[45]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

First-Order Methods in Optimization

Amir Beck. First-Order Methods in Optimization. SIAM-Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2017. ISBN 1611974984

work page 2017

[2] [2]

Incremental proximal methods for large scale convex optimization

Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical programming, 129 0 (2): 0 163--195, 2011

work page 2011

[3] [3]

On biased compression for distributed learning

Aleksandr Beznosikov, Samuel Horv \'a th, Peter Richt \'a rik, and Mher Safaryan. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410, 2020

work page arXiv 2002

[4] [4]

P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26 0 (4): 0 2235--2260, 2016

work page 2016

[5] [5]

P. L. Combettes and J. I. Madariaga. Randomly activated proximal methods for nonsmooth convex minimization. preprint arXiv:2403.10673, 2024

work page arXiv 2024

[6] [6]

Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections

Patrick L Combettes. Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Transactions on Image Processing, 6 0 (4): 0 493--506, 1997

work page 1997

[7] [7]

Condat and P

L. Condat and P. Richt \'a rik. RandProx : P rimal-dual optimization algorithms with randomized proximal updates. In Proc.\ of Int. Conf. Learning Representations (ICLR) , 2023

work page 2023

[8] [8]

Local SGD : Unified theory and new efficient methods

Eduard Gorbunov, Filip Hanzely, and Peter Richt \'a rik. Local SGD : Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pp.\ 3556--3564. PMLR, 2021

work page 2021

[9] [9]

Improving the worst-case bidirectional communication complexity for nonconvex distributed optimization under function similarity

Kaja Gruntkowska, Alexander Tyurin, and Peter Richt \'a rik. Improving the worst-case bidirectional communication complexity for nonconvex distributed optimization under function similarity. arXiv preprint arXiv:2402.06412, 2024

work page arXiv 2024

[10] [10]

Federated Learning for Mobile Keyboard Prediction

Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Fran c oise Beaufays, Sean Augenstein, Hubert Eichner, Chlo \'e Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Adaptive learning rates for faster stochastic gradient methods

Samuel Horv \'a th, Konstantin Mishchenko, and Peter Richt \'a rik. Adaptive learning rates for faster stochastic gradient methods. arXiv preprint arXiv:2208.05287, 2022

work page arXiv 2022

[12] [12]

Fed E x P : Speeding up federated averaging via extrapolation

Divyansh Jhunjhunwala, Shiqiang Wang, and Gauri Joshi. Fed E x P : Speeding up federated averaging via extrapolation. arXiv preprint arXiv:2301.09604, 2023

work page arXiv 2023

[13] [13]

Advances and open problems in federated learning

Peter Kairouz, H Brendan McMahan, Brendan Avent, Aur \'e lien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning , 14 0 (1--2): 0 1--210, 2021

work page 2021

[14] [14]

Linear convergence of gradient and proximal-gradient methods under the P olyak- ojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the P olyak- ojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp.\ 795--811. Springer, 2016

work page 2016

[15] [15]

Scaffold: Stochastic controlled averaging for federated learning

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pp.\ 5132--5143. PMLR, 2020

work page 2020

[16] [16]

Faster federated optimization under second-order similarity

Ahmed Khaled and Chi Jin. Faster federated optimization under second-order similarity. arXiv preprint arXiv:2209.02257, 2022

work page arXiv 2022

[17] [17]

First analysis of local GD on heterogeneous data

Ahmed Khaled, Konstantin Mishchenko, and Peter Richt \'a rik. First analysis of local GD on heterogeneous data. arXiv preprint arXiv:1909.04715, 2019

work page arXiv 1909

[18] [18]

A unified theory of decentralized SGD with changing topology and local updates

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized SGD with changing topology and local updates. In International Conference on Machine Learning, pp.\ 5381--5393. PMLR, 2020

work page 2020

[19] [19]

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Kone c n \'y , H Brendan McMahan, Felix X Yu, Peter Richt \'a rik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

The power of extrapolation in federated learning

Hanmin Li, Kirill Acharya, and Peter Richtarik. The power of extrapolation in federated learning. arXiv preprint arXiv:2405.13766, 2024

work page arXiv 2024

[21] [21]

Federated optimization in heterogeneous networks

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2: 0 429--450, 2020

work page 2020

[22] [22]

On the limited memory BFGS method for large scale optimization

Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45 0 (1): 0 503--528, 1989

work page 1989

[23] [23]

The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In International Conference on Machine Learning, pp.\ 3325--3334. PMLR, 2018

work page 2018

[24] [24]

Communication-efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.\ 1273--1282. PMLR, 2017

work page 2017

[25] [25]

Prox S kip: Yes! local gradient steps provably lead to communication acceleration! F inally! In International Conference on Machine Learning, pp.\ 15750--15769

Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richt \'a rik. Prox S kip: Yes! local gradient steps provably lead to communication acceleration! F inally! In International Conference on Machine Learning, pp.\ 15750--15769. PMLR, 2022

work page 2022

[26] [26]

J.J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93: 0 273--299, 1965

work page 1965

[27] [27]

Randomized projection methods for convex feasibility: Conditioning and convergence rates

Ion Necoara, Peter Richt \'a rik, and Andrei Patrascu. Randomized projection methods for convex feasibility: Conditioning and convergence rates. SIAM Journal on Optimization, 29 0 (4): 0 2814--2852, 2019

work page 2019

[28] [28]

Lectures on convex optimization, volume 137

Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018

work page 2018

[29] [29]

Patrascu and I

A. Patrascu and I. Necoara. Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. Journal of Machine Learning Research, 18: 0 1--42, 2018

work page 2018

[30] [30]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.\ 8821--8831. PMLR, 2021

work page 2021

[31] [31]

Parallel coordinate descent methods for big data optimization

Peter Richt \'a rik and Martin Tak \'a c . Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156: 0 433--484, 2016

work page 2016

[32] [32]

Stochastic reformulations of linear systems: algorithms and convergence theory

Peter Richt \'a rik and Martin Tak \'a c . Stochastic reformulations of linear systems: algorithms and convergence theory. SIAM Journal on Matrix Analysis and Applications, 41 0 (2): 0 487--524, 2020

work page 2020

[33] [33]

A unified theory of stochastic proximal point methods without smoothness

Peter Richt \'a rik, Abdurakhmon Sadiev, and Yury Demidovich. A unified theory of stochastic proximal point methods without smoothness. arXiv preprint arXiv:2405.15941, 2024

work page arXiv 2024

[34] [34]

The future of digital health with federated learning

Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. The future of digital health with federated learning. NPJ digital medicine, 3 0 (1): 0 1--7, 2020

work page 2020

[35] [35]

Tyrrell Rockafellar

R. Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control Optim., 14 0 (5): 0 877--898, 1976

work page 1976

[36] [36]

Tyrrell Rockafellar and Roger J-B Wets

R. Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften. Springer Berlin, Heidelberg, 1998

work page 1998

[37] [37]

Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent

Ernest K Ryu and Stephen Boyd. Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft, 2014

work page 2014

[38] [38]

Stochastic proximal point methods for monotone inclusions under expected similarity

Abdurakhmon Sadiev, Laurent Condat, and Peter Richt \'a rik. Stochastic proximal point methods for monotone inclusions under expected similarity. arXiv preprint arXiv:2405.14255, 2024

work page arXiv 2024

[39] [39]

Variance reduction techniques for stochastic proximal point algorithms

Cheik Traor \'e , Vassilis Apidopoulos, Saverio Salzo, and Silvia Villa. Variance reduction techniques for stochastic proximal point algorithms. Journal of Optimization Theory and Applications, pp.\ 1--30, 2024

work page 2024

[40] [40]

A computation and communication efficient method for distributed nonconvex problems in the partial participation setting

Alexander Tyurin and Peter Richt \'a rik. A computation and communication efficient method for distributed nonconvex problems in the partial participation setting. Advances in Neural Information Processing Systems, 2023

work page 2023

[41] [41]

Sharper rates and flexible framework for nonconvex SGD with client and data sampling

Alexander Tyurin, Lukang Sun, Konstantin Burlachenko, and Peter Richt \'a rik. Sharper rates and flexible framework for nonconvex SGD with client and data sampling. arXiv preprint arXiv:2206.02275, 2022

work page arXiv 2022

[42] [42]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[43] [43]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[44] [44]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[45] [45]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page