Adaptive Federated Optimization

H. Brendan McMahan; Jakub Kone\v{c}n\'y; Keith Rush; Manzil Zaheer; Sanjiv Kumar; Sashank Reddi; Zachary Charles; Zachary Garrett

arxiv: 2003.00295 · v5 · pith:IUIV74V2new · submitted 2020-02-29 · 💻 cs.LG · cs.DC· math.OC· stat.ML

Adaptive Federated Optimization

Sashank Reddi , Zachary Charles , Manzil Zaheer , Zachary Garrett , Keith Rush , Jakub Kone\v{c}n\'y , Sanjiv Kumar , H. Brendan McMahan This is my paper

Pith reviewed 2026-05-21 10:25 UTC · model grok-4.3

classification 💻 cs.LG cs.DCmath.OCstat.ML

keywords federated learningadaptive optimizationconvergence analysisnon-convex optimizationheterogeneous dataAdamYogi

0 comments

The pith

Federated versions of Adam and Yogi achieve convergence guarantees for non-convex objectives despite heterogeneous client data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces federated adaptations of adaptive gradient methods such as Adagrad, Adam, and Yogi to replace or augment Federated Averaging. These adaptations apply per-parameter learning rate adjustments across clients while averaging model updates at a central server. Convergence analysis is provided for general non-convex loss functions under arbitrary data heterogeneity, with explicit attention to how client differences affect the number of communication rounds needed. Experiments on benchmark tasks show that the adaptive federated methods reach higher accuracy with fewer rounds than standard approaches.

Core claim

The central claim is that adaptive optimization techniques can be lifted to the federated setting, yielding both practical performance gains and provable convergence rates for non-convex problems even when each client holds a statistically different subset of the data.

What carries the argument

Federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) that maintain separate per-coordinate statistics while performing periodic averaging of client updates.

If this is right

Convergence rates now account for both the degree of client heterogeneity and the frequency of communication.
Hyperparameter tuning becomes less critical because adaptive per-coordinate scaling compensates for varying gradient magnitudes across clients.
The methods remain compatible with existing federated protocols that already perform local SGD steps before averaging.
Communication efficiency can be traded against heterogeneity by adjusting the number of local steps without losing the adaptive benefit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive averaging idea may apply to other distributed settings such as decentralized learning without a central server.
Combining these optimizers with privacy mechanisms could preserve both utility and differential privacy guarantees under heterogeneity.
The interplay result suggests that extremely heterogeneous clients may still require more frequent communication even with adaptation.

Load-bearing premise

The benefits of per-parameter adaptation observed in centralized settings continue to hold after averaging across clients with arbitrarily different data distributions.

What would settle it

An experiment or counter-example in which one of the proposed methods fails to converge or is outperformed by plain FedAvg on a non-convex objective with strong client heterogeneity.

read the original abstract

Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general non-convex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper ports adaptive optimizers to federated learning, adds a heterogeneity-aware convergence analysis, and shows practical gains over FedAvg on non-iid data.

read the letter

The core contribution is straightforward: they define server-side adaptive updates (FedAdagrad, FedAdam, FedYogi) that keep per-coordinate learning rates while averaging client models, and they prove convergence for non-convex objectives when client gradients differ. The analysis tracks the extra drift from multiple local steps and partial participation, then shows the rates recover the usual centralized bounds once heterogeneity goes to zero. That interplay between heterogeneity and communication rounds is the part worth noting; it is not just a routine re-labeling of Adam for FL. Experiments on standard image and language tasks with common non-iid partitions back the claim that these methods need less tuning and reach target accuracy in fewer rounds than plain FedAvg. The proofs rely on the usual L-smoothness and bounded-variance assumptions, which are stated clearly and do not hide circular steps. The main limitation is that the heterogeneity measure (expected gradient dissimilarity) is still somewhat abstract, and the reported gains are tied to the specific partitions used; real deployments with more extreme client skew might need further checks. The work is aimed at people already running federated training who want better default optimizers rather than at theorists looking for entirely new proof ideas. It is solid enough on both the math and the empirical side to merit a full referee process.

Referee Report

0 major / 3 minor

Summary. The paper proposes federated versions of adaptive optimizers (FedAdagrad, FedAdam, FedYogi) that incorporate per-parameter adaptation at the server while performing local SGD steps on clients. It derives convergence guarantees for general non-convex objectives under client data heterogeneity, bounded gradient dissimilarity, and partial participation, showing that the methods recover known centralized rates when heterogeneity vanishes. Extensive experiments on image classification and language modeling tasks with non-iid partitions demonstrate faster convergence and higher accuracy relative to FedAvg.

Significance. If the stated bounds hold, the work is significant because it supplies the first explicit convergence analysis of server-side adaptive methods in the federated regime and quantifies the precise penalty imposed by client drift and heterogeneity on the adaptive rates. The proofs correctly track the additional terms arising from multiple local steps and partial client participation, and the experiments corroborate the predicted dependence of communication rounds on the heterogeneity measure. These contributions directly address a practical bottleneck in federated learning.

minor comments (3)

[§4.1] §4.1, Algorithm 1: the server update for the first-moment estimate is written with a single global learning rate η; it would be clearer to introduce a separate server learning rate η_s and state its relation to the client learning rate η_c used in the local steps.
[Theorem 3.1] Theorem 3.1: the final rate contains a term proportional to the heterogeneity constant G; the paper should explicitly note the regime in which this term dominates the usual 1/√T rate and whether the adaptive methods reduce the constant in front of G relative to FedAvg.
[Table 2] Table 2: the reported test accuracies for FedAdam on the Shakespeare dataset lack standard deviations across the five random seeds; adding error bars would make the claimed gains over FedAvg statistically interpretable.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review. We appreciate the recognition of the significance of providing the first explicit convergence analysis for server-side adaptive methods under heterogeneity and partial participation, as well as the corroboration from our experiments. We will incorporate minor revisions to address any remaining presentation or clarification points.

Circularity Check

0 steps flagged

No significant circularity; derivation extends prior adaptive analyses independently

full rationale

The paper introduces federated variants of Adagrad, Adam, and Yogi and derives convergence bounds for non-convex objectives under client heterogeneity. These bounds are obtained by extending standard centralized adaptive-optimizer proofs (with explicit tracking of additional client-drift and partial-participation terms) from well-known assumptions such as L-smoothness, bounded stochastic-gradient variance, and a heterogeneity measure based on expected gradient dissimilarity. The central claims do not reduce to any fitted parameter renamed as a prediction, nor do they rest on a load-bearing self-citation chain; the cited prior work on centralized adaptivity is external and the new federated rates recover the known non-federated bounds when heterogeneity vanishes. Experiments are presented separately and do not enter the theoretical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard domain assumptions from non-convex optimization and federated learning literature to support its convergence claims; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Standard bounded-variance or bounded-gradient assumptions typical for non-convex convergence analysis of adaptive methods.
Invoked to obtain convergence rates in the presence of heterogeneous client data.

pith-pipeline@v0.9.0 · 5688 in / 1212 out tokens · 51628 ms · 2026-05-21T10:25:45.945044+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
cs.LG 2026-05 unverdicted novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
cs.LG 2026-05 unverdicted novelty 7.0

Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
math.OC 2026-05 unverdicted novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training
cs.DC 2026-05 unverdicted novelty 7.0

FedQueue adds online queue-delay prediction, cutoff admission, and staleness-aware aggregation to federated learning, proving O(1/sqrt(R)) convergence under bounded staleness and reporting 20.5% real-world improvement...
FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning
cs.LG 2026-03 unverdicted novelty 7.0

FedBCGD reduces communication in federated learning by a factor of 1/N through block-wise parameter updates with accelerated convergence guarantees.
Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning
cs.LG 2026-02 unverdicted novelty 7.0

Split-MoPE integrates split learning with predefined-expert routing to maximize usable data in vertical federated learning under sample misalignment, delivering state-of-the-art accuracy in one communication round plu...
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
cs.CV 2025-10 conditional novelty 7.0

The FedSurg challenge benchmarks federated learning on appendectomy videos and finds only 26% F1 on unseen centers even with centralized data, plus extra penalties from decentralization, with spatiotemporal models per...
Synchronous and Asynchronous Parallelism Approaches for Generalized Canonical Polyadic Tensor Decomposition with GenTen
math.NA 2026-05 unverdicted novelty 6.0

Presents new synchronous and asynchronous parallel approaches for GCP tensor decomposition and evaluates computational cost and accuracy on synthetic and real-world datasets.
Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning
stat.ML 2026-05 unverdicted novelty 6.0

Introduces FedHybrid and FedNewton for DP federated M-estimation, with finite-sample MSE bounds, minimax lower bound, and evaluations on vision datasets.
FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training
cs.DC 2026-05 unverdicted novelty 6.0

FedQueue predicts per-facility queue delays, buffers late arrivals via cutoffs, and uses staleness-aware aggregation to achieve O(1/sqrt(R)) convergence and 20.5% real-world improvement in cross-facility HPC federated...
FedSDR: Federated Self-Distillation with Rectification
cs.LG 2026-05 unverdicted novelty 5.0

FedSDR augments federated self-distillation with dual LoRA streams (local smoothing and global rectification) to produce globally aligned, factually faithful models under statistical heterogeneity.
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
math.OC 2026-05 unverdicted novelty 5.0

Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
cs.LG 2026-05 unverdicted novelty 5.0

SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect on...
FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing
cs.LG 2026-05 unverdicted novelty 5.0

FedFrozen improves stability in heterogeneous federated Transformer training by warming up the full model then freezing the attention kernel (query/key) while optimizing the value block under a fixed kernel.
FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

FMCL performs one-shot class-aware client clustering in heterogeneous federated learning by deriving semantic signatures from foundation model embeddings and using cosine distance, yielding improved performance and st...
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
cs.LG 2026-04 unverdicted novelty 5.0

PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering
cs.LG 2025-08 unverdicted novelty 5.0

Loss-based clustering of clients enables robust federated learning against strong Byzantine attacks with bounded optimality gaps using only the server and one honest client.
Rethinking the Personalized Relaxed Initialization in the Federated Learning: Consistency and Generalization
cs.LG 2026-04 unverdicted novelty 4.0

FedInit uses reverse personalized initialization in FL to reduce client drift effects, showing via excess risk that inconsistency impacts generalization error more than optimization error.
Accelerating Optimization and Machine Learning through Decentralization
cs.LG 2026-04 unverdicted novelty 3.0

Decentralized optimization can reach optimal solutions in fewer iterations than centralized methods for machine learning tasks.
A Comparative Study of Federated Learning Aggregation Strategies under Homogeneous and Heterogeneous Data Distributions
cs.LG 2026-05 unverdicted novelty 2.0

Federated aggregation strategies show distinct performance trade-offs in accuracy, loss, and efficiency depending on whether client data distributions are homogeneous or heterogeneous.
Performance and Energy Trade-Off Analysis of Hierarchical Federated Learning for Plant Disease Classification
cs.DC 2026-04 unverdicted novelty 2.0

Hierarchical federated learning for plant-disease classification shows distinct accuracy-versus-energy trade-offs across EfficientNet-B0, ResNet-50, and MobileNetV3-Large paired with FedAvg, FedProx, and FedAvgM.

Reference graph

Works this paper leans on

232 extracted references · 232 canonical work pages · cited by 20 Pith papers · 44 internal anchors

[1]

Tensor F low F ederated

The TFF Authors. Tensor F low F ederated

work page
[2]

Introducing

Ingerman, Alex and Ostrowski, Krzys , Url =. Introducing

work page
[3]

Tensor F low F ederated Stack Overflow Dataset

The TensorFlow Federated Authors. Tensor F low F ederated Stack Overflow Dataset

work page
[4]

7th International Conference on Learning Representations,

Liangchen Luo and Yuanhao Xiong and Yan Liu and Xu Sun , title =. 7th International Conference on Learning Representations,. 2019 , url =

work page 2019
[5]

Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =

Brendan McMahan and Eider Moore and Daniel Ramage and Seth Hampson and Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =. 2017 , url =

work page 2017
[6]

Towards Federated Learning at Scale: System Design , url =

Bonawitz, Keith and Eichner, Hubert and Grieskamp, Wolfgang and Huba, Dzmitry and Ingerman, Alex and Ivanov, Vladimir and Kiddon, Chlo\'. Towards Federated Learning at Scale: System Design , url =. Proceedings of Machine Learning and Systems , editor =. 2019 , publisher =

work page 2019
[11]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. 3rd International Conference on Learning Representations,. 2015 , timestamp =

work page 2015
[12]

Brendan McMahan and Matthew J

H. Brendan McMahan and Matthew J. Streeter , title =

work page
[13]

Journal of Machine Learning Research , volume=

Adaptive subgradient methods for online learning and stochastic optimization , author=. Journal of Machine Learning Research , volume=

work page
[14]

Karimireddy, Sai Praneeth and Kale, Satyen and Mohri, Mehryar and Reddi, Sashank J and Stich, Sebastian U and Suresh, Ananda Theertha , journal=

work page
[15]

The Non-

Hsieh, Kevin and Phanishayee, Amar and Mutlu, Onur and Gibbons, Phillip B , journal=. The Non-

work page
[16]

Qsparse-local-

Basu, Debraj and Data, Deepesh and Karakus, Can and Diggavi, Suhas , booktitle=. Qsparse-local-

work page
[17]

Xie, Cong and Koyejo, Oluwasanmi and Gupta, Indranil and Lin, Haibin , journal=. Local

work page
[18]

arXiv preprint arXiv:1905.10497 , year=

Fair resource allocation in federated learning , author=. arXiv preprint arXiv:1905.10497 , year=

work page arXiv 1905
[19]

Federated Learning for Mobile Keyboard Prediction

Federated learning for mobile keyboard prediction , author=. arXiv preprint arXiv:1811.03604 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Learning Differentially Private Recurrent Language Models

Learning differentially private recurrent language models , author=. arXiv preprint arXiv:1710.06963 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

2017 , organization=

Cohen, Gregory and Afshar, Saeed and Tapson, Jonathan and Van Schaik, Andre , booktitle=. 2017 , organization=

work page 2017
[22]

Pachinko allocation:

Li, Wei and McCallum, Andrew , booktitle=. Pachinko allocation:

work page
[24]

Advances in neural information processing systems , pages=

Parallelized stochastic gradient descent , author=. Advances in neural information processing systems , pages=

work page
[25]

Stich , booktitle=

Sebastian U. Stich , booktitle=. Local. 2019 , url=

work page 2019
[26]

arXiv preprint arXiv:1808.07217 , year=

Don't Use Large Mini-Batches, Use Local SGD , author=. arXiv preprint arXiv:1808.07217 , year=

work page arXiv
[27]

Cooperative

Wang, Jianyu and Joshi, Gauri , journal=. Cooperative

work page
[28]

Parallel restarted

Yu, Hao and Yang, Sen and Zhu, Shenghuo , booktitle=. Parallel restarted

work page
[29]

The error-feedback framework: Better rates for

Stich, Sebastian U and Karimireddy, Sai Praneeth , journal=. The error-feedback framework: Better rates for

work page
[30]

On the convergence of

Li, Xiang and Huang, Kaixuan and Yang, Wenhao and Wang, Shusen and Zhang, Zhihua , journal=. On the convergence of

work page
[31]

Zhang and James Lucas and Jimmy Ba and Geoffrey E

Michael R. Zhang and James Lucas and Jimmy Ba and Geoffrey E. Hinton , editor =. Lookahead Optimizer: k steps forward, 1 step back , booktitle =

work page
[32]

IEEE Journal on Selected Areas in Communications , volume=

Adaptive federated learning in resource constrained edge computing systems , author=. IEEE Journal on Selected Areas in Communications , volume=. 2019 , publisher=

work page 2019
[34]

Proceedings of the European Conference on Computer Vision (ECCV) , pages=

Group normalization , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

work page
[36]

One weird trick for parallelizing convolutional neural networks

One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

International conference on machine learning , pages=

On the importance of initialization and momentum in deep learning , author=. International conference on machine learning , pages=

work page
[38]

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

On the convergence of a class of adam-type algorithms for non-convex optimization , author=. arXiv preprint arXiv:1808.02941 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

arXiv preprint arXiv:1808.03408 , year=

On the convergence of weighted AdaGrad with momentum for training deep neural networks , author=. arXiv preprint arXiv:1808.03408 , year=

work page arXiv
[40]

AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , author=. arXiv preprint arXiv:1810.00143 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

arXiv preprint arXiv:1806.02958 , year=

The case for full-matrix adaptive regularization , author=. arXiv preprint arXiv:1806.02958 , year=

work page arXiv
[42]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

A sufficient condition for convergences of adam and rmsprop , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[43]

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

A tail-index analysis of stochastic gradient noise in deep neural networks , author=. arXiv preprint arXiv:1901.06053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1901
[44]

arXiv preprint arXiv:1905.11881 , year=

Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition , author=. arXiv preprint arXiv:1905.11881 , year=

work page arXiv 1905
[45]

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise , author=. arXiv preprint arXiv:1906.09069 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1906
[46]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Advances in Neural Information Processing Systems , pages=

The marginal value of adaptive gradient methods in machine learning , author=. Advances in Neural Information Processing Systems , pages=

work page
[48]

Minimization of functions having

Armijo, Larry , journal=. Minimization of functions having. 1966 , publisher=

work page 1966
[49]

Optimization Software , author=

Introduction to Optimization. Optimization Software , author=. Inc., Publications Division, New York , volume=

work page
[50]

Advances in neural information processing systems , pages=

Deep learning without poor local minima , author=. Advances in neural information processing systems , pages=

work page
[51]

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive gradient methods with dynamic bound of learning rate , author=. arXiv preprint arXiv:1902.09843 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1902
[52]

Advances in Neural Information Processing Systems , pages=

How does batch normalization help optimization? , author=. Advances in Neural Information Processing Systems , pages=

work page
[53]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[54]

2009 , institution=

Learning multiple layers of features from tiny images , author=. 2009 , institution=

work page 2009
[56]

Neural computation , keywords =

Hochreiter, Sepp and Schmidhuber, J. Neural computation , keywords =

work page
[57]

Journal of Machine Learning Research , year =

Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , title =. Journal of Machine Learning Research , year =

work page
[58]

Regularization of Neural Networks using

Li Wan and Matthew Zeiler and Sixin Zhang and Yann Le Cun and Rob Fergus , booktitle =. Regularization of Neural Networks using. 2013 , editor =

work page 2013
[59]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=

work page
[60]

Carbonell and Quoc V

Zihang Dai and Zhilin Yang and Yiming Yang and Jaime G. Carbonell and Quoc V. Le and Ruslan Salakhutdinov , title =. CoRR , volume =. 2019 , url =

work page 2019
[61]

Regularizing and Optimizing

Stephen Merity and Nitish Shirish Keskar and Richard Socher , booktitle=. Regularizing and Optimizing. 2018 , url=

work page 2018
[62]

2012 , url =

Martin Sundermeyer and Ralf Schl. 2012 , url =

work page 2012
[63]

CoRR , volume =

Tom Young and Devamanyu Hazarika and Soujanya Poria and Erik Cambria , title =. CoRR , volume =. 2017 , url =

work page 2017
[64]

Recurrent neural network based language model , booktitle =

Tomas Mikolov and Martin Karafi. Recurrent neural network based language model , booktitle =. 2010 , url =

work page 2010
[65]

Le , title =

Ilya Sutskever and Oriol Vinyals and Quoc V. Le , title =. Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages =. 2014 , url =

work page 2014
[66]

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation , booktitle =

Cho, Kyunghyun and van Merri. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation , booktitle =. 2014 , pages =

work page 2014
[67]

Mathematical Programming , volume=

Accelerated gradient methods for nonconvex nonlinear and stochastic programming , author=. Mathematical Programming , volume=. 2016 , publisher=

work page 2016
[68]

SIAM Journal on Optimization , volume=

Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework , author=. SIAM Journal on Optimization , volume=. 2012 , publisher=

work page 2012
[69]

Conference On Learning Theory , pages=

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , author=. Conference On Learning Theory , pages=

work page
[70]

SIAM journal on imaging sciences , volume=

A fast iterative shrinkage-thresholding algorithm for linear inverse problems , author=. SIAM journal on imaging sciences , volume=. 2009 , publisher=

work page 2009
[71]

SIAM Journal on Optimization , volume=

Efficiency of coordinate descent methods on huge-scale optimization problems , author=. SIAM Journal on Optimization , volume=. 2012 , publisher=

work page 2012
[72]

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , author=. arXiv preprint arXiv:1807.01695 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Advances in Neural Information Processing Systems , pages=

A universal catalyst for first-order optimization , author=. Advances in Neural Information Processing Systems , pages=

work page
[74]

International Conference on Machine Learning , pages=

Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , author=. International Conference on Machine Learning , pages=

work page
[75]

SIAM Journal on Optimization , volume=

Accelerated methods for nonconvex optimization , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=

work page 2018
[76]

The Journal of Machine Learning Research , volume=

Katyusha: The first direct acceleration of stochastic gradient methods , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=

work page 2017
[77]

1983 , publisher=

Problem complexity and method efficiency in optimization , author=. 1983 , publisher=

work page 1983
[78]

USSR Computational Mathematics and Mathematical Physics , volume=

Some methods of speeding up the convergence of iteration methods , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1964 , publisher=

work page 1964
[79]

A method of solving a convex programming problem with convergence rate

Nesterov, Yurii , booktitle=. A method of solving a convex programming problem with convergence rate

work page
[80]

A geometric alternative to Nesterov's accelerated gradient descent

A geometric alternative to Nesterov's accelerated gradient descent , author=. arXiv preprint arXiv:1506.08187 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

SIAM Journal on Optimization , volume=

Analysis and design of optimization algorithms via integral quadratic constraints , author=. SIAM Journal on Optimization , volume=. 2016 , publisher=

work page 2016
[82]

Dissipativity Theory for Nesterov's Accelerated Method

Dissipativity Theory for Nesterov's Accelerated Method , author=. arXiv preprint arXiv:1706.04381 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[83]

Analysis of Optimization Algorithms via Integral Quadratic Constraints: Nonstrongly Convex Problems

Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems , author=. arXiv preprint arXiv:1705.03615 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[84]

Advances In Neural Information Processing Systems , pages=

Regularized nonlinear acceleration , author=. Advances In Neural Information Processing Systems , pages=

work page
[85]

Advances in Neural Information Processing Systems , pages=

A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights , author=. Advances in Neural Information Processing Systems , pages=

work page
[86]

Proceedings of the National Academy of Sciences , volume=

A variational perspective on accelerated methods in optimization , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=

work page 2016
[87]

A Lyapunov Analysis of Momentum Methods in Optimization

A lyapunov analysis of momentum methods in optimization , author=. arXiv preprint arXiv:1611.02635 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[88]

arXiv preprint arXiv:1712.02485 , year=

The Approximate Duality Gap Technique: A Unified Theory of First-Order Methods , author=. arXiv preprint arXiv:1712.02485 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Tensor F low F ederated

The TFF Authors. Tensor F low F ederated

work page

[2] [2]

Introducing

Ingerman, Alex and Ostrowski, Krzys , Url =. Introducing

work page

[3] [3]

Tensor F low F ederated Stack Overflow Dataset

The TensorFlow Federated Authors. Tensor F low F ederated Stack Overflow Dataset

work page

[4] [4]

7th International Conference on Learning Representations,

Liangchen Luo and Yuanhao Xiong and Yan Liu and Xu Sun , title =. 7th International Conference on Learning Representations,. 2019 , url =

work page 2019

[5] [5]

Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =

Brendan McMahan and Eider Moore and Daniel Ramage and Seth Hampson and Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =. 2017 , url =

work page 2017

[6] [6]

Towards Federated Learning at Scale: System Design , url =

Bonawitz, Keith and Eichner, Hubert and Grieskamp, Wolfgang and Huba, Dzmitry and Ingerman, Alex and Ivanov, Vladimir and Kiddon, Chlo\'. Towards Federated Learning at Scale: System Design , url =. Proceedings of Machine Learning and Systems , editor =. 2019 , publisher =

work page 2019

[7] [11]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. 3rd International Conference on Learning Representations,. 2015 , timestamp =

work page 2015

[8] [12]

Brendan McMahan and Matthew J

H. Brendan McMahan and Matthew J. Streeter , title =

work page

[9] [13]

Journal of Machine Learning Research , volume=

Adaptive subgradient methods for online learning and stochastic optimization , author=. Journal of Machine Learning Research , volume=

work page

[10] [14]

Karimireddy, Sai Praneeth and Kale, Satyen and Mohri, Mehryar and Reddi, Sashank J and Stich, Sebastian U and Suresh, Ananda Theertha , journal=

work page

[11] [15]

The Non-

Hsieh, Kevin and Phanishayee, Amar and Mutlu, Onur and Gibbons, Phillip B , journal=. The Non-

work page

[12] [16]

Qsparse-local-

Basu, Debraj and Data, Deepesh and Karakus, Can and Diggavi, Suhas , booktitle=. Qsparse-local-

work page

[13] [17]

Xie, Cong and Koyejo, Oluwasanmi and Gupta, Indranil and Lin, Haibin , journal=. Local

work page

[14] [18]

arXiv preprint arXiv:1905.10497 , year=

Fair resource allocation in federated learning , author=. arXiv preprint arXiv:1905.10497 , year=

work page arXiv 1905

[15] [19]

Federated Learning for Mobile Keyboard Prediction

Federated learning for mobile keyboard prediction , author=. arXiv preprint arXiv:1811.03604 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [20]

Learning Differentially Private Recurrent Language Models

Learning differentially private recurrent language models , author=. arXiv preprint arXiv:1710.06963 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [21]

2017 , organization=

Cohen, Gregory and Afshar, Saeed and Tapson, Jonathan and Van Schaik, Andre , booktitle=. 2017 , organization=

work page 2017

[18] [22]

Pachinko allocation:

Li, Wei and McCallum, Andrew , booktitle=. Pachinko allocation:

work page

[19] [24]

Advances in neural information processing systems , pages=

Parallelized stochastic gradient descent , author=. Advances in neural information processing systems , pages=

work page

[20] [25]

Stich , booktitle=

Sebastian U. Stich , booktitle=. Local. 2019 , url=

work page 2019

[21] [26]

arXiv preprint arXiv:1808.07217 , year=

Don't Use Large Mini-Batches, Use Local SGD , author=. arXiv preprint arXiv:1808.07217 , year=

work page arXiv

[22] [27]

Cooperative

Wang, Jianyu and Joshi, Gauri , journal=. Cooperative

work page

[23] [28]

Parallel restarted

Yu, Hao and Yang, Sen and Zhu, Shenghuo , booktitle=. Parallel restarted

work page

[24] [29]

The error-feedback framework: Better rates for

Stich, Sebastian U and Karimireddy, Sai Praneeth , journal=. The error-feedback framework: Better rates for

work page

[25] [30]

On the convergence of

Li, Xiang and Huang, Kaixuan and Yang, Wenhao and Wang, Shusen and Zhang, Zhihua , journal=. On the convergence of

work page

[26] [31]

Zhang and James Lucas and Jimmy Ba and Geoffrey E

Michael R. Zhang and James Lucas and Jimmy Ba and Geoffrey E. Hinton , editor =. Lookahead Optimizer: k steps forward, 1 step back , booktitle =

work page

[27] [32]

IEEE Journal on Selected Areas in Communications , volume=

Adaptive federated learning in resource constrained edge computing systems , author=. IEEE Journal on Selected Areas in Communications , volume=. 2019 , publisher=

work page 2019

[28] [34]

Proceedings of the European Conference on Computer Vision (ECCV) , pages=

Group normalization , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

work page

[29] [36]

One weird trick for parallelizing convolutional neural networks

One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [37]

International conference on machine learning , pages=

On the importance of initialization and momentum in deep learning , author=. International conference on machine learning , pages=

work page

[31] [38]

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

On the convergence of a class of adam-type algorithms for non-convex optimization , author=. arXiv preprint arXiv:1808.02941 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [39]

arXiv preprint arXiv:1808.03408 , year=

On the convergence of weighted AdaGrad with momentum for training deep neural networks , author=. arXiv preprint arXiv:1808.03408 , year=

work page arXiv

[33] [40]

AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , author=. arXiv preprint arXiv:1810.00143 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [41]

arXiv preprint arXiv:1806.02958 , year=

The case for full-matrix adaptive regularization , author=. arXiv preprint arXiv:1806.02958 , year=

work page arXiv

[35] [42]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

A sufficient condition for convergences of adam and rmsprop , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page

[36] [43]

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

A tail-index analysis of stochastic gradient noise in deep neural networks , author=. arXiv preprint arXiv:1901.06053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1901

[37] [44]

arXiv preprint arXiv:1905.11881 , year=

Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition , author=. arXiv preprint arXiv:1905.11881 , year=

work page arXiv 1905

[38] [45]

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise , author=. arXiv preprint arXiv:1906.09069 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1906

[39] [46]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [47]

Advances in Neural Information Processing Systems , pages=

The marginal value of adaptive gradient methods in machine learning , author=. Advances in Neural Information Processing Systems , pages=

work page

[41] [48]

Minimization of functions having

Armijo, Larry , journal=. Minimization of functions having. 1966 , publisher=

work page 1966

[42] [49]

Optimization Software , author=

Introduction to Optimization. Optimization Software , author=. Inc., Publications Division, New York , volume=

work page

[43] [50]

Advances in neural information processing systems , pages=

Deep learning without poor local minima , author=. Advances in neural information processing systems , pages=

work page

[44] [51]

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive gradient methods with dynamic bound of learning rate , author=. arXiv preprint arXiv:1902.09843 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1902

[45] [52]

Advances in Neural Information Processing Systems , pages=

How does batch normalization help optimization? , author=. Advances in Neural Information Processing Systems , pages=

work page

[46] [53]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[47] [54]

2009 , institution=

Learning multiple layers of features from tiny images , author=. 2009 , institution=

work page 2009

[48] [56]

Neural computation , keywords =

Hochreiter, Sepp and Schmidhuber, J. Neural computation , keywords =

work page

[49] [57]

Journal of Machine Learning Research , year =

Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , title =. Journal of Machine Learning Research , year =

work page

[50] [58]

Regularization of Neural Networks using

Li Wan and Matthew Zeiler and Sixin Zhang and Yann Le Cun and Rob Fergus , booktitle =. Regularization of Neural Networks using. 2013 , editor =

work page 2013

[51] [59]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=

work page

[52] [60]

Carbonell and Quoc V

Zihang Dai and Zhilin Yang and Yiming Yang and Jaime G. Carbonell and Quoc V. Le and Ruslan Salakhutdinov , title =. CoRR , volume =. 2019 , url =

work page 2019

[53] [61]

Regularizing and Optimizing

Stephen Merity and Nitish Shirish Keskar and Richard Socher , booktitle=. Regularizing and Optimizing. 2018 , url=

work page 2018

[54] [62]

2012 , url =

Martin Sundermeyer and Ralf Schl. 2012 , url =

work page 2012

[55] [63]

CoRR , volume =

Tom Young and Devamanyu Hazarika and Soujanya Poria and Erik Cambria , title =. CoRR , volume =. 2017 , url =

work page 2017

[56] [64]

Recurrent neural network based language model , booktitle =

Tomas Mikolov and Martin Karafi. Recurrent neural network based language model , booktitle =. 2010 , url =

work page 2010

[57] [65]

Le , title =

Ilya Sutskever and Oriol Vinyals and Quoc V. Le , title =. Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages =. 2014 , url =

work page 2014

[58] [66]

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation , booktitle =

Cho, Kyunghyun and van Merri. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation , booktitle =. 2014 , pages =

work page 2014

[59] [67]

Mathematical Programming , volume=

Accelerated gradient methods for nonconvex nonlinear and stochastic programming , author=. Mathematical Programming , volume=. 2016 , publisher=

work page 2016

[60] [68]

SIAM Journal on Optimization , volume=

Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework , author=. SIAM Journal on Optimization , volume=. 2012 , publisher=

work page 2012

[61] [69]

Conference On Learning Theory , pages=

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , author=. Conference On Learning Theory , pages=

work page

[62] [70]

SIAM journal on imaging sciences , volume=

A fast iterative shrinkage-thresholding algorithm for linear inverse problems , author=. SIAM journal on imaging sciences , volume=. 2009 , publisher=

work page 2009

[63] [71]

SIAM Journal on Optimization , volume=

Efficiency of coordinate descent methods on huge-scale optimization problems , author=. SIAM Journal on Optimization , volume=. 2012 , publisher=

work page 2012

[64] [72]

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , author=. arXiv preprint arXiv:1807.01695 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [73]

Advances in Neural Information Processing Systems , pages=

A universal catalyst for first-order optimization , author=. Advances in Neural Information Processing Systems , pages=

work page

[66] [74]

International Conference on Machine Learning , pages=

Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , author=. International Conference on Machine Learning , pages=

work page

[67] [75]

SIAM Journal on Optimization , volume=

Accelerated methods for nonconvex optimization , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=

work page 2018

[68] [76]

The Journal of Machine Learning Research , volume=

Katyusha: The first direct acceleration of stochastic gradient methods , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=

work page 2017

[69] [77]

1983 , publisher=

Problem complexity and method efficiency in optimization , author=. 1983 , publisher=

work page 1983

[70] [78]

USSR Computational Mathematics and Mathematical Physics , volume=

Some methods of speeding up the convergence of iteration methods , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1964 , publisher=

work page 1964

[71] [79]

A method of solving a convex programming problem with convergence rate

Nesterov, Yurii , booktitle=. A method of solving a convex programming problem with convergence rate

work page

[72] [80]

A geometric alternative to Nesterov's accelerated gradient descent

A geometric alternative to Nesterov's accelerated gradient descent , author=. arXiv preprint arXiv:1506.08187 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[73] [81]

SIAM Journal on Optimization , volume=

Analysis and design of optimization algorithms via integral quadratic constraints , author=. SIAM Journal on Optimization , volume=. 2016 , publisher=

work page 2016

[74] [82]

Dissipativity Theory for Nesterov's Accelerated Method

Dissipativity Theory for Nesterov's Accelerated Method , author=. arXiv preprint arXiv:1706.04381 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[75] [83]

Analysis of Optimization Algorithms via Integral Quadratic Constraints: Nonstrongly Convex Problems

Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems , author=. arXiv preprint arXiv:1705.03615 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[76] [84]

Advances In Neural Information Processing Systems , pages=

Regularized nonlinear acceleration , author=. Advances In Neural Information Processing Systems , pages=

work page

[77] [85]

Advances in Neural Information Processing Systems , pages=

A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights , author=. Advances in Neural Information Processing Systems , pages=

work page

[78] [86]

Proceedings of the National Academy of Sciences , volume=

A variational perspective on accelerated methods in optimization , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=

work page 2016

[79] [87]

A Lyapunov Analysis of Momentum Methods in Optimization

A lyapunov analysis of momentum methods in optimization , author=. arXiv preprint arXiv:1611.02635 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [88]

arXiv preprint arXiv:1712.02485 , year=

The Approximate Duality Gap Technique: A Unified Theory of First-Order Methods , author=. arXiv preprint arXiv:1712.02485 , year=

work page arXiv