Adaptive Federated Optimization
Pith reviewed 2026-05-21 10:25 UTC · model grok-4.3
The pith
Federated versions of Adam and Yogi achieve convergence guarantees for non-convex objectives despite heterogeneous client data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that adaptive optimization techniques can be lifted to the federated setting, yielding both practical performance gains and provable convergence rates for non-convex problems even when each client holds a statistically different subset of the data.
What carries the argument
Federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) that maintain separate per-coordinate statistics while performing periodic averaging of client updates.
If this is right
- Convergence rates now account for both the degree of client heterogeneity and the frequency of communication.
- Hyperparameter tuning becomes less critical because adaptive per-coordinate scaling compensates for varying gradient magnitudes across clients.
- The methods remain compatible with existing federated protocols that already perform local SGD steps before averaging.
- Communication efficiency can be traded against heterogeneity by adjusting the number of local steps without losing the adaptive benefit.
Where Pith is reading between the lines
- The same adaptive averaging idea may apply to other distributed settings such as decentralized learning without a central server.
- Combining these optimizers with privacy mechanisms could preserve both utility and differential privacy guarantees under heterogeneity.
- The interplay result suggests that extremely heterogeneous clients may still require more frequent communication even with adaptation.
Load-bearing premise
The benefits of per-parameter adaptation observed in centralized settings continue to hold after averaging across clients with arbitrarily different data distributions.
What would settle it
An experiment or counter-example in which one of the proposed methods fails to converge or is outperformed by plain FedAvg on a non-convex objective with strong client heterogeneity.
read the original abstract
Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general non-convex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes federated versions of adaptive optimizers (FedAdagrad, FedAdam, FedYogi) that incorporate per-parameter adaptation at the server while performing local SGD steps on clients. It derives convergence guarantees for general non-convex objectives under client data heterogeneity, bounded gradient dissimilarity, and partial participation, showing that the methods recover known centralized rates when heterogeneity vanishes. Extensive experiments on image classification and language modeling tasks with non-iid partitions demonstrate faster convergence and higher accuracy relative to FedAvg.
Significance. If the stated bounds hold, the work is significant because it supplies the first explicit convergence analysis of server-side adaptive methods in the federated regime and quantifies the precise penalty imposed by client drift and heterogeneity on the adaptive rates. The proofs correctly track the additional terms arising from multiple local steps and partial client participation, and the experiments corroborate the predicted dependence of communication rounds on the heterogeneity measure. These contributions directly address a practical bottleneck in federated learning.
minor comments (3)
- [§4.1] §4.1, Algorithm 1: the server update for the first-moment estimate is written with a single global learning rate η; it would be clearer to introduce a separate server learning rate η_s and state its relation to the client learning rate η_c used in the local steps.
- [Theorem 3.1] Theorem 3.1: the final rate contains a term proportional to the heterogeneity constant G; the paper should explicitly note the regime in which this term dominates the usual 1/√T rate and whether the adaptive methods reduce the constant in front of G relative to FedAvg.
- [Table 2] Table 2: the reported test accuracies for FedAdam on the Shakespeare dataset lack standard deviations across the five random seeds; adding error bars would make the claimed gains over FedAvg statistically interpretable.
Simulated Author's Rebuttal
We thank the referee for the positive and constructive review. We appreciate the recognition of the significance of providing the first explicit convergence analysis for server-side adaptive methods under heterogeneity and partial participation, as well as the corroboration from our experiments. We will incorporate minor revisions to address any remaining presentation or clarification points.
Circularity Check
No significant circularity; derivation extends prior adaptive analyses independently
full rationale
The paper introduces federated variants of Adagrad, Adam, and Yogi and derives convergence bounds for non-convex objectives under client heterogeneity. These bounds are obtained by extending standard centralized adaptive-optimizer proofs (with explicit tracking of additional client-drift and partial-participation terms) from well-known assumptions such as L-smoothness, bounded stochastic-gradient variance, and a heterogeneity measure based on expected gradient dissimilarity. The central claims do not reduce to any fitted parameter renamed as a prediction, nor do they rest on a load-bearing self-citation chain; the cited prior work on centralized adaptivity is external and the new federated rates recover the known non-federated bounds when heterogeneity vanishes. Experiments are presented separately and do not enter the theoretical derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard bounded-variance or bounded-gradient assumptions typical for non-convex convergence analysis of adaptive methods.
Forward citations
Cited by 21 Pith papers
-
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
-
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
-
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
-
FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training
FedQueue adds online queue-delay prediction, cutoff admission, and staleness-aware aggregation to federated learning, proving O(1/sqrt(R)) convergence under bounded staleness and reporting 20.5% real-world improvement...
-
FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning
FedBCGD reduces communication in federated learning by a factor of 1/N through block-wise parameter updates with accelerated convergence guarantees.
-
Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning
Split-MoPE integrates split learning with predefined-expert routing to maximize usable data in vertical federated learning under sample misalignment, delivering state-of-the-art accuracy in one communication round plu...
-
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
The FedSurg challenge benchmarks federated learning on appendectomy videos and finds only 26% F1 on unseen centers even with centralized data, plus extra penalties from decentralization, with spatiotemporal models per...
-
Synchronous and Asynchronous Parallelism Approaches for Generalized Canonical Polyadic Tensor Decomposition with GenTen
Presents new synchronous and asynchronous parallel approaches for GCP tensor decomposition and evaluates computational cost and accuracy on synthetic and real-world datasets.
-
Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning
Introduces FedHybrid and FedNewton for DP federated M-estimation, with finite-sample MSE bounds, minimax lower bound, and evaluations on vision datasets.
-
FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training
FedQueue predicts per-facility queue delays, buffers late arrivals via cutoffs, and uses staleness-aware aggregation to achieve O(1/sqrt(R)) convergence and 20.5% real-world improvement in cross-facility HPC federated...
-
FedSDR: Federated Self-Distillation with Rectification
FedSDR augments federated self-distillation with dual LoRA streams (local smoothing and global rectification) to produce globally aligned, factually faithful models under statistical heterogeneity.
-
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
-
Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect on...
-
FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing
FedFrozen improves stability in heterogeneous federated Transformer training by warming up the full model then freezing the attention kernel (query/key) while optimizing the value block under a fixed kernel.
-
FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning
FMCL performs one-shot class-aware client clustering in heterogeneous federated learning by deriving semantic signatures from foundation model embeddings and using cosine distance, yielding improved performance and st...
-
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
-
Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering
Loss-based clustering of clients enables robust federated learning against strong Byzantine attacks with bounded optimality gaps using only the server and one honest client.
-
Rethinking the Personalized Relaxed Initialization in the Federated Learning: Consistency and Generalization
FedInit uses reverse personalized initialization in FL to reduce client drift effects, showing via excess risk that inconsistency impacts generalization error more than optimization error.
-
Accelerating Optimization and Machine Learning through Decentralization
Decentralized optimization can reach optimal solutions in fewer iterations than centralized methods for machine learning tasks.
-
A Comparative Study of Federated Learning Aggregation Strategies under Homogeneous and Heterogeneous Data Distributions
Federated aggregation strategies show distinct performance trade-offs in accuracy, loss, and efficiency depending on whether client data distributions are homogeneous or heterogeneous.
-
Performance and Energy Trade-Off Analysis of Hierarchical Federated Learning for Plant Disease Classification
Hierarchical federated learning for plant-disease classification shows distinct accuracy-versus-energy trade-offs across EfficientNet-B0, ResNet-50, and MobileNetV3-Large paired with FedAvg, FedProx, and FedAvgM.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Tensor F low F ederated Stack Overflow Dataset
The TensorFlow Federated Authors. Tensor F low F ederated Stack Overflow Dataset
-
[4]
7th International Conference on Learning Representations,
Liangchen Luo and Yuanhao Xiong and Yan Liu and Xu Sun , title =. 7th International Conference on Learning Representations,. 2019 , url =
work page 2019
-
[5]
Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =
Brendan McMahan and Eider Moore and Daniel Ramage and Seth Hampson and Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =. 2017 , url =
work page 2017
-
[6]
Towards Federated Learning at Scale: System Design , url =
Bonawitz, Keith and Eichner, Hubert and Grieskamp, Wolfgang and Huba, Dzmitry and Ingerman, Alex and Ivanov, Vladimir and Kiddon, Chlo\'. Towards Federated Learning at Scale: System Design , url =. Proceedings of Machine Learning and Systems , editor =. 2019 , publisher =
work page 2019
-
[11]
Diederik P. Kingma and Jimmy Ba , title =. 3rd International Conference on Learning Representations,. 2015 , timestamp =
work page 2015
- [12]
-
[13]
Journal of Machine Learning Research , volume=
Adaptive subgradient methods for online learning and stochastic optimization , author=. Journal of Machine Learning Research , volume=
-
[14]
Karimireddy, Sai Praneeth and Kale, Satyen and Mohri, Mehryar and Reddi, Sashank J and Stich, Sebastian U and Suresh, Ananda Theertha , journal=
- [15]
-
[16]
Basu, Debraj and Data, Deepesh and Karakus, Can and Diggavi, Suhas , booktitle=. Qsparse-local-
-
[17]
Xie, Cong and Koyejo, Oluwasanmi and Gupta, Indranil and Lin, Haibin , journal=. Local
-
[18]
arXiv preprint arXiv:1905.10497 , year=
Fair resource allocation in federated learning , author=. arXiv preprint arXiv:1905.10497 , year=
-
[19]
Federated Learning for Mobile Keyboard Prediction
Federated learning for mobile keyboard prediction , author=. arXiv preprint arXiv:1811.03604 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Learning Differentially Private Recurrent Language Models
Learning differentially private recurrent language models , author=. arXiv preprint arXiv:1710.06963 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Cohen, Gregory and Afshar, Saeed and Tapson, Jonathan and Van Schaik, Andre , booktitle=. 2017 , organization=
work page 2017
- [22]
-
[24]
Advances in neural information processing systems , pages=
Parallelized stochastic gradient descent , author=. Advances in neural information processing systems , pages=
- [25]
-
[26]
arXiv preprint arXiv:1808.07217 , year=
Don't Use Large Mini-Batches, Use Local SGD , author=. arXiv preprint arXiv:1808.07217 , year=
- [27]
-
[28]
Yu, Hao and Yang, Sen and Zhu, Shenghuo , booktitle=. Parallel restarted
-
[29]
The error-feedback framework: Better rates for
Stich, Sebastian U and Karimireddy, Sai Praneeth , journal=. The error-feedback framework: Better rates for
-
[30]
Li, Xiang and Huang, Kaixuan and Yang, Wenhao and Wang, Shusen and Zhang, Zhihua , journal=. On the convergence of
-
[31]
Zhang and James Lucas and Jimmy Ba and Geoffrey E
Michael R. Zhang and James Lucas and Jimmy Ba and Geoffrey E. Hinton , editor =. Lookahead Optimizer: k steps forward, 1 step back , booktitle =
-
[32]
IEEE Journal on Selected Areas in Communications , volume=
Adaptive federated learning in resource constrained edge computing systems , author=. IEEE Journal on Selected Areas in Communications , volume=. 2019 , publisher=
work page 2019
-
[34]
Proceedings of the European Conference on Computer Vision (ECCV) , pages=
Group normalization , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=
-
[36]
One weird trick for parallelizing convolutional neural networks
One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
International conference on machine learning , pages=
On the importance of initialization and momentum in deep learning , author=. International conference on machine learning , pages=
-
[38]
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization
On the convergence of a class of adam-type algorithms for non-convex optimization , author=. arXiv preprint arXiv:1808.02941 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
arXiv preprint arXiv:1808.03408 , year=
On the convergence of weighted AdaGrad with momentum for training deep neural networks , author=. arXiv preprint arXiv:1808.03408 , year=
-
[40]
AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods
AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , author=. arXiv preprint arXiv:1810.00143 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
arXiv preprint arXiv:1806.02958 , year=
The case for full-matrix adaptive regularization , author=. arXiv preprint arXiv:1806.02958 , year=
-
[42]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
A sufficient condition for convergences of adam and rmsprop , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[43]
A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks
A tail-index analysis of stochastic gradient noise in deep neural networks , author=. arXiv preprint arXiv:1901.06053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[44]
arXiv preprint arXiv:1905.11881 , year=
Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition , author=. arXiv preprint arXiv:1905.11881 , year=
-
[45]
First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise
First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise , author=. arXiv preprint arXiv:1906.09069 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[46]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Advances in Neural Information Processing Systems , pages=
The marginal value of adaptive gradient methods in machine learning , author=. Advances in Neural Information Processing Systems , pages=
-
[48]
Minimization of functions having
Armijo, Larry , journal=. Minimization of functions having. 1966 , publisher=
work page 1966
-
[49]
Optimization Software , author=
Introduction to Optimization. Optimization Software , author=. Inc., Publications Division, New York , volume=
-
[50]
Advances in neural information processing systems , pages=
Deep learning without poor local minima , author=. Advances in neural information processing systems , pages=
-
[51]
Adaptive Gradient Methods with Dynamic Bound of Learning Rate
Adaptive gradient methods with dynamic bound of learning rate , author=. arXiv preprint arXiv:1902.09843 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[52]
Advances in Neural Information Processing Systems , pages=
How does batch normalization help optimization? , author=. Advances in Neural Information Processing Systems , pages=
-
[53]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[54]
Learning multiple layers of features from tiny images , author=. 2009 , institution=
work page 2009
-
[56]
Neural computation , keywords =
Hochreiter, Sepp and Schmidhuber, J. Neural computation , keywords =
-
[57]
Journal of Machine Learning Research , year =
Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , title =. Journal of Machine Learning Research , year =
-
[58]
Regularization of Neural Networks using
Li Wan and Matthew Zeiler and Sixin Zhang and Yann Le Cun and Rob Fergus , booktitle =. Regularization of Neural Networks using. 2013 , editor =
work page 2013
-
[59]
Advances in neural information processing systems , pages=
Attention is all you need , author=. Advances in neural information processing systems , pages=
-
[60]
Zihang Dai and Zhilin Yang and Yiming Yang and Jaime G. Carbonell and Quoc V. Le and Ruslan Salakhutdinov , title =. CoRR , volume =. 2019 , url =
work page 2019
-
[61]
Stephen Merity and Nitish Shirish Keskar and Richard Socher , booktitle=. Regularizing and Optimizing. 2018 , url=
work page 2018
- [62]
-
[63]
Tom Young and Devamanyu Hazarika and Soujanya Poria and Erik Cambria , title =. CoRR , volume =. 2017 , url =
work page 2017
-
[64]
Recurrent neural network based language model , booktitle =
Tomas Mikolov and Martin Karafi. Recurrent neural network based language model , booktitle =. 2010 , url =
work page 2010
-
[65]
Ilya Sutskever and Oriol Vinyals and Quoc V. Le , title =. Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages =. 2014 , url =
work page 2014
-
[66]
Cho, Kyunghyun and van Merri. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation , booktitle =. 2014 , pages =
work page 2014
-
[67]
Mathematical Programming , volume=
Accelerated gradient methods for nonconvex nonlinear and stochastic programming , author=. Mathematical Programming , volume=. 2016 , publisher=
work page 2016
-
[68]
SIAM Journal on Optimization , volume=
Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework , author=. SIAM Journal on Optimization , volume=. 2012 , publisher=
work page 2012
-
[69]
Conference On Learning Theory , pages=
Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , author=. Conference On Learning Theory , pages=
-
[70]
SIAM journal on imaging sciences , volume=
A fast iterative shrinkage-thresholding algorithm for linear inverse problems , author=. SIAM journal on imaging sciences , volume=. 2009 , publisher=
work page 2009
-
[71]
SIAM Journal on Optimization , volume=
Efficiency of coordinate descent methods on huge-scale optimization problems , author=. SIAM Journal on Optimization , volume=. 2012 , publisher=
work page 2012
-
[72]
SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator
SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , author=. arXiv preprint arXiv:1807.01695 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Advances in Neural Information Processing Systems , pages=
A universal catalyst for first-order optimization , author=. Advances in Neural Information Processing Systems , pages=
-
[74]
International Conference on Machine Learning , pages=
Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , author=. International Conference on Machine Learning , pages=
-
[75]
SIAM Journal on Optimization , volume=
Accelerated methods for nonconvex optimization , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=
work page 2018
-
[76]
The Journal of Machine Learning Research , volume=
Katyusha: The first direct acceleration of stochastic gradient methods , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=
work page 2017
-
[77]
Problem complexity and method efficiency in optimization , author=. 1983 , publisher=
work page 1983
-
[78]
USSR Computational Mathematics and Mathematical Physics , volume=
Some methods of speeding up the convergence of iteration methods , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1964 , publisher=
work page 1964
-
[79]
A method of solving a convex programming problem with convergence rate
Nesterov, Yurii , booktitle=. A method of solving a convex programming problem with convergence rate
-
[80]
A geometric alternative to Nesterov's accelerated gradient descent
A geometric alternative to Nesterov's accelerated gradient descent , author=. arXiv preprint arXiv:1506.08187 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
SIAM Journal on Optimization , volume=
Analysis and design of optimization algorithms via integral quadratic constraints , author=. SIAM Journal on Optimization , volume=. 2016 , publisher=
work page 2016
-
[82]
Dissipativity Theory for Nesterov's Accelerated Method
Dissipativity Theory for Nesterov's Accelerated Method , author=. arXiv preprint arXiv:1706.04381 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
Analysis of Optimization Algorithms via Integral Quadratic Constraints: Nonstrongly Convex Problems
Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems , author=. arXiv preprint arXiv:1705.03615 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[84]
Advances In Neural Information Processing Systems , pages=
Regularized nonlinear acceleration , author=. Advances In Neural Information Processing Systems , pages=
-
[85]
Advances in Neural Information Processing Systems , pages=
A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights , author=. Advances in Neural Information Processing Systems , pages=
-
[86]
Proceedings of the National Academy of Sciences , volume=
A variational perspective on accelerated methods in optimization , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=
work page 2016
-
[87]
A Lyapunov Analysis of Momentum Methods in Optimization
A lyapunov analysis of momentum methods in optimization , author=. arXiv preprint arXiv:1611.02635 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[88]
arXiv preprint arXiv:1712.02485 , year=
The Approximate Duality Gap Technique: A Unified Theory of First-Order Methods , author=. arXiv preprint arXiv:1712.02485 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.