pith. sign in

arxiv: 2003.00295 · v5 · pith:IUIV74V2new · submitted 2020-02-29 · 💻 cs.LG · cs.DC· math.OC· stat.ML

Adaptive Federated Optimization

Pith reviewed 2026-05-21 10:25 UTC · model grok-4.3

classification 💻 cs.LG cs.DCmath.OCstat.ML
keywords federated learningadaptive optimizationconvergence analysisnon-convex optimizationheterogeneous dataAdamYogi
0
0 comments X

The pith

Federated versions of Adam and Yogi achieve convergence guarantees for non-convex objectives despite heterogeneous client data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces federated adaptations of adaptive gradient methods such as Adagrad, Adam, and Yogi to replace or augment Federated Averaging. These adaptations apply per-parameter learning rate adjustments across clients while averaging model updates at a central server. Convergence analysis is provided for general non-convex loss functions under arbitrary data heterogeneity, with explicit attention to how client differences affect the number of communication rounds needed. Experiments on benchmark tasks show that the adaptive federated methods reach higher accuracy with fewer rounds than standard approaches.

Core claim

The central claim is that adaptive optimization techniques can be lifted to the federated setting, yielding both practical performance gains and provable convergence rates for non-convex problems even when each client holds a statistically different subset of the data.

What carries the argument

Federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) that maintain separate per-coordinate statistics while performing periodic averaging of client updates.

If this is right

  • Convergence rates now account for both the degree of client heterogeneity and the frequency of communication.
  • Hyperparameter tuning becomes less critical because adaptive per-coordinate scaling compensates for varying gradient magnitudes across clients.
  • The methods remain compatible with existing federated protocols that already perform local SGD steps before averaging.
  • Communication efficiency can be traded against heterogeneity by adjusting the number of local steps without losing the adaptive benefit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptive averaging idea may apply to other distributed settings such as decentralized learning without a central server.
  • Combining these optimizers with privacy mechanisms could preserve both utility and differential privacy guarantees under heterogeneity.
  • The interplay result suggests that extremely heterogeneous clients may still require more frequent communication even with adaptation.

Load-bearing premise

The benefits of per-parameter adaptation observed in centralized settings continue to hold after averaging across clients with arbitrarily different data distributions.

What would settle it

An experiment or counter-example in which one of the proposed methods fails to converge or is outperformed by plain FedAvg on a non-convex objective with strong client heterogeneity.

read the original abstract

Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general non-convex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes federated versions of adaptive optimizers (FedAdagrad, FedAdam, FedYogi) that incorporate per-parameter adaptation at the server while performing local SGD steps on clients. It derives convergence guarantees for general non-convex objectives under client data heterogeneity, bounded gradient dissimilarity, and partial participation, showing that the methods recover known centralized rates when heterogeneity vanishes. Extensive experiments on image classification and language modeling tasks with non-iid partitions demonstrate faster convergence and higher accuracy relative to FedAvg.

Significance. If the stated bounds hold, the work is significant because it supplies the first explicit convergence analysis of server-side adaptive methods in the federated regime and quantifies the precise penalty imposed by client drift and heterogeneity on the adaptive rates. The proofs correctly track the additional terms arising from multiple local steps and partial client participation, and the experiments corroborate the predicted dependence of communication rounds on the heterogeneity measure. These contributions directly address a practical bottleneck in federated learning.

minor comments (3)
  1. [§4.1] §4.1, Algorithm 1: the server update for the first-moment estimate is written with a single global learning rate η; it would be clearer to introduce a separate server learning rate η_s and state its relation to the client learning rate η_c used in the local steps.
  2. [Theorem 3.1] Theorem 3.1: the final rate contains a term proportional to the heterogeneity constant G; the paper should explicitly note the regime in which this term dominates the usual 1/√T rate and whether the adaptive methods reduce the constant in front of G relative to FedAvg.
  3. [Table 2] Table 2: the reported test accuracies for FedAdam on the Shakespeare dataset lack standard deviations across the five random seeds; adding error bars would make the claimed gains over FedAvg statistically interpretable.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review. We appreciate the recognition of the significance of providing the first explicit convergence analysis for server-side adaptive methods under heterogeneity and partial participation, as well as the corroboration from our experiments. We will incorporate minor revisions to address any remaining presentation or clarification points.

Circularity Check

0 steps flagged

No significant circularity; derivation extends prior adaptive analyses independently

full rationale

The paper introduces federated variants of Adagrad, Adam, and Yogi and derives convergence bounds for non-convex objectives under client heterogeneity. These bounds are obtained by extending standard centralized adaptive-optimizer proofs (with explicit tracking of additional client-drift and partial-participation terms) from well-known assumptions such as L-smoothness, bounded stochastic-gradient variance, and a heterogeneity measure based on expected gradient dissimilarity. The central claims do not reduce to any fitted parameter renamed as a prediction, nor do they rest on a load-bearing self-citation chain; the cited prior work on centralized adaptivity is external and the new federated rates recover the known non-federated bounds when heterogeneity vanishes. Experiments are presented separately and do not enter the theoretical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard domain assumptions from non-convex optimization and federated learning literature to support its convergence claims; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Standard bounded-variance or bounded-gradient assumptions typical for non-convex convergence analysis of adaptive methods.
    Invoked to obtain convergence rates in the presence of heterogeneous client data.

pith-pipeline@v0.9.0 · 5688 in / 1212 out tokens · 51628 ms · 2026-05-21T10:25:45.945044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

    cs.LG 2026-05 unverdicted novelty 7.0

    LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.

  2. Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

    cs.LG 2026-05 unverdicted novelty 7.0

    Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.

  3. Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

    math.OC 2026-05 unverdicted novelty 7.0

    Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

  4. FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training

    cs.DC 2026-05 unverdicted novelty 7.0

    FedQueue adds online queue-delay prediction, cutoff admission, and staleness-aware aggregation to federated learning, proving O(1/sqrt(R)) convergence under bounded staleness and reporting 20.5% real-world improvement...

  5. FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

    cs.LG 2026-03 unverdicted novelty 7.0

    FedBCGD reduces communication in federated learning by a factor of 1/N through block-wise parameter updates with accelerated convergence guarantees.

  6. Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

    cs.LG 2026-02 unverdicted novelty 7.0

    Split-MoPE integrates split learning with predefined-expert routing to maximize usable data in vertical federated learning under sample misalignment, delivering state-of-the-art accuracy in one communication round plu...

  7. Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

    cs.CV 2025-10 conditional novelty 7.0

    The FedSurg challenge benchmarks federated learning on appendectomy videos and finds only 26% F1 on unseen centers even with centralized data, plus extra penalties from decentralization, with spatiotemporal models per...

  8. Synchronous and Asynchronous Parallelism Approaches for Generalized Canonical Polyadic Tensor Decomposition with GenTen

    math.NA 2026-05 unverdicted novelty 6.0

    Presents new synchronous and asynchronous parallel approaches for GCP tensor decomposition and evaluates computational cost and accuracy on synthetic and real-world datasets.

  9. Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning

    stat.ML 2026-05 unverdicted novelty 6.0

    Introduces FedHybrid and FedNewton for DP federated M-estimation, with finite-sample MSE bounds, minimax lower bound, and evaluations on vision datasets.

  10. FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training

    cs.DC 2026-05 unverdicted novelty 6.0

    FedQueue predicts per-facility queue delays, buffers late arrivals via cutoffs, and uses staleness-aware aggregation to achieve O(1/sqrt(R)) convergence and 20.5% real-world improvement in cross-facility HPC federated...

  11. FedSDR: Federated Self-Distillation with Rectification

    cs.LG 2026-05 unverdicted novelty 5.0

    FedSDR augments federated self-distillation with dual LoRA streams (local smoothing and global rectification) to produce globally aligned, factually faithful models under statistical heterogeneity.

  12. Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

    math.OC 2026-05 unverdicted novelty 5.0

    Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.

  13. Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback

    cs.LG 2026-05 unverdicted novelty 5.0

    SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect on...

  14. FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing

    cs.LG 2026-05 unverdicted novelty 5.0

    FedFrozen improves stability in heterogeneous federated Transformer training by warming up the full model then freezing the attention kernel (query/key) while optimizing the value block under a fixed kernel.

  15. FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    FMCL performs one-shot class-aware client clustering in heterogeneous federated learning by deriving semantic signatures from foundation model embeddings and using cosine distance, yielding improved performance and st...

  16. PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

    cs.LG 2026-04 unverdicted novelty 5.0

    PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.

  17. Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering

    cs.LG 2025-08 unverdicted novelty 5.0

    Loss-based clustering of clients enables robust federated learning against strong Byzantine attacks with bounded optimality gaps using only the server and one honest client.

  18. Rethinking the Personalized Relaxed Initialization in the Federated Learning: Consistency and Generalization

    cs.LG 2026-04 unverdicted novelty 4.0

    FedInit uses reverse personalized initialization in FL to reduce client drift effects, showing via excess risk that inconsistency impacts generalization error more than optimization error.

  19. Accelerating Optimization and Machine Learning through Decentralization

    cs.LG 2026-04 unverdicted novelty 3.0

    Decentralized optimization can reach optimal solutions in fewer iterations than centralized methods for machine learning tasks.

  20. A Comparative Study of Federated Learning Aggregation Strategies under Homogeneous and Heterogeneous Data Distributions

    cs.LG 2026-05 unverdicted novelty 2.0

    Federated aggregation strategies show distinct performance trade-offs in accuracy, loss, and efficiency depending on whether client data distributions are homogeneous or heterogeneous.

  21. Performance and Energy Trade-Off Analysis of Hierarchical Federated Learning for Plant Disease Classification

    cs.DC 2026-04 unverdicted novelty 2.0

    Hierarchical federated learning for plant-disease classification shows distinct accuracy-versus-energy trade-offs across EfficientNet-B0, ResNet-50, and MobileNetV3-Large paired with FedAvg, FedProx, and FedAvgM.

Reference graph

Works this paper leans on

232 extracted references · 232 canonical work pages · cited by 20 Pith papers · 44 internal anchors

  1. [1]

    Tensor F low F ederated

    The TFF Authors. Tensor F low F ederated

  2. [2]

    Introducing

    Ingerman, Alex and Ostrowski, Krzys , Url =. Introducing

  3. [3]

    Tensor F low F ederated Stack Overflow Dataset

    The TensorFlow Federated Authors. Tensor F low F ederated Stack Overflow Dataset

  4. [4]

    7th International Conference on Learning Representations,

    Liangchen Luo and Yuanhao Xiong and Yan Liu and Xu Sun , title =. 7th International Conference on Learning Representations,. 2019 , url =

  5. [5]

    Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =

    Brendan McMahan and Eider Moore and Daniel Ramage and Seth Hampson and Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =. 2017 , url =

  6. [6]

    Towards Federated Learning at Scale: System Design , url =

    Bonawitz, Keith and Eichner, Hubert and Grieskamp, Wolfgang and Huba, Dzmitry and Ingerman, Alex and Ivanov, Vladimir and Kiddon, Chlo\'. Towards Federated Learning at Scale: System Design , url =. Proceedings of Machine Learning and Systems , editor =. 2019 , publisher =

  7. [11]

    Kingma and Jimmy Ba , title =

    Diederik P. Kingma and Jimmy Ba , title =. 3rd International Conference on Learning Representations,. 2015 , timestamp =

  8. [12]

    Brendan McMahan and Matthew J

    H. Brendan McMahan and Matthew J. Streeter , title =

  9. [13]

    Journal of Machine Learning Research , volume=

    Adaptive subgradient methods for online learning and stochastic optimization , author=. Journal of Machine Learning Research , volume=

  10. [14]

    Karimireddy, Sai Praneeth and Kale, Satyen and Mohri, Mehryar and Reddi, Sashank J and Stich, Sebastian U and Suresh, Ananda Theertha , journal=

  11. [15]

    The Non-

    Hsieh, Kevin and Phanishayee, Amar and Mutlu, Onur and Gibbons, Phillip B , journal=. The Non-

  12. [16]

    Qsparse-local-

    Basu, Debraj and Data, Deepesh and Karakus, Can and Diggavi, Suhas , booktitle=. Qsparse-local-

  13. [17]

    Xie, Cong and Koyejo, Oluwasanmi and Gupta, Indranil and Lin, Haibin , journal=. Local

  14. [18]

    arXiv preprint arXiv:1905.10497 , year=

    Fair resource allocation in federated learning , author=. arXiv preprint arXiv:1905.10497 , year=

  15. [19]

    Federated Learning for Mobile Keyboard Prediction

    Federated learning for mobile keyboard prediction , author=. arXiv preprint arXiv:1811.03604 , year=

  16. [20]

    Learning Differentially Private Recurrent Language Models

    Learning differentially private recurrent language models , author=. arXiv preprint arXiv:1710.06963 , year=

  17. [21]

    2017 , organization=

    Cohen, Gregory and Afshar, Saeed and Tapson, Jonathan and Van Schaik, Andre , booktitle=. 2017 , organization=

  18. [22]

    Pachinko allocation:

    Li, Wei and McCallum, Andrew , booktitle=. Pachinko allocation:

  19. [24]

    Advances in neural information processing systems , pages=

    Parallelized stochastic gradient descent , author=. Advances in neural information processing systems , pages=

  20. [25]

    Stich , booktitle=

    Sebastian U. Stich , booktitle=. Local. 2019 , url=

  21. [26]

    arXiv preprint arXiv:1808.07217 , year=

    Don't Use Large Mini-Batches, Use Local SGD , author=. arXiv preprint arXiv:1808.07217 , year=

  22. [27]

    Cooperative

    Wang, Jianyu and Joshi, Gauri , journal=. Cooperative

  23. [28]

    Parallel restarted

    Yu, Hao and Yang, Sen and Zhu, Shenghuo , booktitle=. Parallel restarted

  24. [29]

    The error-feedback framework: Better rates for

    Stich, Sebastian U and Karimireddy, Sai Praneeth , journal=. The error-feedback framework: Better rates for

  25. [30]

    On the convergence of

    Li, Xiang and Huang, Kaixuan and Yang, Wenhao and Wang, Shusen and Zhang, Zhihua , journal=. On the convergence of

  26. [31]

    Zhang and James Lucas and Jimmy Ba and Geoffrey E

    Michael R. Zhang and James Lucas and Jimmy Ba and Geoffrey E. Hinton , editor =. Lookahead Optimizer: k steps forward, 1 step back , booktitle =

  27. [32]

    IEEE Journal on Selected Areas in Communications , volume=

    Adaptive federated learning in resource constrained edge computing systems , author=. IEEE Journal on Selected Areas in Communications , volume=. 2019 , publisher=

  28. [34]

    Proceedings of the European Conference on Computer Vision (ECCV) , pages=

    Group normalization , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

  29. [36]

    One weird trick for parallelizing convolutional neural networks

    One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

  30. [37]

    International conference on machine learning , pages=

    On the importance of initialization and momentum in deep learning , author=. International conference on machine learning , pages=

  31. [38]

    On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

    On the convergence of a class of adam-type algorithms for non-convex optimization , author=. arXiv preprint arXiv:1808.02941 , year=

  32. [39]

    arXiv preprint arXiv:1808.03408 , year=

    On the convergence of weighted AdaGrad with momentum for training deep neural networks , author=. arXiv preprint arXiv:1808.03408 , year=

  33. [40]

    AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

    AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , author=. arXiv preprint arXiv:1810.00143 , year=

  34. [41]

    arXiv preprint arXiv:1806.02958 , year=

    The case for full-matrix adaptive regularization , author=. arXiv preprint arXiv:1806.02958 , year=

  35. [42]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    A sufficient condition for convergences of adam and rmsprop , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  36. [43]

    A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

    A tail-index analysis of stochastic gradient noise in deep neural networks , author=. arXiv preprint arXiv:1901.06053 , year=

  37. [44]

    arXiv preprint arXiv:1905.11881 , year=

    Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition , author=. arXiv preprint arXiv:1905.11881 , year=

  38. [45]

    First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

    First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise , author=. arXiv preprint arXiv:1906.09069 , year=

  39. [46]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

  40. [47]

    Advances in Neural Information Processing Systems , pages=

    The marginal value of adaptive gradient methods in machine learning , author=. Advances in Neural Information Processing Systems , pages=

  41. [48]

    Minimization of functions having

    Armijo, Larry , journal=. Minimization of functions having. 1966 , publisher=

  42. [49]

    Optimization Software , author=

    Introduction to Optimization. Optimization Software , author=. Inc., Publications Division, New York , volume=

  43. [50]

    Advances in neural information processing systems , pages=

    Deep learning without poor local minima , author=. Advances in neural information processing systems , pages=

  44. [51]

    Adaptive Gradient Methods with Dynamic Bound of Learning Rate

    Adaptive gradient methods with dynamic bound of learning rate , author=. arXiv preprint arXiv:1902.09843 , year=

  45. [52]

    Advances in Neural Information Processing Systems , pages=

    How does batch normalization help optimization? , author=. Advances in Neural Information Processing Systems , pages=

  46. [53]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  47. [54]

    2009 , institution=

    Learning multiple layers of features from tiny images , author=. 2009 , institution=

  48. [56]

    Neural computation , keywords =

    Hochreiter, Sepp and Schmidhuber, J. Neural computation , keywords =

  49. [57]

    Journal of Machine Learning Research , year =

    Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , title =. Journal of Machine Learning Research , year =

  50. [58]

    Regularization of Neural Networks using

    Li Wan and Matthew Zeiler and Sixin Zhang and Yann Le Cun and Rob Fergus , booktitle =. Regularization of Neural Networks using. 2013 , editor =

  51. [59]

    Advances in neural information processing systems , pages=

    Attention is all you need , author=. Advances in neural information processing systems , pages=

  52. [60]

    Carbonell and Quoc V

    Zihang Dai and Zhilin Yang and Yiming Yang and Jaime G. Carbonell and Quoc V. Le and Ruslan Salakhutdinov , title =. CoRR , volume =. 2019 , url =

  53. [61]

    Regularizing and Optimizing

    Stephen Merity and Nitish Shirish Keskar and Richard Socher , booktitle=. Regularizing and Optimizing. 2018 , url=

  54. [62]

    2012 , url =

    Martin Sundermeyer and Ralf Schl. 2012 , url =

  55. [63]

    CoRR , volume =

    Tom Young and Devamanyu Hazarika and Soujanya Poria and Erik Cambria , title =. CoRR , volume =. 2017 , url =

  56. [64]

    Recurrent neural network based language model , booktitle =

    Tomas Mikolov and Martin Karafi. Recurrent neural network based language model , booktitle =. 2010 , url =

  57. [65]

    Le , title =

    Ilya Sutskever and Oriol Vinyals and Quoc V. Le , title =. Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages =. 2014 , url =

  58. [66]

    Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation , booktitle =

    Cho, Kyunghyun and van Merri. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation , booktitle =. 2014 , pages =

  59. [67]

    Mathematical Programming , volume=

    Accelerated gradient methods for nonconvex nonlinear and stochastic programming , author=. Mathematical Programming , volume=. 2016 , publisher=

  60. [68]

    SIAM Journal on Optimization , volume=

    Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework , author=. SIAM Journal on Optimization , volume=. 2012 , publisher=

  61. [69]

    Conference On Learning Theory , pages=

    Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , author=. Conference On Learning Theory , pages=

  62. [70]

    SIAM journal on imaging sciences , volume=

    A fast iterative shrinkage-thresholding algorithm for linear inverse problems , author=. SIAM journal on imaging sciences , volume=. 2009 , publisher=

  63. [71]

    SIAM Journal on Optimization , volume=

    Efficiency of coordinate descent methods on huge-scale optimization problems , author=. SIAM Journal on Optimization , volume=. 2012 , publisher=

  64. [72]

    SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

    SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , author=. arXiv preprint arXiv:1807.01695 , year=

  65. [73]

    Advances in Neural Information Processing Systems , pages=

    A universal catalyst for first-order optimization , author=. Advances in Neural Information Processing Systems , pages=

  66. [74]

    International Conference on Machine Learning , pages=

    Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , author=. International Conference on Machine Learning , pages=

  67. [75]

    SIAM Journal on Optimization , volume=

    Accelerated methods for nonconvex optimization , author=. SIAM Journal on Optimization , volume=. 2018 , publisher=

  68. [76]

    The Journal of Machine Learning Research , volume=

    Katyusha: The first direct acceleration of stochastic gradient methods , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=

  69. [77]

    1983 , publisher=

    Problem complexity and method efficiency in optimization , author=. 1983 , publisher=

  70. [78]

    USSR Computational Mathematics and Mathematical Physics , volume=

    Some methods of speeding up the convergence of iteration methods , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1964 , publisher=

  71. [79]

    A method of solving a convex programming problem with convergence rate

    Nesterov, Yurii , booktitle=. A method of solving a convex programming problem with convergence rate

  72. [80]

    A geometric alternative to Nesterov's accelerated gradient descent

    A geometric alternative to Nesterov's accelerated gradient descent , author=. arXiv preprint arXiv:1506.08187 , year=

  73. [81]

    SIAM Journal on Optimization , volume=

    Analysis and design of optimization algorithms via integral quadratic constraints , author=. SIAM Journal on Optimization , volume=. 2016 , publisher=

  74. [82]

    Dissipativity Theory for Nesterov's Accelerated Method

    Dissipativity Theory for Nesterov's Accelerated Method , author=. arXiv preprint arXiv:1706.04381 , year=

  75. [83]

    Analysis of Optimization Algorithms via Integral Quadratic Constraints: Nonstrongly Convex Problems

    Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems , author=. arXiv preprint arXiv:1705.03615 , year=

  76. [84]

    Advances In Neural Information Processing Systems , pages=

    Regularized nonlinear acceleration , author=. Advances In Neural Information Processing Systems , pages=

  77. [85]

    Advances in Neural Information Processing Systems , pages=

    A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights , author=. Advances in Neural Information Processing Systems , pages=

  78. [86]

    Proceedings of the National Academy of Sciences , volume=

    A variational perspective on accelerated methods in optimization , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=

  79. [87]

    A Lyapunov Analysis of Momentum Methods in Optimization

    A lyapunov analysis of momentum methods in optimization , author=. arXiv preprint arXiv:1611.02635 , year=

  80. [88]

    arXiv preprint arXiv:1712.02485 , year=

    The Approximate Duality Gap Technique: A Unified Theory of First-Order Methods , author=. arXiv preprint arXiv:1712.02485 , year=

Showing first 80 references.