Communication-Efficient Decentralized Stochastic Minimax Optimization
Pith reviewed 2026-05-22 00:15 UTC · model grok-4.3
The pith
Integrating accelerated momentum with local updates reduces local steps to O(κ ε^{-1}) per round while attaining the best-known communication complexity O(κ² ε^{-2}) for decentralized stochastic nonconvex PL minimax problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The algorithm integrates accelerated momentum into multiple local updates performed between gossip rounds; this change yields Õ(κ ε^{-1}) local updates per communication round together with the communication complexity O(κ² ε^{-2}), which is the best rate known for stochastic nonconvex PL minimax problems and improves on the sample complexity of plain local gradient descent-ascent.
What carries the argument
Local decentralized minimax iteration that embeds accelerated momentum inside blocks of local updates to bound drift between gossip steps.
If this is right
- Local-update count drops from Õ(κ² ε^{-2}) to Õ(κ ε^{-1}) per round relative to local GDA.
- Communication complexity reaches O(κ² ε^{-2}), matching the best previously reported rate.
- Sample complexity improves over the local gradient descent-ascent baseline.
- Practical runs on robust logistic regression show fewer total communication rounds to target accuracy.
Where Pith is reading between the lines
- The same momentum-local-update pattern may transfer to other decentralized nonconvex problems where drift control is the bottleneck.
- Lower local-update counts free bandwidth for larger networks or faster-changing objectives.
- Topology dependence beyond the current gossip model could be tested by varying the mixing matrix in experiments.
Load-bearing premise
Accelerated momentum is enough to keep local drift from spoiling the reduced local-update schedule or the target communication complexity.
What would settle it
A concrete numerical run or proof instance in which Õ(κ ε^{-1}) local updates without momentum produce communication rounds scaling worse than O(κ² ε^{-2}).
Figures
read the original abstract
In this work, we study decentralized stochastic nonconvex Polyak--{\L}ojasiewicz minimax problems and propose a communication-efficient algorithm. Motivated by the efficiency of local SGD in federated learning, we investigate decentralized minimax algorithms that perform multiple local updates between gossip rounds to improve communication efficiency. Existing results show that the local decentralized gradient descent-ascent algorithm requires an excessive number of local updates, on the order of $\tilde{\mathcal{O}}(\kappa^2\varepsilon^{-2})$ per communication round, to achieve the communication complexity $\tilde{\mathcal{O}}(\kappa^3\varepsilon^{-2})$, where $\varepsilon$ denotes the target accuracy and $\kappa>1$ is the condition number. However, such a large number of local updates can be impractical: it can underexploit available communication resources and exacerbate local drift, as agents may move toward their own local optima. To address this limitation, we develop a local decentralized minimax method that integrates accelerated momentum with local updates. Our method reduces the number of local updates to $\tilde{\mathcal{O}}(\kappa\varepsilon^{-1})$ per communication round while achieving the best-known communication complexity $\mathcal{O}(\kappa^2\varepsilon^{-2})$. Compared with the local gradient descent-ascent method, the proposed method also achieves an enhanced sample complexity. Experiments on robust logistic regression with real-world datasets demonstrate that our method achieves superior communication efficiency over several existing baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies decentralized stochastic nonconvex Polyak-Łojasiewicz minimax optimization and proposes a local-update algorithm that augments gradient descent-ascent with accelerated momentum. It claims that performing Õ(κ ε^{-1}) local steps per gossip round suffices to achieve the communication complexity O(κ² ε^{-2}), improving on prior local GD-A results that required Õ(κ² ε^{-2}) local steps for a worse Õ(κ³ ε^{-2}) communication bound. The work also asserts an improved sample complexity and reports empirical gains on robust logistic regression.
Significance. If the drift-control argument holds, the result meaningfully advances communication-efficient decentralized minimax methods by showing that momentum can safely reduce local steps without inflating the number of communication rounds. This addresses a practical bottleneck in existing analyses and could influence algorithm design in federated or distributed minimax settings. The combination of theoretical rates and real-data experiments strengthens the contribution.
major comments (2)
- [§4.2] §4.2 (Local Drift Analysis): the bound on the consensus error after K = Θ(κ ε^{-1}) momentum-augmented steps must be shown to remain O(ε) or smaller without introducing hidden factors of κ or 1/ε. The standard stochastic drift term proportional to K² σ² / η (or its momentum analogue) needs an explicit inequality demonstrating that it is absorbed into the existing O(κ² ε^{-2}) communication budget; otherwise the claimed reduction in local updates cannot be maintained.
- [Theorem 4.1] Theorem 4.1 (or the main convergence theorem): the choice of momentum parameter is stated to be independent of K, yet the proof sketch must verify that this choice simultaneously controls both the PL-driven convergence and the local-to-global deviation under the given stochastic assumptions. If the momentum coefficient implicitly depends on κ or ε, the communication complexity reverts to prior bounds.
minor comments (2)
- [Algorithm 1] Notation for the momentum coefficient and the local step-size schedule should be unified across the algorithm box, the theorem statements, and the proof appendix to avoid reader confusion.
- [§5] The experimental section would benefit from reporting the exact number of communication rounds and total local updates used by each baseline to make the efficiency comparison quantitative rather than qualitative.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications and indicating the revisions we will incorporate to strengthen the analysis.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Local Drift Analysis): the bound on the consensus error after K = Θ(κ ε^{-1}) momentum-augmented steps must be shown to remain O(ε) or smaller without introducing hidden factors of κ or 1/ε. The standard stochastic drift term proportional to K² σ² / η (or its momentum analogue) needs an explicit inequality demonstrating that it is absorbed into the existing O(κ² ε^{-2}) communication budget; otherwise the claimed reduction in local updates cannot be maintained.
Authors: We agree that the local drift analysis in §4.2 would benefit from a more explicit bound. In the revised manuscript we will insert a new supporting lemma (Lemma 4.3) that derives the consensus error after exactly K = Θ(κ ε^{-1}) momentum-augmented local steps. The lemma shows that the momentum-augmented drift term is bounded by O(ε) under the given step-size and momentum schedules, without extra factors of κ or 1/ε. This O(ε) term is then absorbed directly into the existing O(κ² ε^{-2}) communication-complexity budget via a standard telescoping argument already present in the proof of Theorem 4.1. The revised appendix will contain the full inequality chain. revision: yes
-
Referee: [Theorem 4.1] Theorem 4.1 (or the main convergence theorem): the choice of momentum parameter is stated to be independent of K, yet the proof sketch must verify that this choice simultaneously controls both the PL-driven convergence and the local-to-global deviation under the given stochastic assumptions. If the momentum coefficient implicitly depends on κ or ε, the communication complexity reverts to prior bounds.
Authors: The momentum coefficient is chosen as β = 1 − Θ(1/κ), which depends only on the condition number κ and is explicitly independent of both the local-update count K and the target accuracy ε. In the expanded proof of Theorem 4.1 we will add a dedicated paragraph that separates the analysis into two parts: (i) the PL-driven convergence rate, which follows from the standard momentum-augmented descent lemma and is unaffected by K, and (ii) the local-to-global deviation bound, which is controlled by the same β through a contraction argument that holds uniformly under the stochastic assumptions. Because β does not depend on K or ε, the claimed communication complexity O(κ² ε^{-2}) is preserved. The revised proof sketch will make this separation explicit. revision: yes
Circularity Check
No circularity: rates derived from momentum-augmented analysis under standard PL assumptions
full rationale
The paper proposes a new decentralized algorithm that augments local updates with accelerated momentum to control drift in stochastic nonconvex PL minimax problems. The claimed reduction in local steps to Õ(κ ε^{-1}) per round while preserving O(κ² ε^{-2}) communication rounds follows from the convergence analysis of the proposed method, which is presented as independent of prior fitted quantities or self-referential definitions. No load-bearing step reduces by construction to a parameter fit, a self-citation chain, or an ansatz smuggled from the authors' own prior work; the derivation remains self-contained against external stochastic assumptions and standard gossip-based consensus bounds.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The nonconvex minimax objective satisfies the Polyak-Łojasiewicz condition with parameter μ > 0.
Reference graph
Works this paper leans on
-
[1]
Gradient descent ascent for minimax problems on riemannian manifolds,
F. Huang and S. Gao, “Gradient descent ascent for minimax problems on riemannian manifolds,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8466–8476, 2023
work page 2023
-
[2]
Theoretical analysis of adversarial learning: A minimax approach,
Z. Tu, J. Zhang, and D. Tao, “Theoretical analysis of adversarial learning: A minimax approach,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019
work page 2019
-
[3]
Improved Adversarial Learning for Fair Classification
L. E. Celis and V . Keswani, “Improved adversarial learning for fair classification,” 2019. arXiv:1901.10443
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
A minimax framework for classification with applications to images and high dimensional data,
Q. Cheng, H. Zhou, J. Cheng, and H. Li, “A minimax framework for classification with applications to images and high dimensional data,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 11, pp. 2117–2130, 2014
work page 2014
-
[5]
Communication-efficient learning of deep networks from decentralized data,
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), (Fort Lauderdale, FL, USA), pp. 1273–1282, PMLR, Apr 2017
work page 2017
-
[6]
Subgradient methods for saddle-point problems,
A. Nedi ´c and A. Ozdaglar, “Subgradient methods for saddle-point problems,” J. Optim. Theory Appl., vol. 142, no. 1, pp. 205–228, 2009
work page 2009
-
[7]
Adaptation, learning, and optimization over networks,
A. H. Sayed, “Adaptation, learning, and optimization over networks,”Found. Trends Mach. Learn., vol. 7, no. 4– 5, pp. 311–801, 2014
work page 2014
-
[8]
Advances and open problems in federated learning,
P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cor- mode, R. Cummings, et al., “Advances and open problems in federated learning,” Found. Trends Mach. Learn., vol. 14, no. 1–2, pp. 1–210, 2021
work page 2021
-
[9]
SCAFFOLD: Stochastic controlled averaging for federated learning,
S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 119, (Virtual), pp. 5132–5143, PMLR, Jul 2020
work page 2020
-
[10]
On the influence of bias-correction on distributed stochas- tic optimization,
K. Yuan, S. A. Alghunaim, B. Ying, and A. H. Sayed, “On the influence of bias-correction on distributed stochas- tic optimization,”IEEE Trans. Signal Process., vol. 68, pp. 4352–4367, 2020
work page 2020
-
[11]
A unified theory of decentralized sgd with changing topology and local updates,
A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized sgd with changing topology and local updates,” inProc. Int. Conf. Mach. Learn. (ICML), vol. 119, (Virtual), pp. 5381–5393, PMLR, Jul 2020
work page 2020
-
[12]
E. Vlatakis-Gkaragkounis, L. Flokas, and G. Piliouras, “Poincaré recurrence, cycles and spurious equilibria in gradient-descent-ascent for non-convex non-concave zero-sum games,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019
work page 2019
-
[13]
The complexity of constrained min-max optimization,
C. Daskalakis, S. Skoulakis, and M. Zampetakis, “The complexity of constrained min-max optimization,” in Proc. ACM Symp. Theory Comput. (STOC), pp. 1466–1478, 2021
work page 2021
-
[14]
Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile
P. Mertikopoulos, B. Lecouat, H. Zenati, C.-S. Foo, V . Chandrasekhar, and G. Piliouras, “Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile,” arXiv:1807.02629, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Efficient methods for structured nonconvex-nonconcave min- max optimization,
J. Diakonikolas, C. Daskalakis, and M. I. Jordan, “Efficient methods for structured nonconvex-nonconcave min- max optimization,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 2746–2754, PMLR, 2021
work page 2021
-
[16]
On gradient descent ascent for nonconvex-concave minimax problems,
T. Lin, C. Jin, and M. I. Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 6083–6093, PMLR, 2020
work page 2020
-
[17]
L. Luo, H. Ye, Z. Huang, and T. Zhang, “Stochastic recursive gradient descent ascent for stochastic nonconvex- strongly-concave minimax problems,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , vol. 33, pp. 20566– 20577, 2020
work page 2020
-
[18]
Faster single-loop algorithms for minimax optimization without strong concavity,
J. Yang, A. Orvieto, A. Lucchi, and N. He, “Faster single-loop algorithms for minimax optimization without strong concavity,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 5485–5517, PMLR, 2022
work page 2022
-
[19]
Enhanced adaptive gradient algorithms for nonsmooth nonconvex-PL minimax optimization,
F. Huang, X. Wang, S. Zhang, C. Xuan, and S. Chen, “Enhanced adaptive gradient algorithms for nonsmooth nonconvex-PL minimax optimization,” inThe 28th International Conference on Artificial Intelligence and Statis- tics, 2025. 17 arXiv Template A PREPRINT
work page 2025
-
[20]
Y . Xu, “Decentralized gradient descent maximization method for composite nonconvex strongly-concave mini- max problems,”SIAM J. Optim., vol. 34, no. 1, pp. 1006–1044, 2024
work page 2024
-
[21]
Solving a class of non-convex min-max games using iterative first order methods,
M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn, “Solving a class of non-convex min-max games using iterative first order methods,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019
work page 2019
-
[22]
What is local optimality in nonconvex-nonconcave minimax optimiza- tion?,
C. Jin, P. Netrapalli, and M. I. Jordan, “What is local optimality in nonconvex-nonconcave minimax optimiza- tion?,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 4880–4889, PMLR, 2020
work page 2020
-
[23]
A faster decentralized algorithm for nonconvex minimax prob- lems,
W. Xian, F. Huang, Y . Zhang, and H. Huang, “A faster decentralized algorithm for nonconvex minimax prob- lems,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, pp. 25865–25877, 2021
work page 2021
-
[24]
Near-optimal decentralized momentum method for nonconvex-pl minimax problems,
F. Huang and S. Chen, “Near-optimal decentralized momentum method for nonconvex-pl minimax problems,” arXiv:2304.10902, 2023
-
[25]
Decentralized stochastic gradient descent ascent for finite-sum minimax problems,
H. Gao, “Decentralized stochastic gradient descent ascent for finite-sum minimax problems,”arXiv:2212.02724, 2022
-
[26]
An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,
L. Chen, H. Ye, and L. Luo, “An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 1990–1998, PMLR, 2024
work page 1990
-
[27]
X. Zhang, G. Mancino-Ball, N. S. Aybat, and Y . Xu, “Jointly improving the sample and communication com- plexities in decentralized stochastic minimax optimization,” inProceedings of the AAAI Conference on Artificial Intelligence, no. 18, pp. 20865–20873, 2024
work page 2024
-
[28]
Diffusion stochastic optimization for min-max problems,
H. Cai, S. A. Alghunaim, and A. H. Sayed, “Diffusion stochastic optimization for min-max problems,” IEEE Trans. Signal Process., vol. 73, pp. 259–274, 2025
work page 2025
-
[29]
Federated minimax optimization: Improved convergence anal- yses and algorithms,
P. Sharma, R. Panda, G. Joshi, and P. Varshney, “Federated minimax optimization: Improved convergence anal- yses and algorithms,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 19683–19730, PMLR, 2022
work page 2022
-
[30]
Stochastic smoothed gradient descent ascent for federated minimax optimization,
W. Shen, M. Huang, J. Zhang, and C. Shen, “Stochastic smoothed gradient descent ascent for federated minimax optimization,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 3988–3996, PMLR, 2024
work page 2024
-
[31]
Solving a class of non-convex minimax optimization in federated learning,
X. Wu, J. Sun, Z. Hu, A. Zhang, and H. Huang, “Solving a class of non-convex minimax optimization in federated learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[32]
Local stochastic gradient descent ascent: Convergence analysis and communication efficiency,
Y . Deng and M. Mahdavi, “Local stochastic gradient descent ascent: Convergence analysis and communication efficiency,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 1387–1395, 2021
work page 2021
-
[33]
Federated learning: Challenges, methods, and future directions,
T. Li, A. K. Sahu, A. Talwalkar, and V . Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 50–60, 2020
work page 2020
-
[34]
Accelerated stochastic min-max optimization based on bias-corrected momentum,
H. Cai, S. A. Alghunaim, and A. H. Sayed, “Accelerated stochastic min-max optimization based on bias-corrected momentum,”arXiv:2406.13041, 2024
-
[35]
Momentum improves normalized sgd,
A. Cutkosky and H. Mehta, “Momentum improves normalized sgd,” in Proc. Int. Conf. Mach. Learn. (ICML) , pp. 2260–2268, PMLR, 2020
work page 2020
-
[36]
Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,
X. Xie, P. Zhou, H. Li, Z. Lin, and S. Yan, “Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,”IEEE Trans. Pattern Anal. Mach. Intell., 2024
work page 2024
-
[37]
Better sgd using second-order momentum,
H. Tran and A. Cutkosky, “Better sgd using second-order momentum,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 3530–3541, 2022
work page 2022
-
[38]
Accelerated methods for nonconvex optimization,
Y . Carmon, J. C. Duchi, O. Hinder, and A. Sidford, “Accelerated methods for nonconvex optimization,”SIAM J. Optim., vol. 28, no. 2, pp. 1751–1772, 2018
work page 2018
-
[39]
Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization
X. Li, J. Yang, and N. He, “Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization,” arXiv:2210.17478, 2022
-
[40]
M. Dagréou, P. Ablin, S. Vaiter, and T. Moreau, “A framework for bilevel optimization that enables stochastic and global variance reduction algorithms,” in Adv. Neural Inf. Process. Syst. (NeurIPS) , vol. 35, pp. 26698–26710, 2022
work page 2022
-
[41]
Bilevel optimization: Convergence analysis and enhanced design,
K. Ji, J. Yang, and Y . Liang, “Bilevel optimization: Convergence analysis and enhanced design,” in Proc. Int. Conf. Mach. Learn. (ICML), pp. 4882–4892, PMLR, 2021
work page 2021
-
[42]
Robust decentralized learning with local updates and gradient tracking,
S. Ghiasvand, A. Reisizadeh, M. Alizadeh, and R. Pedarsani, “Robust decentralized learning with local updates and gradient tracking,”IEEE Transactions on Networking, 2025
work page 2025
-
[43]
Decentralized gradient tracking with local steps,
Y . Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,”Optim. Methods Softw., pp. 1–28, 2024. 18 arXiv Template A PREPRINT
work page 2024
-
[44]
On the performance of gradient tracking with local updates,
E. D. H. Nguyen, S. A. Alghunaim, K. Yuan, and C. A. Uribe, “On the performance of gradient tracking with local updates,” inProc. IEEE Conf. Decision and Control (CDC), pp. 4309–4313, 2023
work page 2023
-
[45]
Local exact-diffusion for decentralized optimization and learning,
S. A. Alghunaim, “Local exact-diffusion for decentralized optimization and learning,” IEEE Trans. Autom. Con- trol, vol. 69, pp. 7371–7386, 2024
work page 2024
-
[46]
Fast decentralized gradient tracking for federated minimax optimization with local updates,
C. J. Li, “Fast decentralized gradient tracking for federated minimax optimization with local updates,” arXiv:2405.04566, 2024
-
[47]
Nest your adaptive algorithm for parameter-agnostic nonconvex minimax optimiza- tion,
J. Yang, X. Li, and N. He, “Nest your adaptive algorithm for parameter-agnostic nonconvex minimax optimiza- tion,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 11202–11216, 2022
work page 2022
-
[48]
Tight analysis of extra-gradient and optimistic gradient meth- ods for nonconvex minimax problems,
P. Mahdavinia, Y . Deng, H. Li, and M. Mahdavi, “Tight analysis of extra-gradient and optimistic gradient meth- ods for nonconvex minimax problems,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 31213–31225, 2022
work page 2022
-
[49]
A single-loop smoothed gradient descent-ascent algorithm for nonconvex- concave min-max problems,
J. Zhang, P. Xiao, R. Sun, and Z. Luo, “A single-loop smoothed gradient descent-ascent algorithm for nonconvex- concave min-max problems,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 7377–7389, 2020
work page 2020
-
[50]
Efficient search of first-order nash equilibria in nonconvex- concave smooth min-max problems,
D. M. Ostrovskii, A. Lowy, and M. Razaviyayn, “Efficient search of first-order nash equilibria in nonconvex- concave smooth min-max problems,”SIAM J. Optim., vol. 31, no. 4, pp. 2508–2538, 2021
work page 2021
-
[51]
An accelerated inexact proximal point method for solving nonconvex-concave min-max problems,
W. Kong and R. D. Monteiro, “An accelerated inexact proximal point method for solving nonconvex-concave min-max problems,”SIAM J. Optim., vol. 31, no. 4, pp. 2558–2585, 2021
work page 2021
-
[52]
Sapd+: An accelerated stochastic method for nonconvex- concave minimax problems,
X. Zhang, N. S. Aybat, and M. Gurbuzbalaban, “Sapd+: An accelerated stochastic method for nonconvex- concave minimax problems,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 21668–21681, 2022
work page 2022
-
[53]
Z. Xu, H. Zhang, Y . Xu, and G. Lan, “A unified single-loop alternating gradient projection algorithm for nonconvex–concave and convex–nonconcave minimax problems,” Math. Program., vol. 201, no. 1, pp. 635– 706, 2023
work page 2023
-
[54]
Alternating proximal-gradient steps for (stochastic) nonconvex-concave minimax prob- lems,
R. I. Bo¸ t and A. Böhm, “Alternating proximal-gradient steps for (stochastic) nonconvex-concave minimax prob- lems,”SIAM J. Optim., vol. 33, no. 3, pp. 1884–1913, 2023
work page 1913
-
[55]
Global convergence and variance reduction for a class of nonconvex- nonconcave minimax problems,
J. Yang, N. Kiyavash, and N. He, “Global convergence and variance reduction for a class of nonconvex- nonconcave minimax problems,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 1153–1165, 2020
work page 2020
-
[56]
Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,
H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” inProc. Eur. Conf. Mach. Learn. Knowl. Discov. Databases (ECML PKDD), pp. 795–811, 2016
work page 2016
-
[57]
Sarah: A novel method for machine learning problems using stochastic recursive gradient,
L. M. Nguyen, J. Liu, K. Scheinberg, and M. Taká ˇc, “Sarah: A novel method for machine learning problems using stochastic recursive gradient,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 2613–2621, PMLR, 2017
work page 2017
-
[58]
Accelerating stochastic gradient descent using predictive variance reduction,
R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 26, 2013
work page 2013
-
[59]
Stochastic variance reduction for nonconvex optimiza- tion,
S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochastic variance reduction for nonconvex optimiza- tion,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 314–323, PMLR, 2016
work page 2016
-
[60]
Momentum-based variance reduction in non-convex sgd,
A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019
work page 2019
-
[61]
Storm+: Fully adaptive sgd with recursive momentum for nonconvex opti- mization,
K. Levy, A. Kavis, and V . Cevher, “Storm+: Fully adaptive sgd with recursive momentum for nonconvex opti- mization,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, pp. 20571–20582, 2021
work page 2021
-
[62]
Reducing the variance in online optimization by transporting past gradients,
S. Arnold, P. A. Manzagol, R. Babanezhad Harikandeh, I. Mitliagkas, and N. Le Roux, “Reducing the variance in online optimization by transporting past gradients,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019
work page 2019
-
[63]
Complexity lower bounds for nonconvex-strongly-concave min-max optimization,
H. Li, Y . Tian, J. Zhang, and A. Jadbabaie, “Complexity lower bounds for nonconvex-strongly-concave min-max optimization,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, pp. 1792–1804, 2021
work page 2021
-
[64]
Near-optimal algorithms for minimax optimization,
T. Lin, C. Jin, and M. I. Jordan, “Near-optimal algorithms for minimax optimization,” in Proc. Conf. Learn. Theory (COLT), pp. 2738–2779, PMLR, 2020
work page 2020
-
[65]
A catalyst framework for minimax optimization,
J. Yang, S. Zhang, N. Kiyavash, and N. He, “A catalyst framework for minimax optimization,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 5667–5678, 2020
work page 2020
-
[66]
Fast exact multiplication by the hessian,
B. A. Pearlmutter, “Fast exact multiplication by the hessian,” Neural Comput., vol. 6, no. 1, pp. 147–160, 1994
work page 1994
-
[67]
Automatic differentiation in pytorch,
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” inNIPS Workshop on Autodiff, 2017. 19 arXiv Template A PREPRINT
work page 2017
-
[68]
N. Andrei, “Accelerated conjugate gradient algorithm with finite difference hessian/vector product approximation for unconstrained optimization,”J. Comput. Appl. Math., vol. 230, no. 2, pp. 570–582, 2009
work page 2009
-
[69]
Adaptive subgradient methods for online learning and stochastic optimiza- tion,
J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimiza- tion,”J. Mach. Learn. Res., vol. 12, no. 7, 2011
work page 2011
-
[70]
Adam: A Method for Stochastic Optimization
D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[71]
A. H. Sayed, Inference and Learning from Data. Cambridge University Press, 2022
work page 2022
-
[72]
Super-adam: Faster and universal framework of adaptive gradients,
F. Huang, J. Li, and H. Huang, “Super-adam: Faster and universal framework of adaptive gradients,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, pp. 9074–9085, 2021
work page 2021
-
[73]
On the convergence of decentralized adaptive gradient methods,
X. Chen, B. Karimi, W. Zhao, and P. Li, “On the convergence of decentralized adaptive gradient methods,” in Proc. Asian Conf. Mach. Learn. (ACML), pp. 217–232, PMLR, 2023
work page 2023
-
[74]
Y . Yan, Y . Xu, Q. Lin, L. Zhang, and T. Yang, “Stochastic primal-dual algorithms with faster convergence than O(1/ √ T ) for problems without bilinear structure,”arXiv:1904.10112, 2019
-
[75]
M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” in Proc. Int. Conf. Mach. Learn. (ICML), pp. 4615–4625, 2019
work page 2019
-
[76]
On the Convergence of Adam and Beyond
S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv:1904.09237, 2019. 20 arXiv Template A PREPRINT The supplemental material hereafter provides missing algorithmic details for the single-agent algorithm AdaHMM and proof details for Theorems 4.2 and 4.3. We begin our analysis by presenting the convergence proof for the single...
work page internal anchor Pith review Pith/arXiv arXiv 1904
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.