pith. sign in

arxiv: 2507.21901 · v2 · pith:P2O7CXH7new · submitted 2025-07-29 · 🧮 math.OC

Communication-Efficient Decentralized Stochastic Minimax Optimization

Pith reviewed 2026-05-22 00:15 UTC · model grok-4.3

classification 🧮 math.OC
keywords decentralized optimizationstochastic minimaxcommunication efficiencyPolyak-Lojasiewiczaccelerated momentumlocal updatesgradient descent-ascentrobust logistic regression
0
0 comments X

The pith

Integrating accelerated momentum with local updates reduces local steps to O(κ ε^{-1}) per round while attaining the best-known communication complexity O(κ² ε^{-2}) for decentralized stochastic nonconvex PL minimax problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets communication efficiency in decentralized stochastic minimax optimization under the nonconvex Polyak-Łojasiewicz condition. Existing local gradient descent-ascent requires too many local updates per gossip round, which wastes communication opportunities and increases local drift. By combining accelerated momentum with those local steps, the new method lowers the local-update count to Õ(κ ε^{-1}) while preserving the leading communication rate O(κ² ε^{-2}) and improving sample complexity. This directly tackles the practical mismatch between available communication rounds and the large local-update budgets demanded by prior analyses. Real-data experiments on robust logistic regression confirm faster convergence in communication rounds compared with baselines.

Core claim

The algorithm integrates accelerated momentum into multiple local updates performed between gossip rounds; this change yields Õ(κ ε^{-1}) local updates per communication round together with the communication complexity O(κ² ε^{-2}), which is the best rate known for stochastic nonconvex PL minimax problems and improves on the sample complexity of plain local gradient descent-ascent.

What carries the argument

Local decentralized minimax iteration that embeds accelerated momentum inside blocks of local updates to bound drift between gossip steps.

If this is right

  • Local-update count drops from Õ(κ² ε^{-2}) to Õ(κ ε^{-1}) per round relative to local GDA.
  • Communication complexity reaches O(κ² ε^{-2}), matching the best previously reported rate.
  • Sample complexity improves over the local gradient descent-ascent baseline.
  • Practical runs on robust logistic regression show fewer total communication rounds to target accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same momentum-local-update pattern may transfer to other decentralized nonconvex problems where drift control is the bottleneck.
  • Lower local-update counts free bandwidth for larger networks or faster-changing objectives.
  • Topology dependence beyond the current gossip model could be tested by varying the mixing matrix in experiments.

Load-bearing premise

Accelerated momentum is enough to keep local drift from spoiling the reduced local-update schedule or the target communication complexity.

What would settle it

A concrete numerical run or proof instance in which Õ(κ ε^{-1}) local updates without momentum produce communication rounds scaling worse than O(κ² ε^{-2}).

Figures

Figures reproduced from arXiv: 2507.21901 by Ali H. Sayed, Haoyuan Cai, Sulaiman A. Alghunaim.

Figure 1
Figure 1. Figure 1: Illustration of the nonsmooth function (7) when [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Simulation results on the federated learning setup. The figures represent the results on the datasets "mush [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simulation results on fully decentralized setup. The figures represent the results on the datasets "mush [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulation results with varying degrees of sparsity, where [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A comparison of worst-case accuracy over iterations in training a fair neural network classifier. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

In this work, we study decentralized stochastic nonconvex Polyak--{\L}ojasiewicz minimax problems and propose a communication-efficient algorithm. Motivated by the efficiency of local SGD in federated learning, we investigate decentralized minimax algorithms that perform multiple local updates between gossip rounds to improve communication efficiency. Existing results show that the local decentralized gradient descent-ascent algorithm requires an excessive number of local updates, on the order of $\tilde{\mathcal{O}}(\kappa^2\varepsilon^{-2})$ per communication round, to achieve the communication complexity $\tilde{\mathcal{O}}(\kappa^3\varepsilon^{-2})$, where $\varepsilon$ denotes the target accuracy and $\kappa>1$ is the condition number. However, such a large number of local updates can be impractical: it can underexploit available communication resources and exacerbate local drift, as agents may move toward their own local optima. To address this limitation, we develop a local decentralized minimax method that integrates accelerated momentum with local updates. Our method reduces the number of local updates to $\tilde{\mathcal{O}}(\kappa\varepsilon^{-1})$ per communication round while achieving the best-known communication complexity $\mathcal{O}(\kappa^2\varepsilon^{-2})$. Compared with the local gradient descent-ascent method, the proposed method also achieves an enhanced sample complexity. Experiments on robust logistic regression with real-world datasets demonstrate that our method achieves superior communication efficiency over several existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies decentralized stochastic nonconvex Polyak-Łojasiewicz minimax optimization and proposes a local-update algorithm that augments gradient descent-ascent with accelerated momentum. It claims that performing Õ(κ ε^{-1}) local steps per gossip round suffices to achieve the communication complexity O(κ² ε^{-2}), improving on prior local GD-A results that required Õ(κ² ε^{-2}) local steps for a worse Õ(κ³ ε^{-2}) communication bound. The work also asserts an improved sample complexity and reports empirical gains on robust logistic regression.

Significance. If the drift-control argument holds, the result meaningfully advances communication-efficient decentralized minimax methods by showing that momentum can safely reduce local steps without inflating the number of communication rounds. This addresses a practical bottleneck in existing analyses and could influence algorithm design in federated or distributed minimax settings. The combination of theoretical rates and real-data experiments strengthens the contribution.

major comments (2)
  1. [§4.2] §4.2 (Local Drift Analysis): the bound on the consensus error after K = Θ(κ ε^{-1}) momentum-augmented steps must be shown to remain O(ε) or smaller without introducing hidden factors of κ or 1/ε. The standard stochastic drift term proportional to K² σ² / η (or its momentum analogue) needs an explicit inequality demonstrating that it is absorbed into the existing O(κ² ε^{-2}) communication budget; otherwise the claimed reduction in local updates cannot be maintained.
  2. [Theorem 4.1] Theorem 4.1 (or the main convergence theorem): the choice of momentum parameter is stated to be independent of K, yet the proof sketch must verify that this choice simultaneously controls both the PL-driven convergence and the local-to-global deviation under the given stochastic assumptions. If the momentum coefficient implicitly depends on κ or ε, the communication complexity reverts to prior bounds.
minor comments (2)
  1. [Algorithm 1] Notation for the momentum coefficient and the local step-size schedule should be unified across the algorithm box, the theorem statements, and the proof appendix to avoid reader confusion.
  2. [§5] The experimental section would benefit from reporting the exact number of communication rounds and total local updates used by each baseline to make the efficiency comparison quantitative rather than qualitative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications and indicating the revisions we will incorporate to strengthen the analysis.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Local Drift Analysis): the bound on the consensus error after K = Θ(κ ε^{-1}) momentum-augmented steps must be shown to remain O(ε) or smaller without introducing hidden factors of κ or 1/ε. The standard stochastic drift term proportional to K² σ² / η (or its momentum analogue) needs an explicit inequality demonstrating that it is absorbed into the existing O(κ² ε^{-2}) communication budget; otherwise the claimed reduction in local updates cannot be maintained.

    Authors: We agree that the local drift analysis in §4.2 would benefit from a more explicit bound. In the revised manuscript we will insert a new supporting lemma (Lemma 4.3) that derives the consensus error after exactly K = Θ(κ ε^{-1}) momentum-augmented local steps. The lemma shows that the momentum-augmented drift term is bounded by O(ε) under the given step-size and momentum schedules, without extra factors of κ or 1/ε. This O(ε) term is then absorbed directly into the existing O(κ² ε^{-2}) communication-complexity budget via a standard telescoping argument already present in the proof of Theorem 4.1. The revised appendix will contain the full inequality chain. revision: yes

  2. Referee: [Theorem 4.1] Theorem 4.1 (or the main convergence theorem): the choice of momentum parameter is stated to be independent of K, yet the proof sketch must verify that this choice simultaneously controls both the PL-driven convergence and the local-to-global deviation under the given stochastic assumptions. If the momentum coefficient implicitly depends on κ or ε, the communication complexity reverts to prior bounds.

    Authors: The momentum coefficient is chosen as β = 1 − Θ(1/κ), which depends only on the condition number κ and is explicitly independent of both the local-update count K and the target accuracy ε. In the expanded proof of Theorem 4.1 we will add a dedicated paragraph that separates the analysis into two parts: (i) the PL-driven convergence rate, which follows from the standard momentum-augmented descent lemma and is unaffected by K, and (ii) the local-to-global deviation bound, which is controlled by the same β through a contraction argument that holds uniformly under the stochastic assumptions. Because β does not depend on K or ε, the claimed communication complexity O(κ² ε^{-2}) is preserved. The revised proof sketch will make this separation explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: rates derived from momentum-augmented analysis under standard PL assumptions

full rationale

The paper proposes a new decentralized algorithm that augments local updates with accelerated momentum to control drift in stochastic nonconvex PL minimax problems. The claimed reduction in local steps to Õ(κ ε^{-1}) per round while preserving O(κ² ε^{-2}) communication rounds follows from the convergence analysis of the proposed method, which is presented as independent of prior fitted quantities or self-referential definitions. No load-bearing step reduces by construction to a parameter fit, a self-citation chain, or an ansatz smuggled from the authors' own prior work; the derivation remains self-contained against external stochastic assumptions and standard gossip-based consensus bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the Polyak-Łojasiewicz condition for the minimax objective and standard assumptions on stochastic gradients and network connectivity; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption The nonconvex minimax objective satisfies the Polyak-Łojasiewicz condition with parameter μ > 0.
    Invoked to obtain linear convergence rates in the nonconvex setting.

pith-pipeline@v0.9.0 · 5787 in / 1097 out tokens · 51567 ms · 2026-05-22T00:15:40.937051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 4 internal anchors

  1. [1]

    Gradient descent ascent for minimax problems on riemannian manifolds,

    F. Huang and S. Gao, “Gradient descent ascent for minimax problems on riemannian manifolds,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8466–8476, 2023

  2. [2]

    Theoretical analysis of adversarial learning: A minimax approach,

    Z. Tu, J. Zhang, and D. Tao, “Theoretical analysis of adversarial learning: A minimax approach,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019

  3. [3]

    Improved Adversarial Learning for Fair Classification

    L. E. Celis and V . Keswani, “Improved adversarial learning for fair classification,” 2019. arXiv:1901.10443

  4. [4]

    A minimax framework for classification with applications to images and high dimensional data,

    Q. Cheng, H. Zhou, J. Cheng, and H. Li, “A minimax framework for classification with applications to images and high dimensional data,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 11, pp. 2117–2130, 2014

  5. [5]

    Communication-efficient learning of deep networks from decentralized data,

    H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), (Fort Lauderdale, FL, USA), pp. 1273–1282, PMLR, Apr 2017

  6. [6]

    Subgradient methods for saddle-point problems,

    A. Nedi ´c and A. Ozdaglar, “Subgradient methods for saddle-point problems,” J. Optim. Theory Appl., vol. 142, no. 1, pp. 205–228, 2009

  7. [7]

    Adaptation, learning, and optimization over networks,

    A. H. Sayed, “Adaptation, learning, and optimization over networks,”Found. Trends Mach. Learn., vol. 7, no. 4– 5, pp. 311–801, 2014

  8. [8]

    Advances and open problems in federated learning,

    P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cor- mode, R. Cummings, et al., “Advances and open problems in federated learning,” Found. Trends Mach. Learn., vol. 14, no. 1–2, pp. 1–210, 2021

  9. [9]

    SCAFFOLD: Stochastic controlled averaging for federated learning,

    S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 119, (Virtual), pp. 5132–5143, PMLR, Jul 2020

  10. [10]

    On the influence of bias-correction on distributed stochas- tic optimization,

    K. Yuan, S. A. Alghunaim, B. Ying, and A. H. Sayed, “On the influence of bias-correction on distributed stochas- tic optimization,”IEEE Trans. Signal Process., vol. 68, pp. 4352–4367, 2020

  11. [11]

    A unified theory of decentralized sgd with changing topology and local updates,

    A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized sgd with changing topology and local updates,” inProc. Int. Conf. Mach. Learn. (ICML), vol. 119, (Virtual), pp. 5381–5393, PMLR, Jul 2020

  12. [12]

    Poincaré recurrence, cycles and spurious equilibria in gradient-descent-ascent for non-convex non-concave zero-sum games,

    E. Vlatakis-Gkaragkounis, L. Flokas, and G. Piliouras, “Poincaré recurrence, cycles and spurious equilibria in gradient-descent-ascent for non-convex non-concave zero-sum games,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019

  13. [13]

    The complexity of constrained min-max optimization,

    C. Daskalakis, S. Skoulakis, and M. Zampetakis, “The complexity of constrained min-max optimization,” in Proc. ACM Symp. Theory Comput. (STOC), pp. 1466–1478, 2021

  14. [14]

    Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile

    P. Mertikopoulos, B. Lecouat, H. Zenati, C.-S. Foo, V . Chandrasekhar, and G. Piliouras, “Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile,” arXiv:1807.02629, 2018

  15. [15]

    Efficient methods for structured nonconvex-nonconcave min- max optimization,

    J. Diakonikolas, C. Daskalakis, and M. I. Jordan, “Efficient methods for structured nonconvex-nonconcave min- max optimization,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 2746–2754, PMLR, 2021

  16. [16]

    On gradient descent ascent for nonconvex-concave minimax problems,

    T. Lin, C. Jin, and M. I. Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 6083–6093, PMLR, 2020

  17. [17]

    Stochastic recursive gradient descent ascent for stochastic nonconvex- strongly-concave minimax problems,

    L. Luo, H. Ye, Z. Huang, and T. Zhang, “Stochastic recursive gradient descent ascent for stochastic nonconvex- strongly-concave minimax problems,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , vol. 33, pp. 20566– 20577, 2020

  18. [18]

    Faster single-loop algorithms for minimax optimization without strong concavity,

    J. Yang, A. Orvieto, A. Lucchi, and N. He, “Faster single-loop algorithms for minimax optimization without strong concavity,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 5485–5517, PMLR, 2022

  19. [19]

    Enhanced adaptive gradient algorithms for nonsmooth nonconvex-PL minimax optimization,

    F. Huang, X. Wang, S. Zhang, C. Xuan, and S. Chen, “Enhanced adaptive gradient algorithms for nonsmooth nonconvex-PL minimax optimization,” inThe 28th International Conference on Artificial Intelligence and Statis- tics, 2025. 17 arXiv Template A PREPRINT

  20. [20]

    Decentralized gradient descent maximization method for composite nonconvex strongly-concave mini- max problems,

    Y . Xu, “Decentralized gradient descent maximization method for composite nonconvex strongly-concave mini- max problems,”SIAM J. Optim., vol. 34, no. 1, pp. 1006–1044, 2024

  21. [21]

    Solving a class of non-convex min-max games using iterative first order methods,

    M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn, “Solving a class of non-convex min-max games using iterative first order methods,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019

  22. [22]

    What is local optimality in nonconvex-nonconcave minimax optimiza- tion?,

    C. Jin, P. Netrapalli, and M. I. Jordan, “What is local optimality in nonconvex-nonconcave minimax optimiza- tion?,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 4880–4889, PMLR, 2020

  23. [23]

    A faster decentralized algorithm for nonconvex minimax prob- lems,

    W. Xian, F. Huang, Y . Zhang, and H. Huang, “A faster decentralized algorithm for nonconvex minimax prob- lems,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, pp. 25865–25877, 2021

  24. [24]

    Near-optimal decentralized momentum method for nonconvex-pl minimax problems,

    F. Huang and S. Chen, “Near-optimal decentralized momentum method for nonconvex-pl minimax problems,” arXiv:2304.10902, 2023

  25. [25]

    Decentralized stochastic gradient descent ascent for finite-sum minimax problems,

    H. Gao, “Decentralized stochastic gradient descent ascent for finite-sum minimax problems,”arXiv:2212.02724, 2022

  26. [26]

    An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,

    L. Chen, H. Ye, and L. Luo, “An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 1990–1998, PMLR, 2024

  27. [27]

    Jointly improving the sample and communication com- plexities in decentralized stochastic minimax optimization,

    X. Zhang, G. Mancino-Ball, N. S. Aybat, and Y . Xu, “Jointly improving the sample and communication com- plexities in decentralized stochastic minimax optimization,” inProceedings of the AAAI Conference on Artificial Intelligence, no. 18, pp. 20865–20873, 2024

  28. [28]

    Diffusion stochastic optimization for min-max problems,

    H. Cai, S. A. Alghunaim, and A. H. Sayed, “Diffusion stochastic optimization for min-max problems,” IEEE Trans. Signal Process., vol. 73, pp. 259–274, 2025

  29. [29]

    Federated minimax optimization: Improved convergence anal- yses and algorithms,

    P. Sharma, R. Panda, G. Joshi, and P. Varshney, “Federated minimax optimization: Improved convergence anal- yses and algorithms,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 19683–19730, PMLR, 2022

  30. [30]

    Stochastic smoothed gradient descent ascent for federated minimax optimization,

    W. Shen, M. Huang, J. Zhang, and C. Shen, “Stochastic smoothed gradient descent ascent for federated minimax optimization,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 3988–3996, PMLR, 2024

  31. [31]

    Solving a class of non-convex minimax optimization in federated learning,

    X. Wu, J. Sun, Z. Hu, A. Zhang, and H. Huang, “Solving a class of non-convex minimax optimization in federated learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2024

  32. [32]

    Local stochastic gradient descent ascent: Convergence analysis and communication efficiency,

    Y . Deng and M. Mahdavi, “Local stochastic gradient descent ascent: Convergence analysis and communication efficiency,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 1387–1395, 2021

  33. [33]

    Federated learning: Challenges, methods, and future directions,

    T. Li, A. K. Sahu, A. Talwalkar, and V . Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 50–60, 2020

  34. [34]

    Accelerated stochastic min-max optimization based on bias-corrected momentum,

    H. Cai, S. A. Alghunaim, and A. H. Sayed, “Accelerated stochastic min-max optimization based on bias-corrected momentum,”arXiv:2406.13041, 2024

  35. [35]

    Momentum improves normalized sgd,

    A. Cutkosky and H. Mehta, “Momentum improves normalized sgd,” in Proc. Int. Conf. Mach. Learn. (ICML) , pp. 2260–2268, PMLR, 2020

  36. [36]

    Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,

    X. Xie, P. Zhou, H. Li, Z. Lin, and S. Yan, “Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,”IEEE Trans. Pattern Anal. Mach. Intell., 2024

  37. [37]

    Better sgd using second-order momentum,

    H. Tran and A. Cutkosky, “Better sgd using second-order momentum,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 3530–3541, 2022

  38. [38]

    Accelerated methods for nonconvex optimization,

    Y . Carmon, J. C. Duchi, O. Hinder, and A. Sidford, “Accelerated methods for nonconvex optimization,”SIAM J. Optim., vol. 28, no. 2, pp. 1751–1772, 2018

  39. [39]

    Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization

    X. Li, J. Yang, and N. He, “Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization,” arXiv:2210.17478, 2022

  40. [40]

    A framework for bilevel optimization that enables stochastic and global variance reduction algorithms,

    M. Dagréou, P. Ablin, S. Vaiter, and T. Moreau, “A framework for bilevel optimization that enables stochastic and global variance reduction algorithms,” in Adv. Neural Inf. Process. Syst. (NeurIPS) , vol. 35, pp. 26698–26710, 2022

  41. [41]

    Bilevel optimization: Convergence analysis and enhanced design,

    K. Ji, J. Yang, and Y . Liang, “Bilevel optimization: Convergence analysis and enhanced design,” in Proc. Int. Conf. Mach. Learn. (ICML), pp. 4882–4892, PMLR, 2021

  42. [42]

    Robust decentralized learning with local updates and gradient tracking,

    S. Ghiasvand, A. Reisizadeh, M. Alizadeh, and R. Pedarsani, “Robust decentralized learning with local updates and gradient tracking,”IEEE Transactions on Networking, 2025

  43. [43]

    Decentralized gradient tracking with local steps,

    Y . Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,”Optim. Methods Softw., pp. 1–28, 2024. 18 arXiv Template A PREPRINT

  44. [44]

    On the performance of gradient tracking with local updates,

    E. D. H. Nguyen, S. A. Alghunaim, K. Yuan, and C. A. Uribe, “On the performance of gradient tracking with local updates,” inProc. IEEE Conf. Decision and Control (CDC), pp. 4309–4313, 2023

  45. [45]

    Local exact-diffusion for decentralized optimization and learning,

    S. A. Alghunaim, “Local exact-diffusion for decentralized optimization and learning,” IEEE Trans. Autom. Con- trol, vol. 69, pp. 7371–7386, 2024

  46. [46]

    Fast decentralized gradient tracking for federated minimax optimization with local updates,

    C. J. Li, “Fast decentralized gradient tracking for federated minimax optimization with local updates,” arXiv:2405.04566, 2024

  47. [47]

    Nest your adaptive algorithm for parameter-agnostic nonconvex minimax optimiza- tion,

    J. Yang, X. Li, and N. He, “Nest your adaptive algorithm for parameter-agnostic nonconvex minimax optimiza- tion,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 11202–11216, 2022

  48. [48]

    Tight analysis of extra-gradient and optimistic gradient meth- ods for nonconvex minimax problems,

    P. Mahdavinia, Y . Deng, H. Li, and M. Mahdavi, “Tight analysis of extra-gradient and optimistic gradient meth- ods for nonconvex minimax problems,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 31213–31225, 2022

  49. [49]

    A single-loop smoothed gradient descent-ascent algorithm for nonconvex- concave min-max problems,

    J. Zhang, P. Xiao, R. Sun, and Z. Luo, “A single-loop smoothed gradient descent-ascent algorithm for nonconvex- concave min-max problems,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 7377–7389, 2020

  50. [50]

    Efficient search of first-order nash equilibria in nonconvex- concave smooth min-max problems,

    D. M. Ostrovskii, A. Lowy, and M. Razaviyayn, “Efficient search of first-order nash equilibria in nonconvex- concave smooth min-max problems,”SIAM J. Optim., vol. 31, no. 4, pp. 2508–2538, 2021

  51. [51]

    An accelerated inexact proximal point method for solving nonconvex-concave min-max problems,

    W. Kong and R. D. Monteiro, “An accelerated inexact proximal point method for solving nonconvex-concave min-max problems,”SIAM J. Optim., vol. 31, no. 4, pp. 2558–2585, 2021

  52. [52]

    Sapd+: An accelerated stochastic method for nonconvex- concave minimax problems,

    X. Zhang, N. S. Aybat, and M. Gurbuzbalaban, “Sapd+: An accelerated stochastic method for nonconvex- concave minimax problems,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 21668–21681, 2022

  53. [53]

    A unified single-loop alternating gradient projection algorithm for nonconvex–concave and convex–nonconcave minimax problems,

    Z. Xu, H. Zhang, Y . Xu, and G. Lan, “A unified single-loop alternating gradient projection algorithm for nonconvex–concave and convex–nonconcave minimax problems,” Math. Program., vol. 201, no. 1, pp. 635– 706, 2023

  54. [54]

    Alternating proximal-gradient steps for (stochastic) nonconvex-concave minimax prob- lems,

    R. I. Bo¸ t and A. Böhm, “Alternating proximal-gradient steps for (stochastic) nonconvex-concave minimax prob- lems,”SIAM J. Optim., vol. 33, no. 3, pp. 1884–1913, 2023

  55. [55]

    Global convergence and variance reduction for a class of nonconvex- nonconcave minimax problems,

    J. Yang, N. Kiyavash, and N. He, “Global convergence and variance reduction for a class of nonconvex- nonconcave minimax problems,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 1153–1165, 2020

  56. [56]

    Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,

    H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” inProc. Eur. Conf. Mach. Learn. Knowl. Discov. Databases (ECML PKDD), pp. 795–811, 2016

  57. [57]

    Sarah: A novel method for machine learning problems using stochastic recursive gradient,

    L. M. Nguyen, J. Liu, K. Scheinberg, and M. Taká ˇc, “Sarah: A novel method for machine learning problems using stochastic recursive gradient,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 2613–2621, PMLR, 2017

  58. [58]

    Accelerating stochastic gradient descent using predictive variance reduction,

    R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 26, 2013

  59. [59]

    Stochastic variance reduction for nonconvex optimiza- tion,

    S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochastic variance reduction for nonconvex optimiza- tion,” inProc. Int. Conf. Mach. Learn. (ICML), pp. 314–323, PMLR, 2016

  60. [60]

    Momentum-based variance reduction in non-convex sgd,

    A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019

  61. [61]

    Storm+: Fully adaptive sgd with recursive momentum for nonconvex opti- mization,

    K. Levy, A. Kavis, and V . Cevher, “Storm+: Fully adaptive sgd with recursive momentum for nonconvex opti- mization,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, pp. 20571–20582, 2021

  62. [62]

    Reducing the variance in online optimization by transporting past gradients,

    S. Arnold, P. A. Manzagol, R. Babanezhad Harikandeh, I. Mitliagkas, and N. Le Roux, “Reducing the variance in online optimization by transporting past gradients,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019

  63. [63]

    Complexity lower bounds for nonconvex-strongly-concave min-max optimization,

    H. Li, Y . Tian, J. Zhang, and A. Jadbabaie, “Complexity lower bounds for nonconvex-strongly-concave min-max optimization,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, pp. 1792–1804, 2021

  64. [64]

    Near-optimal algorithms for minimax optimization,

    T. Lin, C. Jin, and M. I. Jordan, “Near-optimal algorithms for minimax optimization,” in Proc. Conf. Learn. Theory (COLT), pp. 2738–2779, PMLR, 2020

  65. [65]

    A catalyst framework for minimax optimization,

    J. Yang, S. Zhang, N. Kiyavash, and N. He, “A catalyst framework for minimax optimization,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 5667–5678, 2020

  66. [66]

    Fast exact multiplication by the hessian,

    B. A. Pearlmutter, “Fast exact multiplication by the hessian,” Neural Comput., vol. 6, no. 1, pp. 147–160, 1994

  67. [67]

    Automatic differentiation in pytorch,

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” inNIPS Workshop on Autodiff, 2017. 19 arXiv Template A PREPRINT

  68. [68]

    Accelerated conjugate gradient algorithm with finite difference hessian/vector product approximation for unconstrained optimization,

    N. Andrei, “Accelerated conjugate gradient algorithm with finite difference hessian/vector product approximation for unconstrained optimization,”J. Comput. Appl. Math., vol. 230, no. 2, pp. 570–582, 2009

  69. [69]

    Adaptive subgradient methods for online learning and stochastic optimiza- tion,

    J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimiza- tion,”J. Mach. Learn. Res., vol. 12, no. 7, 2011

  70. [70]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014

  71. [71]

    A. H. Sayed, Inference and Learning from Data. Cambridge University Press, 2022

  72. [72]

    Super-adam: Faster and universal framework of adaptive gradients,

    F. Huang, J. Li, and H. Huang, “Super-adam: Faster and universal framework of adaptive gradients,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, pp. 9074–9085, 2021

  73. [73]

    On the convergence of decentralized adaptive gradient methods,

    X. Chen, B. Karimi, W. Zhao, and P. Li, “On the convergence of decentralized adaptive gradient methods,” in Proc. Asian Conf. Mach. Learn. (ACML), pp. 217–232, PMLR, 2023

  74. [74]

    Stochastic primal-dual algorithms with faster convergence than O(1/ √ T ) for problems without bilinear structure,

    Y . Yan, Y . Xu, Q. Lin, L. Zhang, and T. Yang, “Stochastic primal-dual algorithms with faster convergence than O(1/ √ T ) for problems without bilinear structure,”arXiv:1904.10112, 2019

  75. [75]

    Agnostic federated learning,

    M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” in Proc. Int. Conf. Mach. Learn. (ICML), pp. 4615–4625, 2019

  76. [76]

    On the Convergence of Adam and Beyond

    S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv:1904.09237, 2019. 20 arXiv Template A PREPRINT The supplemental material hereafter provides missing algorithmic details for the single-agent algorithm AdaHMM and proof details for Theorems 4.2 and 4.3. We begin our analysis by presenting the convergence proof for the single...