Distributed Learning with Adversarial Gradient Perturbations

Nawapon Sangsiri; Yufei Tao

arxiv: 2605.03313 · v1 · submitted 2026-05-05 · 💻 cs.LG

Distributed Learning with Adversarial Gradient Perturbations

Nawapon Sangsiri , Yufei Tao This is my paper

Pith reviewed 2026-05-07 17:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords distributed learningadversarial gradient perturbationsub-optimality gapquery complexityconvex optimizationfederated learningrobust optimization

0 comments

The pith

Adversarial but distance-bounded gradient perturbations in distributed learning impose a tight lower bound on the achievable sub-optimality gap for convex L-smooth functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines distributed optimization where clients may return gradients that deviate from the true ones by at most a fixed distance, modeling privacy or adversarial behavior. It determines the smallest excess error that any algorithm can guarantee despite such responses, and shows this bound is tight. Matching upper bounds come from algorithms whose query complexity to the server is provably bounded. A reader cares because this quantifies the unavoidable cost of robustness in settings like federated learning where full trust in client data is unrealistic.

Core claim

For convex and L-smooth objective functions, when each gradient reply can be adversarially altered within a distance bound, there exist tight feasibility thresholds for the sub-optimality gap. Algorithms are provided that achieve these thresholds, with explicit guarantees on the number of queries needed to reach a target gap.

What carries the argument

The minimax analysis that derives matching lower and upper bounds on the sub-optimality gap as a function of the perturbation radius and smoothness parameter, together with algorithms attaining the bound.

If this is right

The minimal achievable sub-optimality gap scales directly with the maximum allowed perturbation distance.
Algorithms exist that attain this exact gap using a number of server queries bounded in terms of the target accuracy and problem parameters.
No algorithm can guarantee a smaller gap than the established threshold against worst-case bounded perturbations.
The query complexity results give explicit communication budgets for reaching a prescribed accuracy level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In practice, systems might add verification steps to keep perturbations below the threshold that preserves desired accuracy.
Similar threshold analyses could be applied to non-convex objectives to see how the error floor changes.
The query bounds suggest ways to allocate communication rounds in noisy multi-agent optimization settings.
These limits highlight the value of designing incentives or checks that reduce effective perturbation size.

Load-bearing premise

The objective functions must be convex and L-smooth, and every adversarial perturbation is bounded in distance from the true gradient.

What would settle it

An algorithm that achieves strictly smaller sub-optimality than the derived threshold while using only the claimed number of queries, or a perturbation strategy within the distance bound that forces every algorithm to exceed the threshold.

Figures

Figures reproduced from arXiv: 2605.03313 by Nawapon Sangsiri, Yufei Tao.

**Figure 1.** Figure 1: Behavior of optimization under an ϵ-AGP oracle • Opposing: the oracle perturbs ∇f(w) by distance ϵ toward the direction opposite to that of ∇f(w); • Amplifying: the oracle perturbs ∇f(w) by distance ϵ in the same direction as ∇f(w); • Fixed-Direction: the oracle perturbs ∇f(w) by distance ϵ toward a fixed direction, chosen to be the negative direction of the first dimension of R d view at source ↗

**Figure 2.** Figure 2: Effects of query allocation in distributed learning view at source ↗

read the original abstract

Privacy concerns in distributed learning often lead clients to return intentionally altered gradient information. We consider the problem of learning convex and $L$-smooth functions under adversarial gradient perturbation, where a client's gradient reply to a server query can deviate arbitrarily from the true gradient subject to a distance bound. Our study focuses on two fundamental questions: (i) what is the smallest achievable sub-optimality gap (i.e., excess error in optimization) under such responses, and (ii) how many queries are sufficient to guarantee a given sub-optimality gap? We establish tight feasibility thresholds on the sub-optimality gap and provide algorithms that achieve these thresholds with provable query complexity guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper pins down tight sub-optimality thresholds and query bounds for convex distributed optimization when each returned gradient can deviate by a fixed distance.

read the letter

The main takeaway is that even when clients return gradients shifted by at most a fixed distance, convex L-smooth optimization still has a positive but bounded excess error, and the paper gives matching lower and upper bounds plus algorithms that hit them with explicit query counts. This is new in the distributed setting: prior work on adversarial gradients or privacy usually allows unbounded changes or uses different models like differential privacy, so the exact distance-bounded per-query case with tight feasibility thresholds does not appear earlier. The paper does well by delivering both the impossibility side and constructive algorithms that achieve the thresholds, which makes the result complete rather than one-sided. The assumptions are stated plainly as convexity, L-smoothness, and a uniform distance bound on perturbations, and these are enough to support standard minimax arguments without circularity. Soft spots are minor. The query complexity counts server queries but may understate total communication volume across clients, and the model assumes the server can issue repeated queries to the same clients, which is reasonable but not universal. Without the full proofs it is impossible to check every constant or edge case in the analysis, but the structure shows no load-bearing gaps. This work is for theorists working on robust or privacy-aware distributed optimization. A reader who needs concrete limits on excess error under bounded gradient tampering will find usable numbers here. It deserves a serious referee because the claims are specific, the assumptions are minimal, and the approach is grounded in convex analysis.

Referee Report

0 major / 1 minor

Summary. The paper studies distributed optimization of convex L-smooth objectives in which each gradient response to a server query may be adversarially altered subject to a fixed distance bound. It poses two questions: the smallest achievable sub-optimality gap under such perturbations and the query complexity sufficient to guarantee a target gap. The central claims are that tight feasibility thresholds exist for the sub-optimality gap and that matching algorithms attain these thresholds with explicit query-complexity bounds.

Significance. If the stated thresholds and matching bounds are correct, the work supplies clean minimax characterizations for convex optimization under per-query bounded adversarial gradient noise. Such results would be useful for understanding fundamental limits in privacy-aware or robust federated learning. The assumptions (convexity, L-smoothness, and a uniform perturbation radius) are standard and sufficient for the usual lower-bound constructions via convex analysis.

minor comments (1)

The abstract states that tight thresholds and provable query-complexity guarantees are established, yet the provided text contains no explicit statements of the thresholds, the algorithms, or the proof sketches. A reader cannot assess tightness without these details.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for summarizing our work on distributed convex L-smooth optimization under bounded adversarial gradient perturbations and for noting its potential relevance to privacy-aware and robust federated learning. The recommendation of 'uncertain' appears to stem from a desire to confirm the correctness of the claimed tight feasibility thresholds and matching query-complexity bounds. We maintain that these characterizations are correct, as they follow from standard minimax arguments via convex analysis for the lower bounds and explicit algorithm constructions (with explicit query bounds) for the upper bounds. No specific major comments were listed in the report, so we have no point-by-point items to address.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives tight feasibility thresholds on sub-optimality and matching query-complexity bounds for convex L-smooth objectives under bounded adversarial gradient perturbations. These results rest on explicitly stated external assumptions (convexity, L-smoothness, and a fixed perturbation distance bound) and standard minimax arguments from convex analysis. No derivation step reduces by construction to a quantity defined in terms of the target result itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise depends on a self-citation chain or imported uniqueness theorem. The argument structure is self-contained against the declared assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two standard domain assumptions from optimization theory; no free parameters or new entities are introduced.

axioms (2)

domain assumption The objective function is convex and L-smooth.
Invoked to obtain convergence rates and to define the optimization setting.
domain assumption Adversarial gradient perturbations are bounded in Euclidean distance from the true gradient.
Defines the adversarial model under which the feasibility thresholds are derived.

pith-pipeline@v0.9.0 · 5401 in / 1230 out tokens · 115909 ms · 2026-05-07T17:25:42.800014+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Bartlett, Pradeep Ravikumar, and Martin J

Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex opti- mization.IEEE Transactions on Information Theory, 58(5):3235–3249, 2012

work page 2012
[2]

Byzantine stochastic gradient descent

Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. In Proceedings of Neural Information Processing Systems (NIPS), pages 4618–4628, 2018

work page 2018
[3]

Machine learning with adversaries: Byzantine tolerant gradient descent

Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. InProceedings of Neural Information Processing Systems (NIPS), pages 119–129, 2017

work page 2017
[4]

Stochastic first-order methods for convex and nonconvex functional constrained optimization.Mathematical Programming, 197(1):215– 279, 2023

Digvijay Boob, Qi Deng, and Guanghui Lan. Stochastic first-order methods for convex and nonconvex functional constrained optimization.Mathematical Programming, 197(1):215– 279, 2023

work page 2023
[5]

Convex optimization: Algorithms and complexity.Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015

Sebastien Bubeck. Convex optimization: Algorithms and complexity.Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015

work page 2015
[6]

Efficient coreset selection with cluster-based methods

Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, and Guoren Wang. Efficient coreset selection with cluster-based methods. InProceedings of ACM Knowledge Discovery and Data Mining (SIGKDD), pages 167–178, 2023

work page 2023
[7]

Gradient descent: Robustness to adversarial corruption

Fu-Chieh Chang, Farhang Nabiei, Pei-Yuan Wu, Alexandru Cioba, Sattar Vakili, and Al- berto Bernacchia. Gradient descent: Robustness to adversarial corruption. InProceedings of International OPT Workshop on Optimization for Machine Learning, 2022

work page 2022
[8]

Smooth optimization with approximate gradient.SIAM Journal on Optimization, 19(3):1171–1183, 2008

Alexandre d’Aspremont. Smooth optimization with approximate gradient.SIAM Journal on Optimization, 19(3):1171–1183, 2008

work page 2008
[9]

Optimal distributed online prediction using mini-batches.Journal of Machine Learning Research (JMLR), 13:165–202, 2012

Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches.Journal of Machine Learning Research (JMLR), 13:165–202, 2012

work page 2012
[10]

Nesterov

Olivier Devolder, Francois Glineur, and Yurii E. Nesterov. First-order methods of smooth convex optimization with inexact oracle.Mathematical Programming, 146(1-2):37–75, 2014

work page 2014
[11]

Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochas- tic) gradient methods.ArXiv, abs/2301.11235, 2024

work page arXiv 2024
[12]

A study of first-order methods with a determinis- tic relative-error gradient oracle

Nadav Hallak and Kfir Yehuda Levy. A study of first-order methods with a determinis- tic relative-error gradient oracle. InProceedings of International Conference on Machine Learning (ICML), 2024

work page 2024
[13]

Freris, and Hu Ding

Jiawei Huang, Ruomin Huang, Wenjie Liu, Nikolaos M. Freris, and Hu Ding. A novel sequential coreset method for gradient descent algorithms. InProceedings of International Conference on Machine Learning (ICML), pages 4412–4422, 2021. 12

work page 2021
[14]

How to learn when data gradually reacts to your model

Zachary Izzo, James Zou, and Lexing Ying. How to learn when data gradually reacts to your model. InProceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3998–4035, 2022

work page 2022
[15]

Revisiting frank-wolfe: Projection-free sparse convex optimization

Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. InPro- ceedings of International Conference on Machine Learning (ICML), pages 427–435, 2013

work page 2013
[16]

Learning from history for byzantine robust optimization

Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Learning from history for byzantine robust optimization. InProceedings of International Conference on Machine Learning (ICML), pages 5311–5319, 2021

work page 2021
[17]

Kerger, Marco Molinaro, Hongyi Jiang, and Amitabh Basu

Phillip A. Kerger, Marco Molinaro, Hongyi Jiang, and Amitabh Basu. A universal transfer theorem for convex optimization algorithms using inexact first-order oracles. InProceedings of International Conference on Machine Learning (ICML), 2024

work page 2024
[18]

Sub-sampled cubic regularization for non-convex optimization

Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. InProceedings of International Conference on Machine Learning (ICML), pages 1895–1904, 2017

work page 1904
[19]

Springer, 2020

Guanghui Lan.First-order and Stochastic Optimization Methods for Machine Learning. Springer, 2020

work page 2020
[20]

Learning rates for stochastic gradient descent with nonconvex objectives.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 43(12):4505–4511, 2021

Yunwen Lei and Ke Tang. Learning rates for stochastic gradient descent with nonconvex objectives.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 43(12):4505–4511, 2021

work page 2021
[21]

Federated learning: Challenges, methods, and future directions.IEEE Signal Processing Magazine, 37(3):50– 60, 2020

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions.IEEE Signal Processing Magazine, 37(3):50– 60, 2020

work page 2020
[22]

Bilmes, and Jure Leskovec

Baharan Mirzasoleiman, Jeff A. Bilmes, and Jure Leskovec. Coresets for data-efficient train- ing of machine learning models. InProceedings of International Conference on Machine Learning (ICML), pages 6950–6960, 2020

work page 2020
[23]

MIT Press, 2018

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT Press, 2018

work page 2018
[24]

Juditsky, Guanghui Lan, and Alexander Shapiro

Arkadi Nemirovski, Anatoli B. Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Opti- mization, 19(4):1574–1609, 2009

work page 2009
[25]

John Wiley & Sons, 1983

Arkadi Nemirovsky and David Berkovich Yudin.Problem Complexity and Method Effi- ciency in Optimization. John Wiley & Sons, 1983

work page 1983
[26]

Nesterov

Yurii E. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.SIAM Journal on Optimization, 22(2):341–362, 2012

work page 2012
[27]

Tsybakov

Alexandre B. Tsybakov. Optimal rates of aggregation. In Bernhard Scholkopf and Man- fred K. Warmuth, editors,Proceedings of Conference on Learning Theory (COLT), volume 2777, pages 303–313, 2003

work page 2003
[28]

Shuche Wang and Vincent Y. F. Tan. Robust distributed gradient descent to corruption over noisy channels. InIEEE International Symposium on Information Theory, pages 2520–2525, 2024

work page 2024
[29]

Shuche Wang and Vincent Y. F. Tan. A mirror descent-based algorithm for corruption- tolerant distributed gradient descent.IEEE Transactions on Signal Processing, 73:827–842, 2025. 13

work page 2025
[30]

if-branch

Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter L. Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. InProceedings of International Conference on Machine Learning (ICML), pages 5636–5645, 2018. Appendix A Completing the Proofs of Theorems 1 and 2 This section discusses the functionfconstructed in the proof of Theorem...

work page 2018
[31]

by magic

studied the behavior of gradient descent (GD) when it runs under anϵ-AGP, as described in the pseudocode below: algorithm GD(K) 1.w 0 ←0,k←0 2.whilek < Kdo 3.g k ←oracle(w k) /* output of theϵ-AGP oracle */ 4.w k+1 ←w k − 1 2L gk 5.k←k+ 1 6.returnw k For anyK≥1, [14, Lemma 5] proved K min k=1 ∥∇f(w k)∥=O r L·U K +Bϵ (35) where U= K max k=1 f(w k) 18 B= K ...

work page
[32]

9”, “13”, “17

did not analyze the sub-optimality gap ofw +. Nevertheless, we can do an analysis for them by resorting to our theory. For this purpose, let us augment their algorithm by requiring it to return the currentw k as soon as∥g k∥falls below 4ϵ(as in Line 4 ofAGP-opt). If the algorithm terminates this way, then our Lemma 4 applies; otherwise, the sub-optimality...

work page

[1] [1]

Bartlett, Pradeep Ravikumar, and Martin J

Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex opti- mization.IEEE Transactions on Information Theory, 58(5):3235–3249, 2012

work page 2012

[2] [2]

Byzantine stochastic gradient descent

Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. In Proceedings of Neural Information Processing Systems (NIPS), pages 4618–4628, 2018

work page 2018

[3] [3]

Machine learning with adversaries: Byzantine tolerant gradient descent

Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. InProceedings of Neural Information Processing Systems (NIPS), pages 119–129, 2017

work page 2017

[4] [4]

Stochastic first-order methods for convex and nonconvex functional constrained optimization.Mathematical Programming, 197(1):215– 279, 2023

Digvijay Boob, Qi Deng, and Guanghui Lan. Stochastic first-order methods for convex and nonconvex functional constrained optimization.Mathematical Programming, 197(1):215– 279, 2023

work page 2023

[5] [5]

Convex optimization: Algorithms and complexity.Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015

Sebastien Bubeck. Convex optimization: Algorithms and complexity.Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015

work page 2015

[6] [6]

Efficient coreset selection with cluster-based methods

Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, and Guoren Wang. Efficient coreset selection with cluster-based methods. InProceedings of ACM Knowledge Discovery and Data Mining (SIGKDD), pages 167–178, 2023

work page 2023

[7] [7]

Gradient descent: Robustness to adversarial corruption

Fu-Chieh Chang, Farhang Nabiei, Pei-Yuan Wu, Alexandru Cioba, Sattar Vakili, and Al- berto Bernacchia. Gradient descent: Robustness to adversarial corruption. InProceedings of International OPT Workshop on Optimization for Machine Learning, 2022

work page 2022

[8] [8]

Smooth optimization with approximate gradient.SIAM Journal on Optimization, 19(3):1171–1183, 2008

Alexandre d’Aspremont. Smooth optimization with approximate gradient.SIAM Journal on Optimization, 19(3):1171–1183, 2008

work page 2008

[9] [9]

Optimal distributed online prediction using mini-batches.Journal of Machine Learning Research (JMLR), 13:165–202, 2012

Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches.Journal of Machine Learning Research (JMLR), 13:165–202, 2012

work page 2012

[10] [10]

Nesterov

Olivier Devolder, Francois Glineur, and Yurii E. Nesterov. First-order methods of smooth convex optimization with inexact oracle.Mathematical Programming, 146(1-2):37–75, 2014

work page 2014

[11] [11]

Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochas- tic) gradient methods.ArXiv, abs/2301.11235, 2024

work page arXiv 2024

[12] [12]

A study of first-order methods with a determinis- tic relative-error gradient oracle

Nadav Hallak and Kfir Yehuda Levy. A study of first-order methods with a determinis- tic relative-error gradient oracle. InProceedings of International Conference on Machine Learning (ICML), 2024

work page 2024

[13] [13]

Freris, and Hu Ding

Jiawei Huang, Ruomin Huang, Wenjie Liu, Nikolaos M. Freris, and Hu Ding. A novel sequential coreset method for gradient descent algorithms. InProceedings of International Conference on Machine Learning (ICML), pages 4412–4422, 2021. 12

work page 2021

[14] [14]

How to learn when data gradually reacts to your model

Zachary Izzo, James Zou, and Lexing Ying. How to learn when data gradually reacts to your model. InProceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3998–4035, 2022

work page 2022

[15] [15]

Revisiting frank-wolfe: Projection-free sparse convex optimization

Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. InPro- ceedings of International Conference on Machine Learning (ICML), pages 427–435, 2013

work page 2013

[16] [16]

Learning from history for byzantine robust optimization

Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Learning from history for byzantine robust optimization. InProceedings of International Conference on Machine Learning (ICML), pages 5311–5319, 2021

work page 2021

[17] [17]

Kerger, Marco Molinaro, Hongyi Jiang, and Amitabh Basu

Phillip A. Kerger, Marco Molinaro, Hongyi Jiang, and Amitabh Basu. A universal transfer theorem for convex optimization algorithms using inexact first-order oracles. InProceedings of International Conference on Machine Learning (ICML), 2024

work page 2024

[18] [18]

Sub-sampled cubic regularization for non-convex optimization

Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. InProceedings of International Conference on Machine Learning (ICML), pages 1895–1904, 2017

work page 1904

[19] [19]

Springer, 2020

Guanghui Lan.First-order and Stochastic Optimization Methods for Machine Learning. Springer, 2020

work page 2020

[20] [20]

Learning rates for stochastic gradient descent with nonconvex objectives.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 43(12):4505–4511, 2021

Yunwen Lei and Ke Tang. Learning rates for stochastic gradient descent with nonconvex objectives.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 43(12):4505–4511, 2021

work page 2021

[21] [21]

Federated learning: Challenges, methods, and future directions.IEEE Signal Processing Magazine, 37(3):50– 60, 2020

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions.IEEE Signal Processing Magazine, 37(3):50– 60, 2020

work page 2020

[22] [22]

Bilmes, and Jure Leskovec

Baharan Mirzasoleiman, Jeff A. Bilmes, and Jure Leskovec. Coresets for data-efficient train- ing of machine learning models. InProceedings of International Conference on Machine Learning (ICML), pages 6950–6960, 2020

work page 2020

[23] [23]

MIT Press, 2018

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT Press, 2018

work page 2018

[24] [24]

Juditsky, Guanghui Lan, and Alexander Shapiro

Arkadi Nemirovski, Anatoli B. Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Opti- mization, 19(4):1574–1609, 2009

work page 2009

[25] [25]

John Wiley & Sons, 1983

Arkadi Nemirovsky and David Berkovich Yudin.Problem Complexity and Method Effi- ciency in Optimization. John Wiley & Sons, 1983

work page 1983

[26] [26]

Nesterov

Yurii E. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.SIAM Journal on Optimization, 22(2):341–362, 2012

work page 2012

[27] [27]

Tsybakov

Alexandre B. Tsybakov. Optimal rates of aggregation. In Bernhard Scholkopf and Man- fred K. Warmuth, editors,Proceedings of Conference on Learning Theory (COLT), volume 2777, pages 303–313, 2003

work page 2003

[28] [28]

Shuche Wang and Vincent Y. F. Tan. Robust distributed gradient descent to corruption over noisy channels. InIEEE International Symposium on Information Theory, pages 2520–2525, 2024

work page 2024

[29] [29]

Shuche Wang and Vincent Y. F. Tan. A mirror descent-based algorithm for corruption- tolerant distributed gradient descent.IEEE Transactions on Signal Processing, 73:827–842, 2025. 13

work page 2025

[30] [30]

if-branch

Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter L. Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. InProceedings of International Conference on Machine Learning (ICML), pages 5636–5645, 2018. Appendix A Completing the Proofs of Theorems 1 and 2 This section discusses the functionfconstructed in the proof of Theorem...

work page 2018

[31] [31]

by magic

studied the behavior of gradient descent (GD) when it runs under anϵ-AGP, as described in the pseudocode below: algorithm GD(K) 1.w 0 ←0,k←0 2.whilek < Kdo 3.g k ←oracle(w k) /* output of theϵ-AGP oracle */ 4.w k+1 ←w k − 1 2L gk 5.k←k+ 1 6.returnw k For anyK≥1, [14, Lemma 5] proved K min k=1 ∥∇f(w k)∥=O r L·U K +Bϵ (35) where U= K max k=1 f(w k) 18 B= K ...

work page

[32] [32]

9”, “13”, “17

did not analyze the sub-optimality gap ofw +. Nevertheless, we can do an analysis for them by resorting to our theory. For this purpose, let us augment their algorithm by requiring it to return the currentw k as soon as∥g k∥falls below 4ϵ(as in Line 4 ofAGP-opt). If the algorithm terminates this way, then our Lemma 4 applies; otherwise, the sub-optimality...

work page