Unveiling High-Probability Generalization in Decentralized SGD

Dongsheng Li; Jiahuan Wang; Ping Luo; Tao Sun; Ziqing Wen

arxiv: 2605.10205 · v1 · submitted 2026-05-11 · 💻 cs.LG

Unveiling High-Probability Generalization in Decentralized SGD

Jiahuan Wang , Ping Luo , Ziqing Wen , Dongsheng Li , Tao Sun This is my paper

Pith reviewed 2026-05-12 05:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords decentralized SGDhigh-probability generalizationpointwise uniform stabilitydistributed learningconvex optimizationnon-convex optimizationgeneralization boundscommunication overhead

0 comments

The pith

Decentralized SGD achieves optimal high-probability generalization bounds at O(1/sqrt(mn) log(1/δ))

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a high-probability learning theory for decentralized stochastic gradient descent that matches the optimal rate known for standard SGD. Prior D-SGD analyses gave expected bounds but high-probability versions carried a worse dependence on the failure probability δ. By refining the analysis with pointwise uniform stability, a weaker condition than full uniform stability, the authors obtain the improved rate across convex, strongly convex, and non-convex regimes. They also bound optimization error, excess risk, and gradient measures while folding in communication costs. Readers care because the result supplies reliable worst-case guarantees for large-scale distributed training.

Core claim

We develop a high-probability learning theory for D-SGD aiming for the optimal O(1/sqrt(mn) log(1/δ)) rate. We refine bounds using pointwise uniform stability in distributed learning and analyze them in convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures where only local minima exist and derive optimization error and excess risk bounds while accounting for communication overhead in time-varying frameworks.

What carries the argument

Pointwise uniform stability in distributed learning, which bounds the effect of changing one sample on the output model to yield high-probability generalization control under decentralized communication.

If this is right

High-probability excess risk bounds reach the optimal O(1/sqrt(mn) log(1/δ)) rate in convex settings.
Non-convex cases obtain high-probability bounds on gradient norms at local minima.
Optimization error and excess risk are controlled jointly with the generalization term.
Local models retain generalization bounds in time-varying communication frameworks after accounting for overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Decentralization itself need not degrade high-probability guarantees once pointwise stability is established.
The same stability technique may extend directly to other decentralized first-order methods.
Empirical checks on heterogeneous data partitions would reveal whether the derived rates remain tight in practice.

Load-bearing premise

Pointwise uniform stability remains a valid and sufficiently strong notion under the decentralized communication constraints and loss-function assumptions of the D-SGD setting.

What would settle it

An empirical run of D-SGD on a fixed dataset and network where the fraction of trials exceeding the claimed generalization error bound is substantially larger than δ would falsify the result.

read the original abstract

Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to $\mathcal{O}\left(\frac{1}{\delta \sqrt{mn}}\right)$, where $\delta$ is the confidence parameter, $m$ the number of workers, and $n$ the sample size. When $m=1$, D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is $\mathcal{O}\left(\frac{1}{\sqrt{n}}\log (1/\delta)\right)$. This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal $\mathcal{O}\left(\frac{1}{\sqrt{mn}}\log (1/\delta)\right)$ rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a high-probability generalization theory for decentralized SGD (D-SGD). It refines bounds using pointwise uniform stability (a weaker notion than uniform stability) to achieve the optimal rate O(1/sqrt(mn) log(1/δ)) across convex, strongly convex, and non-convex settings. Additional results cover high-probability bounds for gradient-based measures at local minima, optimization error, excess risk, and generalization for local models under time-varying communication graphs while accounting for communication overhead.

Significance. If the central claims hold, the work would close an important gap between expected-risk analyses and high-probability guarantees for D-SGD, matching the optimal rates known for centralized SGD. The extension to non-convex regimes and explicit treatment of communication overhead could inform algorithm design in large-scale distributed training.

major comments (2)

[Abstract and §3] Abstract and §3 (Stability Definition): the claim that pointwise uniform stability suffices for the optimal O(1/sqrt(mn) log(1/δ)) rate is load-bearing, yet the provided description does not show how a local perturbation at one worker propagates through repeated averaging steps on a (possibly time-varying) graph. Without an explicit bound on the effective stability parameter that incorporates the mixing time or spectral gap, the derived high-probability bound may acquire extra factors polynomial in m or the number of communication rounds.
[§5] §5 (Non-convex Analysis): the high-probability results for gradient-based measures at local minima rely on the same stability argument; if the cross-worker dependence is not controlled, the excess-risk bound will not be parameter-free with respect to the communication topology as stated.

minor comments (2)

[Abstract] Abstract: the rate notation O(1/sqrt(mn) log(1/δ)) should explicitly define m (number of workers) and n (local sample size) on first use.
[Throughout] Notation: ensure δ consistently denotes the failure probability across all theorems and corollaries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer exposition on perturbation propagation and cross-worker dependence in our stability analysis. We address each major comment below with references to the existing proofs and indicate where we will add clarifications to the main text.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Stability Definition): the claim that pointwise uniform stability suffices for the optimal O(1/sqrt(mn) log(1/δ)) rate is load-bearing, yet the provided description does not show how a local perturbation at one worker propagates through repeated averaging steps on a (possibly time-varying) graph. Without an explicit bound on the effective stability parameter that incorporates the mixing time or spectral gap, the derived high-probability bound may acquire extra factors polynomial in m or the number of communication rounds.

Authors: We agree the main-text description in §3 is brief. The appendix proofs (e.g., Lemma 3.2 and Theorem 3.1) explicitly track perturbation propagation: a single-sample change at one worker affects its local gradient step, after which the mixing matrices W_t contract the difference vector by the spectral gap factor (1-λ) per round. Summing the geometric series over T rounds and averaging over m workers yields an effective pointwise stability parameter of O(1/sqrt(mn)) with no polynomial dependence on m or T (the log(1/δ) arises solely from the high-probability tail). We will revise §3 to include a short paragraph summarizing this contraction argument and citing the relevant appendix lemmas. revision: yes
Referee: [§5] §5 (Non-convex Analysis): the high-probability results for gradient-based measures at local minima rely on the same stability argument; if the cross-worker dependence is not controlled, the excess-risk bound will not be parameter-free with respect to the communication topology as stated.

Authors: In §5 the gradient-norm bounds at local minima are obtained by applying the same pointwise stability to the averaged model sequence. Cross-worker dependence is controlled because stability is measured on the global average iterate; the uniform mixing assumption (bounded second-largest eigenvalue of the time-varying graphs) ensures the averaged model’s sensitivity remains O(1/sqrt(mn)) independently of any particular topology beyond the mixing condition. The union bound over m workers contributes only an extra log m factor, which is absorbed into the log(1/δ) term. We will add a clarifying remark in §5 stating this topology independence under the stated mixing assumption. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation relies on independent stability analysis

full rationale

The paper presents its high-probability generalization bounds for D-SGD as derived from pointwise uniform stability analysis applied to the decentralized setting. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described derivation chain. The claimed O(1/sqrt(mn) log(1/δ)) rate is positioned as a refinement obtained by extending stability arguments across convex/non-convex cases while accounting for communication, without reducing to a tautology or prior result by the same authors. The analysis is self-contained against external benchmarks of stability-based generalization.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; standard domain assumptions for generalization bounds are likely invoked but not enumerated.

axioms (2)

domain assumption Loss functions satisfy convexity or smoothness conditions appropriate to each setting (convex, strongly convex, non-convex).
Standard background assumption for stability-based generalization analysis.
domain assumption The decentralized communication graph permits the D-SGD update rule.
Implicit in the problem formulation.

pith-pipeline@v0.9.0 · 5548 in / 1094 out tokens · 62635 ms · 2026-05-12T05:05:56.894236+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

194 extracted references · 194 canonical work pages

[1]

Proceedings of the National Academy of Sciences , volume=

Theoretical issues in deep networks , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020
[2]

Proceedings of the National Academy of Sciences , volume=

Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

work page 2019
[3]

International Conference on Machine Learning (ICML) , year=

Stability and generalization analysis of decentralized SGD: sharper bounds beyond lipschitzness and smoothness , author=. International Conference on Machine Learning (ICML) , year=

work page
[4]

SIAM Journal on optimization , volume=

Robust stochastic approximation approach to stochastic programming , author=. SIAM Journal on optimization , volume=. 2009 , publisher=

work page 2009
[5]

International Conference on Artificial Intelligence and Statistics , pages=

Generalization bounds of nonconvex-(strongly)-concave stochastic minimax optimization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024
[6]

Mathematics of Operations Research , volume=

Graphical convergence of subgradients in nonconvex optimization and learning , author=. Mathematics of Operations Research , volume=

work page
[7]

The Twelfth International Conference on Learning Representations , year=

General stability analysis for zeroth-order optimization algorithms , author=. The Twelfth International Conference on Learning Representations , year=

work page
[8]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Stability and Generalization of Zeroth-Order Decentralized Stochastic Gradient Descent with Changing Topology , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[9]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

On the loss landscape of adversarial training: Identifying challenges and how to overcome them , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page
[10]

Neural Networks , volume=

Data-dependent stability analysis of adversarial training , author=. Neural Networks , volume=. 2025 , publisher=

work page 2025
[11]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

Stability analysis and generalization bounds of adversarial training , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page
[12]

Don’t use large mini-b atches, use local SGD,

Don't use large mini-batches, use local sgd , author=. arXiv preprint arXiv:1808.07217 , year=

work page arXiv
[13]

2024 , journal=

On the generalization of stochastic gradient descent with momentum , author=. 2024 , journal=

work page 2024
[14]

The Eleventh International Conference on Learning Representations (ICLR) , year=

Exponential generalization bounds with near-optimal rates for L_q -stable algorithms , author=. The Eleventh International Conference on Learning Representations (ICLR) , year=

work page
[15]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

L_2 -Uniform stability of randomized learning algorithms: Sharper generalization bounds and confidence boosting , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page
[16]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

work page 2018
[17]

SIAM Journal on Optimization , volume=

Stochastic first-and zeroth-order methods for nonconvex stochastic programming , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=

work page 2013
[18]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Learning rates for stochastic gradient descent with nonconvex objectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=

work page 2021
[19]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

Uniform convergence of gradients for non-convex learning and optimization , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page
[20]

International Conference on Machine Learning (ICML) , pages=

Stability and generalization of learning algorithms that converge to global optima , author=. International Conference on Machine Learning (ICML) , pages=. 2018 , organization=

work page 2018
[21]

The collected works of Wassily Hoeffding , pages=

Probability inequalities for sums of bounded random variables , author=. The collected works of Wassily Hoeffding , pages=. 1994 , publisher=

work page 1994
[22]

SIAM Journal on Optimization , volume=

Achieving geometric convergence for distributed optimization over time-varying graphs , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017
[23]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

work page 2016
[24]

International Conference on Machine Learning (ICML) , pages=

A unified theory of decentralized sgd with changing topology and local updates , author=. International Conference on Machine Learning (ICML) , pages=. 2020 , organization=

work page 2020
[25]

International Conference on Artificial Intelligence and Statistics , pages=

Refined convergence and topology learning for decentralized sgd with heterogeneous data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

work page 2023
[26]

Journal of Machine Learning Research , volume=

Removing data heterogeneity influence enhances network topology dependence of decentralized sgd , author=. Journal of Machine Learning Research , volume=

work page
[27]

SIAM Journal on Optimization , volume=

Extra: An exact first-order algorithm for decentralized consensus optimization , author=. SIAM Journal on Optimization , volume=. 2015 , publisher=

work page 2015
[28]

SIAM Journal on Optimization , volume=

On the convergence of decentralized gradient descent , author=. SIAM Journal on Optimization , volume=. 2016 , publisher=

work page 2016
[29]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards stability and generalization bounds in decentralized minibatch stochastic gradient descent , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[30]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

Stability and generalization of the decentralized stochastic gradient descent ascent algorithm , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page
[31]

Applied and Computational Harmonic Analysis , volume=

High-probability generalization bounds for pointwise uniformly stable algorithms , author=. Applied and Computational Harmonic Analysis , volume=

work page
[32]

International Conference on Learning Representations (ICLR) , year=

Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond , author=. International Conference on Learning Representations (ICLR) , year=

work page
[33]

Meaningful lower-bound of√ a2 +b −a when a≫b >0

Unified optimal analysis of the (stochastic) gradient method , author=. 1907.04232 , archivePrefix=

work page arXiv 1907
[34]

International Conference on Machine Learning (ICML) , pages=

Data-dependent stability of stochastic gradient descent , author=. International Conference on Machine Learning (ICML) , pages=

work page
[35]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Zeroth-order method for distributed optimization with approximate projections , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=

work page
[36]

International Conference on Wireless Communications and Signal Processing (WCSP) , pages=

Communication-efficient decentralized zeroth-order method on heterogeneous data , author=. International Conference on Wireless Communications and Signal Processing (WCSP) , pages=

work page
[37]

IEEE Transactions on Signal Processing , volume=

Communication-efficient stochastic zeroth-order optimization for federated learning , author=. IEEE Transactions on Signal Processing , volume=

work page
[38]

Foundations of Computational Mathematics , volume=

Random gradient-free minimization of convex functions , author=. Foundations of Computational Mathematics , volume=

work page
[39]

IEEE Transactions on Information Theory , volume=

Optimal rates for zero-order convex optimization: The power of two function evaluations , author=. IEEE Transactions on Information Theory , volume=

work page
[40]

Journal of Machine Learning Research , volume=

An optimal algorithm for bandit and zero-order convex optimization with two-point feedback , author=. Journal of Machine Learning Research , volume=

work page
[41]

Conference on Learning Theory (COLT) , pages=

Highly-smooth zero-th order online optimization , author=. Conference on Learning Theory (COLT) , pages=

work page
[42]

The Annals of Mathematical Statistics , volume=

A relationship between arbitrary positive matrices and doubly stochastic matrices , author=. The Annals of Mathematical Statistics , volume=

work page
[43]

Pacific Journal of Mathematics , volume=

Concerning nonnegative matrices and doubly stochastic matrices , author=. Pacific Journal of Mathematics , volume=

work page
[44]

2310.01139 , archivePrefix=

Stability and Generalization for Minibatch SGD and Local SGD , author=. 2310.01139 , archivePrefix=

work page arXiv
[45]

Decentralized stochastic gradient tracking for non-convex empirical risk minimization

Decentralized stochastic gradient tracking for non-convex empirical risk minimization , author=. 1909.02712 , archivePrefix=

work page arXiv 1909
[46]

IEEE Transactions on Signal Processing , volume=

An improved convergence analysis for decentralized online stochastic non-convex optimization , author=. IEEE Transactions on Signal Processing , volume=

work page
[47]

1910.09126 , archivePrefix=

Communication-efficient local decentralized SGD methods , author=. 1910.09126 , archivePrefix=

work page arXiv 1910
[48]

Decentralized SGD with asynchronous, local and quantized updates , author=

work page
[49]

International Conference on Machine Learning (ICML) , pages=

Decentralized training over decentralized data , author=. International Conference on Machine Learning (ICML) , pages=

work page
[50]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Relaysum for decentralized deep learning on heterogeneous data , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page
[51]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Random reshuffling: Simple analysis with vast improvements , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page
[52]

IEEE International Conference on Big Data , pages=

Consensus optimization with delayed and stochastic gradients on decentralized networks , author=. IEEE International Conference on Big Data , pages=

work page
[53]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

A (DP)2 SGD: Asynchronous decentralized parallel stochastic gradient descent With differential privacy , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

work page
[54]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Asynchronous decentralized SGD with quantized and local updates , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[55]

2305.02247 , archivePrefix=

Select without fear: almost all mini-batch schedules generalize optimally , author=. 2305.02247 , archivePrefix=

work page arXiv
[56]

Advances in Neural Information Processing Systems(NeurIPS) , year=

Convergence rates of inexact proximal-gradient methods for convex optimization , author=. Advances in Neural Information Processing Systems(NeurIPS) , year=

work page
[57]

Advances in Neural Information Processing Systems(NeurIPS) , pages=

Better mini-batch algorithms via accelerated gradient methods , author=. Advances in Neural Information Processing Systems(NeurIPS) , pages=

work page
[58]

, author=

Optimal Distributed Online Prediction Using Mini-Batches. , author=. Journal of Machine Learning Research , volume=

work page
[59]

International Conference on Knowledge Discovery and Data Mining , pages=

Efficient mini-batch training for stochastic optimization , author=. International Conference on Knowledge Discovery and Data Mining , pages=

work page
[60]

Annual Allerton Conference on Communication, Control, and Computing , pages=

Distributed stochastic optimization and learning , author=. Annual Allerton Conference on Communication, Control, and Computing , pages=

work page
[61]

International Conference on Artificial Intelligence and Statistics , pages=

Gradient diversity: a key ingredient for scalable distributed learning , author=. International Conference on Artificial Intelligence and Statistics , pages=

work page
[62]

International Conference on Machine Learning(ICML) , pages=

SGD: General analysis and improved rates , author=. International Conference on Machine Learning(ICML) , pages=

work page
[63]

International Conference on Machine Learning(ICML) , pages=

Is local SGD better than minibatch SGD? , author=. International Conference on Machine Learning(ICML) , pages=

work page
[64]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Minibatch vs local sgd for heterogeneous distributed learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[65]

Mathematical Programming , volume=

Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , author=. Mathematical Programming , volume=

work page
[66]

1984 , school=

Problems in decentralized decision making and computation , author=. 1984 , school=

work page 1984
[67]

IEEE Transactions on Automatic Control , volume=

Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=

work page
[68]

IEEE Transactions on Automatic Control , volume=

Distributed subgradient methods for multi-agent optimization , author=. IEEE Transactions on Automatic Control , volume=

work page
[69]

Journal of Optimization Theory and Applications , volume=

Distributed stochastic subgradient projection algorithms for convex optimization , author=. Journal of Optimization Theory and Applications , volume=

work page
[70]

Mathematical Programming , pages=

Communication-efficient algorithms for decentralized and stochastic optimization , author=. Mathematical Programming , pages=

work page
[71]

IEEE Signal Processing Magazine , volume=

Distributed learning in wireless sensor networks , author=. IEEE Signal Processing Magazine , volume=

work page
[72]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[73]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Parallelized stochastic gradient descent , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[74]

Journal of Machine Learning Research , volume=

Graph-dependent implicit regularisation for distributed stochastic subgradient descent , author=. Journal of Machine Learning Research , volume=

work page
[75]

Conference on Learning Theory (COLT) , pages=

Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems , author=. Conference on Learning Theory (COLT) , pages=

work page
[76]

Conference on Learning Theory (COLT) , pages=

Tight analyses for non-smooth stochastic gradient descent , author=. Conference on Learning Theory (COLT) , pages=

work page
[77]

Beyond lipschitz: Sharp general$ ization and excess risk bounds for full$batch gd,

Beyond lipschitz: Sharp generalization and excess risk bounds for full-batch gd , author=. 2204.12446 , archivePrefix=

work page arXiv
[78]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Black-box generalization: Stability of zeroth-order learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page
[79]

Stability of stochastic gradient descent on nonsmooth convex losses , year =

Bassily, Raef and Feldman, Vitaly and Guzm\'. Stability of stochastic gradient descent on nonsmooth convex losses , year =. Advances in Neural Information Processing Systems (NeurIPS) , pages =

work page
[80]

Advances in Neural Information Processing Systems (NeurIPS) , pages =

Smoothness, low noise and fast rates , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages =

work page

Showing first 80 references.

[1] [1]

Proceedings of the National Academy of Sciences , volume=

Theoretical issues in deep networks , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020

[2] [2]

Proceedings of the National Academy of Sciences , volume=

Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

work page 2019

[3] [3]

International Conference on Machine Learning (ICML) , year=

Stability and generalization analysis of decentralized SGD: sharper bounds beyond lipschitzness and smoothness , author=. International Conference on Machine Learning (ICML) , year=

work page

[4] [4]

SIAM Journal on optimization , volume=

Robust stochastic approximation approach to stochastic programming , author=. SIAM Journal on optimization , volume=. 2009 , publisher=

work page 2009

[5] [5]

International Conference on Artificial Intelligence and Statistics , pages=

Generalization bounds of nonconvex-(strongly)-concave stochastic minimax optimization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024

[6] [6]

Mathematics of Operations Research , volume=

Graphical convergence of subgradients in nonconvex optimization and learning , author=. Mathematics of Operations Research , volume=

work page

[7] [7]

The Twelfth International Conference on Learning Representations , year=

General stability analysis for zeroth-order optimization algorithms , author=. The Twelfth International Conference on Learning Representations , year=

work page

[8] [8]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Stability and Generalization of Zeroth-Order Decentralized Stochastic Gradient Descent with Changing Topology , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[9] [9]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

On the loss landscape of adversarial training: Identifying challenges and how to overcome them , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page

[10] [10]

Neural Networks , volume=

Data-dependent stability analysis of adversarial training , author=. Neural Networks , volume=. 2025 , publisher=

work page 2025

[11] [11]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

Stability analysis and generalization bounds of adversarial training , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page

[12] [12]

Don’t use large mini-b atches, use local SGD,

Don't use large mini-batches, use local sgd , author=. arXiv preprint arXiv:1808.07217 , year=

work page arXiv

[13] [13]

2024 , journal=

On the generalization of stochastic gradient descent with momentum , author=. 2024 , journal=

work page 2024

[14] [14]

The Eleventh International Conference on Learning Representations (ICLR) , year=

Exponential generalization bounds with near-optimal rates for L_q -stable algorithms , author=. The Eleventh International Conference on Learning Representations (ICLR) , year=

work page

[15] [15]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

L_2 -Uniform stability of randomized learning algorithms: Sharper generalization bounds and confidence boosting , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page

[16] [16]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

work page 2018

[17] [17]

SIAM Journal on Optimization , volume=

Stochastic first-and zeroth-order methods for nonconvex stochastic programming , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=

work page 2013

[18] [18]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Learning rates for stochastic gradient descent with nonconvex objectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=

work page 2021

[19] [19]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

Uniform convergence of gradients for non-convex learning and optimization , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page

[20] [20]

International Conference on Machine Learning (ICML) , pages=

Stability and generalization of learning algorithms that converge to global optima , author=. International Conference on Machine Learning (ICML) , pages=. 2018 , organization=

work page 2018

[21] [21]

The collected works of Wassily Hoeffding , pages=

Probability inequalities for sums of bounded random variables , author=. The collected works of Wassily Hoeffding , pages=. 1994 , publisher=

work page 1994

[22] [22]

SIAM Journal on Optimization , volume=

Achieving geometric convergence for distributed optimization over time-varying graphs , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017

[23] [23]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

work page 2016

[24] [24]

International Conference on Machine Learning (ICML) , pages=

A unified theory of decentralized sgd with changing topology and local updates , author=. International Conference on Machine Learning (ICML) , pages=. 2020 , organization=

work page 2020

[25] [25]

International Conference on Artificial Intelligence and Statistics , pages=

Refined convergence and topology learning for decentralized sgd with heterogeneous data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

work page 2023

[26] [26]

Journal of Machine Learning Research , volume=

Removing data heterogeneity influence enhances network topology dependence of decentralized sgd , author=. Journal of Machine Learning Research , volume=

work page

[27] [27]

SIAM Journal on Optimization , volume=

Extra: An exact first-order algorithm for decentralized consensus optimization , author=. SIAM Journal on Optimization , volume=. 2015 , publisher=

work page 2015

[28] [28]

SIAM Journal on Optimization , volume=

On the convergence of decentralized gradient descent , author=. SIAM Journal on Optimization , volume=. 2016 , publisher=

work page 2016

[29] [29]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards stability and generalization bounds in decentralized minibatch stochastic gradient descent , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[30] [30]

Advances in Neural Information Processing Systems (NeuIPS) , volume=

Stability and generalization of the decentralized stochastic gradient descent ascent algorithm , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

work page

[31] [31]

Applied and Computational Harmonic Analysis , volume=

High-probability generalization bounds for pointwise uniformly stable algorithms , author=. Applied and Computational Harmonic Analysis , volume=

work page

[32] [32]

International Conference on Learning Representations (ICLR) , year=

Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond , author=. International Conference on Learning Representations (ICLR) , year=

work page

[33] [33]

Meaningful lower-bound of√ a2 +b −a when a≫b >0

Unified optimal analysis of the (stochastic) gradient method , author=. 1907.04232 , archivePrefix=

work page arXiv 1907

[34] [34]

International Conference on Machine Learning (ICML) , pages=

Data-dependent stability of stochastic gradient descent , author=. International Conference on Machine Learning (ICML) , pages=

work page

[35] [35]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Zeroth-order method for distributed optimization with approximate projections , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=

work page

[36] [36]

International Conference on Wireless Communications and Signal Processing (WCSP) , pages=

Communication-efficient decentralized zeroth-order method on heterogeneous data , author=. International Conference on Wireless Communications and Signal Processing (WCSP) , pages=

work page

[37] [37]

IEEE Transactions on Signal Processing , volume=

Communication-efficient stochastic zeroth-order optimization for federated learning , author=. IEEE Transactions on Signal Processing , volume=

work page

[38] [38]

Foundations of Computational Mathematics , volume=

Random gradient-free minimization of convex functions , author=. Foundations of Computational Mathematics , volume=

work page

[39] [39]

IEEE Transactions on Information Theory , volume=

Optimal rates for zero-order convex optimization: The power of two function evaluations , author=. IEEE Transactions on Information Theory , volume=

work page

[40] [40]

Journal of Machine Learning Research , volume=

An optimal algorithm for bandit and zero-order convex optimization with two-point feedback , author=. Journal of Machine Learning Research , volume=

work page

[41] [41]

Conference on Learning Theory (COLT) , pages=

Highly-smooth zero-th order online optimization , author=. Conference on Learning Theory (COLT) , pages=

work page

[42] [42]

The Annals of Mathematical Statistics , volume=

A relationship between arbitrary positive matrices and doubly stochastic matrices , author=. The Annals of Mathematical Statistics , volume=

work page

[43] [43]

Pacific Journal of Mathematics , volume=

Concerning nonnegative matrices and doubly stochastic matrices , author=. Pacific Journal of Mathematics , volume=

work page

[44] [44]

2310.01139 , archivePrefix=

Stability and Generalization for Minibatch SGD and Local SGD , author=. 2310.01139 , archivePrefix=

work page arXiv

[45] [45]

Decentralized stochastic gradient tracking for non-convex empirical risk minimization

Decentralized stochastic gradient tracking for non-convex empirical risk minimization , author=. 1909.02712 , archivePrefix=

work page arXiv 1909

[46] [46]

IEEE Transactions on Signal Processing , volume=

An improved convergence analysis for decentralized online stochastic non-convex optimization , author=. IEEE Transactions on Signal Processing , volume=

work page

[47] [47]

1910.09126 , archivePrefix=

Communication-efficient local decentralized SGD methods , author=. 1910.09126 , archivePrefix=

work page arXiv 1910

[48] [48]

Decentralized SGD with asynchronous, local and quantized updates , author=

work page

[49] [49]

International Conference on Machine Learning (ICML) , pages=

Decentralized training over decentralized data , author=. International Conference on Machine Learning (ICML) , pages=

work page

[50] [50]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Relaysum for decentralized deep learning on heterogeneous data , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page

[51] [51]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Random reshuffling: Simple analysis with vast improvements , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page

[52] [52]

IEEE International Conference on Big Data , pages=

Consensus optimization with delayed and stochastic gradients on decentralized networks , author=. IEEE International Conference on Big Data , pages=

work page

[53] [53]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

A (DP)2 SGD: Asynchronous decentralized parallel stochastic gradient descent With differential privacy , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

work page

[54] [54]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Asynchronous decentralized SGD with quantized and local updates , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page

[55] [55]

2305.02247 , archivePrefix=

Select without fear: almost all mini-batch schedules generalize optimally , author=. 2305.02247 , archivePrefix=

work page arXiv

[56] [56]

Advances in Neural Information Processing Systems(NeurIPS) , year=

Convergence rates of inexact proximal-gradient methods for convex optimization , author=. Advances in Neural Information Processing Systems(NeurIPS) , year=

work page

[57] [57]

Advances in Neural Information Processing Systems(NeurIPS) , pages=

Better mini-batch algorithms via accelerated gradient methods , author=. Advances in Neural Information Processing Systems(NeurIPS) , pages=

work page

[58] [58]

, author=

Optimal Distributed Online Prediction Using Mini-Batches. , author=. Journal of Machine Learning Research , volume=

work page

[59] [59]

International Conference on Knowledge Discovery and Data Mining , pages=

Efficient mini-batch training for stochastic optimization , author=. International Conference on Knowledge Discovery and Data Mining , pages=

work page

[60] [60]

Annual Allerton Conference on Communication, Control, and Computing , pages=

Distributed stochastic optimization and learning , author=. Annual Allerton Conference on Communication, Control, and Computing , pages=

work page

[61] [61]

International Conference on Artificial Intelligence and Statistics , pages=

Gradient diversity: a key ingredient for scalable distributed learning , author=. International Conference on Artificial Intelligence and Statistics , pages=

work page

[62] [62]

International Conference on Machine Learning(ICML) , pages=

SGD: General analysis and improved rates , author=. International Conference on Machine Learning(ICML) , pages=

work page

[63] [63]

International Conference on Machine Learning(ICML) , pages=

Is local SGD better than minibatch SGD? , author=. International Conference on Machine Learning(ICML) , pages=

work page

[64] [64]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Minibatch vs local sgd for heterogeneous distributed learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page

[65] [65]

Mathematical Programming , volume=

Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , author=. Mathematical Programming , volume=

work page

[66] [66]

1984 , school=

Problems in decentralized decision making and computation , author=. 1984 , school=

work page 1984

[67] [67]

IEEE Transactions on Automatic Control , volume=

Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=

work page

[68] [68]

IEEE Transactions on Automatic Control , volume=

Distributed subgradient methods for multi-agent optimization , author=. IEEE Transactions on Automatic Control , volume=

work page

[69] [69]

Journal of Optimization Theory and Applications , volume=

Distributed stochastic subgradient projection algorithms for convex optimization , author=. Journal of Optimization Theory and Applications , volume=

work page

[70] [70]

Mathematical Programming , pages=

Communication-efficient algorithms for decentralized and stochastic optimization , author=. Mathematical Programming , pages=

work page

[71] [71]

IEEE Signal Processing Magazine , volume=

Distributed learning in wireless sensor networks , author=. IEEE Signal Processing Magazine , volume=

work page

[72] [72]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[73] [73]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Parallelized stochastic gradient descent , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page

[74] [74]

Journal of Machine Learning Research , volume=

Graph-dependent implicit regularisation for distributed stochastic subgradient descent , author=. Journal of Machine Learning Research , volume=

work page

[75] [75]

Conference on Learning Theory (COLT) , pages=

Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems , author=. Conference on Learning Theory (COLT) , pages=

work page

[76] [76]

Conference on Learning Theory (COLT) , pages=

Tight analyses for non-smooth stochastic gradient descent , author=. Conference on Learning Theory (COLT) , pages=

work page

[77] [77]

Beyond lipschitz: Sharp general$ ization and excess risk bounds for full$batch gd,

Beyond lipschitz: Sharp generalization and excess risk bounds for full-batch gd , author=. 2204.12446 , archivePrefix=

work page arXiv

[78] [78]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Black-box generalization: Stability of zeroth-order learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page

[79] [79]

Stability of stochastic gradient descent on nonsmooth convex losses , year =

Bassily, Raef and Feldman, Vitaly and Guzm\'. Stability of stochastic gradient descent on nonsmooth convex losses , year =. Advances in Neural Information Processing Systems (NeurIPS) , pages =

work page

[80] [80]

Advances in Neural Information Processing Systems (NeurIPS) , pages =

Smoothness, low noise and fast rates , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages =

work page