pith. sign in

arxiv: 2605.10205 · v1 · submitted 2026-05-11 · 💻 cs.LG

Unveiling High-Probability Generalization in Decentralized SGD

Pith reviewed 2026-05-12 05:05 UTC · model grok-4.3

classification 💻 cs.LG
keywords decentralized SGDhigh-probability generalizationpointwise uniform stabilitydistributed learningconvex optimizationnon-convex optimizationgeneralization boundscommunication overhead
0
0 comments X

The pith

Decentralized SGD achieves optimal high-probability generalization bounds at O(1/sqrt(mn) log(1/δ))

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a high-probability learning theory for decentralized stochastic gradient descent that matches the optimal rate known for standard SGD. Prior D-SGD analyses gave expected bounds but high-probability versions carried a worse dependence on the failure probability δ. By refining the analysis with pointwise uniform stability, a weaker condition than full uniform stability, the authors obtain the improved rate across convex, strongly convex, and non-convex regimes. They also bound optimization error, excess risk, and gradient measures while folding in communication costs. Readers care because the result supplies reliable worst-case guarantees for large-scale distributed training.

Core claim

We develop a high-probability learning theory for D-SGD aiming for the optimal O(1/sqrt(mn) log(1/δ)) rate. We refine bounds using pointwise uniform stability in distributed learning and analyze them in convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures where only local minima exist and derive optimization error and excess risk bounds while accounting for communication overhead in time-varying frameworks.

What carries the argument

Pointwise uniform stability in distributed learning, which bounds the effect of changing one sample on the output model to yield high-probability generalization control under decentralized communication.

If this is right

  • High-probability excess risk bounds reach the optimal O(1/sqrt(mn) log(1/δ)) rate in convex settings.
  • Non-convex cases obtain high-probability bounds on gradient norms at local minima.
  • Optimization error and excess risk are controlled jointly with the generalization term.
  • Local models retain generalization bounds in time-varying communication frameworks after accounting for overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Decentralization itself need not degrade high-probability guarantees once pointwise stability is established.
  • The same stability technique may extend directly to other decentralized first-order methods.
  • Empirical checks on heterogeneous data partitions would reveal whether the derived rates remain tight in practice.

Load-bearing premise

Pointwise uniform stability remains a valid and sufficiently strong notion under the decentralized communication constraints and loss-function assumptions of the D-SGD setting.

What would settle it

An empirical run of D-SGD on a fixed dataset and network where the fraction of trials exceeding the claimed generalization error bound is substantially larger than δ would falsify the result.

read the original abstract

Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to $\mathcal{O}\left(\frac{1}{\delta \sqrt{mn}}\right)$, where $\delta$ is the confidence parameter, $m$ the number of workers, and $n$ the sample size. When $m=1$, D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is $\mathcal{O}\left(\frac{1}{\sqrt{n}}\log (1/\delta)\right)$. This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal $\mathcal{O}\left(\frac{1}{\sqrt{mn}}\log (1/\delta)\right)$ rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a high-probability generalization theory for decentralized SGD (D-SGD). It refines bounds using pointwise uniform stability (a weaker notion than uniform stability) to achieve the optimal rate O(1/sqrt(mn) log(1/δ)) across convex, strongly convex, and non-convex settings. Additional results cover high-probability bounds for gradient-based measures at local minima, optimization error, excess risk, and generalization for local models under time-varying communication graphs while accounting for communication overhead.

Significance. If the central claims hold, the work would close an important gap between expected-risk analyses and high-probability guarantees for D-SGD, matching the optimal rates known for centralized SGD. The extension to non-convex regimes and explicit treatment of communication overhead could inform algorithm design in large-scale distributed training.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Stability Definition): the claim that pointwise uniform stability suffices for the optimal O(1/sqrt(mn) log(1/δ)) rate is load-bearing, yet the provided description does not show how a local perturbation at one worker propagates through repeated averaging steps on a (possibly time-varying) graph. Without an explicit bound on the effective stability parameter that incorporates the mixing time or spectral gap, the derived high-probability bound may acquire extra factors polynomial in m or the number of communication rounds.
  2. [§5] §5 (Non-convex Analysis): the high-probability results for gradient-based measures at local minima rely on the same stability argument; if the cross-worker dependence is not controlled, the excess-risk bound will not be parameter-free with respect to the communication topology as stated.
minor comments (2)
  1. [Abstract] Abstract: the rate notation O(1/sqrt(mn) log(1/δ)) should explicitly define m (number of workers) and n (local sample size) on first use.
  2. [Throughout] Notation: ensure δ consistently denotes the failure probability across all theorems and corollaries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer exposition on perturbation propagation and cross-worker dependence in our stability analysis. We address each major comment below with references to the existing proofs and indicate where we will add clarifications to the main text.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Stability Definition): the claim that pointwise uniform stability suffices for the optimal O(1/sqrt(mn) log(1/δ)) rate is load-bearing, yet the provided description does not show how a local perturbation at one worker propagates through repeated averaging steps on a (possibly time-varying) graph. Without an explicit bound on the effective stability parameter that incorporates the mixing time or spectral gap, the derived high-probability bound may acquire extra factors polynomial in m or the number of communication rounds.

    Authors: We agree the main-text description in §3 is brief. The appendix proofs (e.g., Lemma 3.2 and Theorem 3.1) explicitly track perturbation propagation: a single-sample change at one worker affects its local gradient step, after which the mixing matrices W_t contract the difference vector by the spectral gap factor (1-λ) per round. Summing the geometric series over T rounds and averaging over m workers yields an effective pointwise stability parameter of O(1/sqrt(mn)) with no polynomial dependence on m or T (the log(1/δ) arises solely from the high-probability tail). We will revise §3 to include a short paragraph summarizing this contraction argument and citing the relevant appendix lemmas. revision: yes

  2. Referee: [§5] §5 (Non-convex Analysis): the high-probability results for gradient-based measures at local minima rely on the same stability argument; if the cross-worker dependence is not controlled, the excess-risk bound will not be parameter-free with respect to the communication topology as stated.

    Authors: In §5 the gradient-norm bounds at local minima are obtained by applying the same pointwise stability to the averaged model sequence. Cross-worker dependence is controlled because stability is measured on the global average iterate; the uniform mixing assumption (bounded second-largest eigenvalue of the time-varying graphs) ensures the averaged model’s sensitivity remains O(1/sqrt(mn)) independently of any particular topology beyond the mixing condition. The union bound over m workers contributes only an extra log m factor, which is absorbed into the log(1/δ) term. We will add a clarifying remark in §5 stating this topology independence under the stated mixing assumption. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation relies on independent stability analysis

full rationale

The paper presents its high-probability generalization bounds for D-SGD as derived from pointwise uniform stability analysis applied to the decentralized setting. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described derivation chain. The claimed O(1/sqrt(mn) log(1/δ)) rate is positioned as a refinement obtained by extending stability arguments across convex/non-convex cases while accounting for communication, without reducing to a tautology or prior result by the same authors. The analysis is self-contained against external benchmarks of stability-based generalization.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; standard domain assumptions for generalization bounds are likely invoked but not enumerated.

axioms (2)
  • domain assumption Loss functions satisfy convexity or smoothness conditions appropriate to each setting (convex, strongly convex, non-convex).
    Standard background assumption for stability-based generalization analysis.
  • domain assumption The decentralized communication graph permits the D-SGD update rule.
    Implicit in the problem formulation.

pith-pipeline@v0.9.0 · 5548 in / 1094 out tokens · 62635 ms · 2026-05-12T05:05:56.894236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

194 extracted references · 194 canonical work pages

  1. [1]

    Proceedings of the National Academy of Sciences , volume=

    Theoretical issues in deep networks , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

  2. [2]

    Proceedings of the National Academy of Sciences , volume=

    Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

  3. [3]

    International Conference on Machine Learning (ICML) , year=

    Stability and generalization analysis of decentralized SGD: sharper bounds beyond lipschitzness and smoothness , author=. International Conference on Machine Learning (ICML) , year=

  4. [4]

    SIAM Journal on optimization , volume=

    Robust stochastic approximation approach to stochastic programming , author=. SIAM Journal on optimization , volume=. 2009 , publisher=

  5. [5]

    International Conference on Artificial Intelligence and Statistics , pages=

    Generalization bounds of nonconvex-(strongly)-concave stochastic minimax optimization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

  6. [6]

    Mathematics of Operations Research , volume=

    Graphical convergence of subgradients in nonconvex optimization and learning , author=. Mathematics of Operations Research , volume=

  7. [7]

    The Twelfth International Conference on Learning Representations , year=

    General stability analysis for zeroth-order optimization algorithms , author=. The Twelfth International Conference on Learning Representations , year=

  8. [8]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Stability and Generalization of Zeroth-Order Decentralized Stochastic Gradient Descent with Changing Topology , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  9. [9]

    Advances in Neural Information Processing Systems (NeuIPS) , volume=

    On the loss landscape of adversarial training: Identifying challenges and how to overcome them , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

  10. [10]

    Neural Networks , volume=

    Data-dependent stability analysis of adversarial training , author=. Neural Networks , volume=. 2025 , publisher=

  11. [11]

    Advances in Neural Information Processing Systems (NeuIPS) , volume=

    Stability analysis and generalization bounds of adversarial training , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

  12. [12]

    Don’t use large mini-b atches, use local SGD,

    Don't use large mini-batches, use local sgd , author=. arXiv preprint arXiv:1808.07217 , year=

  13. [13]

    2024 , journal=

    On the generalization of stochastic gradient descent with momentum , author=. 2024 , journal=

  14. [14]

    The Eleventh International Conference on Learning Representations (ICLR) , year=

    Exponential generalization bounds with near-optimal rates for L_q -stable algorithms , author=. The Eleventh International Conference on Learning Representations (ICLR) , year=

  15. [15]

    Advances in Neural Information Processing Systems (NeuIPS) , volume=

    L_2 -Uniform stability of randomized learning algorithms: Sharper generalization bounds and confidence boosting , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

  16. [16]

    2018 , publisher=

    High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

  17. [17]

    SIAM Journal on Optimization , volume=

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=

  18. [18]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Learning rates for stochastic gradient descent with nonconvex objectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=

  19. [19]

    Advances in Neural Information Processing Systems (NeuIPS) , volume=

    Uniform convergence of gradients for non-convex learning and optimization , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

  20. [20]

    International Conference on Machine Learning (ICML) , pages=

    Stability and generalization of learning algorithms that converge to global optima , author=. International Conference on Machine Learning (ICML) , pages=. 2018 , organization=

  21. [21]

    The collected works of Wassily Hoeffding , pages=

    Probability inequalities for sums of bounded random variables , author=. The collected works of Wassily Hoeffding , pages=. 1994 , publisher=

  22. [22]

    SIAM Journal on Optimization , volume=

    Achieving geometric convergence for distributed optimization over time-varying graphs , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

  23. [23]

    Linear convergence of gradient and proximal-gradient methods under the polyak-

    Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

  24. [24]

    International Conference on Machine Learning (ICML) , pages=

    A unified theory of decentralized sgd with changing topology and local updates , author=. International Conference on Machine Learning (ICML) , pages=. 2020 , organization=

  25. [25]

    International Conference on Artificial Intelligence and Statistics , pages=

    Refined convergence and topology learning for decentralized sgd with heterogeneous data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

  26. [26]

    Journal of Machine Learning Research , volume=

    Removing data heterogeneity influence enhances network topology dependence of decentralized sgd , author=. Journal of Machine Learning Research , volume=

  27. [27]

    SIAM Journal on Optimization , volume=

    Extra: An exact first-order algorithm for decentralized consensus optimization , author=. SIAM Journal on Optimization , volume=. 2015 , publisher=

  28. [28]

    SIAM Journal on Optimization , volume=

    On the convergence of decentralized gradient descent , author=. SIAM Journal on Optimization , volume=. 2016 , publisher=

  29. [29]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Towards stability and generalization bounds in decentralized minibatch stochastic gradient descent , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  30. [30]

    Advances in Neural Information Processing Systems (NeuIPS) , volume=

    Stability and generalization of the decentralized stochastic gradient descent ascent algorithm , author=. Advances in Neural Information Processing Systems (NeuIPS) , volume=

  31. [31]

    Applied and Computational Harmonic Analysis , volume=

    High-probability generalization bounds for pointwise uniformly stable algorithms , author=. Applied and Computational Harmonic Analysis , volume=

  32. [32]

    International Conference on Learning Representations (ICLR) , year=

    Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond , author=. International Conference on Learning Representations (ICLR) , year=

  33. [33]

    Meaningful lower-bound of√ a2 +b −a when a≫b >0

    Unified optimal analysis of the (stochastic) gradient method , author=. 1907.04232 , archivePrefix=

  34. [34]

    International Conference on Machine Learning (ICML) , pages=

    Data-dependent stability of stochastic gradient descent , author=. International Conference on Machine Learning (ICML) , pages=

  35. [35]

    IEEE Transactions on Neural Networks and Learning Systems , volume=

    Zeroth-order method for distributed optimization with approximate projections , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=

  36. [36]

    International Conference on Wireless Communications and Signal Processing (WCSP) , pages=

    Communication-efficient decentralized zeroth-order method on heterogeneous data , author=. International Conference on Wireless Communications and Signal Processing (WCSP) , pages=

  37. [37]

    IEEE Transactions on Signal Processing , volume=

    Communication-efficient stochastic zeroth-order optimization for federated learning , author=. IEEE Transactions on Signal Processing , volume=

  38. [38]

    Foundations of Computational Mathematics , volume=

    Random gradient-free minimization of convex functions , author=. Foundations of Computational Mathematics , volume=

  39. [39]

    IEEE Transactions on Information Theory , volume=

    Optimal rates for zero-order convex optimization: The power of two function evaluations , author=. IEEE Transactions on Information Theory , volume=

  40. [40]

    Journal of Machine Learning Research , volume=

    An optimal algorithm for bandit and zero-order convex optimization with two-point feedback , author=. Journal of Machine Learning Research , volume=

  41. [41]

    Conference on Learning Theory (COLT) , pages=

    Highly-smooth zero-th order online optimization , author=. Conference on Learning Theory (COLT) , pages=

  42. [42]

    The Annals of Mathematical Statistics , volume=

    A relationship between arbitrary positive matrices and doubly stochastic matrices , author=. The Annals of Mathematical Statistics , volume=

  43. [43]

    Pacific Journal of Mathematics , volume=

    Concerning nonnegative matrices and doubly stochastic matrices , author=. Pacific Journal of Mathematics , volume=

  44. [44]

    2310.01139 , archivePrefix=

    Stability and Generalization for Minibatch SGD and Local SGD , author=. 2310.01139 , archivePrefix=

  45. [45]

    Decentralized stochastic gradient tracking for non-convex empirical risk minimization

    Decentralized stochastic gradient tracking for non-convex empirical risk minimization , author=. 1909.02712 , archivePrefix=

  46. [46]

    IEEE Transactions on Signal Processing , volume=

    An improved convergence analysis for decentralized online stochastic non-convex optimization , author=. IEEE Transactions on Signal Processing , volume=

  47. [47]

    1910.09126 , archivePrefix=

    Communication-efficient local decentralized SGD methods , author=. 1910.09126 , archivePrefix=

  48. [48]

    Decentralized SGD with asynchronous, local and quantized updates , author=

  49. [49]

    International Conference on Machine Learning (ICML) , pages=

    Decentralized training over decentralized data , author=. International Conference on Machine Learning (ICML) , pages=

  50. [50]

    Advances in Neural Information Processing Systems (NeurIPS) , pages=

    Relaysum for decentralized deep learning on heterogeneous data , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

  51. [51]

    Advances in Neural Information Processing Systems (NeurIPS) , pages=

    Random reshuffling: Simple analysis with vast improvements , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

  52. [52]

    IEEE International Conference on Big Data , pages=

    Consensus optimization with delayed and stochastic gradients on decentralized networks , author=. IEEE International Conference on Big Data , pages=

  53. [53]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    A (DP)2 SGD: Asynchronous decentralized parallel stochastic gradient descent With differential privacy , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

  54. [54]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Asynchronous decentralized SGD with quantized and local updates , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  55. [55]

    2305.02247 , archivePrefix=

    Select without fear: almost all mini-batch schedules generalize optimally , author=. 2305.02247 , archivePrefix=

  56. [56]

    Advances in Neural Information Processing Systems(NeurIPS) , year=

    Convergence rates of inexact proximal-gradient methods for convex optimization , author=. Advances in Neural Information Processing Systems(NeurIPS) , year=

  57. [57]

    Advances in Neural Information Processing Systems(NeurIPS) , pages=

    Better mini-batch algorithms via accelerated gradient methods , author=. Advances in Neural Information Processing Systems(NeurIPS) , pages=

  58. [58]

    , author=

    Optimal Distributed Online Prediction Using Mini-Batches. , author=. Journal of Machine Learning Research , volume=

  59. [59]

    International Conference on Knowledge Discovery and Data Mining , pages=

    Efficient mini-batch training for stochastic optimization , author=. International Conference on Knowledge Discovery and Data Mining , pages=

  60. [60]

    Annual Allerton Conference on Communication, Control, and Computing , pages=

    Distributed stochastic optimization and learning , author=. Annual Allerton Conference on Communication, Control, and Computing , pages=

  61. [61]

    International Conference on Artificial Intelligence and Statistics , pages=

    Gradient diversity: a key ingredient for scalable distributed learning , author=. International Conference on Artificial Intelligence and Statistics , pages=

  62. [62]

    International Conference on Machine Learning(ICML) , pages=

    SGD: General analysis and improved rates , author=. International Conference on Machine Learning(ICML) , pages=

  63. [63]

    International Conference on Machine Learning(ICML) , pages=

    Is local SGD better than minibatch SGD? , author=. International Conference on Machine Learning(ICML) , pages=

  64. [64]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Minibatch vs local sgd for heterogeneous distributed learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  65. [65]

    Mathematical Programming , volume=

    Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , author=. Mathematical Programming , volume=

  66. [66]

    1984 , school=

    Problems in decentralized decision making and computation , author=. 1984 , school=

  67. [67]

    IEEE Transactions on Automatic Control , volume=

    Distributed asynchronous deterministic and stochastic gradient optimization algorithms , author=. IEEE Transactions on Automatic Control , volume=

  68. [68]

    IEEE Transactions on Automatic Control , volume=

    Distributed subgradient methods for multi-agent optimization , author=. IEEE Transactions on Automatic Control , volume=

  69. [69]

    Journal of Optimization Theory and Applications , volume=

    Distributed stochastic subgradient projection algorithms for convex optimization , author=. Journal of Optimization Theory and Applications , volume=

  70. [70]

    Mathematical Programming , pages=

    Communication-efficient algorithms for decentralized and stochastic optimization , author=. Mathematical Programming , pages=

  71. [71]

    IEEE Signal Processing Magazine , volume=

    Distributed learning in wireless sensor networks , author=. IEEE Signal Processing Magazine , volume=

  72. [72]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Distributed delayed stochastic optimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  73. [73]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Parallelized stochastic gradient descent , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  74. [74]

    Journal of Machine Learning Research , volume=

    Graph-dependent implicit regularisation for distributed stochastic subgradient descent , author=. Journal of Machine Learning Research , volume=

  75. [75]

    Conference on Learning Theory (COLT) , pages=

    Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems , author=. Conference on Learning Theory (COLT) , pages=

  76. [76]

    Conference on Learning Theory (COLT) , pages=

    Tight analyses for non-smooth stochastic gradient descent , author=. Conference on Learning Theory (COLT) , pages=

  77. [77]

    Beyond lipschitz: Sharp general$ ization and excess risk bounds for full$batch gd,

    Beyond lipschitz: Sharp generalization and excess risk bounds for full-batch gd , author=. 2204.12446 , archivePrefix=

  78. [78]

    Advances in Neural Information Processing Systems (NeurIPS) , pages=

    Black-box generalization: Stability of zeroth-order learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

  79. [79]

    Stability of stochastic gradient descent on nonsmooth convex losses , year =

    Bassily, Raef and Feldman, Vitaly and Guzm\'. Stability of stochastic gradient descent on nonsmooth convex losses , year =. Advances in Neural Information Processing Systems (NeurIPS) , pages =

  80. [80]

    Advances in Neural Information Processing Systems (NeurIPS) , pages =

    Smoothness, low noise and fast rates , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages =

Showing first 80 references.