Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Alexander Rakhlin; Karthik Sridharan; Ohad Shamir

arxiv: 1109.5647 · v7 · pith:EUCLAXZZnew · submitted 2011-09-26 · 💻 cs.LG · math.OC

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Alexander Rakhlin , Ohad Shamir , Karthik Sridharan This is my paper

classification 💻 cs.LG math.OC

keywords problemsratestochasticalgorithmoptimalaveragingconvergenceconvex

0 comments

read the original abstract

Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(\log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be \Omega(\log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo
cs.LG 2026-05 unverdicted novelty 7.0

Derives joint asymptotic jump-diffusion limit for global parameters and latent variables in SGLD-Gibbs under space-time rescaling, yielding explicit hyperparameter tuning guidance for calibrated uncertainty quantification.
Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
math.PR 2026-05 unverdicted novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properti...
Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation
stat.ML 2026-05 unverdicted novelty 7.0

Establishes non-asymptotic Gaussian approximation bounds for federated LSA with explicit communication-heterogeneity trade-offs and introduces an online multiplier bootstrap for last-iterate inference with validity gu...
Characterizing and Correcting Effective Target Shift in Online Learning
stat.ML 2026-05 unverdicted novelty 7.0

Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
Shuffling the Data, Stretching the Step-size: Sharper Bias in constant step-size SGD
math.OC 2026-04 unverdicted novelty 7.0

Combining random reshuffling and Richardson-Romberg extrapolation yields cubic bias refinement and better MSE for constant-step SGD on structured non-monotone variational inequalities.
SGD at the Edge of Stability: Stochastic Stabilization with Large Learning Rates
stat.ML 2026-06 unverdicted novelty 6.0

SGD on multiclass cross-entropy loss alternates between curvature-driven oscillations and stable regimes but self-stabilizes to enable best-iterate convergence with large learning rates for linear and two-layer models.
Cost-Aware Learning
cs.LG 2026-04 unverdicted novelty 6.0

Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
Communication-Efficient Approximate Gradient Coding for Distributed Learning in Heterogeneous Systems
eess.SY 2026-05 unverdicted novelty 5.0

Derives a closed-form optimal gradient coding structure and bit allocation strategy to minimize residual error under an unbiasedness constraint for communication-efficient distributed learning in heterogeneous systems.
Cost-Aware Learning
cs.LG 2026-04 unverdicted novelty 5.0

Cost-Aware SGD samples by gradient-norm-to-cost ratio and is instantiated as Cost-Aware GRPO for length-dependent policy gradients, reducing tokens used in LLM RL while matching baseline accuracy.
Stochastic versus Deterministic in Stochastic Gradient Descent
math.OC 2025-09 unverdicted novelty 5.0

Treating stochastic and deterministic gradients separately in mini-batch SGD yields faster convergence and smaller error radius than uniform treatment, with further gains under strong convexity.