Improving Generalization Performance by Switching from Adam to SGD

Nitish Shirish Keskar , Richard Socher

Authors on Pith no claims yet

classification 💻 cs.LG math.OC

keywords adamtrainingconditiondatastrategyadaptivegeneralizationgradient

read the original abstract

Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and language modeling with recurrent networks on the PTB and WT2 data sets. The results show that our strategy is capable of closing the generalization gap between SGD and Adam on a majority of the tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Spikes in Stochastic Gradient Descent: A Large-Deviations View
cs.LG 2026-03 unverdicted novelty 7.0

Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.
Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy
cs.LG 2026-04 unverdicted novelty 5.0

SignSGD with pre-sign dithering and a calibrated hybrid switch to SGD achieves 92.18% accuracy on CIFAR-10 with ResNet-18, outperforming pure SGD and SignSGD, plus better results than Adam on CIFAR-100.
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning
cs.CV 2026-04 unverdicted novelty 3.0

DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.