pith. machine review for the scientific record. sign in

arxiv: 1712.07628 · v1 · submitted 2017-12-20 · 💻 cs.LG · math.OC

Recognition: unknown

Improving Generalization Performance by Switching from Adam to SGD

Authors on Pith no claims yet
classification 💻 cs.LG math.OC
keywords adamtrainingconditiondatastrategyadaptivegeneralizationgradient
0
0 comments X
read the original abstract

Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and language modeling with recurrent networks on the PTB and WT2 data sets. The results show that our strategy is capable of closing the generalization gap between SGD and Adam on a majority of the tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

    cs.LG 2026-03 unverdicted novelty 7.0

    Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.

  2. Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy

    cs.LG 2026-04 unverdicted novelty 5.0

    SignSGD with pre-sign dithering and a calibrated hybrid switch to SGD achieves 92.18% accuracy on CIFAR-10 with ResNet-18, outperforming pure SGD and SignSGD, plus better results than Adam on CIFAR-100.

  3. Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning

    cs.CV 2026-04 unverdicted novelty 3.0

    DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.