Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
read the original abstract
In this paper, we describe a phenomenon, which we named "super-convergence", where neural networks can be trained an order of magnitude faster than with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate. A primary insight that allows super-convergence training is that large learning rates regularize the training, hence requiring a reduction of all other forms of regularization in order to preserve an optimal regularization balance. We also derive a simplification of the Hessian Free optimization method to compute an estimate of the optimal learning rate. Experiments demonstrate super-convergence for Cifar-10/100, MNIST and Imagenet datasets, and resnet, wide-resnet, densenet, and inception architectures. In addition, we show that super-convergence provides a greater boost in performance relative to standard training when the amount of labeled training data is limited. The architectures and code to replicate the figures in this paper are available at github.com/lnsmith54/super-convergence. See http://www.fast.ai/2018/04/30/dawnbench-fastai/ for an application of super-convergence to win the DAWNBench challenge (see https://dawn.cs.stanford.edu/benchmark/).
This paper has not been read by Pith yet.
Forward citations
Cited by 13 Pith papers
-
Generative models on phase space
Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.
-
OmniMol: Transferring Particle Physics Knowledge to Molecular Dynamics with Point-Edge Transformers
OmniMol transfers a billion-jet pre-trained PET foundation model from HEP to molecular dynamics via an interaction-matrix attention bias, delivering strong performance on the oMol dataset with minimal fine-tuning and ...
-
Data-Driven Calibration of Large Liquid Detectors with Unsupervised Learning
Unsupervised deep learning with a simplified optical photon transport model in the loss function extracts three PMT calibration constants per tube from background events in the SNO+ detector.
-
Bayesian Modeling and Prediction of Generalized Contact Matrices
A Bayesian model for multi-feature contact matrices that uses tensor structures and contingency table theory to satisfy structural constraints and impute missing contact features, validated on simulations and US/Germa...
-
Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training
A momentum schedule from critical damping speeds convergence and yields an optimizer-invariant diagnostic for locating and correcting specific underperforming layers in trained networks.
-
Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X
A roadside-camera plus infrastructure-vehicle LiDAR fusion model reaches 0.85 mAP on TUMTraf V2X but 44 of 50 test frames overlap with the released train and validation splits.
-
From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments
Gaussian and linear cropping strategies for large point clouds improve 3D neural network performance over spherical crops, especially in outdoor scenes, and achieve new state-of-the-art results.
-
From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments
Gaussian and related cropping strategies for point cloud subclouds improve 3D neural network performance over spherical cropping on large outdoor scenes.
-
Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
An end-to-end policy learns robust humanoid locomotion directly from noisy depth images via high-fidelity sensor simulation, vision-aware distillation from privileged maps, and terrain-specific multi-critic reward shaping.
-
Learning Minimal Representations of Many-Body Physics from Snapshots of a Quantum Simulator
A VAE learns a minimal latent representation from noisy quantum simulator snapshots that correlates with the sine-Gordon equilibrium parameter and detects anomalous post-quench dynamics including frozen-in solitons.
-
Heterogeneous and Adept Snapshot Distillation for 3D Semantic Segmentation
HAS-KD combines information-oriented heterogeneous distillation from multi-modal models with adept snapshot distillation from training checkpoints to reach SOTA 3D semantic segmentation on ScanNetV2 and S3DIS without ...
-
Staged Factorial Screening for Budget-Constrained Micro-Pretraining
Staged factorial screening recovers stable early penalties from total batch, depth, and width in 2-10 minute pretraining runs and supports a bridge-centered recommendation through 24-hour continuations on two hosts.
-
Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics
A comprehensive review of deep learning techniques for computational mechanics, including LSTM for constitutive modeling, PINNs for PDE solving, optimizers, and kernel methods.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.