Recognition: unknown
Three Mechanisms of Weight Decay Regularization
read the original abstract
Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of $L_2$ regularization. Literal weight decay has been shown to outperform $L_2$ regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Vibrational infrared and Raman spectra of the methanol molecule with equivariant neural-network property surfaces
Equivariant neural networks produce dipole and polarizability surfaces for methanol that enable variational computation of vibrational IR and Raman spectra agreeing with experiment to 2.2 cm^{-1} RMSD on fundamentals.
-
Demystifying Manifold Constraints in LLM Pre-training
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
-
Dante: An Open Source Model Pre-Training and Fine-Tuning Tool for the Dafne Federated Framework for Medical Image Segmentation
Dante is a new open-source backend for the Dafne ecosystem that implements configurable training from scratch, layer freezing, and channel-wise LoRA for medical image segmentation, with validation showing faster conve...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.