arxiv: 1706.05350 · v1 · pith:NNWZ43PUnew · submitted 2017-06-16 · 💻 cs.LG · stat.ML

L2 Regularization versus Batch and Weight Normalization

Twan van Laarhoven This is my paper

classification 💻 cs.LG stat.ML

keywords normalizationregularizationbatchinfluencelearningnetworksneuralrate

0 comments

read the original abstract

Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scalable Agentic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins
q-bio.QM 2025-12 unverdicted novelty 7.0

StructBioReasoner is a scalable multi-agent system that designs IDP-targeting biologics, with over 50% of 787 candidates for Der f 21 showing better binding free energy than human-designed references.
How does the optimizer implicitly bias the model merging loss landscape?
cs.LG 2025-10 unverdicted novelty 7.0

Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
Progressive Growing of GANs for Improved Quality, Stability, and Variation
cs.NE 2017-10 accept novelty 7.0

Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
Demystifying Manifold Constraints in LLM Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
Can Stationary Distributions of Scale-Invariant Neural Networks Be Described by the Thermodynamics of an Ideal Gas?
cs.LG 2025-11 unverdicted novelty 6.0

A thermodynamic framework maps SGD stationary distributions in scale-invariant networks to ideal-gas behavior, with training hyperparameters acting as thermodynamic variables.
XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies
cs.LG 2026-05 unverdicted novelty 5.0

XQCfD accelerates actor-critic RL by using prior data, pretrained policies, and stationary architectures to achieve state-of-the-art results on Adroit, Robomimic, and MimicGen manipulation benchmarks with low update-t...
Adaptive Norm-Based Regularization for Neural Networks
stat.ML 2026-04 unverdicted novelty 5.0

Covariance-aware ridge and combined l1-l2 regularizers for neural networks yield better predictive performance and complexity control than standard penalties in simulations and applications to cooling-load prediction ...