hub

L2 Regularization versus Batch and Weight Normalization

Twan van Laarhoven · 2017 · cs.LG · arXiv 1706.05350

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

open full Pith review browse 18 citing papers arXiv PDF

abstract

Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Dead-Direction Conditioners provide gauge-equivariant preconditioning by conditioning optimizer state on symmetry orbits, yielding improved resistance to over-training collapse and higher detection of dead directions compared to AdamW and Muon.

Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Neural networks admit large families of approximately equivalent solutions via neuron identifiability even without structural symmetry, enabling linear low-loss merging paths without prior alignment.

Scalable Agentic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins

q-bio.QM · 2025-12-17 · unverdicted · novelty 7.0

StructBioReasoner is a scalable multi-agent system that designs IDP-targeting biologics, with over 50% of 787 candidates for Der f 21 showing better binding free energy than human-designed references.

How does the optimizer implicitly bias the model merging loss landscape?

cs.LG · 2025-10-06 · unverdicted · novelty 7.0

Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.

Progressive Growing of GANs for Improved Quality, Stability, and Variation

cs.NE · 2017-10-27 · accept · novelty 7.0

Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.

On the Nonlinearity of Learning Rate Scaling for LLM Training

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

Optimal learning rate for models from 22M to 707M parameters shows nonlinear upward curvature with scale that disappears under effective learning rate and data-scale extrapolation.

Muown Implicitly Performs Angular Step-size Decay

cs.LG · 2026-06-22 · conditional · novelty 6.0

Muown's update implicitly decays angular step size via magnitude modulation; AngularMuown decouples and schedules angular steps explicitly, yielding better empirical results.

Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics

cs.LG · 2026-06-11 · accept · novelty 6.0

Derives three-force decomposition of squared weight norm under AdamW and validates it on Pythia-70M models, plus spline recovery of alignment force from checkpoints.

Preserving Plasticity in Continual Learning via Dynamical Isometry

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Dynamical isometry (Jacobian singular values near 1) preserves plasticity in continual learning; an isometry-promoting regularizer and decoupled AdamO optimizer match or beat prior methods on supervised and RL benchmarks.

Does Weight Decay Enhance Training Stability?

cs.LG · 2026-05-15 · conditional · novelty 6.0

Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

cs.LG · 2026-04-06 · unverdicted · novelty 6.0 · 2 refs

FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.

Demystifying Manifold Constraints in LLM Pre-training

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

cs.LG · 2026-06-24 · conditional · novelty 5.0

Splitting weight matrices into a fixed-norm direction and learnable per-row/column magnitudes improves LLM training over AdamW/Muon, removes weight decay and warmup, and transfers the optimal LR across width.

Anytime Training with Schedule-Free Spectral Optimization

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

ScheduleFree+ scales schedule-free learning to LLMs with fixes for large batches and models, outperforming Warmup-Stable-Decay schedules by up to 31% at 1000 tokens per parameter.

XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

XQCfD accelerates actor-critic RL by using prior data, pretrained policies, and stationary architectures to achieve state-of-the-art results on Adroit, Robomimic, and MimicGen manipulation benchmarks with low update-to-data ratios.

Adaptive Norm-Based Regularization for Neural Networks

stat.ML · 2026-04-30 · unverdicted · novelty 5.0

Covariance-aware ridge and combined l1-l2 regularizers for neural networks yield better predictive performance and complexity control than standard penalties in simulations and applications to cooling-load prediction and leukemia classification.

Can Stationary Distributions of Scale-Invariant Neural Networks Be Described by the Thermodynamics of an Ideal Gas?

cs.LG · 2025-11-10

citing papers explorer

Showing 18 of 18 citing papers.

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks cs.LG · 2026-06-28 · unverdicted · none · ref 33 · internal anchor
Dead-Direction Conditioners provide gauge-equivariant preconditioning by conditioning optimizer state on symmetry orbits, yielding improved resistance to over-training collapse and higher detection of dead directions compared to AdamW and Muon.
Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability cs.LG · 2026-06-03 · unverdicted · none · ref 75 · internal anchor
Neural networks admit large families of approximately equivalent solutions via neuron identifiability even without structural symmetry, enabling linear low-loss merging paths without prior alignment.
Scalable Agentic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins q-bio.QM · 2025-12-17 · unverdicted · none · ref 63 · internal anchor
StructBioReasoner is a scalable multi-agent system that designs IDP-targeting biologics, with over 50% of 787 candidates for Der f 21 showing better binding free energy than human-designed references.
How does the optimizer implicitly bias the model merging loss landscape? cs.LG · 2025-10-06 · unverdicted · none · ref 10 · internal anchor
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
Progressive Growing of GANs for Improved Quality, Stability, and Variation cs.NE · 2017-10-27 · accept · none · ref 50
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
On the Nonlinearity of Learning Rate Scaling for LLM Training cs.LG · 2026-06-28 · unverdicted · none · ref 37 · internal anchor
Optimal learning rate for models from 22M to 707M parameters shows nonlinear upward curvature with scale that disappears under effective learning rate and data-scale extrapolation.
Muown Implicitly Performs Angular Step-size Decay cs.LG · 2026-06-22 · conditional · none · ref 1 · internal anchor
Muown's update implicitly decays angular step size via magnitude modulation; AngularMuown decouples and schedules angular steps explicitly, yielding better empirical results.
Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics cs.LG · 2026-06-11 · accept · none · ref 17 · internal anchor
Derives three-force decomposition of squared weight norm under AdamW and validates it on Pythia-70M models, plus spline recovery of alignment force from checkpoints.
Preserving Plasticity in Continual Learning via Dynamical Isometry cs.LG · 2026-06-08 · unverdicted · none · ref 15 · internal anchor
Dynamical isometry (Jacobian singular values near 1) preserves plasticity in continual learning; an isometry-promoting regularizer and decoupled AdamO optimizer match or beat prior methods on supervised and RL benchmarks.
Does Weight Decay Enhance Training Stability? cs.LG · 2026-05-15 · conditional · none · ref 25 · internal anchor
Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control cs.LG · 2026-04-06 · unverdicted · none · ref 88 · 2 links · internal anchor
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
Demystifying Manifold Constraints in LLM Pre-training cs.LG · 2026-05-06 · unverdicted · none · ref 60
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors cs.LG · 2026-06-24 · conditional · none · ref 75 · internal anchor
Splitting weight matrices into a fixed-norm direction and learnable per-row/column magnitudes improves LLM training over AdamW/Muon, removes weight decay and warmup, and transfers the optimal LR across width.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 74 · internal anchor
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models cs.LG · 2026-05-18 · unverdicted · none · ref 35 · internal anchor
ScheduleFree+ scales schedule-free learning to LLMs with fixes for large batches and models, outperforming Warmup-Stable-Decay schedules by up to 31% at 1000 tokens per parameter.
XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies cs.LG · 2026-05-11 · unverdicted · none · ref 40
XQCfD accelerates actor-critic RL by using prior data, pretrained policies, and stationary architectures to achieve state-of-the-art results on Adroit, Robomimic, and MimicGen manipulation benchmarks with low update-to-data ratios.
Adaptive Norm-Based Regularization for Neural Networks stat.ML · 2026-04-30 · unverdicted · none · ref 20
Covariance-aware ridge and combined l1-l2 regularizers for neural networks yield better predictive performance and complexity control than standard penalties in simulations and applications to cooling-load prediction and leukemia classification.
Can Stationary Distributions of Scale-Invariant Neural Networks Be Described by the Thermodynamics of an Ideal Gas? cs.LG · 2025-11-10 · unreviewed · ref 6 · internal anchor

L2 Regularization versus Batch and Weight Normalization

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer