Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

Kaifeng Lyu; Sanjeev Arora; Zhiyuan Li

arxiv: 1812.03981 · v1 · pith:W2P2XATXnew · submitted 2018-12-10 · 💻 cs.LG · stat.ML

Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

Sanjeev Arora , Zhiyuan Li , Kaifeng Lyu This is my paper

classification 💻 cs.LG stat.ML

keywords gradientdescentlearningratetheoreticalanalysisbatchnormalization

0 comments

read the original abstract

Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, $0.3$), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of $T^{-1/2}$ in $T$ iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate $T^{-1/4}$ is also shown for stochastic gradient descent.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks
cs.LG 2026-06 unverdicted novelty 7.0

Symmetry-quotiented SGD dynamics in neural networks satisfy viscous Hamilton-Jacobi and Burgers-type equations after local-entropy coarse-graining, with rigorous shock formation under a gradient-field assumption.
Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View
cs.LG 2026-06 unverdicted novelty 7.0

Low-Rank Decay induces effective-rank collapse in Query/Key matrices and widens the grokking regime on modular arithmetic tasks in scale-invariant Transformers.
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
cs.LG 2026-06 unverdicted novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.