Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

· 2018 · cs.LG · arXiv 1812.03981

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, $0.3$), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of $T^{-1/2}$ in $T$ iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate $T^{-1/4}$ is also shown for stochastic gradient descent.

representative citing papers

Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Low-Rank Decay induces effective-rank collapse in Query/Key matrices and widens the grokking regime on modular arithmetic tasks in scale-invariant Transformers.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View cs.LG · 2026-06-03 · unverdicted · none · ref 5 · internal anchor
Low-Rank Decay induces effective-rank collapse in Query/Key matrices and widens the grokking regime on modular arithmetic tasks in scale-invariant Transformers.

Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

fields

years

verdicts

representative citing papers

citing papers explorer