Traditional and Heavy-Tailed Self Regularization in Neural Network Models

Charles H. Martin; Michael W. Mahoney

arxiv: 1901.08276 · v1 · pith:B54UWSXAnew · submitted 2019-01-24 · 💻 cs.LG · stat.ML

Traditional and Heavy-Tailed Self Regularization in Neural Network Models

Charles H. Martin , Michael W. Mahoney This is my paper

classification 💻 cs.LG stat.ML

keywords modelsself-regularizationdnnsemphheavy-tailedimplicitmatricesregularization

0 comments

read the original abstract

Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models such as AlexNet and Inception, and smaller models trained from scratch, such as LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly indicate that the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of regularization, such as Dropout or Weight Norm constraints. Building on recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, we develop a theory to identify \emph{5+1 Phases of Training}, corresponding to increasing amounts of \emph{Implicit Self-Regularization}. For smaller and/or older DNNs, this Implicit Self-Regularization is like traditional Tikhonov regularization, in that there is a `size scale' separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of \emph{Heavy-Tailed Self-Regularization}, similar to the self-organization seen in the statistical physics of disordered systems. This implicit Self-Regularization can depend strongly on the many knobs of the training process. By exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Selectivity and Shape in the Design of Forward-Forward Goodness Functions
cs.LG 2026-03 unverdicted novelty 7.0

Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
Spectral phase transitions and trainability in neural network learning dynamics
cond-mat.dis-nn 2026-06 unverdicted novelty 6.0

SGD on neural network weights induces a BBP phase transition that detaches signal eigenvalues from the random bulk, yielding an analytically solvable phase diagram for trainability in a linear teacher-student model.
Patnaik-Pearson intrinsic dimension for internal representations of neural networks
math.ST 2026-06 unverdicted novelty 6.0

Introduces the Patnaik-Pearson intrinsic dimension estimator, proves some of its properties, relates it to HTSR/SETOL for Pareto spectra, and applies it to track embedding dimension evolution in BERT-base and DeepSeek...
Patnaik-Pearson intrinsic dimension for internal representations of neural networks
math.ST 2026-06 unverdicted novelty 6.0

Introduces the Patnaik-Pearson intrinsic dimension estimator, relates it to HTSR/SETOL for Pareto spectral densities, and applies it to measure embedding dimension evolution in BERT-base and DeepSeek-R1-Distill-Qwen-1.
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
cs.LG 2026-05 conditional novelty 6.0

Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.
A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions
cs.LG 2026-05 unverdicted novelty 6.0

A Weibull diagnostic framework classifies transformer weight matrices into consistent functional classes via the shape parameter k and tracks training progress via the scale parameter lambda across multiple architectures.
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning
cs.LG 2024-11 unverdicted novelty 6.0

CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
cs.LG 2026-05 unverdicted novelty 5.0

LLR uses heavy-tailed self-regularization theory to set per-layer learning rates in Transformers, yielding faster convergence and higher zero-shot accuracy than uniform rates across model scales.