In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur; Nathan Srebro; Ryota Tomioka

arxiv: 1412.6614 · v4 · pith:KGHMGNU5new · submitted 2014-12-20 · 💻 cs.LG · cs.AI· cs.CV· stat.ML

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur , Ryota Tomioka , Nathan Srebro This is my paper

classification 💻 cs.LG cs.AIcs.CVstat.ML

keywords learningbiasdeepinductiveroleanalogyarguecapacity

0 comments

read the original abstract

We present experiments demonstrating that some other form of capacity control, different from network size, plays a central role in learning multilayer feed-forward networks. We argue, partially through analogy to matrix factorization, that this is an inductive bias that can help shed light on deep learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Understanding deep learning requires rethinking generalization
cs.LG 2016-11 accept novelty 8.0

State-of-the-art convolutional networks easily memorize random labels and unstructured noise images, indicating that generalization in deep learning cannot be explained by traditional capacity or regularization arguments.
Estimating Implicit Regularization in Deep Learning
stat.ML 2026-05 unverdicted novelty 7.0

Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
New Equivalences Between Interpolation and SVMs: Kernels and Structured Features
stat.ML 2023-05 unverdicted novelty 7.0

New conditions for support vector proliferation (SVP) in RKHS for bounded orthonormal systems and sub-Gaussian features, yielding generalization bounds for kernel SVMs beyond prior restrictive assumptions.
Memorisation, convergence and generalisation in generative models
stat.ML 2026-05 unverdicted novelty 6.0

Linear generative models memorize at small data loads but converge continuously once samples scale linearly with dimension; this convergence is insensitive to sharp recovery of principal latent factors.
Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and fo...
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
cs.LG 2026-04 unverdicted novelty 6.0

Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
Deep sequence models tend to memorize geometrically; it is unclear why
cs.LG 2025-10 unverdicted novelty 6.0

Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
cs.LG 2024-01 unverdicted novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs
stat.ML 2022-06 unverdicted novelty 6.0

For orthogonal inputs, gradient flow on shallow ReLU nets with MSE loss at small init converges to zero loss, exhibits min-variation-norm bias, initial alignment, and saddle-to-saddle dynamics.
Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari
cs.LG 2026-05 unverdicted novelty 5.0

Transformer world models on Atari exhibit game-specific scaling regimes, but joint training on 26 environments produces consistent monotonic gains that improve downstream control policies to a median normalized score ...
(How) Learning Rates Regulate Catastrophic Overtraining
cs.LG 2026-04 unverdicted novelty 5.0

Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.
On improving deep learning generalization with adaptive sparse connectivity
cs.NE 2019-06 unverdicted novelty 4.0

Sparse MLPs trained via SET plus neuron pruning achieve competitive performance on 15 datasets while pruning ~50% of hidden neurons and keeping parameter count linear in neuron count.