On the Generalization Mystery in Deep Learning

Piotr Zielinski; Satrajit Chatterjee

arxiv: 2203.10036 · v3 · pith:DCCKEZTUnew · submitted 2022-03-18 · 💻 cs.LG

On the Generalization Mystery in Deep Learning

Satrajit Chatterjee , Piotr Zielinski This is my paper

classification 💻 cs.LG

keywords deepgeneralizationlearningdatasetswellarguedifferentexamples

0 comments

read the original abstract

The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? We argue that the answer to both questions lies in the interaction of the gradients of different examples during training. Intuitively, if the per-example gradients are well-aligned, that is, if they are coherent, then one may expect GD to be (algorithmically) stable, and hence generalize well. We formalize this argument with an easy to compute and interpretable metric for coherence, and show that the metric takes on very different values on real and random datasets for several common vision networks. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others, why early stopping works, and why it is possible to learn from noisy labels. Moreover, since the theory provides a causal explanation of how GD finds a well-generalizing solution when one exists, it motivates a class of simple modifications to GD that attenuate memorization and improve generalization. Generalization in deep learning is an extremely broad phenomenon, and therefore, it requires an equally general explanation. We conclude with a survey of alternative lines of attack on this problem, and argue that the proposed approach is the most viable one on this basis.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines
cond-mat.dis-nn 2026-05 unverdicted novelty 5.0

RBMs using exponential activation functions can represent and learn data structures with strong higher-order interactions better than linear, step or ReLU activations, but only inside an analytically determined parame...
Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines
cond-mat.dis-nn 2026-05 unverdicted novelty 4.0

RBMs with Gaussian weights rarely induce or easily learn distributions with strong higher-order interactions on visible units, except when the hidden-unit activation function is Exponential.
Sharpness-Aware Minimization with Z-Score Gradient Filtering
cs.LG 2025-05 unverdicted novelty 4.0

Z-Score Filtered SAM retains only high absolute Z-score gradient components per layer during the ascent step and reports higher test accuracy than standard SAM on CIFAR and Tiny-ImageNet benchmarks.