Implicit Regularization and Generalization in Overparameterized Neural Networks

Zeran Johannsen

arxiv: 2604.07603 · v1 · submitted 2026-04-08 · 💻 cs.LG

Implicit Regularization and Generalization in Overparameterized Neural Networks

Zeran Johannsen This is my paper

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords implicit regularizationoverparameterized neural networksstochastic gradient descentHessian eigenvaluesflat minimalottery ticket hypothesisgeneralizationdouble descent

0 comments

The pith

Smaller SGD batch sizes produce flatter loss minima and higher test accuracy in overparameterized networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why overparameterized neural networks generalize well despite classical predictions of overfitting. It conducts controlled experiments on MNIST and CIFAR-10 varying SGD batch size, measuring loss landscape geometry through the top Hessian eigenvalue, and testing sparse subnetworks via iterative pruning. Results show smaller batches yield flatter minima and 1.61 percentage points better test accuracy, with an 11.8 times smaller top eigenvalue, while 10-percent sparse subnetworks nearly match full-model performance when retrained from initialization. These patterns point to implicit regularization from optimization dynamics as the mechanism enabling generalization in high-parameter regimes.

Core claim

Generalization in overparameterized networks arises from the interaction of network architecture, optimization algorithms such as SGD, and loss landscape geometry, where smaller batch sizes drive solutions toward flatter minima with lower top Hessian eigenvalues that correlate with reduced test error, and where sparse subnetworks identified by magnitude pruning retain near-full performance.

What carries the argument

The top eigenvalue of the Hessian, which quantifies the sharpness of a loss minimum and is shown to shrink under small-batch SGD while tracking improved generalization.

If this is right

Smaller SGD batch sizes produce solutions with smaller top Hessian eigenvalues and lower test error on CIFAR-10 and MNIST.
An 11.8 times reduction in the leading Hessian eigenvalue corresponds to a 1.61 percentage point gain in test accuracy.
Sparse subnetworks retaining 10 percent of parameters reach test performance within 1.15 points of the full model when retrained from the original initialization.
The combination of architecture scale, optimizer choice, and loss geometry governs generalization more than parameter count alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If flatness is the operative factor, then explicit regularization methods that penalize curvature could substitute for small-batch training.
The lottery-ticket and double-descent results may share a common geometric explanation once loss-landscape flatness is measured across all regimes.
Extending the Hessian measurements to larger models would test whether the batch-size effect on eigenvalue magnitude scales with width or depth.

Load-bearing premise

The observed gaps in Hessian eigenvalues and test accuracy between batch sizes stem primarily from optimization dynamics rather than differences in effective learning rate, total computation, or dataset specifics.

What would settle it

Re-running the batch-size experiments on the same networks while scaling the learning rate for large batches to equalize effective step size and checking whether the 11.8x eigenvalue gap and 1.61-point accuracy gap both disappear.

Figures

Figures reproduced from arXiv: 2604.07603 by Zeran Johannsen.

**Figure 3.** Figure 3: Loss landscape analysis. Left: loss increase (%) under weight [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: NTK regime analysis. Left: relative parameter movement decreases [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Lottery ticket pruning results. Blue line shows test accuracy vs. remain [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Classical statistical learning theory predicts that overparameterized models should exhibit severe overfitting, yet modern deep neural networks with far more parameters than training samples consistently generalize well. This contradiction has become a central theoretical question in machine learning. This study investigates the role of optimization dynamics and implicit regularization in enabling generalization in overparameterized neural networks through controlled experiments. We examine stochastic gradient descent (SGD) across batch sizes, the geometry of flat versus sharp minima via Hessian eigenvalue estimation and weight perturbation analysis, the Neural Tangent Kernel (NTK) regime through wide-network experiments, double descent across model scales, and the Lottery Ticket Hypothesis through iterative magnitude pruning. All experiments use PyTorch on CIFAR-10 and MNIST with multiple random seeds. Our findings demonstrate that generalization is strongly influenced by the interaction between network architecture, optimization algorithms, and loss landscape geometry. Smaller batch sizes consistently produced lower test error and flatter minima, with an 11.8x difference in top Hessian eigenvalue between small-batch and large-batch solutions corresponding to 1.61 percentage points higher test accuracy. Sparse subnetworks retaining only 10% of parameters achieved within 1.15 percentage points of full model performance when retrained from their original initialization. These results highlight the need for revised learning-theoretic frameworks capable of explaining generalization in high-dimensional model regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies concrete numbers on batch-size effects and lottery tickets but risks attributing results to implicit regularization when training schedules may not be matched.

read the letter

The paper runs experiments linking smaller SGD batches to flatter minima and slightly better accuracy, plus lottery-ticket pruning results, on MNIST and CIFAR-10. The standout measurements are an 11.8x gap in top Hessian eigenvalue between small-batch and large-batch solutions that lines up with a 1.61 percentage point test-accuracy difference, and 10%-sparse subnetworks staying within 1.15 points of the full model when retrained from the same initialization. These are specific, reportable quantities rather than vague trends.

Referee Report

3 major / 2 minor

Summary. The manuscript empirically investigates implicit regularization mechanisms enabling generalization in overparameterized neural networks. Through PyTorch experiments on MNIST and CIFAR-10, it examines SGD batch-size effects on test error and loss-landscape geometry (via top Hessian eigenvalue and weight perturbations), the NTK regime in wide networks, double descent across model scales, and the Lottery Ticket Hypothesis via iterative magnitude pruning. Key reported outcomes include an 11.8× smaller top Hessian eigenvalue for small-batch solutions corresponding to 1.61 pp higher test accuracy, and 10%-sparse subnetworks retrained from original initialization reaching within 1.15 pp of full-model performance.

Significance. If the reported correlations survive proper controls for effective learning rate and total gradient steps, the work would add concrete empirical support to the view that optimization dynamics and loss-geometry interactions contribute to generalization beyond classical capacity-based bounds. The quantitative Hessian and lottery-ticket numbers, together with the multi-seed protocol, could serve as useful reference points for subsequent theoretical or scaling studies, though the absence of mathematical derivations limits the paper to correlational rather than explanatory status.

major comments (3)

[Experimental Methodology] Experimental Methodology (batch-size section): the central attribution of the 1.61 pp accuracy gap and 11.8× Hessian-eigenvalue gap to implicit regularization from batch size requires that effective step size and total number of gradient updates be held fixed across batch sizes. The manuscript provides no statement on whether learning rates were scaled linearly with batch size or whether epoch counts were adjusted to equalize update counts; without this control the observed geometry and accuracy differences remain consistent with higher noise or greater total compute rather than the claimed architecture-optimizer-loss interaction.
[Results on Hessian and Generalization] Results on Hessian and Generalization: the reported 11.8× difference in top Hessian eigenvalue is presented without error bars, number of random seeds used for the eigenvalue estimation, or the precise network width/depth at which the measurement was taken. Because the NTK regime is also studied in the same manuscript, it is unclear whether the Hessian comparison was performed inside or outside the linearization regime, weakening the link between flatness and the generalization claim.
[Lottery Ticket experiments] Lottery Ticket experiments: the claim that 10%-sparse subnetworks achieve performance within 1.15 pp of the dense model when retrained from the original initialization is load-bearing for the broader narrative of implicit regularization via pruning. The manuscript does not report a random-pruning or magnitude-pruning-at-random-initialization baseline, so it is impossible to determine whether the retained performance is due to the specific mask found by iterative pruning or simply to the original initialization itself.

minor comments (2)

[Abstract] The abstract states that 'multiple random seeds' were used but never reports the exact number or whether error bars reflect standard deviation across seeds; this should be added to every quantitative claim.
[Methods] Notation for the top Hessian eigenvalue is introduced without an explicit equation or reference to the finite-difference or Lanczos method used for its estimation; a short methods paragraph would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have made revisions to improve the clarity and rigor of the experimental claims.

read point-by-point responses

Referee: [Experimental Methodology] Experimental Methodology (batch-size section): the central attribution of the 1.61 pp accuracy gap and 11.8× Hessian-eigenvalue gap to implicit regularization from batch size requires that effective step size and total number of gradient updates be held fixed across batch sizes. The manuscript provides no statement on whether learning rates were scaled linearly with batch size or whether epoch counts were adjusted to equalize update counts; without this control the observed geometry and accuracy differences remain consistent with higher noise or greater total compute rather than the claimed architecture-optimizer-loss interaction.

Authors: We agree that explicit controls for effective learning rate and total gradient updates are necessary to support the attribution to implicit regularization. In our experiments the learning rate was scaled linearly with batch size and the number of epochs was adjusted so that the total number of gradient updates remained constant across batch sizes. We have added a clear statement of these controls, together with a hyperparameter table, to the revised Experimental Methodology section. This revision removes the potential confound and strengthens the link between batch size, loss geometry, and generalization. revision: yes
Referee: [Results on Hessian and Generalization] Results on Hessian and Generalization: the reported 11.8× difference in top Hessian eigenvalue is presented without error bars, number of random seeds used for the eigenvalue estimation, or the precise network width/depth at which the measurement was taken. Because the NTK regime is also studied in the same manuscript, it is unclear whether the Hessian comparison was performed inside or outside the linearization regime, weakening the link between flatness and the generalization claim.

Authors: We thank the referee for highlighting these reporting omissions. The top Hessian eigenvalue was computed on the standard-width networks used throughout the main experiments (ResNet-18 on CIFAR-10 and the corresponding architecture on MNIST), which lie outside the NTK linearization regime. We have added error bars obtained from five independent random seeds, stated the exact width and depth, and explicitly noted that the measurements are performed in the nonlinear regime. A brief comparison in the wide-network NTK limit is also included for completeness. These details appear in the revised Results on Hessian and Generalization section. revision: yes
Referee: [Lottery Ticket experiments] Lottery Ticket experiments: the claim that 10%-sparse subnetworks achieve performance within 1.15 pp of the dense model when retrained from the original initialization is load-bearing for the broader narrative of implicit regularization via pruning. The manuscript does not report a random-pruning or magnitude-pruning-at-random-initialization baseline, so it is impossible to determine whether the retained performance is due to the specific mask found by iterative pruning or simply to the original initialization itself.

Authors: We acknowledge that baselines are required to isolate the contribution of the iteratively discovered mask. We have added two control experiments: random pruning to 10 % sparsity and magnitude pruning performed at random initialization, both followed by retraining from the original initialization. The iteratively pruned masks outperform both baselines by 3–5 percentage points, supporting the claim that the specific mask contributes to the retained performance. These baseline results and the corresponding analysis have been incorporated into the revised Lottery Ticket Hypothesis section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential claims

full rationale

The paper reports experimental observations on SGD batch-size effects, Hessian geometry, NTK regime, double descent, and Lottery Ticket Hypothesis using PyTorch on CIFAR-10/MNIST. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing premises appear in the provided text or abstract. All claims are presented as measured correlations (e.g., 11.8x eigenvalue difference, 1.61 pp accuracy gap) without reducing to inputs by construction or importing uniqueness via prior self-work. The study is self-contained against external benchmarks as a set of controlled empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on standard empirical practices in deep learning rather than new theoretical constructs. No free parameters are fitted to produce the reported findings, and no new entities are postulated.

axioms (1)

domain assumption The largest eigenvalue of the Hessian matrix provides a reliable scalar measure of the sharpness of a loss minimum relevant to generalization.
Invoked when linking Hessian eigenvalues to flat versus sharp minima and to the observed accuracy differences.

pith-pipeline@v0.9.0 · 5527 in / 1382 out tokens · 69865 ms · 2026-05-10T17:19:02.588880+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Smaller batch sizes consistently produced lower test error and flatter minima, with an 11.8x difference in top Hessian eigenvalue between small-batch and large-batch solutions
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Relative parameter movement decreased monotonically from 0.94 at width 32 to 0.08 at width 4096

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

On the uniform convergence of relative frequencies of events to their probabilities,

V . N. Vapnik and A. Y . Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,”Theory of Probability and Its Applications, vol. 16, no. 2, pp. 264–280, 1971

work page 1971
[2]

Understand- ing deep learning requires rethinking generalization,

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,” inInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[3]

Exploring generalization in deep learning,

B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro, “Exploring generalization in deep learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 5947–5956

work page 2017
[4]

Flat minima,

S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural Computation, vol. 9, no. 1, pp. 1–42, 1997

work page 1997
[5]

On large-batch training for deep learning: Generalization gap and sharp minima,

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” inInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[6]

Neural tangent kernel: Conver- gence and generalization in neural networks,

A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Conver- gence and generalization in neural networks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018, pp. 8571– 8580

work page 2018
[7]

Reconciling modern machine-learning practice and the classical bias–variance trade-off,

M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019

work page 2019
[8]

Deep double descent: Where bigger models and more data can hurt,

P. Nakkiran, G. Kaplun, Y . Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double descent: Where bigger models and more data can hurt,” inInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[9]

The lottery ticket hypothesis: Finding sparse, trainable neural networks,

J. Frankle and M. Carlin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” inInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[10]

V . N. Vapnik,Statistical Learning Theory. Wiley, 1998

work page 1998
[11]

Implicit regularization in deep matrix factorization,

S. Arora, N. Cohen, W. Hu, and Y . Luo, “Implicit regularization in deep matrix factorization,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019, pp. 7411–7422

work page 2019
[12]

Visualizing the loss landscape of neural nets,

H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018, pp. 6389–6399

work page 2018
[13]

Entropy-SGD: Biasing gradient descent into wide valleys,

P. Chaudhari, A. Choromanska, S. Soatto, Y . LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-SGD: Biasing gradient descent into wide valleys,”Journal of Statistical Mechanics: Theory and Experiment, vol. 2017, no. 6, p. 063301, 2017

work page 2017
[14]

Sharp minima can generalize for deep nets,

L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio, “Sharp minima can generalize for deep nets,” inProceedings of the 34th International Conference on Machine Learning (ICML), 2017, pp. 1019–1028

work page 2017
[15]

To understand deep learning we need to understand kernel learning,

M. Belkin, S. Ma, and S. Mandal, “To understand deep learning we need to understand kernel learning,” inProceedings of the 35th International Conference on Machine Learning (ICML), 2018, pp. 70–78

work page 2018
[16]

A convergence theory for deep learn- ing via over-parameterization,

Z. Allen-Zhu, Y . Li, and Z. Song, “A convergence theory for deep learn- ing via over-parameterization,” inProceedings of the 36th International Conference on Machine Learning (ICML), 2019, pp. 242–252

work page 2019
[17]

Linear mode connectivity and the lottery ticket hypothesis,

J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carlin, “Linear mode connectivity and the lottery ticket hypothesis,” inProceedings of the 37th International Conference on Machine Learning (ICML), 2020, pp. 3259–3269. APPENDIX The following table summarizes the primary hyperparame- ters used across experiments. TABLE VIII DEFAULT EXPERIMENTAL HYPERPARAMETE...

work page 2020

[1] [1]

On the uniform convergence of relative frequencies of events to their probabilities,

V . N. Vapnik and A. Y . Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,”Theory of Probability and Its Applications, vol. 16, no. 2, pp. 264–280, 1971

work page 1971

[2] [2]

Understand- ing deep learning requires rethinking generalization,

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,” inInternational Conference on Learning Representations (ICLR), 2017

work page 2017

[3] [3]

Exploring generalization in deep learning,

B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro, “Exploring generalization in deep learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 5947–5956

work page 2017

[4] [4]

Flat minima,

S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural Computation, vol. 9, no. 1, pp. 1–42, 1997

work page 1997

[5] [5]

On large-batch training for deep learning: Generalization gap and sharp minima,

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” inInternational Conference on Learning Representations (ICLR), 2017

work page 2017

[6] [6]

Neural tangent kernel: Conver- gence and generalization in neural networks,

A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Conver- gence and generalization in neural networks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018, pp. 8571– 8580

work page 2018

[7] [7]

Reconciling modern machine-learning practice and the classical bias–variance trade-off,

M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019

work page 2019

[8] [8]

Deep double descent: Where bigger models and more data can hurt,

P. Nakkiran, G. Kaplun, Y . Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double descent: Where bigger models and more data can hurt,” inInternational Conference on Learning Representations (ICLR), 2020

work page 2020

[9] [9]

The lottery ticket hypothesis: Finding sparse, trainable neural networks,

J. Frankle and M. Carlin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” inInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[10] [10]

V . N. Vapnik,Statistical Learning Theory. Wiley, 1998

work page 1998

[11] [11]

Implicit regularization in deep matrix factorization,

S. Arora, N. Cohen, W. Hu, and Y . Luo, “Implicit regularization in deep matrix factorization,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019, pp. 7411–7422

work page 2019

[12] [12]

Visualizing the loss landscape of neural nets,

H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018, pp. 6389–6399

work page 2018

[13] [13]

Entropy-SGD: Biasing gradient descent into wide valleys,

P. Chaudhari, A. Choromanska, S. Soatto, Y . LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-SGD: Biasing gradient descent into wide valleys,”Journal of Statistical Mechanics: Theory and Experiment, vol. 2017, no. 6, p. 063301, 2017

work page 2017

[14] [14]

Sharp minima can generalize for deep nets,

L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio, “Sharp minima can generalize for deep nets,” inProceedings of the 34th International Conference on Machine Learning (ICML), 2017, pp. 1019–1028

work page 2017

[15] [15]

To understand deep learning we need to understand kernel learning,

M. Belkin, S. Ma, and S. Mandal, “To understand deep learning we need to understand kernel learning,” inProceedings of the 35th International Conference on Machine Learning (ICML), 2018, pp. 70–78

work page 2018

[16] [16]

A convergence theory for deep learn- ing via over-parameterization,

Z. Allen-Zhu, Y . Li, and Z. Song, “A convergence theory for deep learn- ing via over-parameterization,” inProceedings of the 36th International Conference on Machine Learning (ICML), 2019, pp. 242–252

work page 2019

[17] [17]

Linear mode connectivity and the lottery ticket hypothesis,

J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carlin, “Linear mode connectivity and the lottery ticket hypothesis,” inProceedings of the 37th International Conference on Machine Learning (ICML), 2020, pp. 3259–3269. APPENDIX The following table summarizes the primary hyperparame- ters used across experiments. TABLE VIII DEFAULT EXPERIMENTAL HYPERPARAMETE...

work page 2020