pith. sign in

arxiv: 2604.07603 · v1 · submitted 2026-04-08 · 💻 cs.LG

Implicit Regularization and Generalization in Overparameterized Neural Networks

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords implicit regularizationoverparameterized neural networksstochastic gradient descentHessian eigenvaluesflat minimalottery ticket hypothesisgeneralizationdouble descent
0
0 comments X

The pith

Smaller SGD batch sizes produce flatter loss minima and higher test accuracy in overparameterized networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why overparameterized neural networks generalize well despite classical predictions of overfitting. It conducts controlled experiments on MNIST and CIFAR-10 varying SGD batch size, measuring loss landscape geometry through the top Hessian eigenvalue, and testing sparse subnetworks via iterative pruning. Results show smaller batches yield flatter minima and 1.61 percentage points better test accuracy, with an 11.8 times smaller top eigenvalue, while 10-percent sparse subnetworks nearly match full-model performance when retrained from initialization. These patterns point to implicit regularization from optimization dynamics as the mechanism enabling generalization in high-parameter regimes.

Core claim

Generalization in overparameterized networks arises from the interaction of network architecture, optimization algorithms such as SGD, and loss landscape geometry, where smaller batch sizes drive solutions toward flatter minima with lower top Hessian eigenvalues that correlate with reduced test error, and where sparse subnetworks identified by magnitude pruning retain near-full performance.

What carries the argument

The top eigenvalue of the Hessian, which quantifies the sharpness of a loss minimum and is shown to shrink under small-batch SGD while tracking improved generalization.

If this is right

  • Smaller SGD batch sizes produce solutions with smaller top Hessian eigenvalues and lower test error on CIFAR-10 and MNIST.
  • An 11.8 times reduction in the leading Hessian eigenvalue corresponds to a 1.61 percentage point gain in test accuracy.
  • Sparse subnetworks retaining 10 percent of parameters reach test performance within 1.15 points of the full model when retrained from the original initialization.
  • The combination of architecture scale, optimizer choice, and loss geometry governs generalization more than parameter count alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If flatness is the operative factor, then explicit regularization methods that penalize curvature could substitute for small-batch training.
  • The lottery-ticket and double-descent results may share a common geometric explanation once loss-landscape flatness is measured across all regimes.
  • Extending the Hessian measurements to larger models would test whether the batch-size effect on eigenvalue magnitude scales with width or depth.

Load-bearing premise

The observed gaps in Hessian eigenvalues and test accuracy between batch sizes stem primarily from optimization dynamics rather than differences in effective learning rate, total computation, or dataset specifics.

What would settle it

Re-running the batch-size experiments on the same networks while scaling the learning rate for large batches to equalize effective step size and checking whether the 11.8x eigenvalue gap and 1.61-point accuracy gap both disappear.

Figures

Figures reproduced from arXiv: 2604.07603 by Zeran Johannsen.

Figure 2
Figure 2. Figure 2: Test accuracy and generalization gap across optimizer configurations. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Loss landscape analysis. Left: loss increase (%) under weight [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: NTK regime analysis. Left: relative parameter movement decreases [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Lottery ticket pruning results. Blue line shows test accuracy vs. remain [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Classical statistical learning theory predicts that overparameterized models should exhibit severe overfitting, yet modern deep neural networks with far more parameters than training samples consistently generalize well. This contradiction has become a central theoretical question in machine learning. This study investigates the role of optimization dynamics and implicit regularization in enabling generalization in overparameterized neural networks through controlled experiments. We examine stochastic gradient descent (SGD) across batch sizes, the geometry of flat versus sharp minima via Hessian eigenvalue estimation and weight perturbation analysis, the Neural Tangent Kernel (NTK) regime through wide-network experiments, double descent across model scales, and the Lottery Ticket Hypothesis through iterative magnitude pruning. All experiments use PyTorch on CIFAR-10 and MNIST with multiple random seeds. Our findings demonstrate that generalization is strongly influenced by the interaction between network architecture, optimization algorithms, and loss landscape geometry. Smaller batch sizes consistently produced lower test error and flatter minima, with an 11.8x difference in top Hessian eigenvalue between small-batch and large-batch solutions corresponding to 1.61 percentage points higher test accuracy. Sparse subnetworks retaining only 10% of parameters achieved within 1.15 percentage points of full model performance when retrained from their original initialization. These results highlight the need for revised learning-theoretic frameworks capable of explaining generalization in high-dimensional model regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript empirically investigates implicit regularization mechanisms enabling generalization in overparameterized neural networks. Through PyTorch experiments on MNIST and CIFAR-10, it examines SGD batch-size effects on test error and loss-landscape geometry (via top Hessian eigenvalue and weight perturbations), the NTK regime in wide networks, double descent across model scales, and the Lottery Ticket Hypothesis via iterative magnitude pruning. Key reported outcomes include an 11.8× smaller top Hessian eigenvalue for small-batch solutions corresponding to 1.61 pp higher test accuracy, and 10%-sparse subnetworks retrained from original initialization reaching within 1.15 pp of full-model performance.

Significance. If the reported correlations survive proper controls for effective learning rate and total gradient steps, the work would add concrete empirical support to the view that optimization dynamics and loss-geometry interactions contribute to generalization beyond classical capacity-based bounds. The quantitative Hessian and lottery-ticket numbers, together with the multi-seed protocol, could serve as useful reference points for subsequent theoretical or scaling studies, though the absence of mathematical derivations limits the paper to correlational rather than explanatory status.

major comments (3)
  1. [Experimental Methodology] Experimental Methodology (batch-size section): the central attribution of the 1.61 pp accuracy gap and 11.8× Hessian-eigenvalue gap to implicit regularization from batch size requires that effective step size and total number of gradient updates be held fixed across batch sizes. The manuscript provides no statement on whether learning rates were scaled linearly with batch size or whether epoch counts were adjusted to equalize update counts; without this control the observed geometry and accuracy differences remain consistent with higher noise or greater total compute rather than the claimed architecture-optimizer-loss interaction.
  2. [Results on Hessian and Generalization] Results on Hessian and Generalization: the reported 11.8× difference in top Hessian eigenvalue is presented without error bars, number of random seeds used for the eigenvalue estimation, or the precise network width/depth at which the measurement was taken. Because the NTK regime is also studied in the same manuscript, it is unclear whether the Hessian comparison was performed inside or outside the linearization regime, weakening the link between flatness and the generalization claim.
  3. [Lottery Ticket experiments] Lottery Ticket experiments: the claim that 10%-sparse subnetworks achieve performance within 1.15 pp of the dense model when retrained from the original initialization is load-bearing for the broader narrative of implicit regularization via pruning. The manuscript does not report a random-pruning or magnitude-pruning-at-random-initialization baseline, so it is impossible to determine whether the retained performance is due to the specific mask found by iterative pruning or simply to the original initialization itself.
minor comments (2)
  1. [Abstract] The abstract states that 'multiple random seeds' were used but never reports the exact number or whether error bars reflect standard deviation across seeds; this should be added to every quantitative claim.
  2. [Methods] Notation for the top Hessian eigenvalue is introduced without an explicit equation or reference to the finite-difference or Lanczos method used for its estimation; a short methods paragraph would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have made revisions to improve the clarity and rigor of the experimental claims.

read point-by-point responses
  1. Referee: [Experimental Methodology] Experimental Methodology (batch-size section): the central attribution of the 1.61 pp accuracy gap and 11.8× Hessian-eigenvalue gap to implicit regularization from batch size requires that effective step size and total number of gradient updates be held fixed across batch sizes. The manuscript provides no statement on whether learning rates were scaled linearly with batch size or whether epoch counts were adjusted to equalize update counts; without this control the observed geometry and accuracy differences remain consistent with higher noise or greater total compute rather than the claimed architecture-optimizer-loss interaction.

    Authors: We agree that explicit controls for effective learning rate and total gradient updates are necessary to support the attribution to implicit regularization. In our experiments the learning rate was scaled linearly with batch size and the number of epochs was adjusted so that the total number of gradient updates remained constant across batch sizes. We have added a clear statement of these controls, together with a hyperparameter table, to the revised Experimental Methodology section. This revision removes the potential confound and strengthens the link between batch size, loss geometry, and generalization. revision: yes

  2. Referee: [Results on Hessian and Generalization] Results on Hessian and Generalization: the reported 11.8× difference in top Hessian eigenvalue is presented without error bars, number of random seeds used for the eigenvalue estimation, or the precise network width/depth at which the measurement was taken. Because the NTK regime is also studied in the same manuscript, it is unclear whether the Hessian comparison was performed inside or outside the linearization regime, weakening the link between flatness and the generalization claim.

    Authors: We thank the referee for highlighting these reporting omissions. The top Hessian eigenvalue was computed on the standard-width networks used throughout the main experiments (ResNet-18 on CIFAR-10 and the corresponding architecture on MNIST), which lie outside the NTK linearization regime. We have added error bars obtained from five independent random seeds, stated the exact width and depth, and explicitly noted that the measurements are performed in the nonlinear regime. A brief comparison in the wide-network NTK limit is also included for completeness. These details appear in the revised Results on Hessian and Generalization section. revision: yes

  3. Referee: [Lottery Ticket experiments] Lottery Ticket experiments: the claim that 10%-sparse subnetworks achieve performance within 1.15 pp of the dense model when retrained from the original initialization is load-bearing for the broader narrative of implicit regularization via pruning. The manuscript does not report a random-pruning or magnitude-pruning-at-random-initialization baseline, so it is impossible to determine whether the retained performance is due to the specific mask found by iterative pruning or simply to the original initialization itself.

    Authors: We acknowledge that baselines are required to isolate the contribution of the iteratively discovered mask. We have added two control experiments: random pruning to 10 % sparsity and magnitude pruning performed at random initialization, both followed by retraining from the original initialization. The iteratively pruned masks outperform both baselines by 3–5 percentage points, supporting the claim that the specific mask contributes to the retained performance. These baseline results and the corresponding analysis have been incorporated into the revised Lottery Ticket Hypothesis section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential claims

full rationale

The paper reports experimental observations on SGD batch-size effects, Hessian geometry, NTK regime, double descent, and Lottery Ticket Hypothesis using PyTorch on CIFAR-10/MNIST. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing premises appear in the provided text or abstract. All claims are presented as measured correlations (e.g., 11.8x eigenvalue difference, 1.61 pp accuracy gap) without reducing to inputs by construction or importing uniqueness via prior self-work. The study is self-contained against external benchmarks as a set of controlled empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on standard empirical practices in deep learning rather than new theoretical constructs. No free parameters are fitted to produce the reported findings, and no new entities are postulated.

axioms (1)
  • domain assumption The largest eigenvalue of the Hessian matrix provides a reliable scalar measure of the sharpness of a loss minimum relevant to generalization.
    Invoked when linking Hessian eigenvalues to flat versus sharp minima and to the observed accuracy differences.

pith-pipeline@v0.9.0 · 5527 in / 1382 out tokens · 69865 ms · 2026-05-10T17:19:02.588880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    On the uniform convergence of relative frequencies of events to their probabilities,

    V . N. Vapnik and A. Y . Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,”Theory of Probability and Its Applications, vol. 16, no. 2, pp. 264–280, 1971

  2. [2]

    Understand- ing deep learning requires rethinking generalization,

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,” inInternational Conference on Learning Representations (ICLR), 2017

  3. [3]

    Exploring generalization in deep learning,

    B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro, “Exploring generalization in deep learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 5947–5956

  4. [4]

    Flat minima,

    S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural Computation, vol. 9, no. 1, pp. 1–42, 1997

  5. [5]

    On large-batch training for deep learning: Generalization gap and sharp minima,

    N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” inInternational Conference on Learning Representations (ICLR), 2017

  6. [6]

    Neural tangent kernel: Conver- gence and generalization in neural networks,

    A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Conver- gence and generalization in neural networks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018, pp. 8571– 8580

  7. [7]

    Reconciling modern machine-learning practice and the classical bias–variance trade-off,

    M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019

  8. [8]

    Deep double descent: Where bigger models and more data can hurt,

    P. Nakkiran, G. Kaplun, Y . Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double descent: Where bigger models and more data can hurt,” inInternational Conference on Learning Representations (ICLR), 2020

  9. [9]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks,

    J. Frankle and M. Carlin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” inInternational Conference on Learning Representations (ICLR), 2019

  10. [10]

    V . N. Vapnik,Statistical Learning Theory. Wiley, 1998

  11. [11]

    Implicit regularization in deep matrix factorization,

    S. Arora, N. Cohen, W. Hu, and Y . Luo, “Implicit regularization in deep matrix factorization,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019, pp. 7411–7422

  12. [12]

    Visualizing the loss landscape of neural nets,

    H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018, pp. 6389–6399

  13. [13]

    Entropy-SGD: Biasing gradient descent into wide valleys,

    P. Chaudhari, A. Choromanska, S. Soatto, Y . LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-SGD: Biasing gradient descent into wide valleys,”Journal of Statistical Mechanics: Theory and Experiment, vol. 2017, no. 6, p. 063301, 2017

  14. [14]

    Sharp minima can generalize for deep nets,

    L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio, “Sharp minima can generalize for deep nets,” inProceedings of the 34th International Conference on Machine Learning (ICML), 2017, pp. 1019–1028

  15. [15]

    To understand deep learning we need to understand kernel learning,

    M. Belkin, S. Ma, and S. Mandal, “To understand deep learning we need to understand kernel learning,” inProceedings of the 35th International Conference on Machine Learning (ICML), 2018, pp. 70–78

  16. [16]

    A convergence theory for deep learn- ing via over-parameterization,

    Z. Allen-Zhu, Y . Li, and Z. Song, “A convergence theory for deep learn- ing via over-parameterization,” inProceedings of the 36th International Conference on Machine Learning (ICML), 2019, pp. 242–252

  17. [17]

    Linear mode connectivity and the lottery ticket hypothesis,

    J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carlin, “Linear mode connectivity and the lottery ticket hypothesis,” inProceedings of the 37th International Conference on Machine Learning (ICML), 2020, pp. 3259–3269. APPENDIX The following table summarizes the primary hyperparame- ters used across experiments. TABLE VIII DEFAULT EXPERIMENTAL HYPERPARAMETE...