pith. sign in

arxiv: 2606.29043 · v1 · pith:XPUI4QA4new · submitted 2026-06-27 · 💻 cs.LG

How Far Can Sharpness and Complexity Jointly Explain Generalization?

Pith reviewed 2026-06-30 09:23 UTC · model grok-4.3

classification 💻 cs.LG
keywords generalizationsharpnesscomplexitydeep neural networksfunction spacelinear regressionPareto analysis
0
0 comments X

The pith

Function-oriented sharpness and complexity jointly explain generalization in neural networks more broadly than parameter-level versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the limits of using sharpness and complexity together to account for why deep neural networks generalize from training data to new examples. It applies linear regression and Pareto analysis to measure how much of the observed generalization these two factors can explain when taken jointly. Shifting the definitions of both quantities from raw parameter values to the actual input-output behavior of the network increases the portion of generalization that the pair can cover. This outcome indicates that the two-factor perspective remains useful across varied network architectures and tasks, even though some generalization behavior stays outside its reach.

Core claim

Function-oriented realizations of sharpness and complexity expand the explanatory scope of the two-factor view beyond what is achieved by existing parameter-level metrics, as shown by linear regression and Pareto-based analysis on multiple datasets. The results support the sharpness-complexity perspective as an informative lens for understanding generalization across diverse settings, while the remaining unexplained cases leave open whether this view can serve as a complete theory.

What carries the argument

Pareto-based analysis applied to linear regression models that combine function-oriented sharpness and complexity measures

If this is right

  • Function-oriented definitions increase the share of generalization cases covered by the sharpness-complexity pair relative to parameter-level definitions.
  • The two-factor perspective supplies an informative account across diverse network settings and tasks.
  • Unexplained cases persist, indicating the view cannot yet serve as a complete account of generalization.
  • The joint analysis framework quantifies explanatory power through regression coefficients and Pareto dominance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regression-plus-Pareto method could be reapplied after introducing a third factor to test whether unexplained variance shrinks further.
  • One could examine whether the function-oriented measures remain predictive when the training procedure or loss function changes.
  • The approach offers a template for comparing any pair of candidate generalization factors on equal quantitative footing.

Load-bearing premise

Linear regression combined with Pareto analysis on the chosen datasets and measures provides a reliable quantitative assessment of joint explanatory power.

What would settle it

An experiment on the same datasets where parameter-level sharpness and complexity achieve equal or higher joint explanatory power under the same regression and Pareto procedure would falsify the claimed expansion of scope.

Figures

Figures reproduced from arXiv: 2606.29043 by Longxiu Huang, Rongrong Wang, Xitong Zhang, Ziyu Cheng.

Figure 1
Figure 1. Figure 1: Illustration of parameterization ambiguity. The layer-wise rescaling symmetries in positively [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Each point denotes a trained model in the ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of computing PCR on an artificial collection of trained models {A, B, C, D}. Each point represents a trained model in the sharpness–complexity plane and is annotated by its generalization gap. Pareto analysis checks whether the (S, C)-ordering indicated by point locations agrees with the generalization ordering annotated by generalization gaps; the degree of disagreement is quantified by PCR. … view at source ↗
Figure 4
Figure 4. Figure 4: Pareto analysis of existing baseline metric pair (adapS, path norm), complementing the regression results shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pareto analysis of the proposed function-oriented metrics (bayesS, func norm) for the upper and middle block of [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pareto analysis of the exiting baseline metrics (adapS, path norm) and the proposed metrics (bayesS, func norm) for the lower block (Mixed settings) of [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pareto analysis comparing: (i) two implementations of the proposed functional complexity metric, func KL (implementation A, stochastic approximation (18)) and func norm (implementation B, deterministic proxy (19)); (ii) the use of isotropic v.s. anisotropic posterior; (iii) two sharpness metrics, adapS and bayesS. For posterior-related metrics, the bayesS and func KL, with suffix ‘iso’ is computed using is… view at source ↗
Figure 8
Figure 8. Figure 8: Explanatory Power failed/poor/ moderate/good/excellent CIFAR-100, ViT False (bayesS, func norm) 0.88, (−, +) 20.8% True (bayesS, func norm) 0.89, (+, +) 15.2% [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
read the original abstract

Sharpness and complexity are two central factors in the generalization analysis of deep neural networks. Existing quantitative evaluations of generalization measures have largely focused on individual scalar measures, leaving the joint explanatory power of sharpness and complexity largely unexplored. This work studies how far sharpness and complexity can jointly explain generalization. We use linear regression and introduce a Pareto-based analysis to quantitatively evaluate the joint explanatory power of these two factors. Beyond the existing parameter-level definitions, we further propose realizations of sharpness and complexity that are closer to function space and less dependent on raw parameter representations. We find that function-oriented definitions of these two quantities expand the explanatory scope of the two-factor view beyond what is achieved by existing parameter-level metrics. Overall, our results support the sharpness-complexity perspective as an informative lens for understanding generalization across diverse settings. At the same time, the remaining failures indicate that whether this two-factor view can serve as a complete theory of generalization remains open.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that sharpness and complexity jointly explain generalization in deep neural networks, with function-oriented definitions (closer to function space and less dependent on raw parameters) expanding the explanatory scope beyond existing parameter-level metrics. It supports this via linear regression to quantify variance explained in generalization error combined with a Pareto-based analysis to assess joint coverage, across diverse settings. The work concludes that the two-factor view is informative but not necessarily complete, given remaining unexplained failures.

Significance. If the empirical results hold under scrutiny, the work offers a quantitative two-factor lens for generalization that could guide measure design and highlight when sharpness and complexity suffice versus when other factors dominate. The introduction of Pareto analysis for joint coverage is a methodological strength worth building on. The explicit acknowledgment of remaining failures is a credit to the paper's balanced framing.

major comments (1)
  1. [Methodology paragraph (abstract) and corresponding experimental sections] The central claim that function-oriented definitions expand explanatory scope rests on linear regression (to measure variance explained) and Pareto analysis (to assess joint coverage). The abstract's methodology paragraph provides no robustness details on linearity assumptions, multicollinearity between the two factors, or sensitivity to dataset/measure choice. If relationships are nonlinear or unmodeled factors dominate, the reported expansion relative to parameter-level metrics could be overstated. This is load-bearing for the main result.
minor comments (1)
  1. [Abstract] The abstract packs the methodology and results into dense sentences; splitting the description of the regression and Pareto steps would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive critique of our methodology. The concern regarding robustness is well-taken and we will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methodology paragraph (abstract) and corresponding experimental sections] The central claim that function-oriented definitions expand explanatory scope rests on linear regression (to measure variance explained) and Pareto analysis (to assess joint coverage). The abstract's methodology paragraph provides no robustness details on linearity assumptions, multicollinearity between the two factors, or sensitivity to dataset/measure choice. If relationships are nonlinear or unmodeled factors dominate, the reported expansion relative to parameter-level metrics could be overstated. This is load-bearing for the main result.

    Authors: We agree that explicit robustness details strengthen the central claim. The abstract is space-constrained, but the full experimental sections already span multiple datasets, architectures, and measure variants to probe sensitivity. In revision we will add: (i) variance inflation factor (VIF) diagnostics for all reported regressions to quantify multicollinearity between sharpness and complexity; (ii) side-by-side results from nonlinear models (e.g., random-forest regression and kernel ridge) to test whether the reported R² gains persist; and (iii) a concise sensitivity table summarizing how the function-space advantage varies across the main experimental axes. These additions will appear in the methodology and results sections and will be referenced from the abstract if space permits. If any check materially weakens the expansion claim we will qualify the conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical regression and Pareto analysis are external to the proposed measures

full rationale

The paper evaluates joint explanatory power via linear regression (variance explained in generalization error) and Pareto analysis applied to observed sharpness, complexity, and error values across datasets. These statistical procedures operate on independently computed quantities and do not reduce any reported result to a fitted parameter or self-referential definition by construction. Function-oriented realizations are introduced as new proposals and then tested against parameter-level baselines without invoking self-citations, uniqueness theorems, or ansatzes that presuppose the target conclusion. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full text would be needed to audit these.

pith-pipeline@v0.9.1-grok · 5692 in / 1006 out tokens · 27471 ms · 2026-06-30T09:23:18.755811+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    User-friendly introduction to pac-bayes bounds.Foundations and Trends in Machine Learning, 17(2):174–303, 2024

    Pierre Alquier. User-friendly introduction to pac-bayes bounds.Foundations and Trends in Machine Learning, 17(2):174–303, 2024

  2. [2]

    Properties of variational approximations of gibbs posteriors.Journal of Machine Learning Research, 17(236):1–41, 2016

    Pierre Alquier, James Ridgway, and Nicolas Chopin. Properties of variational approximations of gibbs posteriors.Journal of Machine Learning Research, 17(236):1–41, 2016

  3. [3]

    Towards understanding sharpness-aware minimiza- tion

    Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimiza- tion. InInternational Conference on Machine Learning, pages 639–668. PMLR, 2022

  4. [4]

    Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems, 30, 2017

    Peter L Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems, 30, 2017

  5. [5]

    Bartlett, Philip M

    Peter L. Bartlett, Philip M. Long, G´ abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

  6. [6]

    Ortega, Aritz P´ erez, and Andr´ es R

    Ioar Casado, Luis A. Ortega, Aritz P´ erez, and Andr´ es R. Masegosa. Pac-bayes-chernoff bounds for unbounded losses.Advances in Neural Information Processing Systems, 37, 2024

  7. [7]

    Springer, 2004

    Olivier Catoni.Statistical Learning Theory and Stochastic Optimization: Ecole d’Ete de Probabilites de Saint-Flour XXXI–2001, volume 1851. Springer, 2004

  8. [8]

    Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

    Olivier Catoni. Pac-bayesian supervised classification: the thermodynamics of statistical learning.arXiv preprint arXiv:0712.0248, 2007

  9. [9]

    Sharp minima can generalize for deep nets

    Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. InInternational Conference on Machine Learning, pages 1019–1028, 2017

  10. [10]

    Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. InProceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017

  11. [11]

    Sharpness-aware minimization for efficiently improving generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. InInternational Conference on Learning Representations, 2020

  12. [12]

    A primer on pac-bayesian learning.ArXiv, abs/1901.05353,

    Benjamin Guedj. A primer on pac-bayesian learning.arXiv preprint arXiv:1901.05353, 2019. 22

  13. [13]

    Lee, Daniel Soudry, and Nathan Srebro

    Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. InAdvances in Neural Information Processing Systems, 2018

  14. [14]

    Pac-bayes unleashed: Generalisation bounds with unbounded losses.Entropy, 23(10):1330, 2021

    Maxime Haddouche, Benjamin Guedj, Omar Rivasplata, and John Shawe-Taylor. Pac-bayes unleashed: Generalisation bounds with unbounded losses.Entropy, 23(10):1330, 2021

  15. [15]

    Flat minima

    Sepp Hochreiter and J¨ urgen Schmidhuber. Flat minima. InNeural Computation, volume 9, pages 1–42, 1997

  16. [16]

    Fantastic gen- eralization measures and where to find them

    Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic gen- eralization measures and where to find them. InInternational Conference on Learning Representations, 2019

  17. [17]

    Cohen, and Zachary C

    Simran Kaur, Jeremy M. Cohen, and Zachary C. Lipton. On the maximum hessian eigenvalue and generalization. InProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 1048–1059. PMLR, 2023

  18. [18]

    On large-batch training for deep learning: Generalization gap and sharp minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

  19. [19]

    A simple weight decay can improve generalization

    Anders Krogh and John Hertz. A simple weight decay can improve generalization. InAdvances in Neural Information Processing Systems, volume 4, 1991

  20. [20]

    Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks

    Jungmin Kwon, Jeongmin Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. InInternational Conference on Ma- chine Learning, pages 5905–5914. PMLR, 2021

  21. [21]

    Regularizing deep neural networks with stochastic estimators of hessian trace.arXiv preprint arXiv:2208.05924, 2022

    Yucong Liu, Shixing Yu, and Tong Lin. Regularizing deep neural networks with stochastic estimators of hessian trace.arXiv preprint arXiv:2208.05924, 2022

  22. [22]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  23. [23]

    Gradient descent maximizes the margin of homogeneous neural networks

    Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2020

  24. [24]

    A note on the PAC bayesian theorem, 2004

    Andreas Maurer. A note on the PAC bayesian theorem, 2004

  25. [25]

    Some pac-bayesian theorems

    David A McAllester. Some pac-bayesian theorems. InProceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998

  26. [26]

    Pac-bayesian model averaging

    David A McAllester. Pac-bayesian model averaging. InProceedings of the twelfth annual conference on Computational learning theory, pages 164–170, 1999

  27. [27]

    Simplified pac-bayesian margin bounds

    David A McAllester. Simplified pac-bayesian margin bounds. InLearning Theory and Kernel Ma- chines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 203–215. Springer, 2003

  28. [28]

    Deep double descent: Where bigger models and more data hurt

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. InInternational Conference on Learning Representations, 2020

  29. [29]

    A pac-bayesian approach to spectrally-normalized margin bounds for neural networks

    Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. InInternational Conference on Learning Representations, 2018

  30. [30]

    Exploring generalization in deep learning.Advances in neural information processing systems, 30, 2017

    Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning.Advances in neural information processing systems, 30, 2017. 23

  31. [31]

    In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

    Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014

  32. [32]

    Norm-based capacity control in neural net- works

    Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural net- works. InConference on Learning Theory, pages 1376–1401. PMLR, 2015

  33. [33]

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5389–5400, 2019

  34. [34]

    More pac-bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity.Journal of Machine Learning Research, 25(192):1–78, 2024

    Borja Rodr´ ıguez-G´ alvez, Ragnar Thobaben, and Mikael Skoglund. More pac-bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity.Journal of Machine Learning Research, 25(192):1–78, 2024

  35. [35]

    Pac-bayesian generalisation error bounds for gaussian process classification

    Matthias Seeger. Pac-bayesian generalisation error bounds for gaussian process classification. InJournal of Machine Learning Research, pages 233–269, 2002

  36. [36]

    Smith, Benoit Dherin, David G

    Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. InInternational Conference on Learning Representations, 2021

  37. [37]

    The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

  38. [38]

    Understanding deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017

  39. [39]

    Three mechanisms of weight decay regularization

    Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. InInternational Conference on Learning Representations, 2019

  40. [40]

    Improving generalization of complex models under unbounded loss using PAC-Bayes bounds.Transactions on Machine Learning Research, 2024

    Xitong Zhang, Avrajit Ghosh, Guangliang Liu, and Rongrong Wang. Improving generalization of complex models under unbounded loss using PAC-Bayes bounds.Transactions on Machine Learning Research, 2024. 24 A Experimental Protocol We report several implementation choices that are important for conducting and interpreting the linear regression and Pareto analy...