How Far Can Sharpness and Complexity Jointly Explain Generalization?

Longxiu Huang; Rongrong Wang; Xitong Zhang; Ziyu Cheng

arxiv: 2606.29043 · v1 · pith:XPUI4QA4new · submitted 2026-06-27 · 💻 cs.LG

How Far Can Sharpness and Complexity Jointly Explain Generalization?

Ziyu Cheng , Xitong Zhang , Longxiu Huang , Rongrong Wang This is my paper

Pith reviewed 2026-06-30 09:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords generalizationsharpnesscomplexitydeep neural networksfunction spacelinear regressionPareto analysis

0 comments

The pith

Function-oriented sharpness and complexity jointly explain generalization in neural networks more broadly than parameter-level versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the limits of using sharpness and complexity together to account for why deep neural networks generalize from training data to new examples. It applies linear regression and Pareto analysis to measure how much of the observed generalization these two factors can explain when taken jointly. Shifting the definitions of both quantities from raw parameter values to the actual input-output behavior of the network increases the portion of generalization that the pair can cover. This outcome indicates that the two-factor perspective remains useful across varied network architectures and tasks, even though some generalization behavior stays outside its reach.

Core claim

Function-oriented realizations of sharpness and complexity expand the explanatory scope of the two-factor view beyond what is achieved by existing parameter-level metrics, as shown by linear regression and Pareto-based analysis on multiple datasets. The results support the sharpness-complexity perspective as an informative lens for understanding generalization across diverse settings, while the remaining unexplained cases leave open whether this view can serve as a complete theory.

What carries the argument

Pareto-based analysis applied to linear regression models that combine function-oriented sharpness and complexity measures

If this is right

Function-oriented definitions increase the share of generalization cases covered by the sharpness-complexity pair relative to parameter-level definitions.
The two-factor perspective supplies an informative account across diverse network settings and tasks.
Unexplained cases persist, indicating the view cannot yet serve as a complete account of generalization.
The joint analysis framework quantifies explanatory power through regression coefficients and Pareto dominance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regression-plus-Pareto method could be reapplied after introducing a third factor to test whether unexplained variance shrinks further.
One could examine whether the function-oriented measures remain predictive when the training procedure or loss function changes.
The approach offers a template for comparing any pair of candidate generalization factors on equal quantitative footing.

Load-bearing premise

Linear regression combined with Pareto analysis on the chosen datasets and measures provides a reliable quantitative assessment of joint explanatory power.

What would settle it

An experiment on the same datasets where parameter-level sharpness and complexity achieve equal or higher joint explanatory power under the same regression and Pareto procedure would falsify the claimed expansion of scope.

Figures

Figures reproduced from arXiv: 2606.29043 by Longxiu Huang, Rongrong Wang, Xitong Zhang, Ziyu Cheng.

**Figure 2.** Figure 2: Each point denotes a trained model in the ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of computing PCR on an artificial collection of trained models {A, B, C, D}. Each point represents a trained model in the sharpness–complexity plane and is annotated by its generalization gap. Pareto analysis checks whether the (S, C)-ordering indicated by point locations agrees with the generalization ordering annotated by generalization gaps; the degree of disagreement is quantified by PCR. … view at source ↗

**Figure 4.** Figure 4: Pareto analysis of existing baseline metric pair (adapS, path norm), complementing the regression results shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Pareto analysis of the proposed function-oriented metrics (bayesS, func norm) for the upper and middle block of [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Pareto analysis of the exiting baseline metrics (adapS, path norm) and the proposed metrics (bayesS, func norm) for the lower block (Mixed settings) of [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Pareto analysis comparing: (i) two implementations of the proposed functional complexity metric, func KL (implementation A, stochastic approximation (18)) and func norm (implementation B, deterministic proxy (19)); (ii) the use of isotropic v.s. anisotropic posterior; (iii) two sharpness metrics, adapS and bayesS. For posterior-related metrics, the bayesS and func KL, with suffix ‘iso’ is computed using is… view at source ↗

**Figure 8.** Figure 8: Explanatory Power failed/poor/ moderate/good/excellent CIFAR-100, ViT False (bayesS, func norm) 0.88, (−, +) 20.8% True (bayesS, func norm) 0.89, (+, +) 15.2% [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗

read the original abstract

Sharpness and complexity are two central factors in the generalization analysis of deep neural networks. Existing quantitative evaluations of generalization measures have largely focused on individual scalar measures, leaving the joint explanatory power of sharpness and complexity largely unexplored. This work studies how far sharpness and complexity can jointly explain generalization. We use linear regression and introduce a Pareto-based analysis to quantitatively evaluate the joint explanatory power of these two factors. Beyond the existing parameter-level definitions, we further propose realizations of sharpness and complexity that are closer to function space and less dependent on raw parameter representations. We find that function-oriented definitions of these two quantities expand the explanatory scope of the two-factor view beyond what is achieved by existing parameter-level metrics. Overall, our results support the sharpness-complexity perspective as an informative lens for understanding generalization across diverse settings. At the same time, the remaining failures indicate that whether this two-factor view can serve as a complete theory of generalization remains open.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Function-space versions of sharpness and complexity improve joint regression fit to generalization error over parameter versions, but the linear model and Pareto setup leave the strength of that improvement unclear.

read the letter

The paper's main point is that moving sharpness and complexity definitions closer to function space, instead of staying with raw parameters, increases how much the two factors together account for generalization error. They back this with linear regression on the variance explained and a Pareto analysis to check joint coverage across settings.

The new pieces are the joint quantitative evaluation itself and the function-oriented reformulations. Prior work mostly looked at one measure at a time; combining them this way and testing whether the function versions expand the reach is a direct extension. The abstract shows they get better explanatory numbers with the new definitions, which is the concrete advance.

The soft spot is the methodology. Linear regression assumes additive linear effects and limited multicollinearity between the two factors. The stress-test note flags exactly this, and the abstract gives no robustness checks, error bars, or tests for nonlinearity. If other unmodeled factors dominate or the relationships curve, the reported gain in scope could shrink. Without the full regression specs or dataset details, it's hard to judge how much the numbers actually move the needle.

This is for people already working on generalization diagnostics in deep nets who want to see whether the sharpness-complexity lens can be tightened empirically. It does not claim a full theory and explicitly leaves that question open.

Send it to peer review. The joint angle and the function-space shift are worth referee time even if the regression assumptions need more scrutiny.

Referee Report

1 major / 1 minor

Summary. The paper claims that sharpness and complexity jointly explain generalization in deep neural networks, with function-oriented definitions (closer to function space and less dependent on raw parameters) expanding the explanatory scope beyond existing parameter-level metrics. It supports this via linear regression to quantify variance explained in generalization error combined with a Pareto-based analysis to assess joint coverage, across diverse settings. The work concludes that the two-factor view is informative but not necessarily complete, given remaining unexplained failures.

Significance. If the empirical results hold under scrutiny, the work offers a quantitative two-factor lens for generalization that could guide measure design and highlight when sharpness and complexity suffice versus when other factors dominate. The introduction of Pareto analysis for joint coverage is a methodological strength worth building on. The explicit acknowledgment of remaining failures is a credit to the paper's balanced framing.

major comments (1)

[Methodology paragraph (abstract) and corresponding experimental sections] The central claim that function-oriented definitions expand explanatory scope rests on linear regression (to measure variance explained) and Pareto analysis (to assess joint coverage). The abstract's methodology paragraph provides no robustness details on linearity assumptions, multicollinearity between the two factors, or sensitivity to dataset/measure choice. If relationships are nonlinear or unmodeled factors dominate, the reported expansion relative to parameter-level metrics could be overstated. This is load-bearing for the main result.

minor comments (1)

[Abstract] The abstract packs the methodology and results into dense sentences; splitting the description of the regression and Pareto steps would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive critique of our methodology. The concern regarding robustness is well-taken and we will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Methodology paragraph (abstract) and corresponding experimental sections] The central claim that function-oriented definitions expand explanatory scope rests on linear regression (to measure variance explained) and Pareto analysis (to assess joint coverage). The abstract's methodology paragraph provides no robustness details on linearity assumptions, multicollinearity between the two factors, or sensitivity to dataset/measure choice. If relationships are nonlinear or unmodeled factors dominate, the reported expansion relative to parameter-level metrics could be overstated. This is load-bearing for the main result.

Authors: We agree that explicit robustness details strengthen the central claim. The abstract is space-constrained, but the full experimental sections already span multiple datasets, architectures, and measure variants to probe sensitivity. In revision we will add: (i) variance inflation factor (VIF) diagnostics for all reported regressions to quantify multicollinearity between sharpness and complexity; (ii) side-by-side results from nonlinear models (e.g., random-forest regression and kernel ridge) to test whether the reported R² gains persist; and (iii) a concise sensitivity table summarizing how the function-space advantage varies across the main experimental axes. These additions will appear in the methodology and results sections and will be referenced from the abstract if space permits. If any check materially weakens the expansion claim we will qualify the conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical regression and Pareto analysis are external to the proposed measures

full rationale

The paper evaluates joint explanatory power via linear regression (variance explained in generalization error) and Pareto analysis applied to observed sharpness, complexity, and error values across datasets. These statistical procedures operate on independently computed quantities and do not reduce any reported result to a fitted parameter or self-referential definition by construction. Function-oriented realizations are introduced as new proposals and then tested against parameter-level baselines without invoking self-citations, uniqueness theorems, or ansatzes that presuppose the target conclusion. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full text would be needed to audit these.

pith-pipeline@v0.9.1-grok · 5692 in / 1006 out tokens · 27471 ms · 2026-06-30T09:23:18.755811+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 5 canonical work pages · 2 internal anchors

[1]

User-friendly introduction to pac-bayes bounds.Foundations and Trends in Machine Learning, 17(2):174–303, 2024

Pierre Alquier. User-friendly introduction to pac-bayes bounds.Foundations and Trends in Machine Learning, 17(2):174–303, 2024

2024
[2]

Properties of variational approximations of gibbs posteriors.Journal of Machine Learning Research, 17(236):1–41, 2016

Pierre Alquier, James Ridgway, and Nicolas Chopin. Properties of variational approximations of gibbs posteriors.Journal of Machine Learning Research, 17(236):1–41, 2016

2016
[3]

Towards understanding sharpness-aware minimiza- tion

Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimiza- tion. InInternational Conference on Machine Learning, pages 639–668. PMLR, 2022

2022
[4]

Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems, 30, 2017

Peter L Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems, 30, 2017

2017
[5]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, G´ abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

2020
[6]

Ortega, Aritz P´ erez, and Andr´ es R

Ioar Casado, Luis A. Ortega, Aritz P´ erez, and Andr´ es R. Masegosa. Pac-bayes-chernoff bounds for unbounded losses.Advances in Neural Information Processing Systems, 37, 2024

2024
[7]

Springer, 2004

Olivier Catoni.Statistical Learning Theory and Stochastic Optimization: Ecole d’Ete de Probabilites de Saint-Flour XXXI–2001, volume 1851. Springer, 2004

2001
[8]

Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

Olivier Catoni. Pac-bayesian supervised classification: the thermodynamics of statistical learning.arXiv preprint arXiv:0712.0248, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007
[9]

Sharp minima can generalize for deep nets

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. InInternational Conference on Machine Learning, pages 1019–1028, 2017

2017
[10]

Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. InProceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017

2017
[11]

Sharpness-aware minimization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. InInternational Conference on Learning Representations, 2020

2020
[12]

A primer on pac-bayesian learning.ArXiv, abs/1901.05353,

Benjamin Guedj. A primer on pac-bayesian learning.arXiv preprint arXiv:1901.05353, 2019. 22

work page arXiv 1901
[13]

Lee, Daniel Soudry, and Nathan Srebro

Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. InAdvances in Neural Information Processing Systems, 2018

2018
[14]

Pac-bayes unleashed: Generalisation bounds with unbounded losses.Entropy, 23(10):1330, 2021

Maxime Haddouche, Benjamin Guedj, Omar Rivasplata, and John Shawe-Taylor. Pac-bayes unleashed: Generalisation bounds with unbounded losses.Entropy, 23(10):1330, 2021

2021
[15]

Flat minima

Sepp Hochreiter and J¨ urgen Schmidhuber. Flat minima. InNeural Computation, volume 9, pages 1–42, 1997

1997
[16]

Fantastic gen- eralization measures and where to find them

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic gen- eralization measures and where to find them. InInternational Conference on Learning Representations, 2019

2019
[17]

Cohen, and Zachary C

Simran Kaur, Jeremy M. Cohen, and Zachary C. Lipton. On the maximum hessian eigenvalue and generalization. InProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 1048–1059. PMLR, 2023

2023
[18]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

2017
[19]

A simple weight decay can improve generalization

Anders Krogh and John Hertz. A simple weight decay can improve generalization. InAdvances in Neural Information Processing Systems, volume 4, 1991

1991
[20]

Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks

Jungmin Kwon, Jeongmin Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. InInternational Conference on Ma- chine Learning, pages 5905–5914. PMLR, 2021

2021
[21]

Regularizing deep neural networks with stochastic estimators of hessian trace.arXiv preprint arXiv:2208.05924, 2022

Yucong Liu, Shixing Yu, and Tong Lin. Regularizing deep neural networks with stochastic estimators of hessian trace.arXiv preprint arXiv:2208.05924, 2022

work page arXiv 2022
[22]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019
[23]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2020

2020
[24]

A note on the PAC bayesian theorem, 2004

Andreas Maurer. A note on the PAC bayesian theorem, 2004

2004
[25]

Some pac-bayesian theorems

David A McAllester. Some pac-bayesian theorems. InProceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998

1998
[26]

Pac-bayesian model averaging

David A McAllester. Pac-bayesian model averaging. InProceedings of the twelfth annual conference on Computational learning theory, pages 164–170, 1999

1999
[27]

Simplified pac-bayesian margin bounds

David A McAllester. Simplified pac-bayesian margin bounds. InLearning Theory and Kernel Ma- chines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 203–215. Springer, 2003

2003
[28]

Deep double descent: Where bigger models and more data hurt

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. InInternational Conference on Learning Representations, 2020

2020
[29]

A pac-bayesian approach to spectrally-normalized margin bounds for neural networks

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. InInternational Conference on Learning Representations, 2018

2018
[30]

Exploring generalization in deep learning.Advances in neural information processing systems, 30, 2017

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning.Advances in neural information processing systems, 30, 2017. 23

2017
[31]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[32]

Norm-based capacity control in neural net- works

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural net- works. InConference on Learning Theory, pages 1376–1401. PMLR, 2015

2015
[33]

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5389–5400, 2019

2019
[34]

More pac-bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity.Journal of Machine Learning Research, 25(192):1–78, 2024

Borja Rodr´ ıguez-G´ alvez, Ragnar Thobaben, and Mikael Skoglund. More pac-bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity.Journal of Machine Learning Research, 25(192):1–78, 2024

2024
[35]

Pac-bayesian generalisation error bounds for gaussian process classification

Matthias Seeger. Pac-bayesian generalisation error bounds for gaussian process classification. InJournal of Machine Learning Research, pages 233–269, 2002

2002
[36]

Smith, Benoit Dherin, David G

Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. InInternational Conference on Learning Representations, 2021

2021
[37]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

2018
[38]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017

2017
[39]

Three mechanisms of weight decay regularization

Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. InInternational Conference on Learning Representations, 2019

2019
[40]

Improving generalization of complex models under unbounded loss using PAC-Bayes bounds.Transactions on Machine Learning Research, 2024

Xitong Zhang, Avrajit Ghosh, Guangliang Liu, and Rongrong Wang. Improving generalization of complex models under unbounded loss using PAC-Bayes bounds.Transactions on Machine Learning Research, 2024. 24 A Experimental Protocol We report several implementation choices that are important for conducting and interpreting the linear regression and Pareto analy...

work page arXiv 2024

[1] [1]

User-friendly introduction to pac-bayes bounds.Foundations and Trends in Machine Learning, 17(2):174–303, 2024

Pierre Alquier. User-friendly introduction to pac-bayes bounds.Foundations and Trends in Machine Learning, 17(2):174–303, 2024

2024

[2] [2]

Properties of variational approximations of gibbs posteriors.Journal of Machine Learning Research, 17(236):1–41, 2016

Pierre Alquier, James Ridgway, and Nicolas Chopin. Properties of variational approximations of gibbs posteriors.Journal of Machine Learning Research, 17(236):1–41, 2016

2016

[3] [3]

Towards understanding sharpness-aware minimiza- tion

Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimiza- tion. InInternational Conference on Machine Learning, pages 639–668. PMLR, 2022

2022

[4] [4]

Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems, 30, 2017

Peter L Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems, 30, 2017

2017

[5] [5]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, G´ abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

2020

[6] [6]

Ortega, Aritz P´ erez, and Andr´ es R

Ioar Casado, Luis A. Ortega, Aritz P´ erez, and Andr´ es R. Masegosa. Pac-bayes-chernoff bounds for unbounded losses.Advances in Neural Information Processing Systems, 37, 2024

2024

[7] [7]

Springer, 2004

Olivier Catoni.Statistical Learning Theory and Stochastic Optimization: Ecole d’Ete de Probabilites de Saint-Flour XXXI–2001, volume 1851. Springer, 2004

2001

[8] [8]

Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

Olivier Catoni. Pac-bayesian supervised classification: the thermodynamics of statistical learning.arXiv preprint arXiv:0712.0248, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007

[9] [9]

Sharp minima can generalize for deep nets

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. InInternational Conference on Machine Learning, pages 1019–1028, 2017

2017

[10] [10]

Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. InProceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017

2017

[11] [11]

Sharpness-aware minimization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. InInternational Conference on Learning Representations, 2020

2020

[12] [12]

A primer on pac-bayesian learning.ArXiv, abs/1901.05353,

Benjamin Guedj. A primer on pac-bayesian learning.arXiv preprint arXiv:1901.05353, 2019. 22

work page arXiv 1901

[13] [13]

Lee, Daniel Soudry, and Nathan Srebro

Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. InAdvances in Neural Information Processing Systems, 2018

2018

[14] [14]

Pac-bayes unleashed: Generalisation bounds with unbounded losses.Entropy, 23(10):1330, 2021

Maxime Haddouche, Benjamin Guedj, Omar Rivasplata, and John Shawe-Taylor. Pac-bayes unleashed: Generalisation bounds with unbounded losses.Entropy, 23(10):1330, 2021

2021

[15] [15]

Flat minima

Sepp Hochreiter and J¨ urgen Schmidhuber. Flat minima. InNeural Computation, volume 9, pages 1–42, 1997

1997

[16] [16]

Fantastic gen- eralization measures and where to find them

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic gen- eralization measures and where to find them. InInternational Conference on Learning Representations, 2019

2019

[17] [17]

Cohen, and Zachary C

Simran Kaur, Jeremy M. Cohen, and Zachary C. Lipton. On the maximum hessian eigenvalue and generalization. InProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 1048–1059. PMLR, 2023

2023

[18] [18]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

2017

[19] [19]

A simple weight decay can improve generalization

Anders Krogh and John Hertz. A simple weight decay can improve generalization. InAdvances in Neural Information Processing Systems, volume 4, 1991

1991

[20] [20]

Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks

Jungmin Kwon, Jeongmin Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. InInternational Conference on Ma- chine Learning, pages 5905–5914. PMLR, 2021

2021

[21] [21]

Regularizing deep neural networks with stochastic estimators of hessian trace.arXiv preprint arXiv:2208.05924, 2022

Yucong Liu, Shixing Yu, and Tong Lin. Regularizing deep neural networks with stochastic estimators of hessian trace.arXiv preprint arXiv:2208.05924, 2022

work page arXiv 2022

[22] [22]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019

[23] [23]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2020

2020

[24] [24]

A note on the PAC bayesian theorem, 2004

Andreas Maurer. A note on the PAC bayesian theorem, 2004

2004

[25] [25]

Some pac-bayesian theorems

David A McAllester. Some pac-bayesian theorems. InProceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998

1998

[26] [26]

Pac-bayesian model averaging

David A McAllester. Pac-bayesian model averaging. InProceedings of the twelfth annual conference on Computational learning theory, pages 164–170, 1999

1999

[27] [27]

Simplified pac-bayesian margin bounds

David A McAllester. Simplified pac-bayesian margin bounds. InLearning Theory and Kernel Ma- chines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 203–215. Springer, 2003

2003

[28] [28]

Deep double descent: Where bigger models and more data hurt

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. InInternational Conference on Learning Representations, 2020

2020

[29] [29]

A pac-bayesian approach to spectrally-normalized margin bounds for neural networks

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. InInternational Conference on Learning Representations, 2018

2018

[30] [30]

Exploring generalization in deep learning.Advances in neural information processing systems, 30, 2017

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning.Advances in neural information processing systems, 30, 2017. 23

2017

[31] [31]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[32] [32]

Norm-based capacity control in neural net- works

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural net- works. InConference on Learning Theory, pages 1376–1401. PMLR, 2015

2015

[33] [33]

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5389–5400, 2019

2019

[34] [34]

More pac-bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity.Journal of Machine Learning Research, 25(192):1–78, 2024

Borja Rodr´ ıguez-G´ alvez, Ragnar Thobaben, and Mikael Skoglund. More pac-bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity.Journal of Machine Learning Research, 25(192):1–78, 2024

2024

[35] [35]

Pac-bayesian generalisation error bounds for gaussian process classification

Matthias Seeger. Pac-bayesian generalisation error bounds for gaussian process classification. InJournal of Machine Learning Research, pages 233–269, 2002

2002

[36] [36]

Smith, Benoit Dherin, David G

Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. InInternational Conference on Learning Representations, 2021

2021

[37] [37]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

2018

[38] [38]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017

2017

[39] [39]

Three mechanisms of weight decay regularization

Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. InInternational Conference on Learning Representations, 2019

2019

[40] [40]

Improving generalization of complex models under unbounded loss using PAC-Bayes bounds.Transactions on Machine Learning Research, 2024

Xitong Zhang, Avrajit Ghosh, Guangliang Liu, and Rongrong Wang. Improving generalization of complex models under unbounded loss using PAC-Bayes bounds.Transactions on Machine Learning Research, 2024. 24 A Experimental Protocol We report several implementation choices that are important for conducting and interpreting the linear regression and Pareto analy...

work page arXiv 2024