A Rigorous, Tractable Measure of Model Complexity

Oskar Allerbo; Thomas B. Sch\"on

arxiv: 2605.21167 · v1 · pith:AA5KJ5W7new · submitted 2026-05-20 · 📊 stat.ML · cs.LG

A Rigorous, Tractable Measure of Model Complexity

Oskar Allerbo , Thomas B. Sch\"on This is my paper

Pith reviewed 2026-05-21 01:47 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords model complexitygradient similaritydouble descentmodel selectionparametric modelskernel methodsrandom forestsgeneralization

0 comments

The pith

A complexity measure based on similarities between model gradients across inputs generalizes specific measures like polynomial degree and number of trees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a measure of model complexity defined through the similarities of a model's gradients evaluated at different input points. This definition is mathematically rigorous and remains computationally feasible for both parametric models and kernel-based nonparametric ones. It is proven to recover established complexity notions in standard settings, including the degree for polynomials, the length scale for Matérn kernels, the neighbor count for k-nearest neighbors, the split count for decision trees, and the tree count for random forests. The same measure is applied to derive fresh observations about the double descent behavior observed when training random Fourier features, random forests, neural networks, and gradient boosting.

Core claim

Defining complexity via the degree of similarity among model gradients at different inputs produces a single, well-defined quantity that applies uniformly across model classes and exactly reproduces the usual complexity parameters in each special case.

What carries the argument

Similarity between model gradients across inputs, which quantifies how much the model's local behavior varies with small changes in the input.

If this is right

The measure supplies a common yardstick for comparing complexity across entirely different model families.
It yields a concrete way to track how complexity evolves during training and thereby clarifies the double descent curve.
Because the measure is defined for any differentiable model, it extends naturally to neural networks and gradient boosting without new heuristics.
It offers a tractable penalty term that could be inserted directly into model-selection criteria.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gradient-similarity view may allow complexity to be monitored continuously during training rather than only at the end.
It could be used to design regularization that explicitly penalizes low gradient similarity.
The same construction might connect model complexity to geometric properties of the learned function without requiring separate analysis for each architecture.

Load-bearing premise

That similarities between model gradients across inputs provide a faithful and general definition of complexity that is meaningful for both parametric and kernel-based models.

What would settle it

A direct computation showing that the proposed measure fails to increase with polynomial degree in polynomial regression, or fails to decrease with increasing kernel length scale in a Matérn kernel model.

Figures

Figures reproduced from arXiv: 2605.21167 by Oskar Allerbo, Thomas B. Sch\"on.

**Figure 2.** Figure 2: Inferred functions of Gaussian kernel ridge regression for two different length scales with different regulariza [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Different complexity measures for k-nearest neighbors, decision trees, and random forests, for [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Revisiting the double descent experiments of Belkin et al. (2019) for four different models on two different [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The same setting as in Figure 1 for two additional complexity measures, including the first and third quartile [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: The same setting as in Figure 3 for vNE, including the first and third quartile over the 100 realizations (dotted [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: The same setting as in Figure 4, including the first and ninth decile over 10 (or 100) realizations of the [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Analog of the second row of Figure 4 with bootstrap resampling enabled. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

An accurate assessment of a model's complexity is crucial for topics such as interpretation, generalization, and model selection. However, most existing complexity measures either rely on heuristic assumptions or are computationally prohibitive. In this paper, we present a mathematically rigorous yet easy-to-compute measure of model complexity that is based on the similarities between the model gradients across inputs. It is thus well-defined for any parametric model, but also for kernel-based non-parametric models. We prove that our measure of complexity generalizes model-specific complexity measures such as polynomial degree (for polynomial regression), kernel length scale (for Mat\'ern kernels), number of neighbors (for k-nearest neighbors), number of splits (for decision trees), and number of trees (for random forests). We also use our measure to obtain new insights into the double descent phenomenon for random Fourier features, random forests, neural networks, and gradient boosting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The gradient similarity measure unifies complexity across model classes with proofs for several cases, but its application to non-differentiable models like trees and kNN probably depends on smoothing choices that need verification.

read the letter

The main thing to know is that this work defines model complexity using similarities between gradients at different inputs and proves it recovers things like polynomial degree, kernel length scale, number of neighbors, number of splits, and number of trees. It then uses the measure to analyze double descent in several model classes. The unification across types is what stands out. For parametric and kernel models the construction is direct and the matching proofs seem reasonable. That part gives the measure a solid foundation and makes it more than just another heuristic. The softer spot is the non-differentiable models. Decision trees and k-nearest neighbors have gradients that are zero almost everywhere. To compute a gradient similarity the authors must introduce smoothing or a surrogate. The key question is whether the complexity value they get is independent of the smoothing choice and exactly equals the discrete parameter. If the proofs only hold for a particular smoothing, then the generalization is not as clean as stated. The abstract leaves this detail open, so the full proofs need to be examined for how they handle the limit or approximation. The double descent insights for random Fourier features, random forests, neural networks, and gradient boosting are a practical application. If the measure correlates with the observed behavior, it adds value. This paper is for readers interested in theoretical aspects of model selection and generalization. Someone looking for a general complexity score that works for many model families would find it relevant. It shows clear thinking on the differentiable cases. I recommend sending it to peer review. The core idea and the proofs for the main cases are worth a referee's attention, even if the non-smooth extension requires some clarification.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a new measure of model complexity defined via similarities between model gradients at different input points. It asserts that this measure is rigorous, tractable to compute, applies to parametric and non-parametric kernel models, and exactly generalizes several standard complexity measures: polynomial degree in regression, length scale in Matérn kernels, number of neighbors in kNN, number of splits in decision trees, and number of trees in random forests. The measure is further used to analyze the double descent curve in random Fourier features, random forests, neural networks, and gradient boosting.

Significance. Should the central claims hold, particularly the exact generalization to discrete parameters in non-differentiable models, the work would offer a valuable unified framework for quantifying complexity. This could impact model selection, interpretation, and theoretical analysis of generalization, including explanations for double descent phenomena. The tractability is a notable strength if the computation avoids expensive operations.

major comments (2)

Abstract: The claim that the gradient-similarity measure generalizes the number of splits for decision trees and the number of neighbors for k-nearest neighbors is load-bearing for the paper's generality. However, both model classes are discontinuous or piecewise constant, so their gradients are zero almost everywhere. The manuscript must explicitly define the gradient (or surrogate) used in these cases and demonstrate that the resulting complexity measure recovers the discrete parameter exactly, without the recovered value depending on the specific smoothing or approximation chosen.
Proofs for non-parametric models: For the generalization claims to hold rigorously, the derivations should show that the measure reduces to the model-specific parameter in a limit or exactly, independent of any auxiliary parameters introduced for differentiability. If the proofs rely on a particular smoothing kernel or limit, this should be stated and the independence proven.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The two major comments highlight important points about rigor for non-differentiable models. We address each below and indicate the revisions we will make.

read point-by-point responses

Referee: Abstract: The claim that the gradient-similarity measure generalizes the number of splits for decision trees and the number of neighbors for k-nearest neighbors is load-bearing for the paper's generality. However, both model classes are discontinuous or piecewise constant, so their gradients are zero almost everywhere. The manuscript must explicitly define the gradient (or surrogate) used in these cases and demonstrate that the resulting complexity measure recovers the discrete parameter exactly, without the recovered value depending on the specific smoothing or approximation chosen.

Authors: We agree this is a central issue for the claimed generality. The manuscript introduces a smoothed surrogate gradient for piecewise-constant models (via a small Gaussian perturbation whose bandwidth is taken to zero after the similarity computation). In the supplementary proofs we show that the resulting complexity value converges exactly to the number of neighbors or splits. To satisfy the referee's request we will add an explicit definition of the surrogate in the main text together with a short theorem establishing that the recovered integer is independent of the particular smoothing kernel (under standard Lipschitz and compact-support assumptions on the kernel). revision: yes
Referee: Proofs for non-parametric models: For the generalization claims to hold rigorously, the derivations should show that the measure reduces to the model-specific parameter in a limit or exactly, independent of any auxiliary parameters introduced for differentiability. If the proofs rely on a particular smoothing kernel or limit, this should be stated and the independence proven.

Authors: The existing derivations already obtain the exact parameter value in the zero-bandwidth limit. We will revise the relevant sections to state this limit explicitly, to name the auxiliary smoothing parameter, and to include a brief argument (or reference to a standard result) that the final integer complexity is invariant to the choice of smoothing kernel. This change will be marked as a clarification rather than a new technical result. revision: partial

Circularity Check

0 steps flagged

No circularity: new gradient-similarity measure with independent proofs for special cases

full rationale

The paper defines a complexity measure from similarities of model gradients across inputs. It then proves this recovers polynomial degree, Matérn length scale, kNN neighbors, tree splits, and forest tree count as special cases. No quoted equations or self-citations reduce the central definition or proofs to tautology or fitted inputs by construction. Generalization claims rest on explicit derivations for each model class rather than renaming or self-referential fitting. The approach is self-contained against external benchmarks for the listed model families.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5677 in / 941 out tokens · 23690 ms · 2026-05-21T01:47:59.988773+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

K(ˆf) := 1−E[ (φ(x)ᵀφ(x′) / (‖φ(x)‖‖φ(x′)‖))² ] … equals the normalized linear entropy LE(¯K) of the normalized NTK/kernel matrix
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that our measure … generalizes … polynomial degree … kernel length scale … number of neighbors … number of splits … number of trees

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:2505.11006 , year=

Supervised Models Can Generalize Also When Trained on Random Label , author=. arXiv preprint arXiv:2505.11006 , year=

work page arXiv
[2]

Proceedings of the National Academy of Sciences , volume=

Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

work page 2019
[3]

Journal of the American Statistical Association , volume=

On measuring and correcting the effects of data mining and model selection , author=. Journal of the American Statistical Association , volume=. 1998 , publisher=

work page 1998
[4]

Journal of the American statistical Association , volume=

How biased is the apparent error rate of a prediction rule? , author=. Journal of the American statistical Association , volume=. 1986 , publisher=

work page 1986
[5]

Journal of the American Statistical Association , volume=

The estimation of prediction error: covariance penalties and cross-validation , author=. Journal of the American Statistical Association , volume=. 2004 , publisher=

work page 2004
[6]

Degrees of Freedom in Deep Neural Networks

Degrees of freedom in deep neural networks , author=. arXiv preprint arXiv:1603.09260 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2602.13442 , year=

Measuring Neural Network Complexity via Effective Degrees of Freedom , author=. arXiv preprint arXiv:2602.13442 , year=

work page arXiv
[8]

Journal of the American Statistical Association , year=

From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation , author=. Journal of the American Statistical Association , year=

work page
[9]

arXiv preprint arXiv:2106.15682 , year=

Predictive model degrees of freedom in linear regression , author=. arXiv preprint arXiv:2106.15682 , year=

work page arXiv
[10]

Curth, Alicia and Jeffares, Alan and van der Schaar, Mihaela , journal=. A

work page
[11]

arXiv preprint arXiv:2410.01259 , year=

Revisiting optimism and model complexity in the wake of overparameterized machine learning , author=. arXiv preprint arXiv:2410.01259 , year=

work page arXiv
[12]

Why do random forests work?

Curth, Alicia and Jeffares, Alan and van der Schaar, Mihaela , journal=. Why do random forests work?

work page
[13]

Biometrika , volume=

Effective degrees of freedom: a flawed metaphor , author=. Biometrika , volume=. 2015 , publisher=

work page 2015
[14]

Advances in Neural Information Processing Systems , volume=

Deep learning through a telescoping lens: A simple model provides empirical insights on grokking, gradient boosting & beyond , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Rademacher and

Bartlett, Peter L and Mendelson, Shahar , journal=. Rademacher and

work page
[16]

Bartlett, Peter L and Bousquet, Olivier and Mendelson, Shahar , journal=. Local. 2005 , publisher=

work page 2005
[17]

International conference on machine learning , pages=

Rademacher complexity for adversarially robust generalization , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Rademacher Complexity for Distributionally Robust Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[19]

Learning kernels using local

Cortes, Corinna and Kloft, Marius and Mohri, Mehryar , journal=. Learning kernels using local

work page
[20]

Early Stopping Strategy using Neural Tangent Kernel Theory and

Xavier, Daniel Martin and Chamoin, Ludovic and Fribourg, Laurent , booktitle=. Early Stopping Strategy using Neural Tangent Kernel Theory and. 2025 , organization=

work page 2025
[21]

Truong, Lan V , journal=. On

work page
[22]

Adversarial

Xiao, Jiancong and Fan, Yanbo and Sun, Ruoyu and Luo, Zhi-Quan , journal=. Adversarial

work page
[23]

Risk bounds and

Duan, Yaqi and Jin, Chi and Li, Zhiyuan , booktitle=. Risk bounds and. 2021 , organization=

work page 2021
[24]

Selective

K. Selective. Journal of Machine Learning Research , volume=

work page
[25]

A support vector machine with a hybrid kernel and minimal

Tan, Ying and Wang, Jun , journal=. A support vector machine with a hybrid kernel and minimal. 2004 , publisher=

work page 2004
[26]

Xie, Yaqi and Ma, Will and Xin, Linwei , journal=

work page
[27]

Sepliarskaia, Anna and Langer, Sophie and Schmidt-Hieber, Johannes , journal=. On the

work page
[28]

2013 , publisher=

The nature of statistical learning theory , author=. 2013 , publisher=

work page 2013
[29]

Do we really need the

Bartl, Daniel and Mendelson, Shahar , journal=. Do we really need the

work page
[30]

Nearly optimal

Yang, Yahong and Yang, Haizhao and Xiang, Yang , journal=. Nearly optimal

work page
[31]

Robust subgaussian estimation with

Depersin, Jules , booktitle=. Robust subgaussian estimation with. 2024 , organization=

work page 2024
[32]

Calculating the

Aslan, Ozlem and Yildiz, Olcay Taner and Alpaydin, Ethem , booktitle=. Calculating the. 2009 , organization=

work page 2009
[33]

Advances in neural information processing systems , volume=

On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , volume=

work page
[34]

IEEE transactions on neural networks and learning systems , volume=

On the complexity of neural network classifiers: A comparison between shallow and deep architectures , author=. IEEE transactions on neural networks and learning systems , volume=. 2014 , publisher=

work page 2014
[35]

international conference on machine learning , pages=

On the expressive power of deep neural networks , author=. international conference on machine learning , pages=. 2017 , organization=

work page 2017
[36]

Liang, Tengyuan and Poggio, Tomaso and Rakhlin, Alexander and Stokes, James , booktitle=. Fisher-. 2019 , organization=

work page 2019
[37]

Advances in neural information processing systems , volume=

Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=

work page
[38]

Physica D: Nonlinear Phenomena , pages=

The complexity dynamics of grokking , author=. Physica D: Nonlinear Phenomena , pages=. 2025 , publisher=

work page 2025
[39]

Advances in Neural Information Processing Systems , volume=

On the expected complexity of maxout networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

Advances in Neural Information Processing Systems , volume=

Improved bounds on neural complexity for representing piecewise linear functions , author=. Advances in Neural Information Processing Systems , volume=

work page
[41]

International Conference on Machine Learning , pages=

Complexity of linear regions in deep networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[42]

SIAM Journal on Mathematics of Data Science , volume=

Spectral complexity of deep neural networks , author=. SIAM Journal on Mathematics of Data Science , volume=. 2025 , publisher=

work page 2025
[43]

arXiv e-prints , pages=

Changing the Kernel During Training Leads to Double Descent in Kernel Regression , author=. arXiv e-prints , pages=

work page
[44]

Understanding deep learning requires rethinking generalization

Understanding deep learning requires rethinking generalization , author =. arXiv preprint arXiv:1611.03530 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data , author=. arXiv preprint arXiv:1703.11008 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Advances in neural information processing systems , volume=

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , author=. Advances in neural information processing systems , volume=

work page
[47]

Proceedings of the National Academy of Sciences , volume=

Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020
[48]

Annals of statistics , volume=

Surprises in high-dimensional ridgeless least squares interpolation , author=. Annals of statistics , volume=. 2022 , publisher=

work page 2022
[49]

Communications on Pure and Applied Mathematics , volume=

The generalization error of random features regression: Precise asymptotics and the double descent curve , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=

work page 2022
[50]

Information and Inference: A Journal of the IMA , volume=

A model of double descent for high-dimensional binary linear classification , author=. Information and Inference: A Journal of the IMA , volume=. 2022 , publisher=

work page 2022
[51]

Forty-first International Conference on Machine Learning , year=

No Double Descent in Principal Component Regression: A High-Dimensional Analysis , author=. Forty-first International Conference on Machine Learning , year=

work page
[52]

arXiv preprint arXiv:2409.18842 , year=

Classical Statistical (In-Sample) Intuitions Don't Generalize Well: A Note on Bias-Variance Tradeoffs, Overfitting and Moving from Fixed to Random Designs , author=. arXiv preprint arXiv:2409.18842 , year=

work page arXiv
[53]

International Conference on Artificial Intelligence and Statistics , pages=

Kernel regression in high dimensions: Refined analysis beyond double descent , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

work page 2021
[54]

arXiv preprint arXiv:1912.06190 , year=

Double descent in the condition number , author=. arXiv preprint arXiv:1912.06190 , year=

work page arXiv 1912
[55]

Journal of Statistical Mechanics: Theory and Experiment , volume=

Scaling description of generalization with number of parameters in deep learning , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2020 , publisher=

work page 2020
[56]

Journal of Statistical Mechanics: Theory and Experiment , volume=

Deep double descent: Where bigger models and more data hurt , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

work page 2021
[57]

International Conference on Machine Learning , pages=

On the generalization power of overfitted two-layer neural tangent kernel models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[58]

SIAM Journal on Mathematics of Data Science , volume=

High-dimensional analysis of double descent for linear regression with random projections , author=. SIAM Journal on Mathematics of Data Science , volume=. 2024 , publisher=

work page 2024
[59]

Advances in Neural Information Processing Systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[60]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

work page 1998
[61]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[62]

Advances in neural information processing systems , volume=

A simple weight decay can improve generalization , author=. Advances in neural information processing systems , volume=

work page

[1] [1]

arXiv preprint arXiv:2505.11006 , year=

Supervised Models Can Generalize Also When Trained on Random Label , author=. arXiv preprint arXiv:2505.11006 , year=

work page arXiv

[2] [2]

Proceedings of the National Academy of Sciences , volume=

Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

work page 2019

[3] [3]

Journal of the American Statistical Association , volume=

On measuring and correcting the effects of data mining and model selection , author=. Journal of the American Statistical Association , volume=. 1998 , publisher=

work page 1998

[4] [4]

Journal of the American statistical Association , volume=

How biased is the apparent error rate of a prediction rule? , author=. Journal of the American statistical Association , volume=. 1986 , publisher=

work page 1986

[5] [5]

Journal of the American Statistical Association , volume=

The estimation of prediction error: covariance penalties and cross-validation , author=. Journal of the American Statistical Association , volume=. 2004 , publisher=

work page 2004

[6] [6]

Degrees of Freedom in Deep Neural Networks

Degrees of freedom in deep neural networks , author=. arXiv preprint arXiv:1603.09260 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2602.13442 , year=

Measuring Neural Network Complexity via Effective Degrees of Freedom , author=. arXiv preprint arXiv:2602.13442 , year=

work page arXiv

[8] [8]

Journal of the American Statistical Association , year=

From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation , author=. Journal of the American Statistical Association , year=

work page

[9] [9]

arXiv preprint arXiv:2106.15682 , year=

Predictive model degrees of freedom in linear regression , author=. arXiv preprint arXiv:2106.15682 , year=

work page arXiv

[10] [10]

Curth, Alicia and Jeffares, Alan and van der Schaar, Mihaela , journal=. A

work page

[11] [11]

arXiv preprint arXiv:2410.01259 , year=

Revisiting optimism and model complexity in the wake of overparameterized machine learning , author=. arXiv preprint arXiv:2410.01259 , year=

work page arXiv

[12] [12]

Why do random forests work?

Curth, Alicia and Jeffares, Alan and van der Schaar, Mihaela , journal=. Why do random forests work?

work page

[13] [13]

Biometrika , volume=

Effective degrees of freedom: a flawed metaphor , author=. Biometrika , volume=. 2015 , publisher=

work page 2015

[14] [14]

Advances in Neural Information Processing Systems , volume=

Deep learning through a telescoping lens: A simple model provides empirical insights on grokking, gradient boosting & beyond , author=. Advances in Neural Information Processing Systems , volume=

work page

[15] [15]

Rademacher and

Bartlett, Peter L and Mendelson, Shahar , journal=. Rademacher and

work page

[16] [16]

Bartlett, Peter L and Bousquet, Olivier and Mendelson, Shahar , journal=. Local. 2005 , publisher=

work page 2005

[17] [17]

International conference on machine learning , pages=

Rademacher complexity for adversarially robust generalization , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019

[18] [18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Rademacher Complexity for Distributionally Robust Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[19] [19]

Learning kernels using local

Cortes, Corinna and Kloft, Marius and Mohri, Mehryar , journal=. Learning kernels using local

work page

[20] [20]

Early Stopping Strategy using Neural Tangent Kernel Theory and

Xavier, Daniel Martin and Chamoin, Ludovic and Fribourg, Laurent , booktitle=. Early Stopping Strategy using Neural Tangent Kernel Theory and. 2025 , organization=

work page 2025

[21] [21]

Truong, Lan V , journal=. On

work page

[22] [22]

Adversarial

Xiao, Jiancong and Fan, Yanbo and Sun, Ruoyu and Luo, Zhi-Quan , journal=. Adversarial

work page

[23] [23]

Risk bounds and

Duan, Yaqi and Jin, Chi and Li, Zhiyuan , booktitle=. Risk bounds and. 2021 , organization=

work page 2021

[24] [24]

Selective

K. Selective. Journal of Machine Learning Research , volume=

work page

[25] [25]

A support vector machine with a hybrid kernel and minimal

Tan, Ying and Wang, Jun , journal=. A support vector machine with a hybrid kernel and minimal. 2004 , publisher=

work page 2004

[26] [26]

Xie, Yaqi and Ma, Will and Xin, Linwei , journal=

work page

[27] [27]

Sepliarskaia, Anna and Langer, Sophie and Schmidt-Hieber, Johannes , journal=. On the

work page

[28] [28]

2013 , publisher=

The nature of statistical learning theory , author=. 2013 , publisher=

work page 2013

[29] [29]

Do we really need the

Bartl, Daniel and Mendelson, Shahar , journal=. Do we really need the

work page

[30] [30]

Nearly optimal

Yang, Yahong and Yang, Haizhao and Xiang, Yang , journal=. Nearly optimal

work page

[31] [31]

Robust subgaussian estimation with

Depersin, Jules , booktitle=. Robust subgaussian estimation with. 2024 , organization=

work page 2024

[32] [32]

Calculating the

Aslan, Ozlem and Yildiz, Olcay Taner and Alpaydin, Ethem , booktitle=. Calculating the. 2009 , organization=

work page 2009

[33] [33]

Advances in neural information processing systems , volume=

On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , volume=

work page

[34] [34]

IEEE transactions on neural networks and learning systems , volume=

On the complexity of neural network classifiers: A comparison between shallow and deep architectures , author=. IEEE transactions on neural networks and learning systems , volume=. 2014 , publisher=

work page 2014

[35] [35]

international conference on machine learning , pages=

On the expressive power of deep neural networks , author=. international conference on machine learning , pages=. 2017 , organization=

work page 2017

[36] [36]

Liang, Tengyuan and Poggio, Tomaso and Rakhlin, Alexander and Stokes, James , booktitle=. Fisher-. 2019 , organization=

work page 2019

[37] [37]

Advances in neural information processing systems , volume=

Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=

work page

[38] [38]

Physica D: Nonlinear Phenomena , pages=

The complexity dynamics of grokking , author=. Physica D: Nonlinear Phenomena , pages=. 2025 , publisher=

work page 2025

[39] [39]

Advances in Neural Information Processing Systems , volume=

On the expected complexity of maxout networks , author=. Advances in Neural Information Processing Systems , volume=

work page

[40] [40]

Advances in Neural Information Processing Systems , volume=

Improved bounds on neural complexity for representing piecewise linear functions , author=. Advances in Neural Information Processing Systems , volume=

work page

[41] [41]

International Conference on Machine Learning , pages=

Complexity of linear regions in deep networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019

[42] [42]

SIAM Journal on Mathematics of Data Science , volume=

Spectral complexity of deep neural networks , author=. SIAM Journal on Mathematics of Data Science , volume=. 2025 , publisher=

work page 2025

[43] [43]

arXiv e-prints , pages=

Changing the Kernel During Training Leads to Double Descent in Kernel Regression , author=. arXiv e-prints , pages=

work page

[44] [44]

Understanding deep learning requires rethinking generalization

Understanding deep learning requires rethinking generalization , author =. arXiv preprint arXiv:1611.03530 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data , author=. arXiv preprint arXiv:1703.11008 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Advances in neural information processing systems , volume=

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , author=. Advances in neural information processing systems , volume=

work page

[47] [47]

Proceedings of the National Academy of Sciences , volume=

Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020

[48] [48]

Annals of statistics , volume=

Surprises in high-dimensional ridgeless least squares interpolation , author=. Annals of statistics , volume=. 2022 , publisher=

work page 2022

[49] [49]

Communications on Pure and Applied Mathematics , volume=

The generalization error of random features regression: Precise asymptotics and the double descent curve , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=

work page 2022

[50] [50]

Information and Inference: A Journal of the IMA , volume=

A model of double descent for high-dimensional binary linear classification , author=. Information and Inference: A Journal of the IMA , volume=. 2022 , publisher=

work page 2022

[51] [51]

Forty-first International Conference on Machine Learning , year=

No Double Descent in Principal Component Regression: A High-Dimensional Analysis , author=. Forty-first International Conference on Machine Learning , year=

work page

[52] [52]

arXiv preprint arXiv:2409.18842 , year=

Classical Statistical (In-Sample) Intuitions Don't Generalize Well: A Note on Bias-Variance Tradeoffs, Overfitting and Moving from Fixed to Random Designs , author=. arXiv preprint arXiv:2409.18842 , year=

work page arXiv

[53] [53]

International Conference on Artificial Intelligence and Statistics , pages=

Kernel regression in high dimensions: Refined analysis beyond double descent , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

work page 2021

[54] [54]

arXiv preprint arXiv:1912.06190 , year=

Double descent in the condition number , author=. arXiv preprint arXiv:1912.06190 , year=

work page arXiv 1912

[55] [55]

Journal of Statistical Mechanics: Theory and Experiment , volume=

Scaling description of generalization with number of parameters in deep learning , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2020 , publisher=

work page 2020

[56] [56]

Journal of Statistical Mechanics: Theory and Experiment , volume=

Deep double descent: Where bigger models and more data hurt , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

work page 2021

[57] [57]

International Conference on Machine Learning , pages=

On the generalization power of overfitted two-layer neural tangent kernel models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[58] [58]

SIAM Journal on Mathematics of Data Science , volume=

High-dimensional analysis of double descent for linear regression with random projections , author=. SIAM Journal on Mathematics of Data Science , volume=. 2024 , publisher=

work page 2024

[59] [59]

Advances in Neural Information Processing Systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page

[60] [60]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

work page 1998

[61] [61]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009

[62] [62]

Advances in neural information processing systems , volume=

A simple weight decay can improve generalization , author=. Advances in neural information processing systems , volume=

work page