A Rigorous, Tractable Measure of Model Complexity
Pith reviewed 2026-05-21 01:47 UTC · model grok-4.3
The pith
A complexity measure based on similarities between model gradients across inputs generalizes specific measures like polynomial degree and number of trees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Defining complexity via the degree of similarity among model gradients at different inputs produces a single, well-defined quantity that applies uniformly across model classes and exactly reproduces the usual complexity parameters in each special case.
What carries the argument
Similarity between model gradients across inputs, which quantifies how much the model's local behavior varies with small changes in the input.
If this is right
- The measure supplies a common yardstick for comparing complexity across entirely different model families.
- It yields a concrete way to track how complexity evolves during training and thereby clarifies the double descent curve.
- Because the measure is defined for any differentiable model, it extends naturally to neural networks and gradient boosting without new heuristics.
- It offers a tractable penalty term that could be inserted directly into model-selection criteria.
Where Pith is reading between the lines
- The gradient-similarity view may allow complexity to be monitored continuously during training rather than only at the end.
- It could be used to design regularization that explicitly penalizes low gradient similarity.
- The same construction might connect model complexity to geometric properties of the learned function without requiring separate analysis for each architecture.
Load-bearing premise
That similarities between model gradients across inputs provide a faithful and general definition of complexity that is meaningful for both parametric and kernel-based models.
What would settle it
A direct computation showing that the proposed measure fails to increase with polynomial degree in polynomial regression, or fails to decrease with increasing kernel length scale in a Matérn kernel model.
Figures
read the original abstract
An accurate assessment of a model's complexity is crucial for topics such as interpretation, generalization, and model selection. However, most existing complexity measures either rely on heuristic assumptions or are computationally prohibitive. In this paper, we present a mathematically rigorous yet easy-to-compute measure of model complexity that is based on the similarities between the model gradients across inputs. It is thus well-defined for any parametric model, but also for kernel-based non-parametric models. We prove that our measure of complexity generalizes model-specific complexity measures such as polynomial degree (for polynomial regression), kernel length scale (for Mat\'ern kernels), number of neighbors (for k-nearest neighbors), number of splits (for decision trees), and number of trees (for random forests). We also use our measure to obtain new insights into the double descent phenomenon for random Fourier features, random forests, neural networks, and gradient boosting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a new measure of model complexity defined via similarities between model gradients at different input points. It asserts that this measure is rigorous, tractable to compute, applies to parametric and non-parametric kernel models, and exactly generalizes several standard complexity measures: polynomial degree in regression, length scale in Matérn kernels, number of neighbors in kNN, number of splits in decision trees, and number of trees in random forests. The measure is further used to analyze the double descent curve in random Fourier features, random forests, neural networks, and gradient boosting.
Significance. Should the central claims hold, particularly the exact generalization to discrete parameters in non-differentiable models, the work would offer a valuable unified framework for quantifying complexity. This could impact model selection, interpretation, and theoretical analysis of generalization, including explanations for double descent phenomena. The tractability is a notable strength if the computation avoids expensive operations.
major comments (2)
- Abstract: The claim that the gradient-similarity measure generalizes the number of splits for decision trees and the number of neighbors for k-nearest neighbors is load-bearing for the paper's generality. However, both model classes are discontinuous or piecewise constant, so their gradients are zero almost everywhere. The manuscript must explicitly define the gradient (or surrogate) used in these cases and demonstrate that the resulting complexity measure recovers the discrete parameter exactly, without the recovered value depending on the specific smoothing or approximation chosen.
- Proofs for non-parametric models: For the generalization claims to hold rigorously, the derivations should show that the measure reduces to the model-specific parameter in a limit or exactly, independent of any auxiliary parameters introduced for differentiability. If the proofs rely on a particular smoothing kernel or limit, this should be stated and the independence proven.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The two major comments highlight important points about rigor for non-differentiable models. We address each below and indicate the revisions we will make.
read point-by-point responses
-
Referee: Abstract: The claim that the gradient-similarity measure generalizes the number of splits for decision trees and the number of neighbors for k-nearest neighbors is load-bearing for the paper's generality. However, both model classes are discontinuous or piecewise constant, so their gradients are zero almost everywhere. The manuscript must explicitly define the gradient (or surrogate) used in these cases and demonstrate that the resulting complexity measure recovers the discrete parameter exactly, without the recovered value depending on the specific smoothing or approximation chosen.
Authors: We agree this is a central issue for the claimed generality. The manuscript introduces a smoothed surrogate gradient for piecewise-constant models (via a small Gaussian perturbation whose bandwidth is taken to zero after the similarity computation). In the supplementary proofs we show that the resulting complexity value converges exactly to the number of neighbors or splits. To satisfy the referee's request we will add an explicit definition of the surrogate in the main text together with a short theorem establishing that the recovered integer is independent of the particular smoothing kernel (under standard Lipschitz and compact-support assumptions on the kernel). revision: yes
-
Referee: Proofs for non-parametric models: For the generalization claims to hold rigorously, the derivations should show that the measure reduces to the model-specific parameter in a limit or exactly, independent of any auxiliary parameters introduced for differentiability. If the proofs rely on a particular smoothing kernel or limit, this should be stated and the independence proven.
Authors: The existing derivations already obtain the exact parameter value in the zero-bandwidth limit. We will revise the relevant sections to state this limit explicitly, to name the auxiliary smoothing parameter, and to include a brief argument (or reference to a standard result) that the final integer complexity is invariant to the choice of smoothing kernel. This change will be marked as a clarification rather than a new technical result. revision: partial
Circularity Check
No circularity: new gradient-similarity measure with independent proofs for special cases
full rationale
The paper defines a complexity measure from similarities of model gradients across inputs. It then proves this recovers polynomial degree, Matérn length scale, kNN neighbors, tree splits, and forest tree count as special cases. No quoted equations or self-citations reduce the central definition or proofs to tautology or fitted inputs by construction. Generalization claims rest on explicit derivations for each model class rather than renaming or self-referential fitting. The approach is self-contained against external benchmarks for the listed model families.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J uniqueness) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
K(ˆf) := 1−E[ (φ(x)ᵀφ(x′) / (‖φ(x)‖‖φ(x′)‖))² ] … equals the normalized linear entropy LE(¯K) of the normalized NTK/kernel matrix
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that our measure … generalizes … polynomial degree … kernel length scale … number of neighbors … number of splits … number of trees
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2505.11006 , year=
Supervised Models Can Generalize Also When Trained on Random Label , author=. arXiv preprint arXiv:2505.11006 , year=
-
[2]
Proceedings of the National Academy of Sciences , volume=
Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=
work page 2019
-
[3]
Journal of the American Statistical Association , volume=
On measuring and correcting the effects of data mining and model selection , author=. Journal of the American Statistical Association , volume=. 1998 , publisher=
work page 1998
-
[4]
Journal of the American statistical Association , volume=
How biased is the apparent error rate of a prediction rule? , author=. Journal of the American statistical Association , volume=. 1986 , publisher=
work page 1986
-
[5]
Journal of the American Statistical Association , volume=
The estimation of prediction error: covariance penalties and cross-validation , author=. Journal of the American Statistical Association , volume=. 2004 , publisher=
work page 2004
-
[6]
Degrees of Freedom in Deep Neural Networks
Degrees of freedom in deep neural networks , author=. arXiv preprint arXiv:1603.09260 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
arXiv preprint arXiv:2602.13442 , year=
Measuring Neural Network Complexity via Effective Degrees of Freedom , author=. arXiv preprint arXiv:2602.13442 , year=
-
[8]
Journal of the American Statistical Association , year=
From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation , author=. Journal of the American Statistical Association , year=
-
[9]
arXiv preprint arXiv:2106.15682 , year=
Predictive model degrees of freedom in linear regression , author=. arXiv preprint arXiv:2106.15682 , year=
-
[10]
Curth, Alicia and Jeffares, Alan and van der Schaar, Mihaela , journal=. A
-
[11]
arXiv preprint arXiv:2410.01259 , year=
Revisiting optimism and model complexity in the wake of overparameterized machine learning , author=. arXiv preprint arXiv:2410.01259 , year=
-
[12]
Curth, Alicia and Jeffares, Alan and van der Schaar, Mihaela , journal=. Why do random forests work?
-
[13]
Effective degrees of freedom: a flawed metaphor , author=. Biometrika , volume=. 2015 , publisher=
work page 2015
-
[14]
Advances in Neural Information Processing Systems , volume=
Deep learning through a telescoping lens: A simple model provides empirical insights on grokking, gradient boosting & beyond , author=. Advances in Neural Information Processing Systems , volume=
- [15]
-
[16]
Bartlett, Peter L and Bousquet, Olivier and Mendelson, Shahar , journal=. Local. 2005 , publisher=
work page 2005
-
[17]
International conference on machine learning , pages=
Rademacher complexity for adversarially robust generalization , author=. International conference on machine learning , pages=. 2019 , organization=
work page 2019
-
[18]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Rademacher Complexity for Distributionally Robust Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[19]
Cortes, Corinna and Kloft, Marius and Mohri, Mehryar , journal=. Learning kernels using local
-
[20]
Early Stopping Strategy using Neural Tangent Kernel Theory and
Xavier, Daniel Martin and Chamoin, Ludovic and Fribourg, Laurent , booktitle=. Early Stopping Strategy using Neural Tangent Kernel Theory and. 2025 , organization=
work page 2025
-
[21]
Truong, Lan V , journal=. On
-
[22]
Xiao, Jiancong and Fan, Yanbo and Sun, Ruoyu and Luo, Zhi-Quan , journal=. Adversarial
-
[23]
Duan, Yaqi and Jin, Chi and Li, Zhiyuan , booktitle=. Risk bounds and. 2021 , organization=
work page 2021
- [24]
-
[25]
A support vector machine with a hybrid kernel and minimal
Tan, Ying and Wang, Jun , journal=. A support vector machine with a hybrid kernel and minimal. 2004 , publisher=
work page 2004
-
[26]
Xie, Yaqi and Ma, Will and Xin, Linwei , journal=
-
[27]
Sepliarskaia, Anna and Langer, Sophie and Schmidt-Hieber, Johannes , journal=. On the
-
[28]
The nature of statistical learning theory , author=. 2013 , publisher=
work page 2013
-
[29]
Bartl, Daniel and Mendelson, Shahar , journal=. Do we really need the
- [30]
-
[31]
Robust subgaussian estimation with
Depersin, Jules , booktitle=. Robust subgaussian estimation with. 2024 , organization=
work page 2024
-
[32]
Aslan, Ozlem and Yildiz, Olcay Taner and Alpaydin, Ethem , booktitle=. Calculating the. 2009 , organization=
work page 2009
-
[33]
Advances in neural information processing systems , volume=
On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , volume=
-
[34]
IEEE transactions on neural networks and learning systems , volume=
On the complexity of neural network classifiers: A comparison between shallow and deep architectures , author=. IEEE transactions on neural networks and learning systems , volume=. 2014 , publisher=
work page 2014
-
[35]
international conference on machine learning , pages=
On the expressive power of deep neural networks , author=. international conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[36]
Liang, Tengyuan and Poggio, Tomaso and Rakhlin, Alexander and Stokes, James , booktitle=. Fisher-. 2019 , organization=
work page 2019
-
[37]
Advances in neural information processing systems , volume=
Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=
-
[38]
Physica D: Nonlinear Phenomena , pages=
The complexity dynamics of grokking , author=. Physica D: Nonlinear Phenomena , pages=. 2025 , publisher=
work page 2025
-
[39]
Advances in Neural Information Processing Systems , volume=
On the expected complexity of maxout networks , author=. Advances in Neural Information Processing Systems , volume=
-
[40]
Advances in Neural Information Processing Systems , volume=
Improved bounds on neural complexity for representing piecewise linear functions , author=. Advances in Neural Information Processing Systems , volume=
-
[41]
International Conference on Machine Learning , pages=
Complexity of linear regions in deep networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[42]
SIAM Journal on Mathematics of Data Science , volume=
Spectral complexity of deep neural networks , author=. SIAM Journal on Mathematics of Data Science , volume=. 2025 , publisher=
work page 2025
-
[43]
Changing the Kernel During Training Leads to Double Descent in Kernel Regression , author=. arXiv e-prints , pages=
-
[44]
Understanding deep learning requires rethinking generalization
Understanding deep learning requires rethinking generalization , author =. arXiv preprint arXiv:1611.03530 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data , author=. arXiv preprint arXiv:1703.11008 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Advances in neural information processing systems , volume=
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , author=. Advances in neural information processing systems , volume=
-
[47]
Proceedings of the National Academy of Sciences , volume=
Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=
work page 2020
-
[48]
Annals of statistics , volume=
Surprises in high-dimensional ridgeless least squares interpolation , author=. Annals of statistics , volume=. 2022 , publisher=
work page 2022
-
[49]
Communications on Pure and Applied Mathematics , volume=
The generalization error of random features regression: Precise asymptotics and the double descent curve , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=
work page 2022
-
[50]
Information and Inference: A Journal of the IMA , volume=
A model of double descent for high-dimensional binary linear classification , author=. Information and Inference: A Journal of the IMA , volume=. 2022 , publisher=
work page 2022
-
[51]
Forty-first International Conference on Machine Learning , year=
No Double Descent in Principal Component Regression: A High-Dimensional Analysis , author=. Forty-first International Conference on Machine Learning , year=
-
[52]
arXiv preprint arXiv:2409.18842 , year=
Classical Statistical (In-Sample) Intuitions Don't Generalize Well: A Note on Bias-Variance Tradeoffs, Overfitting and Moving from Fixed to Random Designs , author=. arXiv preprint arXiv:2409.18842 , year=
-
[53]
International Conference on Artificial Intelligence and Statistics , pages=
Kernel regression in high dimensions: Refined analysis beyond double descent , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=
work page 2021
-
[54]
arXiv preprint arXiv:1912.06190 , year=
Double descent in the condition number , author=. arXiv preprint arXiv:1912.06190 , year=
-
[55]
Journal of Statistical Mechanics: Theory and Experiment , volume=
Scaling description of generalization with number of parameters in deep learning , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2020 , publisher=
work page 2020
-
[56]
Journal of Statistical Mechanics: Theory and Experiment , volume=
Deep double descent: Where bigger models and more data hurt , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=
work page 2021
-
[57]
International Conference on Machine Learning , pages=
On the generalization power of overfitted two-layer neural tangent kernel models , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[58]
SIAM Journal on Mathematics of Data Science , volume=
High-dimensional analysis of double descent for linear regression with random projections , author=. SIAM Journal on Mathematics of Data Science , volume=. 2024 , publisher=
work page 2024
-
[59]
Advances in Neural Information Processing Systems , volume=
Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
Proceedings of the IEEE , volume=
Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=
work page 1998
-
[61]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[62]
Advances in neural information processing systems , volume=
A simple weight decay can improve generalization , author=. Advances in neural information processing systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.