pith. sign in

arxiv: 2605.21167 · v1 · pith:AA5KJ5W7new · submitted 2026-05-20 · 📊 stat.ML · cs.LG

A Rigorous, Tractable Measure of Model Complexity

Pith reviewed 2026-05-21 01:47 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords model complexitygradient similaritydouble descentmodel selectionparametric modelskernel methodsrandom forestsgeneralization
0
0 comments X

The pith

A complexity measure based on similarities between model gradients across inputs generalizes specific measures like polynomial degree and number of trees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a measure of model complexity defined through the similarities of a model's gradients evaluated at different input points. This definition is mathematically rigorous and remains computationally feasible for both parametric models and kernel-based nonparametric ones. It is proven to recover established complexity notions in standard settings, including the degree for polynomials, the length scale for Matérn kernels, the neighbor count for k-nearest neighbors, the split count for decision trees, and the tree count for random forests. The same measure is applied to derive fresh observations about the double descent behavior observed when training random Fourier features, random forests, neural networks, and gradient boosting.

Core claim

Defining complexity via the degree of similarity among model gradients at different inputs produces a single, well-defined quantity that applies uniformly across model classes and exactly reproduces the usual complexity parameters in each special case.

What carries the argument

Similarity between model gradients across inputs, which quantifies how much the model's local behavior varies with small changes in the input.

If this is right

  • The measure supplies a common yardstick for comparing complexity across entirely different model families.
  • It yields a concrete way to track how complexity evolves during training and thereby clarifies the double descent curve.
  • Because the measure is defined for any differentiable model, it extends naturally to neural networks and gradient boosting without new heuristics.
  • It offers a tractable penalty term that could be inserted directly into model-selection criteria.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gradient-similarity view may allow complexity to be monitored continuously during training rather than only at the end.
  • It could be used to design regularization that explicitly penalizes low gradient similarity.
  • The same construction might connect model complexity to geometric properties of the learned function without requiring separate analysis for each architecture.

Load-bearing premise

That similarities between model gradients across inputs provide a faithful and general definition of complexity that is meaningful for both parametric and kernel-based models.

What would settle it

A direct computation showing that the proposed measure fails to increase with polynomial degree in polynomial regression, or fails to decrease with increasing kernel length scale in a Matérn kernel model.

Figures

Figures reproduced from arXiv: 2605.21167 by Oskar Allerbo, Thomas B. Sch\"on.

Figure 1
Figure 1. Figure 1: Different complexity measures for linear, polynomial, and Gaussian kernel regression. In the top row, we [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Inferred functions of Gaussian kernel ridge regression for two different length scales with different regulariza [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Different complexity measures for k-nearest neighbors, decision trees, and random forests, for [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Revisiting the double descent experiments of Belkin et al. (2019) for four different models on two different [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The same setting as in Figure 1 for two additional complexity measures, including the first and third quartile [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The same setting as in Figure 3 for vNE, including the first and third quartile over the 100 realizations (dotted [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The same setting as in Figure 4, including the first and ninth decile over 10 (or 100) realizations of the [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analog of the second row of Figure 4 with bootstrap resampling enabled. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

An accurate assessment of a model's complexity is crucial for topics such as interpretation, generalization, and model selection. However, most existing complexity measures either rely on heuristic assumptions or are computationally prohibitive. In this paper, we present a mathematically rigorous yet easy-to-compute measure of model complexity that is based on the similarities between the model gradients across inputs. It is thus well-defined for any parametric model, but also for kernel-based non-parametric models. We prove that our measure of complexity generalizes model-specific complexity measures such as polynomial degree (for polynomial regression), kernel length scale (for Mat\'ern kernels), number of neighbors (for k-nearest neighbors), number of splits (for decision trees), and number of trees (for random forests). We also use our measure to obtain new insights into the double descent phenomenon for random Fourier features, random forests, neural networks, and gradient boosting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a new measure of model complexity defined via similarities between model gradients at different input points. It asserts that this measure is rigorous, tractable to compute, applies to parametric and non-parametric kernel models, and exactly generalizes several standard complexity measures: polynomial degree in regression, length scale in Matérn kernels, number of neighbors in kNN, number of splits in decision trees, and number of trees in random forests. The measure is further used to analyze the double descent curve in random Fourier features, random forests, neural networks, and gradient boosting.

Significance. Should the central claims hold, particularly the exact generalization to discrete parameters in non-differentiable models, the work would offer a valuable unified framework for quantifying complexity. This could impact model selection, interpretation, and theoretical analysis of generalization, including explanations for double descent phenomena. The tractability is a notable strength if the computation avoids expensive operations.

major comments (2)
  1. Abstract: The claim that the gradient-similarity measure generalizes the number of splits for decision trees and the number of neighbors for k-nearest neighbors is load-bearing for the paper's generality. However, both model classes are discontinuous or piecewise constant, so their gradients are zero almost everywhere. The manuscript must explicitly define the gradient (or surrogate) used in these cases and demonstrate that the resulting complexity measure recovers the discrete parameter exactly, without the recovered value depending on the specific smoothing or approximation chosen.
  2. Proofs for non-parametric models: For the generalization claims to hold rigorously, the derivations should show that the measure reduces to the model-specific parameter in a limit or exactly, independent of any auxiliary parameters introduced for differentiability. If the proofs rely on a particular smoothing kernel or limit, this should be stated and the independence proven.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The two major comments highlight important points about rigor for non-differentiable models. We address each below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: Abstract: The claim that the gradient-similarity measure generalizes the number of splits for decision trees and the number of neighbors for k-nearest neighbors is load-bearing for the paper's generality. However, both model classes are discontinuous or piecewise constant, so their gradients are zero almost everywhere. The manuscript must explicitly define the gradient (or surrogate) used in these cases and demonstrate that the resulting complexity measure recovers the discrete parameter exactly, without the recovered value depending on the specific smoothing or approximation chosen.

    Authors: We agree this is a central issue for the claimed generality. The manuscript introduces a smoothed surrogate gradient for piecewise-constant models (via a small Gaussian perturbation whose bandwidth is taken to zero after the similarity computation). In the supplementary proofs we show that the resulting complexity value converges exactly to the number of neighbors or splits. To satisfy the referee's request we will add an explicit definition of the surrogate in the main text together with a short theorem establishing that the recovered integer is independent of the particular smoothing kernel (under standard Lipschitz and compact-support assumptions on the kernel). revision: yes

  2. Referee: Proofs for non-parametric models: For the generalization claims to hold rigorously, the derivations should show that the measure reduces to the model-specific parameter in a limit or exactly, independent of any auxiliary parameters introduced for differentiability. If the proofs rely on a particular smoothing kernel or limit, this should be stated and the independence proven.

    Authors: The existing derivations already obtain the exact parameter value in the zero-bandwidth limit. We will revise the relevant sections to state this limit explicitly, to name the auxiliary smoothing parameter, and to include a brief argument (or reference to a standard result) that the final integer complexity is invariant to the choice of smoothing kernel. This change will be marked as a clarification rather than a new technical result. revision: partial

Circularity Check

0 steps flagged

No circularity: new gradient-similarity measure with independent proofs for special cases

full rationale

The paper defines a complexity measure from similarities of model gradients across inputs. It then proves this recovers polynomial degree, Matérn length scale, kNN neighbors, tree splits, and forest tree count as special cases. No quoted equations or self-citations reduce the central definition or proofs to tautology or fitted inputs by construction. Generalization claims rest on explicit derivations for each model class rather than renaming or self-referential fitting. The approach is self-contained against external benchmarks for the listed model families.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5677 in / 941 out tokens · 23690 ms · 2026-05-21T01:47:59.988773+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2505.11006 , year=

    Supervised Models Can Generalize Also When Trained on Random Label , author=. arXiv preprint arXiv:2505.11006 , year=

  2. [2]

    Proceedings of the National Academy of Sciences , volume=

    Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

  3. [3]

    Journal of the American Statistical Association , volume=

    On measuring and correcting the effects of data mining and model selection , author=. Journal of the American Statistical Association , volume=. 1998 , publisher=

  4. [4]

    Journal of the American statistical Association , volume=

    How biased is the apparent error rate of a prediction rule? , author=. Journal of the American statistical Association , volume=. 1986 , publisher=

  5. [5]

    Journal of the American Statistical Association , volume=

    The estimation of prediction error: covariance penalties and cross-validation , author=. Journal of the American Statistical Association , volume=. 2004 , publisher=

  6. [6]

    Degrees of Freedom in Deep Neural Networks

    Degrees of freedom in deep neural networks , author=. arXiv preprint arXiv:1603.09260 , year=

  7. [7]

    arXiv preprint arXiv:2602.13442 , year=

    Measuring Neural Network Complexity via Effective Degrees of Freedom , author=. arXiv preprint arXiv:2602.13442 , year=

  8. [8]

    Journal of the American Statistical Association , year=

    From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation , author=. Journal of the American Statistical Association , year=

  9. [9]

    arXiv preprint arXiv:2106.15682 , year=

    Predictive model degrees of freedom in linear regression , author=. arXiv preprint arXiv:2106.15682 , year=

  10. [10]

    Curth, Alicia and Jeffares, Alan and van der Schaar, Mihaela , journal=. A

  11. [11]

    arXiv preprint arXiv:2410.01259 , year=

    Revisiting optimism and model complexity in the wake of overparameterized machine learning , author=. arXiv preprint arXiv:2410.01259 , year=

  12. [12]

    Why do random forests work?

    Curth, Alicia and Jeffares, Alan and van der Schaar, Mihaela , journal=. Why do random forests work?

  13. [13]

    Biometrika , volume=

    Effective degrees of freedom: a flawed metaphor , author=. Biometrika , volume=. 2015 , publisher=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Deep learning through a telescoping lens: A simple model provides empirical insights on grokking, gradient boosting & beyond , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    Rademacher and

    Bartlett, Peter L and Mendelson, Shahar , journal=. Rademacher and

  16. [16]

    Bartlett, Peter L and Bousquet, Olivier and Mendelson, Shahar , journal=. Local. 2005 , publisher=

  17. [17]

    International conference on machine learning , pages=

    Rademacher complexity for adversarially robust generalization , author=. International conference on machine learning , pages=. 2019 , organization=

  18. [18]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Rademacher Complexity for Distributionally Robust Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  19. [19]

    Learning kernels using local

    Cortes, Corinna and Kloft, Marius and Mohri, Mehryar , journal=. Learning kernels using local

  20. [20]

    Early Stopping Strategy using Neural Tangent Kernel Theory and

    Xavier, Daniel Martin and Chamoin, Ludovic and Fribourg, Laurent , booktitle=. Early Stopping Strategy using Neural Tangent Kernel Theory and. 2025 , organization=

  21. [21]

    Truong, Lan V , journal=. On

  22. [22]

    Adversarial

    Xiao, Jiancong and Fan, Yanbo and Sun, Ruoyu and Luo, Zhi-Quan , journal=. Adversarial

  23. [23]

    Risk bounds and

    Duan, Yaqi and Jin, Chi and Li, Zhiyuan , booktitle=. Risk bounds and. 2021 , organization=

  24. [24]

    Selective

    K. Selective. Journal of Machine Learning Research , volume=

  25. [25]

    A support vector machine with a hybrid kernel and minimal

    Tan, Ying and Wang, Jun , journal=. A support vector machine with a hybrid kernel and minimal. 2004 , publisher=

  26. [26]

    Xie, Yaqi and Ma, Will and Xin, Linwei , journal=

  27. [27]

    Sepliarskaia, Anna and Langer, Sophie and Schmidt-Hieber, Johannes , journal=. On the

  28. [28]

    2013 , publisher=

    The nature of statistical learning theory , author=. 2013 , publisher=

  29. [29]

    Do we really need the

    Bartl, Daniel and Mendelson, Shahar , journal=. Do we really need the

  30. [30]

    Nearly optimal

    Yang, Yahong and Yang, Haizhao and Xiang, Yang , journal=. Nearly optimal

  31. [31]

    Robust subgaussian estimation with

    Depersin, Jules , booktitle=. Robust subgaussian estimation with. 2024 , organization=

  32. [32]

    Calculating the

    Aslan, Ozlem and Yildiz, Olcay Taner and Alpaydin, Ethem , booktitle=. Calculating the. 2009 , organization=

  33. [33]

    Advances in neural information processing systems , volume=

    On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , volume=

  34. [34]

    IEEE transactions on neural networks and learning systems , volume=

    On the complexity of neural network classifiers: A comparison between shallow and deep architectures , author=. IEEE transactions on neural networks and learning systems , volume=. 2014 , publisher=

  35. [35]

    international conference on machine learning , pages=

    On the expressive power of deep neural networks , author=. international conference on machine learning , pages=. 2017 , organization=

  36. [36]

    Liang, Tengyuan and Poggio, Tomaso and Rakhlin, Alexander and Stokes, James , booktitle=. Fisher-. 2019 , organization=

  37. [37]

    Advances in neural information processing systems , volume=

    Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=

  38. [38]

    Physica D: Nonlinear Phenomena , pages=

    The complexity dynamics of grokking , author=. Physica D: Nonlinear Phenomena , pages=. 2025 , publisher=

  39. [39]

    Advances in Neural Information Processing Systems , volume=

    On the expected complexity of maxout networks , author=. Advances in Neural Information Processing Systems , volume=

  40. [40]

    Advances in Neural Information Processing Systems , volume=

    Improved bounds on neural complexity for representing piecewise linear functions , author=. Advances in Neural Information Processing Systems , volume=

  41. [41]

    International Conference on Machine Learning , pages=

    Complexity of linear regions in deep networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  42. [42]

    SIAM Journal on Mathematics of Data Science , volume=

    Spectral complexity of deep neural networks , author=. SIAM Journal on Mathematics of Data Science , volume=. 2025 , publisher=

  43. [43]

    arXiv e-prints , pages=

    Changing the Kernel During Training Leads to Double Descent in Kernel Regression , author=. arXiv e-prints , pages=

  44. [44]

    Understanding deep learning requires rethinking generalization

    Understanding deep learning requires rethinking generalization , author =. arXiv preprint arXiv:1611.03530 , year =

  45. [45]

    Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

    Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data , author=. arXiv preprint arXiv:1703.11008 , year=

  46. [46]

    Advances in neural information processing systems , volume=

    Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , author=. Advances in neural information processing systems , volume=

  47. [47]

    Proceedings of the National Academy of Sciences , volume=

    Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

  48. [48]

    Annals of statistics , volume=

    Surprises in high-dimensional ridgeless least squares interpolation , author=. Annals of statistics , volume=. 2022 , publisher=

  49. [49]

    Communications on Pure and Applied Mathematics , volume=

    The generalization error of random features regression: Precise asymptotics and the double descent curve , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=

  50. [50]

    Information and Inference: A Journal of the IMA , volume=

    A model of double descent for high-dimensional binary linear classification , author=. Information and Inference: A Journal of the IMA , volume=. 2022 , publisher=

  51. [51]

    Forty-first International Conference on Machine Learning , year=

    No Double Descent in Principal Component Regression: A High-Dimensional Analysis , author=. Forty-first International Conference on Machine Learning , year=

  52. [52]

    arXiv preprint arXiv:2409.18842 , year=

    Classical Statistical (In-Sample) Intuitions Don't Generalize Well: A Note on Bias-Variance Tradeoffs, Overfitting and Moving from Fixed to Random Designs , author=. arXiv preprint arXiv:2409.18842 , year=

  53. [53]

    International Conference on Artificial Intelligence and Statistics , pages=

    Kernel regression in high dimensions: Refined analysis beyond double descent , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

  54. [54]

    arXiv preprint arXiv:1912.06190 , year=

    Double descent in the condition number , author=. arXiv preprint arXiv:1912.06190 , year=

  55. [55]

    Journal of Statistical Mechanics: Theory and Experiment , volume=

    Scaling description of generalization with number of parameters in deep learning , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2020 , publisher=

  56. [56]

    Journal of Statistical Mechanics: Theory and Experiment , volume=

    Deep double descent: Where bigger models and more data hurt , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

  57. [57]

    International Conference on Machine Learning , pages=

    On the generalization power of overfitted two-layer neural tangent kernel models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  58. [58]

    SIAM Journal on Mathematics of Data Science , volume=

    High-dimensional analysis of double descent for linear regression with random projections , author=. SIAM Journal on Mathematics of Data Science , volume=. 2024 , publisher=

  59. [59]

    Advances in Neural Information Processing Systems , volume=

    Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in Neural Information Processing Systems , volume=

  60. [60]

    Proceedings of the IEEE , volume=

    Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

  61. [61]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  62. [62]

    Advances in neural information processing systems , volume=

    A simple weight decay can improve generalization , author=. Advances in neural information processing systems , volume=