pith. sign in

arxiv: 2510.19206 · v2 · submitted 2025-10-22 · 🧮 math.ST · stat.ML· stat.TH

Shrinkage to Infinity: Reducing Test Error by Inflating the Minimum Norm Interpolator in Linear Models

Pith reviewed 2026-05-18 05:29 UTC · model grok-4.3

classification 🧮 math.ST stat.MLstat.TH
keywords high-dimensional regressionminimum norm interpolatoranisotropic covariancegeneralization errorinflationridge regularizationoverparameterized models
0
0 comments X

The pith

Inflating the minimum norm interpolator by a constant greater than one reduces generalization error in high-dimensional linear regression with anisotropic covariates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In high-dimensional linear regression where the number of dimensions greatly exceeds the number of samples, when the covariates have highly anisotropic covariance and the true coefficient vector aligns with the directions of largest variance, simply multiplying the minimum l2-norm interpolating solution by a fixed number larger than one can lower the expected prediction error on new points. This finding stands in direct opposition to the usual recommendation to shrink coefficient estimates toward zero. The authors establish the result by deriving matching upper and lower bounds on the expectations of certain Gaussian random projections that control the excess risk, and they also construct a practical data-splitting estimator that achieves nearly the same performance without knowing the optimal scaling in advance.

Core claim

When covariates are highly anisotropic, beta aligns with the top eigenvalues of the population covariance, and d/n diverges to infinity, the generalization error of the minimum l2-norm interpolator is reduced by scaling it up by a constant greater than one, in contrast to traditional ridge regularization which shrinks it.

What carries the argument

The inflated minimum l2-norm interpolator, formed by multiplying the minimum-norm least-squares solution by a scalar strictly larger than one, which rebalances bias and variance under anisotropic population covariance as dimension grows faster than sample size.

If this is right

  • In the diverging d/n anisotropic regime, ridge shrinkage is suboptimal and inflation can be preferable.
  • Data splitting produces consistent estimators whose generalization error matches that of the optimally inflated min-norm interpolator.
  • The improvement is proved for a broad class of anisotropic covariances by matching bounds on Gaussian projection expectations.
  • Empirical simulations confirm that moderate inflation lowers test error relative to both the unscaled interpolator and ridge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests that the default bias toward shrinkage may need re-examination whenever features exhibit heterogeneous scales and signals concentrate on the strongest ones.
  • Analogous inflation effects could appear in other overparameterized linear or kernel estimators under similar alignment conditions.
  • Developing fully data-driven methods to choose the inflation factor without sample splitting would increase practical applicability.

Load-bearing premise

The covariates must have strongly anisotropic covariance with the signal aligned to the largest-variance directions and the dimension-to-sample ratio must diverge to infinity.

What would settle it

Generate data from a covariance whose eigenvalues decay rapidly, align beta with the leading eigenvectors, draw n samples with d much larger than n, fit the min-norm solution, multiply its coefficients by 1.2, and check whether the resulting test error on fresh data is strictly smaller than that of the unscaled min-norm solution.

Figures

Figures reproduced from arXiv: 2510.19206 by Jake Freeman.

Figure 1
Figure 1. Figure 1: This figure illustrates how the Inflation Property relates to [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Hastie et al. (2022) found that ridge regularization is essential in high dimensional linear regression $y=\beta^Tx + \epsilon$ with isotropic co-variates $x\in \mathbb{R}^d$ and $n$ samples at fixed $d/n$. However, Hastie et al. (2022) also notes that when the co-variates are anisotropic and $\beta$ is aligned with the top eigenvalues of population covariance, the "situation is qualitatively different." In the present article, we make precise this observation for linear regression with highly anisotropic covariances and diverging $d/n$. We find (both theoretically and empirically) that simply scaling up (or inflating) the minimum $\ell_2$ norm interpolator by a constant greater than one can improve the generalization error. This is in sharp contrast to traditional regularization/shrinkage prescriptions. Moreover, we use a data-splitting technique to produce consistent estimators that achieve generalization error comparable to that of the optimally inflated minimum-norm interpolator. Our proof relies on matching upper and lower bounds for expectations of Gaussian random projections for a general class of anisotropic covariance matrices when $d/n\rightarrow \infty$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript studies linear regression in the high-dimensional regime with anisotropic covariates and signal aligned to the leading principal components of the covariance. It argues that the minimum-norm least squares interpolator underestimates the coefficients in the strong directions, and that multiplying it by a constant inflation factor greater than one yields lower test error. Matching upper and lower bounds are provided for the relevant Gaussian projection expectations under general anisotropic Σ as d/n diverges, and a data-splitting method is introduced to consistently estimate the optimal inflation factor without knowledge of the population parameters.

Significance. This finding is significant because it identifies a regime where the usual bias-variance trade-off via shrinkage is reversed, suggesting that 'shrinkage to infinity' can be beneficial. The theoretical bounds and the construction of a practical estimator strengthen the result. If the bounds are as tight as claimed and the empirical results hold, this could shift perspectives on regularization in overparameterized anisotropic models.

major comments (1)
  1. [§4.1, Eq. (12)] §4.1, Eq. (12): the derivation of the risk for the inflated interpolator from the projection bounds is sketched but the step showing that the optimal c >1 follows from the lower bound being less than 1 would benefit from an expanded calculation to confirm it is not an artifact of the asymptotic approximation.
minor comments (3)
  1. [Abstract] Abstract: the phrase 'shrinkage to infinity' is evocative but could be clarified in the introduction to avoid confusion with standard shrinkage estimators.
  2. [Notation] Notation section: the definition of the population covariance Σ and its eigenvalue alignment with β should be stated more explicitly early on.
  3. [Figure 1] Figure 1 caption: could include the specific value of d/n used in the simulation for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation of our manuscript and for the constructive comment. We address the point raised below and have revised the manuscript accordingly to improve clarity.

read point-by-point responses
  1. Referee: [§4.1, Eq. (12)] §4.1, Eq. (12): the derivation of the risk for the inflated interpolator from the projection bounds is sketched but the step showing that the optimal c >1 follows from the lower bound being less than 1 would benefit from an expanded calculation to confirm it is not an artifact of the asymptotic approximation.

    Authors: We agree that expanding this step will strengthen the presentation. In the revised manuscript we insert a detailed calculation immediately after the statement of the matching bounds. Let R(c) denote the asymptotic risk of the c-inflated estimator. From the lower bound on the Gaussian projection expectation we obtain R'(1) < 0 whenever the lower bound is strictly less than 1, which is guaranteed by the conditions on the eigenvalue decay and the alignment of β. We then show that R(c) is convex for c > 0, so the negative derivative at c = 1 implies that the minimizer lies strictly above 1. The argument uses only the explicit form of the bounds already proved in the section and does not rely on further asymptotic approximations beyond those already stated. The expanded derivation appears in the new version of §4.1. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives its central result on inflation of the minimum-norm interpolator from matching upper and lower bounds on expectations of Gaussian random projections under anisotropic covariances with d/n diverging. These bounds are obtained directly from properties of the population covariance and signal alignment, without reference to fitted parameters or post-hoc adjustment of the target risk. The data-splitting construction is shown to yield a consistent estimator whose risk tracks the inflated interpolator via standard concentration arguments independent of the specific inflation factor chosen. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear; the reference to Hastie et al. (2022) is used only to motivate the regime, while the new bounds supply independent support. The argument remains falsifiable outside any fitted values and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of highly anisotropic covariances with signal alignment to top eigenvalues and the technical assumption of diverging d/n for the Gaussian projection bounds.

free parameters (1)
  • inflation constant
    The factor greater than one used to scale the minimum-norm interpolator; its optimal value is estimated via data splitting but is not derived from first principles in the abstract.
axioms (1)
  • domain assumption Covariates are drawn from a general class of anisotropic covariance matrices with β aligned to top eigenvalues
    Invoked as the condition making the situation qualitatively different and enabling the Gaussian projection bounds when d/n → ∞.

pith-pipeline@v0.9.0 · 5738 in / 1416 out tokens · 48044 ms · 2026-05-18T05:29:41.580607+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    Proceedings of the National Academy of Sciences , publisher=

    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, July 2019a. ISSN 1091-6490. doi: 10.1073/ pnas.1903070116. URLhttp://dx.doi.org/10.1073/pnas.1903070116. Mikhail Belkin, Alexander Rakhlin, a...

  2. [2]

    Rick Durrett.Probability: Theory and Examples

    URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ a12c999be280372b157294e72a4bbc8b-Paper-Conference.pdf. Rick Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 5 edition,

  3. [3]

    URLhttps://doi.org/10.1214/21-AOS2133

    doi: 10.1214/21-AOS2133. URLhttps://doi.org/10.1214/21-AOS2133. Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2 edition,

  4. [4]

    On Some Statistical Properties of Dynamical Systems

    University of California Press. URLhttp://projecteuclid.org/euclid.bsmsp/1200512173. 71 Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models,

  5. [5]

    Scaling Laws for Neural Language Models

    URLhttps://arxiv.org/abs/2001.08361. Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.Journal of Machine Learning Research, 21(169):1–16,

  6. [6]

    ridgeless

    ISSN 0047-259X. doi: https:// doi.org/10.1016/j.jmva.2007.05.006. URLhttps://www.sciencedirect.com/science/ article/pii/S0047259X07000814. Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize.The Annals of Statistics, 48(3):pp. 1329–1347,

  7. [7]

    URLhttps://www.jstor.org/stable/26931513

    ISSN 00905364, 21688966. URLhttps://www.jstor.org/stable/26931513. Nicole Muecke, Enrico Reiss, Jonas Rungenhagen, and Markus Klein. Data-splitting im- proves statistical performance in overparameterized regimes. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,Proceedings of The 25th Interna- tional Conference on Artificial Intelli...

  8. [8]

    Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Alex Rakhlin

    URLhttps://arxiv.org/abs/2404.01233. Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Alex Rakhlin. Bagging regular- izes, March

  9. [9]

    Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco

    URLhttps://dspace.mit.edu/bitstream/handle/1721.1/7268/ AIM-2002-003.pdf?sequence=2&isAllowed=y. Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statis...

  10. [10]

    Hanson-Wright inequality and sub-gaussian concentration

    URLhttps://arxiv.org/abs/1306.2872. 72 Jack Sherman and Winifred J. Morrison. Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix.The Annals of Mathematical Statistics, 21(1):124 – 127,

  11. [11]

    On information and sufficiency,

    doi: 10.1214/aoms/1177729893. URLhttps://doi.org/10. 1214/aoms/1177729893. Michiaki Taniguchi and Volker Tresp. Averaging regularized estimators.Neural Computa- tion, 9(5):1163–1178, 07

  12. [12]

    doi: 10.1162/neco.1997.9.5.1163

    ISSN 0899-7667. doi: 10.1162/neco.1997.9.5.1163. URL https://doi.org/10.1162/neco.1997.9.5.1163. Y. L. Tong.The multivariate normal distribution /. Springer-Verlag,, New York,

  13. [13]

    Denny Wu and Ji Xu

    URLhttps://proceedings.neurips.cc/paper_files/paper/2021/file/ caaa29eab72b231b0af62fbdff89bfce-Paper.pdf. Denny Wu and Ji Xu. On the optimal weighted\ell 2 regularization in overparameterized linear regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 10...

  14. [14]

    Yuchen Zhang, John Duchi, and Martin Wainwright

    URLhttps://proceedings.neurips.cc/paper_ files/paper/2020/file/72e6d3238361fe70f22fb0ac624a7072-Paper.pdf. Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression. In Shai Shalev-Shwartz and Ingo Steinwart, editors,Proceedings of the 26th Annual Conference on Learning Theory, volume 30 ofProceedings of Machine Learning ...