Shrinkage to Infinity: Reducing Test Error by Inflating the Minimum Norm Interpolator in Linear Models

Jake Freeman

arxiv: 2510.19206 · v2 · submitted 2025-10-22 · 🧮 math.ST · stat.ML· stat.TH

Shrinkage to Infinity: Reducing Test Error by Inflating the Minimum Norm Interpolator in Linear Models

Jake Freeman This is my paper

Pith reviewed 2026-05-18 05:29 UTC · model grok-4.3

classification 🧮 math.ST stat.MLstat.TH

keywords high-dimensional regressionminimum norm interpolatoranisotropic covariancegeneralization errorinflationridge regularizationoverparameterized models

0 comments

The pith

Inflating the minimum norm interpolator by a constant greater than one reduces generalization error in high-dimensional linear regression with anisotropic covariates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In high-dimensional linear regression where the number of dimensions greatly exceeds the number of samples, when the covariates have highly anisotropic covariance and the true coefficient vector aligns with the directions of largest variance, simply multiplying the minimum l2-norm interpolating solution by a fixed number larger than one can lower the expected prediction error on new points. This finding stands in direct opposition to the usual recommendation to shrink coefficient estimates toward zero. The authors establish the result by deriving matching upper and lower bounds on the expectations of certain Gaussian random projections that control the excess risk, and they also construct a practical data-splitting estimator that achieves nearly the same performance without knowing the optimal scaling in advance.

Core claim

When covariates are highly anisotropic, beta aligns with the top eigenvalues of the population covariance, and d/n diverges to infinity, the generalization error of the minimum l2-norm interpolator is reduced by scaling it up by a constant greater than one, in contrast to traditional ridge regularization which shrinks it.

What carries the argument

The inflated minimum l2-norm interpolator, formed by multiplying the minimum-norm least-squares solution by a scalar strictly larger than one, which rebalances bias and variance under anisotropic population covariance as dimension grows faster than sample size.

If this is right

In the diverging d/n anisotropic regime, ridge shrinkage is suboptimal and inflation can be preferable.
Data splitting produces consistent estimators whose generalization error matches that of the optimally inflated min-norm interpolator.
The improvement is proved for a broad class of anisotropic covariances by matching bounds on Gaussian projection expectations.
Empirical simulations confirm that moderate inflation lowers test error relative to both the unscaled interpolator and ridge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests that the default bias toward shrinkage may need re-examination whenever features exhibit heterogeneous scales and signals concentrate on the strongest ones.
Analogous inflation effects could appear in other overparameterized linear or kernel estimators under similar alignment conditions.
Developing fully data-driven methods to choose the inflation factor without sample splitting would increase practical applicability.

Load-bearing premise

The covariates must have strongly anisotropic covariance with the signal aligned to the largest-variance directions and the dimension-to-sample ratio must diverge to infinity.

What would settle it

Generate data from a covariance whose eigenvalues decay rapidly, align beta with the leading eigenvectors, draw n samples with d much larger than n, fit the min-norm solution, multiply its coefficients by 1.2, and check whether the resulting test error on fresh data is strictly smaller than that of the unscaled min-norm solution.

Figures

Figures reproduced from arXiv: 2510.19206 by Jake Freeman.

read the original abstract

Hastie et al. (2022) found that ridge regularization is essential in high dimensional linear regression $y=\beta^Tx + \epsilon$ with isotropic co-variates $x\in \mathbb{R}^d$ and $n$ samples at fixed $d/n$. However, Hastie et al. (2022) also notes that when the co-variates are anisotropic and $\beta$ is aligned with the top eigenvalues of population covariance, the "situation is qualitatively different." In the present article, we make precise this observation for linear regression with highly anisotropic covariances and diverging $d/n$. We find (both theoretically and empirically) that simply scaling up (or inflating) the minimum $\ell_2$ norm interpolator by a constant greater than one can improve the generalization error. This is in sharp contrast to traditional regularization/shrinkage prescriptions. Moreover, we use a data-splitting technique to produce consistent estimators that achieve generalization error comparable to that of the optimally inflated minimum-norm interpolator. Our proof relies on matching upper and lower bounds for expectations of Gaussian random projections for a general class of anisotropic covariance matrices when $d/n\rightarrow \infty$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that in the specific regime of strongly anisotropic covariates with d/n diverging and beta aligned to top eigenvalues, inflating the min-norm interpolator by a fixed factor >1 reduces risk, and they give matching bounds plus a data-split estimator for it.

read the letter

The colleague should know two things up front. First, this paper turns the qualitative remark in Hastie et al. (2022) into a precise statement: when covariates are highly anisotropic, d/n goes to infinity, and beta lines up with the leading eigenvalues of Sigma, simply scaling the minimum l2-norm interpolator upward by a constant c > 1 lowers asymptotic risk. Second, they supply a data-splitting procedure that consistently estimates a good inflation factor without knowing the population quantities in advance.

Referee Report

1 major / 3 minor

Summary. The manuscript studies linear regression in the high-dimensional regime with anisotropic covariates and signal aligned to the leading principal components of the covariance. It argues that the minimum-norm least squares interpolator underestimates the coefficients in the strong directions, and that multiplying it by a constant inflation factor greater than one yields lower test error. Matching upper and lower bounds are provided for the relevant Gaussian projection expectations under general anisotropic Σ as d/n diverges, and a data-splitting method is introduced to consistently estimate the optimal inflation factor without knowledge of the population parameters.

Significance. This finding is significant because it identifies a regime where the usual bias-variance trade-off via shrinkage is reversed, suggesting that 'shrinkage to infinity' can be beneficial. The theoretical bounds and the construction of a practical estimator strengthen the result. If the bounds are as tight as claimed and the empirical results hold, this could shift perspectives on regularization in overparameterized anisotropic models.

major comments (1)

[§4.1, Eq. (12)] §4.1, Eq. (12): the derivation of the risk for the inflated interpolator from the projection bounds is sketched but the step showing that the optimal c >1 follows from the lower bound being less than 1 would benefit from an expanded calculation to confirm it is not an artifact of the asymptotic approximation.

minor comments (3)

[Abstract] Abstract: the phrase 'shrinkage to infinity' is evocative but could be clarified in the introduction to avoid confusion with standard shrinkage estimators.
[Notation] Notation section: the definition of the population covariance Σ and its eigenvalue alignment with β should be stated more explicitly early on.
[Figure 1] Figure 1 caption: could include the specific value of d/n used in the simulation for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation of our manuscript and for the constructive comment. We address the point raised below and have revised the manuscript accordingly to improve clarity.

read point-by-point responses

Referee: [§4.1, Eq. (12)] §4.1, Eq. (12): the derivation of the risk for the inflated interpolator from the projection bounds is sketched but the step showing that the optimal c >1 follows from the lower bound being less than 1 would benefit from an expanded calculation to confirm it is not an artifact of the asymptotic approximation.

Authors: We agree that expanding this step will strengthen the presentation. In the revised manuscript we insert a detailed calculation immediately after the statement of the matching bounds. Let R(c) denote the asymptotic risk of the c-inflated estimator. From the lower bound on the Gaussian projection expectation we obtain R'(1) < 0 whenever the lower bound is strictly less than 1, which is guaranteed by the conditions on the eigenvalue decay and the alignment of β. We then show that R(c) is convex for c > 0, so the negative derivative at c = 1 implies that the minimizer lies strictly above 1. The argument uses only the explicit form of the bounds already proved in the section and does not rely on further asymptotic approximations beyond those already stated. The expanded derivation appears in the new version of §4.1. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives its central result on inflation of the minimum-norm interpolator from matching upper and lower bounds on expectations of Gaussian random projections under anisotropic covariances with d/n diverging. These bounds are obtained directly from properties of the population covariance and signal alignment, without reference to fitted parameters or post-hoc adjustment of the target risk. The data-splitting construction is shown to yield a consistent estimator whose risk tracks the inflated interpolator via standard concentration arguments independent of the specific inflation factor chosen. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear; the reference to Hastie et al. (2022) is used only to motivate the regime, while the new bounds supply independent support. The argument remains falsifiable outside any fitted values and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of highly anisotropic covariances with signal alignment to top eigenvalues and the technical assumption of diverging d/n for the Gaussian projection bounds.

free parameters (1)

inflation constant
The factor greater than one used to scale the minimum-norm interpolator; its optimal value is estimated via data splitting but is not derived from first principles in the abstract.

axioms (1)

domain assumption Covariates are drawn from a general class of anisotropic covariance matrices with β aligned to top eigenvalues
Invoked as the condition making the situation qualitatively different and enabling the Gaussian projection bounds when d/n → ∞.

pith-pipeline@v0.9.0 · 5738 in / 1416 out tokens · 48044 ms · 2026-05-18T05:29:41.580607+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

simply scaling up (or inflating) the minimum ℓ2 norm interpolator by a constant greater than one can improve the generalization error. This is in sharp contrast to traditional regularization/shrinkage prescriptions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Proceedings of the National Academy of Sciences , publisher=

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, July 2019a. ISSN 1091-6490. doi: 10.1073/ pnas.1903070116. URLhttp://dx.doi.org/10.1073/pnas.1903070116. Mikhail Belkin, Alexander Rakhlin, a...

work page doi:10.1073/pnas.1903070116
[2]

Rick Durrett.Probability: Theory and Examples

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ a12c999be280372b157294e72a4bbc8b-Paper-Conference.pdf. Rick Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 5 edition,

work page 2022
[3]

URLhttps://doi.org/10.1214/21-AOS2133

doi: 10.1214/21-AOS2133. URLhttps://doi.org/10.1214/21-AOS2133. Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2 edition,

work page doi:10.1214/21-aos2133
[4]

On Some Statistical Properties of Dynamical Systems

University of California Press. URLhttp://projecteuclid.org/euclid.bsmsp/1200512173. 71 Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models,

work page arXiv
[5]

Scaling Laws for Neural Language Models

URLhttps://arxiv.org/abs/2001.08361. Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.Journal of Machine Learning Research, 21(169):1–16,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[6]

ridgeless

ISSN 0047-259X. doi: https:// doi.org/10.1016/j.jmva.2007.05.006. URLhttps://www.sciencedirect.com/science/ article/pii/S0047259X07000814. Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize.The Annals of Statistics, 48(3):pp. 1329–1347,

work page doi:10.1016/j.jmva.2007.05.006 2007
[7]

URLhttps://www.jstor.org/stable/26931513

ISSN 00905364, 21688966. URLhttps://www.jstor.org/stable/26931513. Nicole Muecke, Enrico Reiss, Jonas Rungenhagen, and Markus Klein. Data-splitting im- proves statistical performance in overparameterized regimes. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,Proceedings of The 25th Interna- tional Conference on Artificial Intelli...

work page arXiv
[8]

Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Alex Rakhlin

URLhttps://arxiv.org/abs/2404.01233. Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Alex Rakhlin. Bagging regular- izes, March

work page arXiv
[9]

Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco

URLhttps://dspace.mit.edu/bitstream/handle/1721.1/7268/ AIM-2002-003.pdf?sequence=2&isAllowed=y. Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statis...

work page 2002
[10]

Hanson-Wright inequality and sub-gaussian concentration

URLhttps://arxiv.org/abs/1306.2872. 72 Jack Sherman and Winifred J. Morrison. Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix.The Annals of Mathematical Statistics, 21(1):124 – 127,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

On information and sufficiency,

doi: 10.1214/aoms/1177729893. URLhttps://doi.org/10. 1214/aoms/1177729893. Michiaki Taniguchi and Volker Tresp. Averaging regularized estimators.Neural Computa- tion, 9(5):1163–1178, 07

work page doi:10.1214/aoms/1177729893
[12]

doi: 10.1162/neco.1997.9.5.1163

ISSN 0899-7667. doi: 10.1162/neco.1997.9.5.1163. URL https://doi.org/10.1162/neco.1997.9.5.1163. Y. L. Tong.The multivariate normal distribution /. Springer-Verlag,, New York,

work page doi:10.1162/neco.1997.9.5.1163 1997
[13]

Denny Wu and Ji Xu

URLhttps://proceedings.neurips.cc/paper_files/paper/2021/file/ caaa29eab72b231b0af62fbdff89bfce-Paper.pdf. Denny Wu and Ji Xu. On the optimal weighted\ell 2 regularization in overparameterized linear regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 10...

work page 2021
[14]

Yuchen Zhang, John Duchi, and Martin Wainwright

URLhttps://proceedings.neurips.cc/paper_ files/paper/2020/file/72e6d3238361fe70f22fb0ac624a7072-Paper.pdf. Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression. In Shai Shalev-Shwartz and Ingo Steinwart, editors,Proceedings of the 26th Annual Conference on Learning Theory, volume 30 ofProceedings of Machine Learning ...

work page 2020

[1] [1]

Proceedings of the National Academy of Sciences , publisher=

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, July 2019a. ISSN 1091-6490. doi: 10.1073/ pnas.1903070116. URLhttp://dx.doi.org/10.1073/pnas.1903070116. Mikhail Belkin, Alexander Rakhlin, a...

work page doi:10.1073/pnas.1903070116

[2] [2]

Rick Durrett.Probability: Theory and Examples

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ a12c999be280372b157294e72a4bbc8b-Paper-Conference.pdf. Rick Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 5 edition,

work page 2022

[3] [3]

URLhttps://doi.org/10.1214/21-AOS2133

doi: 10.1214/21-AOS2133. URLhttps://doi.org/10.1214/21-AOS2133. Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2 edition,

work page doi:10.1214/21-aos2133

[4] [4]

On Some Statistical Properties of Dynamical Systems

University of California Press. URLhttp://projecteuclid.org/euclid.bsmsp/1200512173. 71 Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models,

work page arXiv

[5] [5]

Scaling Laws for Neural Language Models

URLhttps://arxiv.org/abs/2001.08361. Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.Journal of Machine Learning Research, 21(169):1–16,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[6] [6]

ridgeless

ISSN 0047-259X. doi: https:// doi.org/10.1016/j.jmva.2007.05.006. URLhttps://www.sciencedirect.com/science/ article/pii/S0047259X07000814. Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize.The Annals of Statistics, 48(3):pp. 1329–1347,

work page doi:10.1016/j.jmva.2007.05.006 2007

[7] [7]

URLhttps://www.jstor.org/stable/26931513

ISSN 00905364, 21688966. URLhttps://www.jstor.org/stable/26931513. Nicole Muecke, Enrico Reiss, Jonas Rungenhagen, and Markus Klein. Data-splitting im- proves statistical performance in overparameterized regimes. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,Proceedings of The 25th Interna- tional Conference on Artificial Intelli...

work page arXiv

[8] [8]

Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Alex Rakhlin

URLhttps://arxiv.org/abs/2404.01233. Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Alex Rakhlin. Bagging regular- izes, March

work page arXiv

[9] [9]

Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco

URLhttps://dspace.mit.edu/bitstream/handle/1721.1/7268/ AIM-2002-003.pdf?sequence=2&isAllowed=y. Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statis...

work page 2002

[10] [10]

Hanson-Wright inequality and sub-gaussian concentration

URLhttps://arxiv.org/abs/1306.2872. 72 Jack Sherman and Winifred J. Morrison. Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix.The Annals of Mathematical Statistics, 21(1):124 – 127,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

On information and sufficiency,

doi: 10.1214/aoms/1177729893. URLhttps://doi.org/10. 1214/aoms/1177729893. Michiaki Taniguchi and Volker Tresp. Averaging regularized estimators.Neural Computa- tion, 9(5):1163–1178, 07

work page doi:10.1214/aoms/1177729893

[12] [12]

doi: 10.1162/neco.1997.9.5.1163

ISSN 0899-7667. doi: 10.1162/neco.1997.9.5.1163. URL https://doi.org/10.1162/neco.1997.9.5.1163. Y. L. Tong.The multivariate normal distribution /. Springer-Verlag,, New York,

work page doi:10.1162/neco.1997.9.5.1163 1997

[13] [13]

Denny Wu and Ji Xu

URLhttps://proceedings.neurips.cc/paper_files/paper/2021/file/ caaa29eab72b231b0af62fbdff89bfce-Paper.pdf. Denny Wu and Ji Xu. On the optimal weighted\ell 2 regularization in overparameterized linear regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 10...

work page 2021

[14] [14]

Yuchen Zhang, John Duchi, and Martin Wainwright

URLhttps://proceedings.neurips.cc/paper_ files/paper/2020/file/72e6d3238361fe70f22fb0ac624a7072-Paper.pdf. Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression. In Shai Shalev-Shwartz and Ingo Steinwart, editors,Proceedings of the 26th Annual Conference on Learning Theory, volume 30 ofProceedings of Machine Learning ...

work page 2020