Shrinkage to Infinity: Reducing Test Error by Inflating the Minimum Norm Interpolator in Linear Models
Pith reviewed 2026-05-18 05:29 UTC · model grok-4.3
The pith
Inflating the minimum norm interpolator by a constant greater than one reduces generalization error in high-dimensional linear regression with anisotropic covariates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When covariates are highly anisotropic, beta aligns with the top eigenvalues of the population covariance, and d/n diverges to infinity, the generalization error of the minimum l2-norm interpolator is reduced by scaling it up by a constant greater than one, in contrast to traditional ridge regularization which shrinks it.
What carries the argument
The inflated minimum l2-norm interpolator, formed by multiplying the minimum-norm least-squares solution by a scalar strictly larger than one, which rebalances bias and variance under anisotropic population covariance as dimension grows faster than sample size.
If this is right
- In the diverging d/n anisotropic regime, ridge shrinkage is suboptimal and inflation can be preferable.
- Data splitting produces consistent estimators whose generalization error matches that of the optimally inflated min-norm interpolator.
- The improvement is proved for a broad class of anisotropic covariances by matching bounds on Gaussian projection expectations.
- Empirical simulations confirm that moderate inflation lowers test error relative to both the unscaled interpolator and ridge.
Where Pith is reading between the lines
- The result suggests that the default bias toward shrinkage may need re-examination whenever features exhibit heterogeneous scales and signals concentrate on the strongest ones.
- Analogous inflation effects could appear in other overparameterized linear or kernel estimators under similar alignment conditions.
- Developing fully data-driven methods to choose the inflation factor without sample splitting would increase practical applicability.
Load-bearing premise
The covariates must have strongly anisotropic covariance with the signal aligned to the largest-variance directions and the dimension-to-sample ratio must diverge to infinity.
What would settle it
Generate data from a covariance whose eigenvalues decay rapidly, align beta with the leading eigenvectors, draw n samples with d much larger than n, fit the min-norm solution, multiply its coefficients by 1.2, and check whether the resulting test error on fresh data is strictly smaller than that of the unscaled min-norm solution.
Figures
read the original abstract
Hastie et al. (2022) found that ridge regularization is essential in high dimensional linear regression $y=\beta^Tx + \epsilon$ with isotropic co-variates $x\in \mathbb{R}^d$ and $n$ samples at fixed $d/n$. However, Hastie et al. (2022) also notes that when the co-variates are anisotropic and $\beta$ is aligned with the top eigenvalues of population covariance, the "situation is qualitatively different." In the present article, we make precise this observation for linear regression with highly anisotropic covariances and diverging $d/n$. We find (both theoretically and empirically) that simply scaling up (or inflating) the minimum $\ell_2$ norm interpolator by a constant greater than one can improve the generalization error. This is in sharp contrast to traditional regularization/shrinkage prescriptions. Moreover, we use a data-splitting technique to produce consistent estimators that achieve generalization error comparable to that of the optimally inflated minimum-norm interpolator. Our proof relies on matching upper and lower bounds for expectations of Gaussian random projections for a general class of anisotropic covariance matrices when $d/n\rightarrow \infty$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies linear regression in the high-dimensional regime with anisotropic covariates and signal aligned to the leading principal components of the covariance. It argues that the minimum-norm least squares interpolator underestimates the coefficients in the strong directions, and that multiplying it by a constant inflation factor greater than one yields lower test error. Matching upper and lower bounds are provided for the relevant Gaussian projection expectations under general anisotropic Σ as d/n diverges, and a data-splitting method is introduced to consistently estimate the optimal inflation factor without knowledge of the population parameters.
Significance. This finding is significant because it identifies a regime where the usual bias-variance trade-off via shrinkage is reversed, suggesting that 'shrinkage to infinity' can be beneficial. The theoretical bounds and the construction of a practical estimator strengthen the result. If the bounds are as tight as claimed and the empirical results hold, this could shift perspectives on regularization in overparameterized anisotropic models.
major comments (1)
- [§4.1, Eq. (12)] §4.1, Eq. (12): the derivation of the risk for the inflated interpolator from the projection bounds is sketched but the step showing that the optimal c >1 follows from the lower bound being less than 1 would benefit from an expanded calculation to confirm it is not an artifact of the asymptotic approximation.
minor comments (3)
- [Abstract] Abstract: the phrase 'shrinkage to infinity' is evocative but could be clarified in the introduction to avoid confusion with standard shrinkage estimators.
- [Notation] Notation section: the definition of the population covariance Σ and its eigenvalue alignment with β should be stated more explicitly early on.
- [Figure 1] Figure 1 caption: could include the specific value of d/n used in the simulation for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of our manuscript and for the constructive comment. We address the point raised below and have revised the manuscript accordingly to improve clarity.
read point-by-point responses
-
Referee: [§4.1, Eq. (12)] §4.1, Eq. (12): the derivation of the risk for the inflated interpolator from the projection bounds is sketched but the step showing that the optimal c >1 follows from the lower bound being less than 1 would benefit from an expanded calculation to confirm it is not an artifact of the asymptotic approximation.
Authors: We agree that expanding this step will strengthen the presentation. In the revised manuscript we insert a detailed calculation immediately after the statement of the matching bounds. Let R(c) denote the asymptotic risk of the c-inflated estimator. From the lower bound on the Gaussian projection expectation we obtain R'(1) < 0 whenever the lower bound is strictly less than 1, which is guaranteed by the conditions on the eigenvalue decay and the alignment of β. We then show that R(c) is convex for c > 0, so the negative derivative at c = 1 implies that the minimizer lies strictly above 1. The argument uses only the explicit form of the bounds already proved in the section and does not rely on further asymptotic approximations beyond those already stated. The expanded derivation appears in the new version of §4.1. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper derives its central result on inflation of the minimum-norm interpolator from matching upper and lower bounds on expectations of Gaussian random projections under anisotropic covariances with d/n diverging. These bounds are obtained directly from properties of the population covariance and signal alignment, without reference to fitted parameters or post-hoc adjustment of the target risk. The data-splitting construction is shown to yield a consistent estimator whose risk tracks the inflated interpolator via standard concentration arguments independent of the specific inflation factor chosen. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear; the reference to Hastie et al. (2022) is used only to motivate the regime, while the new bounds supply independent support. The argument remains falsifiable outside any fitted values and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- inflation constant
axioms (1)
- domain assumption Covariates are drawn from a general class of anisotropic covariance matrices with β aligned to top eigenvalues
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
simply scaling up (or inflating) the minimum ℓ2 norm interpolator by a constant greater than one can improve the generalization error. This is in sharp contrast to traditional regularization/shrinkage prescriptions.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the National Academy of Sciences , publisher=
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, July 2019a. ISSN 1091-6490. doi: 10.1073/ pnas.1903070116. URLhttp://dx.doi.org/10.1073/pnas.1903070116. Mikhail Belkin, Alexander Rakhlin, a...
-
[2]
Rick Durrett.Probability: Theory and Examples
URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ a12c999be280372b157294e72a4bbc8b-Paper-Conference.pdf. Rick Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 5 edition,
work page 2022
-
[3]
URLhttps://doi.org/10.1214/21-AOS2133
doi: 10.1214/21-AOS2133. URLhttps://doi.org/10.1214/21-AOS2133. Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2 edition,
-
[4]
On Some Statistical Properties of Dynamical Systems
University of California Press. URLhttp://projecteuclid.org/euclid.bsmsp/1200512173. 71 Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models,
-
[5]
Scaling Laws for Neural Language Models
URLhttps://arxiv.org/abs/2001.08361. Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.Journal of Machine Learning Research, 21(169):1–16,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[6]
ISSN 0047-259X. doi: https:// doi.org/10.1016/j.jmva.2007.05.006. URLhttps://www.sciencedirect.com/science/ article/pii/S0047259X07000814. Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize.The Annals of Statistics, 48(3):pp. 1329–1347,
-
[7]
URLhttps://www.jstor.org/stable/26931513
ISSN 00905364, 21688966. URLhttps://www.jstor.org/stable/26931513. Nicole Muecke, Enrico Reiss, Jonas Rungenhagen, and Markus Klein. Data-splitting im- proves statistical performance in overparameterized regimes. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,Proceedings of The 25th Interna- tional Conference on Artificial Intelli...
-
[8]
Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Alex Rakhlin
URLhttps://arxiv.org/abs/2404.01233. Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Alex Rakhlin. Bagging regular- izes, March
-
[9]
Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco
URLhttps://dspace.mit.edu/bitstream/handle/1721.1/7268/ AIM-2002-003.pdf?sequence=2&isAllowed=y. Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statis...
work page 2002
-
[10]
Hanson-Wright inequality and sub-gaussian concentration
URLhttps://arxiv.org/abs/1306.2872. 72 Jack Sherman and Winifred J. Morrison. Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix.The Annals of Mathematical Statistics, 21(1):124 – 127,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
On information and sufficiency,
doi: 10.1214/aoms/1177729893. URLhttps://doi.org/10. 1214/aoms/1177729893. Michiaki Taniguchi and Volker Tresp. Averaging regularized estimators.Neural Computa- tion, 9(5):1163–1178, 07
-
[12]
doi: 10.1162/neco.1997.9.5.1163
ISSN 0899-7667. doi: 10.1162/neco.1997.9.5.1163. URL https://doi.org/10.1162/neco.1997.9.5.1163. Y. L. Tong.The multivariate normal distribution /. Springer-Verlag,, New York,
-
[13]
URLhttps://proceedings.neurips.cc/paper_files/paper/2021/file/ caaa29eab72b231b0af62fbdff89bfce-Paper.pdf. Denny Wu and Ji Xu. On the optimal weighted\ell 2 regularization in overparameterized linear regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 10...
work page 2021
-
[14]
Yuchen Zhang, John Duchi, and Martin Wainwright
URLhttps://proceedings.neurips.cc/paper_ files/paper/2020/file/72e6d3238361fe70f22fb0ac624a7072-Paper.pdf. Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression. In Shai Shalev-Shwartz and Ingo Steinwart, editors,Proceedings of the 26th Annual Conference on Learning Theory, volume 30 ofProceedings of Machine Learning ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.