pith. sign in

arxiv: 2312.05593 · v3 · submitted 2023-12-09 · 💰 econ.EM · stat.ME

Benign Overfitting in Economic Forecasting via Noise Regularization

Pith reviewed 2026-05-24 04:48 UTC · model grok-4.3

classification 💰 econ.EM stat.ME
keywords benign overfittingnoise regularizationeconomic forecastingridgeless regressionlatent factorshigh-dimensional predictorsdense linear modelsforecast accuracy
0
0 comments X

The pith

A ridgeless regression augmented with noise predictors matches the asymptotic forecast accuracy of an oracle that knows the true factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when both the target variable and high-dimensional predictors are generated by a small number of latent factors, the best linear forecast is a dense model rather than a sparse one. Adding predictors that contain only noise regularizes a ridgeless least-squares estimator by shrinking the eigenvalues of the sample Gram matrix, which lowers out-of-sample variance. This approach reaches the same limiting mean-squared forecast error as an oracle that knows the factors exactly, without ever estimating the factors or requiring them to be strong. In contrast, removing the noise variables through perfect selection can increase forecast error when the number of retained predictors is comparable to sample size. The result is shown both theoretically under the factor structure and empirically in U.S. inflation, international GDP growth, and equity risk premium series.

Core claim

When the outcome and the high-dimensional predictors share a low-dimensional factor structure, the population best linear predictor is dense. A ridgeless regression that deliberately augments the predictor matrix with pure noise variables attains the same asymptotic out-of-sample mean squared error as an oracle regression on the true factors. The mechanism is eigenvalue shrinkage of the design matrix, which reduces the variance term in the forecast error decomposition without any factor estimation or strong-factor assumption.

What carries the argument

ridgeless regression augmented with noise predictors, which shrinks the eigenvalues of the Gram matrix and thereby controls out-of-sample variance

If this is right

  • Forecasts achieve oracle accuracy without estimating or even identifying the latent factors.
  • Perfect variable selection that discards noise variables can increase forecast error when the retained dimension is close to sample size.
  • The same noise-augmented procedure improves and stabilizes predictions for U.S. inflation, international GDP growth, and the equity risk premium.
  • The gain is produced by a reduction in the variance component of forecast error rather than by bias reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regularization may be useful in other high-dimensional economic series that exhibit approximate factor structure.
  • It offers a simple alternative to explicit factor extraction or penalized sparse methods when the goal is pure forecasting.
  • The finding raises the question of how much deliberate noise is optimal when the factor dimension is unknown.

Load-bearing premise

Both the outcome variable and the high-dimensional predictors are generated by a small number of latent factors, which forces the linear forecast model to be dense.

What would settle it

A Monte Carlo design in which the true factors are known and the mean squared forecast error of the noise-augmented ridgeless estimator is strictly larger than that of the oracle factor regression for large samples.

Figures

Figures reproduced from arXiv: 2312.05593 by Andreas Neuhierl, Xinjie Ma, Yuan Liao, Zhentao Shi.

Figure 1
Figure 1. Figure 1: Theoretical predictive variance and squared bias (left panel) and MSE (right panel), averaged over 500 replications. The horizontal axis is the number of predictors increasing from 3 to 500, and we fix n = 100. The first p0 = min{p, 0.9n} are informative predictors, generated using a 3-factor model of strong factors. The remaining p − p0 are i.i.d. Gaussian noises. The vertical dashed line is where p equal… view at source ↗
Figure 2
Figure 2. Figure 2: Predictive MSE P50 j=1(yj − ybj) 2 averaged from 50 replications as the number of predictors p increases. The vertical red dashed line indicates the number of informative predictors p0; the black dashed line indicates the sample size n = 100 [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Predictive MSE P50 j=1(yj − ybj) 2 averaged from 50 replications as the number of predictors p increases. The vertical red dashed line indicates the number of informative predictors p0; the vertical black dashed line indicates the sample size n = 100. The vertical blue dashed line in the last panel indicates the averaged p chosen by the cross validation [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: plots the predictive MSE of the pseudo-OLS, CV-Ridge and CV-Lasso. For the pseudo-OLS, we set the maximum value for p as pmax = C×n √p0, and choose C using cross-validation in a range so that pmax varies from 2,000 to 15,037 (equals 0.5 × n √p0). The result shows that the MSE of pseudo-OLS starts to decrease when the total number of predictors are over 1250, and surpasses that of Ridge and Lasso when p = 2… view at source ↗
Figure 5
Figure 5. Figure 5: Out-of-sample R2 for predicting the U.S. equity premium, using the dataset described by Welch and Goyal (2008), and updated on the webpage by Amit Goyal. The yearly data spans from 1948 to 2015, with p0 = 16 original predictors. We use rolling windows of n = 17 year for one-year horizon forecast. The vertical axis is OOS R2 . The horizontal axis is plotted as log(p), and ticked using p. Regardless of p, bo… view at source ↗
Figure 6
Figure 6. Figure 6: Predictive MSE using 123 Macroeconomic data from McCracken and Ng (2016). Data spans from 1960-May to 2019-December with p0 = 123 predictors. We use rolling windows of n = 120 months for one-month horizon forecast. The vertical axis is P n (yn+1 − ybn+1) 2 , the horizontal axis is log(p), and the horizontal tick is p. Regardless of p, the PCA, CV-Lasso and CV-Ridge use the p0 macrovariables, whereas the ps… view at source ↗
Figure 7
Figure 7. Figure 7: Predictive MSE using 60 socio-economic and geographical characteristics from Barro and Lee (1994). Data for the growth rate of GDP from 90 countries. We estimate the model on a randomly selected sample of n = 45 countries, evaluating its predictions for the remaining 45 countries. We repeat this exercise 100 times. The vertical axis is P n (yn+1 − ybn+1) 2 , the horizontal axis is log(p), and the horizonta… view at source ↗
Figure 8
Figure 8. Figure 8: Predictive MSE P50 j=1(yj − ybj ) 2 averaged from 10 replications as the number of predictors p increases. The number of informative predictors p0 = 0.5p. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗
read the original abstract

This paper studies linear overparameterized models in economic forecasting and highlights that including noise variables (regressors with no predictive power) regularizes the estimator. We consider a setting where both the outcome variable and the high-dimensional predictors are driven by a small number of latent factors, and show that the linear forecast model is dense rather than sparse. It turns out that a ridgeless regression augmented with noise predictors attains the same asymptotic forecast accuracy as an oracle with known true factors, without estimating the factors or assuming them to be strong. The gain comes from shrinkage of the eigenvalues of the design matrix, which reduces the out-of-sample variance. In contrast, perfect variable selection that removes noise variables can worsen forecasts when the number of retained predictors is comparable to the sample size. Empirically, we apply this approach to forecasting U.S. inflation, international GDP growth, and the U.S. equity risk premium, finding that noise regularization improves and stabilizes predictive performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that in economic forecasting settings where both the outcome y and high-dimensional predictors X are driven by a small number of latent factors (making the population linear projection dense rather than sparse), a ridgeless regression augmented with noise predictors (regressors with no predictive power) attains the same asymptotic out-of-sample forecast accuracy as an oracle that knows the true factors. The mechanism is eigenvalue shrinkage of the design matrix that reduces variance; this is contrasted with perfect variable selection, which can worsen performance when the number of retained predictors is comparable to sample size. The result is supported by theory under the factor model and by empirical applications to U.S. inflation, international GDP growth, and the U.S. equity risk premium.

Significance. If the central asymptotic equivalence holds, the paper supplies a practical regularization device for high-dimensional economic forecasting that avoids explicit factor estimation and does not require strong-factor assumptions. It explicitly credits the theoretical equivalence result and the empirical finding that noise augmentation improves and stabilizes predictive performance relative to selection-based alternatives.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (theoretical setup): the oracle-equivalence claim is load-bearing on the population coefficient vector β being dense under X = ΛF + e and y = γ'F + u. The manuscript states this density result but supplies no explicit rate conditions on the number of added noise variables relative to n, p, or factor strength that would keep the equivalence intact when factors are weak or loadings heterogeneous; without such conditions the eigenvalue-shrinkage benefit need not dominate selection-based alternatives.
  2. [Empirical applications] Empirical applications (forecasting tables for inflation, GDP, and equity premium): the reported gains in accuracy and stability are presented without accompanying standard errors, confidence bands, or robustness checks to the exact count of noise variables, which is required to assess whether the finite-sample improvements are statistically distinguishable from the oracle benchmark.
minor comments (1)
  1. [Notation and estimator definition] The definition of the ridgeless estimator after noise augmentation would benefit from an explicit equation (e.g., the augmented design matrix and the resulting β̂) placed in the main text rather than only in an appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the theoretical scope and committing to empirical enhancements where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (theoretical setup): the oracle-equivalence claim is load-bearing on the population coefficient vector β being dense under X = ΛF + e and y = γ'F + u. The manuscript states this density result but supplies no explicit rate conditions on the number of added noise variables relative to n, p, or factor strength that would keep the equivalence intact when factors are weak or loadings heterogeneous; without such conditions the eigenvalue-shrinkage benefit need not dominate selection-based alternatives.

    Authors: The density of β follows immediately from the factor model assumptions in Section 3 (Assumptions 1–3), which allow weak factors and heterogeneous loadings without requiring strong-factor conditions. Theorems 1–2 derive the asymptotic equivalence by showing that noise augmentation induces eigenvalue shrinkage that matches the oracle variance term, and the proofs hold under the stated rates on p/n and the factor structure; no additional rate restrictions on the number of noise variables are needed beyond those already implicit in the high-dimensional regime. We will add a clarifying paragraph in §3.2 explicitly noting that the equivalence continues to hold for weak factors provided the loadings satisfy the moment conditions in Assumption 2, thereby addressing the concern about dominance over selection methods. revision: partial

  2. Referee: [Empirical applications] Empirical applications (forecasting tables for inflation, GDP, and equity premium): the reported gains in accuracy and stability are presented without accompanying standard errors, confidence bands, or robustness checks to the exact count of noise variables, which is required to assess whether the finite-sample improvements are statistically distinguishable from the oracle benchmark.

    Authors: We agree that standard errors and robustness checks would strengthen the empirical section. In the revision we will (i) report bootstrap standard errors for the out-of-sample R² and MSFE differences relative to the oracle benchmark, (ii) add a new table (or appendix figure) showing results for a range of noise-variable counts around the values used in the main tables, and (iii) include Diebold–Mariano tests where feasible. These additions will allow readers to assess statistical distinguishability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is model-based asymptotic analysis

full rationale

The paper derives density of the population projection coefficients from the shared latent factor structure (X = ΛF + e, y = γ'F + u) and shows asymptotic equivalence of ridgeless regression plus noise to the oracle that uses F directly. These steps are explicit mathematical results under the maintained assumptions rather than reductions by construction, fitted-parameter renamings, or load-bearing self-citations. The oracle benchmark is internal to the factor model but is not tautological; the equivalence is obtained via eigenvalue shrinkage arguments that are independent of the target risk quantity. No self-citation chains or ansatz smuggling are indicated in the provided text. The density claim follows directly from the factor loadings without redefining the target quantity in terms of itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that both outcome and predictors are generated by a small number of latent factors; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption both the outcome variable and the high-dimensional predictors are driven by a small number of latent factors
    Explicitly stated as the setting considered in the abstract.

pith-pipeline@v0.9.0 · 5700 in / 1165 out tokens · 22018 ms · 2026-05-24T04:48:48.856559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    Cohen, W

    Arora, S., N. Cohen, W. Hu, and Y. Luo (2019). Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems\/ 32 , 7413--7424

  2. [2]

    Atanasov, V., S. V. M ller, and R. Priestley (2020). Consumption fluctuations and expected returns. The Journal of Finance\/ 75\/ (3), 1677--1713

  3. [3]

    Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica\/ 71 , 135--171

  4. [4]

    Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica\/ 70 , 191--221

  5. [5]

    Bai, J. and S. Ng (2006). Confidence intervals for diffusion index forecasts and inference for factor-augmented regressions. Econometrica\/ 74\/ (4), 1133--1150

  6. [6]

    Bai, Z. and Y. Yin (1993). Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The Annals of Probability\/ 21\/ (3), 1275--1294

  7. [7]

    Ball, R. and V. V. Nikolaev (2022). On earnings and cash flows as predictors of future cash flows. Journal of Accounting and Economics\/ 73\/ (1), 101430

  8. [8]

    Barro, R. J. and J.-W. Lee (1994). Sources of economic growth. In Carnegie-Rochester conference series on public policy , Volume 40, pp.\ 1--46. Elsevier

  9. [9]

    Bekaert, G. and M. Hoerova (2014). The vix, the variance premium and stock market volatility. Journal of econometrics\/ 183\/ (2), 181--192

  10. [10]

    Belkin, M., D. Hsu, S. Ma, and S. Mandal (2019). Reconciling modern machine-learning practice and the classical bias--variance trade-off. Proceedings of the National Academy of Sciences\/ 116\/ (32), 15849--15854

  11. [11]

    Hsu, and J

    Belkin, M., D. Hsu, and J. Xu (2020). Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science\/ 2\/ (4), 1167--1180

  12. [12]

    Chao, J. C. and N. R. Swanson (2022). Selecting the relevant variables for factor estimation in favar models. Available at SSRN 4308280\/

  13. [13]

    Gallmeyer, and H

    Chava, S., M. Gallmeyer, and H. Park (2015). Credit conditions and stock return predictability. Journal of Monetary Economics\/ 74 , 117--132

  14. [14]

    Chen, X., Y. H. Cho, Y. Dou, and B. Lev (2022). Predicting future earnings changes using machine learning and detailed financial data. Journal of Accounting Research\/ 60\/ (2), 467--515

  15. [15]

    Chen, Y., G. W. Eaton, and B. S. Paye (2018). Micro (structure) before macro? the predictive power of aggregate illiquidity for stock returns and economic activity. Journal of Financial Economics\/ 130\/ (1), 48--73

  16. [16]

    Hansen, and Y

    Chernozhukov, V., C. Hansen, and Y. Liao (2017). A lava attack on the recovery of sums of dense and sparse signals. The Annals of Statistics\/ 45\/ (1), 39--76

  17. [17]

    L \"o ffler, and S

    Chinot, G., M. L \"o ffler, and S. van de Geer (2022). On the robustness of minimum norm interpolators and regularized empirical risk minimizers. The Annals of Statistics\/ 50\/ (4), 2306--2333

  18. [18]

    Ghysels, J

    Colacito, R., E. Ghysels, J. Meng, and W. Siwasarit (2016). Skewness in expected macro fundamentals and the predictability of equity returns: Evidence and theory. The Review of Financial Studies\/ 29\/ (8), 2069--2109

  19. [19]

    Connor, G. and R. A. Korajczyk (1988). Risk and return in an equilibrium apt: Application of a new test methodology. Journal of financial economics\/ 21\/ (2), 255--289

  20. [20]

    Didisheim, A., S. B. Ke, B. T. Kelly, and S. Malamud (2023). Complexity in factor pricing models. Technical report, National Bureau of Economic Research

  21. [21]

    Fairfield, P. M., R. J. Sweeney, and T. L. Yohn (1996). Accounting classification and the predictive content of earnings. Accounting Review\/ , 337--355

  22. [22]

    Ke, and K

    Fan, J., Y. Ke, and K. Wang (2020). Factor-adjusted regularized model selection. Journal of Econometrics\/ 216\/ (1), 71--85

  23. [23]

    Fan, J., Z. T. Ke, Y. Liao, and A. Neuhierl (2022). Structural deep learning in conditional asset pricing. Available at SSRN 4117882\/

  24. [24]

    Liao, and M

    Fan, J., Y. Liao, and M. Mincheva (2013). Large covariance estimation by thresholding principal orthogonal complements (with discussion). Journal of the Royal Statistical Society, Series B\/ 75 , 603--680

  25. [25]

    Feltham, G. A. and J. A. Ohlson (1995). Valuation and clean surplus accounting for operating and financial activities. Contemporary accounting research\/ 11\/ (2), 689--731

  26. [26]

    Hallin, M

    Forni, M., M. Hallin, M. Lippi, and L. Reichlin (2005). The generalized dynamic factor model: one-sided estimation and forecasting. Journal of the American Statistical Association\/ 100\/ (471), 830--840

  27. [27]

    Lenza, and G

    Giannone, D., M. Lenza, and G. E. Primiceri (2021). Economic predictions with big data: The illusion of sparsity. Econometrica\/ 89\/ (5), 2409--2437

  28. [28]

    Xiu, and D

    Giglio, S., D. Xiu, and D. Zhang (2023). Prediction when factors are weak. University of Chicago, Becker Friedman Institute for Economics Working Paper\/ (2023-47)

  29. [29]

    Welch, and A

    Goyal, A., I. Welch, and A. Zafirov (2023). A comprehensive 2021 look at the empirical performance of equity premium prediction ii. Swiss Finance Institute Research Paper\/ (21-85)

  30. [30]

    Kelly, and D

    Gu, S., B. Kelly, and D. Xiu (2020). Empirical asset pricing via machine learning. The Review of Financial Studies\/ 33\/ (5), 2223--2273

  31. [31]

    Hansen, C. and Y. Liao (2018). The factor-lasso and k-step bootstrap approach for inference in high-dimensional economic applications. Econometric Theory\/ , 1--45

  32. [32]

    Montanari, S

    Hastie, T., A. Montanari, S. Rosset, and R. J. Tibshirani (2022). Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics\/ 50\/ (2), 949

  33. [33]

    He, Y. (2023). Ridge regression under dense factor augmented models. Journal of the American Statistical Association\/ , 1--13

  34. [34]

    Hou, and S

    Hirshleifer, D., K. Hou, and S. H. Teoh (2009). Accruals, cash flows, and aggregate stock returns. Journal of Financial Economics\/ 91\/ (3), 389--406

  35. [35]

    Jiang, J

    Huang, D., F. Jiang, J. Tu, and G. Zhou (2015). Investor sentiment aligned: A powerful predictor of stock returns. The Review of Financial Studies\/ 28\/ (3), 791--837

  36. [36]

    Zhang, and X

    Jondeau, E., Q. Zhang, and X. Zhu (2019). Average skewness matters. Journal of Financial Economics\/ 134\/ (1), 29--47

  37. [37]

    Jones, C. S. and S. Tuzel (2013). New orders and asset prices. The Review of Financial Studies\/ 26\/ (1), 115--157

  38. [38]

    Kelly, B. and S. Pruitt (2013). Market expectations in the cross-section of present values. The Journal of Finance\/ 68\/ (5), 1721--1756

  39. [39]

    Kelly, B. T., S. Malamud, and K. Zhou (2022). The virtue of complexity in return prediction. Technical report, National Bureau of Economic Research

  40. [40]

    Lee, S. and S. Lee (2023). The mean squared error of the ridgeless least squares estimator under general assumptions on regression errors. arXiv preprint arXiv:2305.12883\/

  41. [41]

    Marchenko, V. A. and L. A. Pastur (1967). Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik\/ 114\/ (4), 507--536

  42. [42]

    Martin, I. (2017). What is the expected return on the market? The Quarterly Journal of Economics\/ 132\/ (1), 367--433

  43. [43]

    McCracken, M. W. and S. Ng (2016). Fred-md: A monthly database for macroeconomic research. Journal of Business & Economic Statistics\/ 34\/ (4), 574--589

  44. [44]

    Mei, S. and A. Montanari (2019). The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics\/

  45. [45]

    M ller, S. V. and J. Rangvid (2015). End-of-the-year economic growth and time-varying expected returns. Journal of Financial Economics\/ 115\/ (1), 136--154

  46. [46]

    Ng, S. (2013). Variable selection in predictive regressions. Handbook of economic forecasting\/ 2 , 752--789

  47. [47]

    Nissim, D. and S. H. Penman (2001). Ratio analysis and equity valuation: From research to practice. Review of accounting studies\/ 6 , 109--154

  48. [48]

    Ohlson, J. A. (1995). Earnings, book values, and dividends in equity valuation. Contemporary accounting research\/ 11\/ (2), 661--687

  49. [49]

    Penman, S. H. (1998). A synthesis of equity valuation techniques and the terminal value calculation for the dividend discount model. Review of accounting studies\/ 2 , 303--323

  50. [50]

    Penman, S. H. and T. Sougiannis (1998). A comparison of dividend, cash flow, and earnings approaches to equity valuation. Contemporary accounting research\/ 15\/ (3), 343--383

  51. [51]

    Rapach, D. E., M. C. Ringgenberg, and G. Zhou (2016). Short interest and aggregate stock returns. Journal of Financial Economics\/ 121\/ (1), 46--65

  52. [52]

    So, E. C. (2013). A new approach to predicting analyst forecast errors: Do investors overweight analyst forecasts? Journal of Financial Economics\/ 108\/ (3), 615--640

  53. [53]

    Imbens, and A

    Spiess, J., G. Imbens, and A. Venugopal (2023). Double and single descent in causal inference with an application to high-dimensional synthetic control. arXiv preprint arXiv:2305.00700\/

  54. [54]

    Stock, J. and M. Watson (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association\/ 97 , 1167--1179

  55. [55]

    Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology\/ 58\/ (1), 267--288

  56. [56]

    Welch, I. and A. Goyal (2008). A comprehensive look at the empirical performance of equity premium prediction. The Review of Financial Studies\/ 21\/ (4), 1455--1508