Missingness-Adaptive Factor Identification in High-Dimensional Data
Pith reviewed 2026-05-10 03:31 UTC · model grok-4.3
The pith
The Missingness-Adaptive Thresholding Estimator determines the number of factors in high-dimensional data with missing observations without requiring imputation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that the Missingness-Adaptive Thresholding Estimator (MATE) provides a consistent estimator for the number of identifiable factors in high-dimensional factor models with incomplete observations, accommodating both homogeneous and heterogeneous missingness without imputation or strong assumptions on factor strength.
What carries the argument
The Missingness-Adaptive Thresholding Estimator (MATE) that applies a data-driven threshold adjusted for the observed missingness to select the factor number.
Load-bearing premise
The data must satisfy structural conditions that make certain factors identifiable despite the missing entries and allow the thresholding to separate signal from noise.
What would settle it
Observing that MATE selects an incorrect number of factors on a dataset with known factor structure and high missingness rate would contradict the consistency result.
Figures
read the original abstract
Determining the number of factors in high-dimensional factor models remains a fundamental challenge, particularly when data are incomplete. This paper introduces the concept of identifiable factors, those that can be reliably recovered despite missing observations, and proposes the Missingness-Adaptive Thresholding Estimator (MATE). To our knowledge, MATE is the first missingness-adaptive framework for factor number determination that accommodates both homogeneous and heterogeneous missingness without imposing restrictive assumptions on factor strength. Notably, it operates without data imputation, circumventing the computational burden associated with most existing approaches. We establish a rigorous theoretical foundation for MATE, proving its consistency under a range of structural conditions. Extensive simulations and real-world applications demonstrate that MATE consistently outperforms state-of-the-art methods, exhibiting superior robustness in settings with high missingness rates and weak factor signals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Missingness-Adaptive Thresholding Estimator (MATE) for determining the number of identifiable factors in high-dimensional factor models under missing data. MATE applies adaptive thresholding directly to the observed Gram matrix entries and is claimed to handle both homogeneous and heterogeneous missingness without imputation or restrictive assumptions on factor strength. Consistency is established in Theorem 3.1 under conditions including an eigenvalue gap, bounded moments, and a positive lower bound on per-variable observation probabilities (which may depend on the missingness pattern). Simulations at missingness rates up to 70% and real-data examples are reported to show outperformance relative to existing methods.
Significance. If the consistency result in Theorem 3.1 holds under the stated conditions, the work provides a computationally lightweight, imputation-free approach to factor-number selection that adapts to the observed missingness pattern. This addresses a practical need in high-dimensional settings where complete-data methods fail and imputation is costly. The explicit separation of identifiable factors from those lost to missingness is a useful conceptual contribution.
minor comments (3)
- [Introduction] The literature review would benefit from a concise table or paragraph explicitly contrasting MATE with the closest prior estimators for factor selection under missingness (e.g., those based on imputed PCA or EM-type procedures).
- [Simulations] In the simulation section, the precise construction of the heterogeneous missingness mechanism (e.g., how the per-variable probabilities are drawn and whether they are fixed or random) should be stated more explicitly to facilitate exact replication.
- [Real-data applications] Figure captions for the real-data examples could include the estimated number of factors returned by each comparator method for direct visual comparison.
Simulated Author's Rebuttal
We thank the referee for the positive summary of our work on the Missingness-Adaptive Thresholding Estimator (MATE) and for recommending minor revision. The recognition of MATE's imputation-free approach, consistency under the stated conditions, and practical utility in high-dimensional settings with missing data is appreciated. No specific major comments were provided in the report.
Circularity Check
No significant circularity detected
full rationale
The derivation chain for MATE is self-contained. The estimator is defined directly via adaptive thresholding on the observed Gram matrix entries without reducing to a fitted parameter renamed as a prediction. Theorem 3.1 states consistency under explicitly enumerated conditions (eigenvalue gap, moment bounds, per-variable observation probability bounded away from zero) that are independent of the target result and do not incorporate the estimator's output by construction. No self-citation chain is invoked to justify uniqueness or the core ansatz; the proof proceeds from standard concentration inequalities applied to the missingness-adjusted matrix. Simulations and real-data examples serve as external validation rather than internal tautologies. The central claim therefore rests on independent mathematical content rather than definitional equivalence or load-bearing self-reference.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Sample covariance matrices and high-dimensional data analysis , author=
-
[2]
Advances In Statistics , pages=
CLT for linear spectral statistics of large-dimensional sample covariance matrices , author=. Advances In Statistics , pages=. 2008 , publisher=
work page 2008
-
[3]
Order Determination for Spiked Type Models , author=. Statistica Sinica , volume=. 2022 , publisher=
work page 2022
-
[4]
Computational Statistics & Data Analysis , volume=
Order determination for spiked-type models with a divergent number of spikes , author=. Computational Statistics & Data Analysis , volume=. 2023 , publisher=
work page 2023
-
[5]
The Annals of Probability , volume=
Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices , author=. The Annals of Probability , volume=
-
[6]
The Annals of Statistics , volume=
Covariance regularization by thresholding , author=. The Annals of Statistics , volume=
-
[7]
The Annals of Applied Statistics , volume=
Bayesian variable selection regression for genome-wide association studies and other large-scale problems , author=. The Annals of Applied Statistics , volume=
-
[8]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
On Bayesian analysis of mixtures with an unknown number of components (with discussion) , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 1997 , publisher=
work page 1997
-
[9]
The Annals of Statistics , volume=
Spiked separable covariance matrices and principal components , author=. The Annals of Statistics , volume=
-
[10]
Proceedings of the ACM on Web Conference 2025 , pages=
A Theory-Driven Approach to Inner Product Matrix Estimation for Incomplete Data: An Eigenvalue Perspective , author=. Proceedings of the ACM on Web Conference 2025 , pages=
work page 2025
-
[11]
Unbalanced panel data models with interactive effects , journal=
Bai, Jushan and Liao, Yuan and Yang, Jisheng , year=. Unbalanced panel data models with interactive effects , journal=
-
[12]
Journal of Business & Economic Statistics , volume=
Macroeconomic forecasting using diffusion indexes , author=. Journal of Business & Economic Statistics , volume=. 2002 , publisher=
work page 2002
-
[13]
Determining the number of factors in approximate factor models , author=. Econometrica , volume=. 2002 , publisher=
work page 2002
-
[14]
Journal of the American Statistical Association , volume=
Determining the number of factors in the general dynamic factor model , author=. Journal of the American Statistical Association , volume=. 2007 , publisher=
work page 2007
-
[15]
Statistics & Probability Letters , volume=
Improved penalization for determining the number of factors in approximate factor models , author=. Statistics & Probability Letters , volume=. 2010 , publisher=
work page 2010
-
[16]
Journal of Econometrics , volume=
Determining the number of factors when the number of factors can increase with sample size , author=. Journal of Econometrics , volume=. 2017 , publisher=
work page 2017
-
[17]
Journal of Econometrics , volume=
On time-varying factor models: estimation and testing , author=. Journal of Econometrics , volume=. 2017 , publisher=
work page 2017
-
[18]
Testing hypotheses about the number of factors in large factor models , author=. Econometrica , volume=. 2009 , doi=
work page 2009
-
[19]
Modelling multiple time series via common factors , author=. Biometrika , volume=. 2008 , publisher=
work page 2008
-
[20]
Journal of Business & Economic Statistics , volume=
A testing procedure for determining the number of factors in approximate factor models with large datasets , author=. Journal of Business & Economic Statistics , volume=. 2010 , publisher=
work page 2010
-
[21]
The Review of Economics and Statistics , volume=
Determining the number of factors from empirical distribution of eigenvalues , author=. The Review of Economics and Statistics , volume=. 2010 , publisher=
work page 2010
-
[22]
The Annals of Statistics , pages=
Factor modeling for high-dimensional time series: inference for the number of factors , author=. The Annals of Statistics , pages=. 2012 , volume=
work page 2012
-
[23]
The Annals of Statistics , pages=
Identifying the number of factors from singular values of a large sample auto-covariance matrix , author=. The Annals of Statistics , pages=. 2017 , volume=
work page 2017
-
[24]
Consistently determining the number of factors in multivariate volatility modelling , author=. Statistica Sinica , volume=. 2015 , publisher=
work page 2015
-
[25]
Eigenvalue ratio test for the number of factors , author=. Econometrica , volume=. 2013 , publisher=
work page 2013
-
[26]
Robust determination for the number of common factors in the approximate factor models , author=. Economics Letters , volume=. 2016 , publisher=
work page 2016
-
[27]
Journal of the American Statistical Association , volume=
Estimating number of factors by adjusted eigenvalues thresholding , author=. Journal of the American Statistical Association , volume=. 2022 , publisher=
work page 2022
-
[28]
Journal of Financial Economics , volume=
The empirical risk--return relation: a factor analysis approach , author=. Journal of Financial Economics , volume=. 2007 , publisher=
work page 2007
-
[29]
On factor models with random missing:
Jin, Sainan and Miao, Ke and Su, Liangjun , journal=. On factor models with random missing:. 2021 , publisher=
work page 2021
-
[30]
Journal of the American Statistical Association , volume=
Matrix completion, counterfactuals, and factor analysis of missing data , author=. Journal of the American Statistical Association , volume=. 2021 , publisher=
work page 2021
-
[31]
Journal of Econometrics , volume=
Large dimensional latent factor modeling with missing observations and applications to causal inference , author=. Journal of Econometrics , volume=. 2023 , publisher=
work page 2023
-
[32]
Journal of Econometrics , volume=
Factor-based imputation of missing values and covariances in panel data of large dimensions , author=. Journal of Econometrics , volume=. 2023 , publisher=
work page 2023
-
[33]
Dynamic factor models, factor-augmented vector autoregressions, and structural vector autoregressions in macroeconomics , author=. 2016 , publisher=
work page 2016
-
[34]
Journal of the American Statistical Association , volume=
Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis , author=. Journal of the American Statistical Association , volume=. 2023 , publisher=
work page 2023
-
[35]
The Annals of Statistics , volume=
On the distribution of the largest eigenvalue in principal components analysis , author=. The Annals of Statistics , volume=. 2001 , publisher=
work page 2001
-
[36]
Journal of Econometrics , volume=
A two-step estimator for large approximate dynamic factor models based on Kalman filtering , author=. Journal of Econometrics , volume=. 2011 , publisher=
work page 2011
-
[37]
Journal of Monetary Economics , volume=
Nowcasting: the real-time informational content of macroeconomic data , author=. Journal of Monetary Economics , volume=. 2008 , publisher=
work page 2008
-
[38]
Journal of Multivariate Analysis , volume=
Eigenvalues of large sample covariance matrices of spiked population models , author=. Journal of Multivariate Analysis , volume=. 2006 , publisher=
work page 2006
-
[39]
Mathematics of the USSR-Sbornik , volume=
Distribution of eigenvalues for some sets of random matrices , author=. Mathematics of the USSR-Sbornik , volume=. 1967 , publisher=
work page 1967
-
[40]
Central limit theorems for eigenvalues in a spiked population model , author=. Annales de l'IHP Probabilit. 2008 , doi=
work page 2008
-
[41]
The Annals of Statistics , volume=
Limiting laws for divergent spiked eigenvalues and largest nonspiked eigenvalue of sample covariance matrices , author=. The Annals of Statistics , volume=. 2020 , doi=
work page 2020
-
[42]
Probability Theory and Related Fields , volume=
Anisotropic local laws for random matrices , author=. Probability Theory and Related Fields , volume=. 2017 , publisher=
work page 2017
-
[43]
Random Matrices: Theory and Applications , volume=
Spiked sample covariance matrices with possibly multiple bulk components , author=. Random Matrices: Theory and Applications , volume=. 2021 , publisher=
work page 2021
-
[44]
Probability Theory and Related Fields , volume=
On the principal components of sample covariance matrices , author=. Probability Theory and Related Fields , volume=. 2016 , publisher=
work page 2016
-
[45]
Journal of Multivariate Analysis , volume=
On sample eigenvalues in a generalized spiked population model , author=. Journal of Multivariate Analysis , volume=. 2012 , publisher=
work page 2012
-
[46]
Journal of Financial Economics , volume=
Common risk factors in the returns on stocks and bonds , author=. Journal of Financial Economics , volume=. 1993 , publisher=
work page 1993
-
[47]
Journal of Financial Economics , volume=
A five-factor asset pricing model , author=. Journal of Financial Economics , volume=. 2015 , publisher=
work page 2015
-
[48]
Journal of the American Statistical Association , pages=
Testing the number of common factors by bootstrapped sample covariance matrix in high-dimensional factor models , author=. Journal of the American Statistical Association , pages=. 2024 , publisher=
work page 2024
-
[49]
NBER Macroeconomics Annual , volume=
New indexes of coincident and leading economic indicators , author=. NBER Macroeconomics Annual , volume=. 1989 , publisher=
work page 1989
-
[50]
Diffusion indexes , author=. NBER Working Paper , volume=. 1998 , institution =
work page 1998
-
[51]
Hsiao, Cheng and Steve Ching, H and Ki Wan, Shui , journal=. A panel data approach for program evaluation: measuring the benefits of political and economic integration of. 2012 , publisher=
work page 2012
-
[52]
Xie, Yu and Hu, Jingwei , journal=. An introduction to the. 2014 , publisher=
work page 2014
-
[53]
Annual Review of Sociology , volume=
The longitudinal revolution: sociological research at the 50-year milestone of the panel study of income dynamics , author=. Annual Review of Sociology , volume=. 2020 , publisher=
work page 2020
-
[54]
Journal of Applied Econometrics , volume=
Maximum likelihood estimation of factor models on datasets with arbitrary pattern of missing data , author=. Journal of Applied Econometrics , volume=. 2014 , publisher=
work page 2014
-
[55]
Statistical analysis with missing data , author=. 2019 , publisher=
work page 2019
-
[56]
What is meant by “missing at random”? , author=. Statistical Science , volume=. 2013 , doi=
work page 2013
-
[57]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Large covariance estimation by thresholding principal orthogonal complements , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2013 , publisher=
work page 2013
-
[58]
Journal of the American Statistical Association , volume=
On consistency and sparsity for principal components analysis in high dimensions , author=. Journal of the American Statistical Association , volume=. 2009 , publisher=
work page 2009
-
[59]
Asymptotics of sample eigenstructure for a large dimensional spiked covariance model , author=. Statistica Sinica , volume=. 2007 , publisher=
work page 2007
-
[60]
The Annals of Statistics , volume=
Optimal prediction in the linearly transformed spiked model , author=. The Annals of Statistics , volume=. 2020 , publisher=
work page 2020
-
[61]
Spectral analysis of high-dimensional sample covariance matrices with missing observations , author=. Bernoulli , volume=. 2017 , doi=
work page 2017
-
[62]
IEEE Transactions on Information Theory , volume=
Optshrink: an algorithm for improved low-rank signal matrix denoising by optimal, data-driven singular value shrinkage , author=. IEEE Transactions on Information Theory , volume=. 2014 , publisher=
work page 2014
-
[63]
Multivariate Behavioral Research , volume=
The scree test for the number of factors , author=. Multivariate Behavioral Research , volume=. 1966 , publisher=
work page 1966
-
[64]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Deterministic parallel analysis: an improved method for selecting factors and principal components , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2019 , publisher=
work page 2019
-
[65]
The Semicircle Law, Free Random Variables and Entropy , author=. 2006 , publisher=
work page 2006
-
[66]
Journal of Multivariate Analysis , volume=
Limiting spectral distribution of renormalized separable sample covariance matrices when p/n→ 0 , author=. Journal of Multivariate Analysis , volume=. 2014 , publisher=
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.