Enhancing Signal Proportion Estimation Through Leveraging Arbitrary Covariance Structures

Jingtian Bai; Xinge Jessie Jeng

arxiv: 2507.11922 · v2 · pith:P5THBJ7Snew · submitted 2025-07-16 · 🧮 math.ST · stat.ME· stat.ML· stat.TH

Enhancing Signal Proportion Estimation Through Leveraging Arbitrary Covariance Structures

Jingtian Bai , Xinge Jessie Jeng This is my paper

Pith reviewed 2026-05-19 04:59 UTC · model grok-4.3

classification 🧮 math.ST stat.MEstat.MLstat.TH

keywords signal proportion estimationcovariance dependenceprincipal factor approximationsparsityconfidence boundsstatistical consistencydependence structures

0 comments

The pith

A new estimator for signal proportions incorporates arbitrary covariance dependence to improve accuracy and detect weaker signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a signal proportion estimator that uses information about arbitrary covariances among variables instead of assuming independence. It extends earlier lower-confidence-bound methods by adding a principal factor approximation step to capture dependence. Theoretical results compare consistency conditions before and after this adjustment, showing when dependence information helps. Simulations then test the estimator across many sparsity levels and dependence patterns, where it yields higher accuracy than prior approaches. The work therefore supplies both a practical tool and guidance on when dependence adjustment is worthwhile.

Core claim

By folding the principal factor approximation into the estimation of signal proportions, the method produces tighter and more reliable lower bounds on the proportion of true signals while remaining consistent under a broader range of sparsity and dependence conditions than independence-based estimators.

What carries the argument

Principal factor approximation integrated into the signal-proportion lower-bound procedure to account for general covariance dependence.

If this is right

The estimator remains consistent for a wider set of dependence structures than independence-assuming methods.
Weaker signals become detectable because dependence information sharpens the lower bounds.
Performance gains hold across low, moderate, and high sparsity regimes in simulations.
Theoretical comparisons directly quantify the improvement from adding the dependence adjustment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on real high-dimensional data such as gene-expression matrices where dependence is known to exist.
One could examine whether the same principal-factor step can be combined with other proportion estimators beyond the original lower-bound framework.
If the approximation quality can be monitored, the method might include a diagnostic for when the dependence adjustment is safe to apply.

Load-bearing premise

The principal factor approximation recovers the dependence structure accurately enough that it does not bias the extended confidence bounds.

What would settle it

A controlled simulation in which the true covariance matrix is known but the principal-factor step produces a poor approximation, resulting in coverage failure or inflated error for the proportion estimator.

Figures

Figures reproduced from arXiv: 2507.11922 by Jingtian Bai, Xinge Jessie Jeng.

**Figure 2.** Figure 2: Boxplots of ˆπ/π for the (i) Gene Network case across four methods: Traditional (ˆπλ), Adaptive (ˆπ(x)), Adjust-L (ˆπL(bz)), and Adjust-A (ˆπA(bz)). The top row has π = 0.02, while the bottom row has π = 0.1. The three columns correspond to signal intensity levels 1, 2, and 3, respectively. cant portion of overestimated results, while the other methods – ˆπ(x), ˆπL(bz), and ˆπA(bz) – demonstrate consistent… view at source ↗

**Figure 3.** Figure 3: Boxplots of ˆπ/π for the (ii) SNP LD case across four methods. The notations and row/column assignments follow those in [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗

**Figure 4.** Figure 4: Boxplots of ˆπ/π for the (iii) Factor Model case across four methods. The notations and row/column assignments follow those in [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

**Figure 5.** Figure 5: Boxplots of ˆπ/π for the (iv) Block Model case across four methods. The notations and row/column assignments follow those in [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: Boxplots of ˆπ/π for the (v) Autocorrelation case across four methods. The notations and row/column assignments follow those in [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Boxplots of ˆπ/π for the (vi) Small Blocks case across four methods. The notations and row/column assignments follow those in [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

read the original abstract

Accurately estimating the proportion of true signals among a large number of variables is crucial for enhancing the precision and reliability of scientific research. Traditional signal proportion estimators often assume independence among variables and specific signal sparsity conditions, limiting their applicability in real-world scenarios where such assumptions may not hold. This paper introduces a novel signal proportion estimator that leverages arbitrary covariance dependence information among variables, thereby improving performance across a wide range of sparsity levels and dependence structures. Building on previous work that provides lower confidence bounds for signal proportions, we extend this approach by incorporating the principal factor approximation procedure to account for variable dependence. Our theoretical insights offer a deeper understanding of how signal sparsity, signal intensity, and covariance dependence interact. By comparing the conditions for estimation consistency before and after dependence adjustment, we highlight the advantages of integrating dependence information across different contexts. This theoretical foundation not only validates the effectiveness of the new estimator but also guides its practical application, ensuring reliable use in diverse scenarios. Through extensive simulations, we demonstrate that our method outperforms state-of-the-art estimators in both estimation accuracy and the detection of weaker signals that might otherwise go undetected.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends lower-bound signal proportion estimators to arbitrary covariances via principal factor approximation, with simulation gains, but residuals from the approximation could affect the bounds.

read the letter

The main takeaway is an extension of lower confidence bound estimators for signal proportions that now incorporates arbitrary covariances through principal factor approximation, with some simulation evidence of better performance. The paper does well by laying out how the consistency conditions change once dependence is accounted for. This comparison helps show the benefits in different sparsity and intensity regimes. The simulations also cover a broad set of dependence structures and demonstrate gains in accuracy and in detecting weaker signals. A soft spot is the reliance on the principal factor approximation to capture the full dependence. For arbitrary covariances, residuals after extracting the factors could still influence the variance estimator used in the bounds. If those residuals are not negligible, the extension might not hold as cleanly as claimed, though the simulations may provide some reassurance on this. This kind of work is useful for researchers focused on multiple testing procedures that must deal with correlated variables, like in genomics or neuroimaging. A reader looking for methods that relax independence assumptions while keeping theoretical grounding would benefit from the theoretical insights and the empirical results. I recommend sending it for peer review so the derivations and any unshown steps in the approximation can be examined closely.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a novel estimator for the proportion of true signals among many variables that incorporates arbitrary covariance dependence structures. It extends prior lower-confidence-bound methods by applying a principal factor approximation to account for variable dependence, provides theoretical comparisons of consistency conditions before versus after this adjustment, and reports simulation results claiming superior accuracy and weaker-signal detection across sparsity levels and dependence structures.

Significance. If the central claims hold, the work would be significant for applications with correlated high-dimensional data (e.g., genomics or finance), where independence assumptions fail. The explicit pre/post-adjustment consistency comparison supplies useful guidance on when dependence information helps, and the breadth of the simulation study offers empirical grounding. Credit is due for attempting to relax a restrictive assumption while retaining a theoretical consistency framework.

major comments (2)

[§3] §3 (theoretical extension of lower confidence bounds): the central claim that the principal-factor-adjusted estimator preserves or improves consistency for arbitrary covariances rests on the unverified premise that the low-rank factor model plus diagonal noise fully captures the dependence structure. For truly arbitrary covariances the residual matrix after factor extraction can retain off-diagonal mass comparable to signal intensity; any such residual directly perturbs the variance estimator inside the bound. This is load-bearing for the pre- versus post-adjustment consistency comparison and requires either an explicit residual bound or a counter-example showing when the approximation fails.
[Simulation section] Simulation section (results supporting outperformance): the reported gains in accuracy and weaker-signal detection are used to validate the theoretical extension, yet the manuscript does not specify data-exclusion rules, fitting choices for the factor model, or how the covariance matrices were generated to ensure they are truly arbitrary. Without these details the simulation evidence cannot be assessed as independent confirmation of the approximation's validity.

minor comments (2)

[§2] Notation for the principal factor approximation and the resulting variance estimator should be introduced with an explicit equation number and contrasted with the independence case to improve readability.
[Abstract and §4] The abstract states the method 'outperforms state-of-the-art estimators'; the manuscript should add a brief table or paragraph listing the exact competing methods and the precise performance metric (e.g., MSE or coverage) used for each comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [§3] §3 (theoretical extension of lower confidence bounds): the central claim that the principal-factor-adjusted estimator preserves or improves consistency for arbitrary covariances rests on the unverified premise that the low-rank factor model plus diagonal noise fully captures the dependence structure. For truly arbitrary covariances the residual matrix after factor extraction can retain off-diagonal mass comparable to signal intensity; any such residual directly perturbs the variance estimator inside the bound. This is load-bearing for the pre- versus post-adjustment consistency comparison and requires either an explicit residual bound or a counter-example showing when the approximation fails.

Authors: We appreciate the referee's observation regarding the assumptions underlying our principal factor approximation. While the manuscript focuses on covariances that can be reasonably approximated by a low-rank factor model plus diagonal noise, we recognize the need to address cases where residuals may persist. In the revised manuscript, we will provide an explicit bound on the effect of the residual matrix on the variance estimator used in the lower confidence bound. This will clarify the conditions under which the consistency is preserved or improved, and we will include a brief discussion of scenarios where the approximation may be less effective. We believe this will strengthen the theoretical comparison. revision: yes
Referee: [Simulation section] Simulation section (results supporting outperformance): the reported gains in accuracy and weaker-signal detection are used to validate the theoretical extension, yet the manuscript does not specify data-exclusion rules, fitting choices for the factor model, or how the covariance matrices were generated to ensure they are truly arbitrary. Without these details the simulation evidence cannot be assessed as independent confirmation of the approximation's validity.

Authors: We agree that the simulation section would benefit from greater transparency. In the revised manuscript, we will add detailed descriptions of: the procedure for generating covariance matrices to represent arbitrary dependence (including the use of factor models with varying ranks and added noise to simulate residuals), the method for fitting the principal factor approximation (such as eigenvalue-based factor selection), and any rules for data exclusion or handling of simulation replicates. These enhancements will better support the empirical validation of our theoretical results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper extends prior lower confidence bounds for signal proportions by incorporating the principal factor approximation to account for arbitrary covariance structures. Theoretical comparisons of consistency conditions before and after the dependence adjustment, combined with simulation results across sparsity levels and dependence structures, supply independent content. No equations or steps reduce by construction to fitted inputs, self-definitions, or unverified self-citations; the central estimator and its performance claims rest on the explicit extension procedure and external benchmarks rather than tautological renaming or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the principal factor approximation is treated as an existing procedure imported from prior literature rather than newly postulated here.

pith-pipeline@v0.9.0 · 5726 in / 1177 out tokens · 38606 ms · 2026-05-19T04:59:37.612094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Arias-Castro, E., E. J. Cand \`e s, and Y. Plan (2011). Global testing under sparse alternatives: Anova, multiple comparisons and the higher criticism. The Annals of Statistics\/ 39\/ (5), 2533--2556

work page 2011
[2]

Arias-Castro, E. and S. Chen (2017). Distribution-free multiple testing. Electronic Journal of Statistics\/ 11\/ (1), 1983--2001

work page 2017
[3]

Neuvial, and E

Blanchard, G., P. Neuvial, and E. Roquain (2020). Post hoc confidence bounds on false positives using reference families. The Annals of Statistics\/ 48\/ (3), 1281--1303

work page 2020
[4]

Cai, T. T., X. J. Jeng, and J. Jin (2011). Optimal detection of heterogeneous and heteroscedastic mixtures. Journal of the Royal Statistical Society: Series B (Statistical Methodology)\/ 73\/ (5), 629--662

work page 2011
[5]

Cai, T. T. and W. Sun (2017). Large-scale global and simultaneous inference: Estimation and testing in very high dimensions. Annual Review of Economics\/ 9 , 411--439

work page 2017
[6]

Chen, S. X., B. Guo, and Y. Qiu (2023). Testing and signal identification for two-sample high-dimensional covariances via multi-level thresholding. Journal of Econometrics\/ 235\/ (2), 1337--1354

work page 2023
[7]

Chen, S. X., J. Li, and P.-S. Zhong (2019). Two-sample and anova tests for high dimensional means. The Annals of Statistics\/ 47\/ (3), 1443--1474

work page 2019
[8]

Chen, X. (2019). Uniformly consistently estimating the proportion of false null hypotheses via lebesgue--stieltjes integral equations. Journal of Multivariate Analysis\/ 173 , 724--744

work page 2019
[9]

Donoho, D. and J. Jin (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics\/ , 962--994

work page 2004
[10]

Donoho, D. and J. Jin (2015). Special invited paper: Higher criticism for large-scale inference, especially for rare and weak effects. Statistical Science\/ , 1--25

work page 2015
[11]

Han, and W

Fan, J., X. Han, and W. Gu (2012). Estimating false discovery proportion under arbitrary covariance dependence. Journal of the American Statistical Association\/ 107\/ (499), 1019--1035

work page 2012
[12]

Gao, Z. and S. Stoev (2021). Concentration of Maxima and Fundamental Limits in High-Dimensional Testing and Inference . Springer

work page 2021
[13]

Genovese, C. and L. Wasserman (2004). A stochastic process approach to false discovery control. Annals of Statistics\/ 32 , 1035--1061

work page 2004
[14]

Jeng, X. J. (2023). Estimating the proportion of signal variables under arbitrary covariance dependence. Electronic Journal of Statistics\/ 17\/ (1), 950--979

work page 2023
[15]

Jeng, X. J. and X. Chen (2019). Variable selection via adaptive false negative control in linear regression. Electronic Journal of Statistics\/ 13\/ (2), 5306--5333

work page 2019
[16]

Jeng, X. J., Z. J. Daye, W. Lu, and J.-Y. Tzeng (2016). Rare variants association analysis in large-scale sequencing studies at the single locus level. PLoS computational biology\/ 12\/ (6), e1004993

work page 2016
[17]

Jeng, X. J., Y. Hu, Q. Sun, and Y. Li (2022). Weak signal inclusion under dependence and applications in genome-wide association study. arXiv preprint arXiv:2212.13574\/

work page arXiv 2022
[18]

Jeng, X. J., Y. Hu, Q. Sun, and Y. Li (2024). Weak signal inclusion under dependence and applications in genome-wide association study. The Annals of Applied Statistics\/ 18\/ (1), 841--857

work page 2024
[19]

Jeng, X. J., T. Zhang, and J.-Y. Tzeng (2019). Efficient signal inclusion with genomic applications. Journal of the American Statistical Association\/ 114\/ (528), 1787--1799

work page 2019
[20]

Ji, P. and J. Jin (2012). Ups delivers optimal phase diagram in high-dimensional variable selection. The Annals of Statistics\/ 40\/ (1), 73--103

work page 2012
[21]

Ji, P. and Z. Zhao (2014). Rate optimal multiple testing procedure in high-dimensional regression. arXiv preprint arXiv:1404.2961\/

work page arXiv 2014
[22]

Jin, J. (2008). Proportion of non-zero normal means: universal oracle equivalences and uniformly consistent estimators. Journal of the Royal Statistical Society: Series B\/ 70\/ (3), 461--493

work page 2008
[23]

Jin, J. and T. T. Cai (2007). Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons. Journal of the American Statistical Association\/ 102\/ (478), 495--506

work page 2007
[24]

Jin, J., Z. T. Ke, and W. Wang (2017). Phase transitions for high dimensional clustering and related problems. The Annals of Statistics\/ 45\/ (5), 2151--2189

work page 2017
[25]

Katsevich, E. and A. Ramdas (2020). Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. The Annals of Statistics\/ 48\/ (6), 3465--3487

work page 2020
[26]

Li, W. V. and Q.-M. Shao (2002). A normal comparison inequality and its applications. Probability Theory and Related Fields\/ 122\/ (4), 494--508

work page 2002
[27]

Meinshausen, N. and J. Rice (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. The Annals of Statistics\/ 34\/ (1), 373--393

work page 2006
[28]

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology)\/ 64\/ (3), 479--498

work page 2002
[29]

Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society Series B: Statistical Methodology\/ 66\/ (1), 187--205

work page 2004

[1] [1]

Arias-Castro, E., E. J. Cand \`e s, and Y. Plan (2011). Global testing under sparse alternatives: Anova, multiple comparisons and the higher criticism. The Annals of Statistics\/ 39\/ (5), 2533--2556

work page 2011

[2] [2]

Arias-Castro, E. and S. Chen (2017). Distribution-free multiple testing. Electronic Journal of Statistics\/ 11\/ (1), 1983--2001

work page 2017

[3] [3]

Neuvial, and E

Blanchard, G., P. Neuvial, and E. Roquain (2020). Post hoc confidence bounds on false positives using reference families. The Annals of Statistics\/ 48\/ (3), 1281--1303

work page 2020

[4] [4]

Cai, T. T., X. J. Jeng, and J. Jin (2011). Optimal detection of heterogeneous and heteroscedastic mixtures. Journal of the Royal Statistical Society: Series B (Statistical Methodology)\/ 73\/ (5), 629--662

work page 2011

[5] [5]

Cai, T. T. and W. Sun (2017). Large-scale global and simultaneous inference: Estimation and testing in very high dimensions. Annual Review of Economics\/ 9 , 411--439

work page 2017

[6] [6]

Chen, S. X., B. Guo, and Y. Qiu (2023). Testing and signal identification for two-sample high-dimensional covariances via multi-level thresholding. Journal of Econometrics\/ 235\/ (2), 1337--1354

work page 2023

[7] [7]

Chen, S. X., J. Li, and P.-S. Zhong (2019). Two-sample and anova tests for high dimensional means. The Annals of Statistics\/ 47\/ (3), 1443--1474

work page 2019

[8] [8]

Chen, X. (2019). Uniformly consistently estimating the proportion of false null hypotheses via lebesgue--stieltjes integral equations. Journal of Multivariate Analysis\/ 173 , 724--744

work page 2019

[9] [9]

Donoho, D. and J. Jin (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics\/ , 962--994

work page 2004

[10] [10]

Donoho, D. and J. Jin (2015). Special invited paper: Higher criticism for large-scale inference, especially for rare and weak effects. Statistical Science\/ , 1--25

work page 2015

[11] [11]

Han, and W

Fan, J., X. Han, and W. Gu (2012). Estimating false discovery proportion under arbitrary covariance dependence. Journal of the American Statistical Association\/ 107\/ (499), 1019--1035

work page 2012

[12] [12]

Gao, Z. and S. Stoev (2021). Concentration of Maxima and Fundamental Limits in High-Dimensional Testing and Inference . Springer

work page 2021

[13] [13]

Genovese, C. and L. Wasserman (2004). A stochastic process approach to false discovery control. Annals of Statistics\/ 32 , 1035--1061

work page 2004

[14] [14]

Jeng, X. J. (2023). Estimating the proportion of signal variables under arbitrary covariance dependence. Electronic Journal of Statistics\/ 17\/ (1), 950--979

work page 2023

[15] [15]

Jeng, X. J. and X. Chen (2019). Variable selection via adaptive false negative control in linear regression. Electronic Journal of Statistics\/ 13\/ (2), 5306--5333

work page 2019

[16] [16]

Jeng, X. J., Z. J. Daye, W. Lu, and J.-Y. Tzeng (2016). Rare variants association analysis in large-scale sequencing studies at the single locus level. PLoS computational biology\/ 12\/ (6), e1004993

work page 2016

[17] [17]

Jeng, X. J., Y. Hu, Q. Sun, and Y. Li (2022). Weak signal inclusion under dependence and applications in genome-wide association study. arXiv preprint arXiv:2212.13574\/

work page arXiv 2022

[18] [18]

Jeng, X. J., Y. Hu, Q. Sun, and Y. Li (2024). Weak signal inclusion under dependence and applications in genome-wide association study. The Annals of Applied Statistics\/ 18\/ (1), 841--857

work page 2024

[19] [19]

Jeng, X. J., T. Zhang, and J.-Y. Tzeng (2019). Efficient signal inclusion with genomic applications. Journal of the American Statistical Association\/ 114\/ (528), 1787--1799

work page 2019

[20] [20]

Ji, P. and J. Jin (2012). Ups delivers optimal phase diagram in high-dimensional variable selection. The Annals of Statistics\/ 40\/ (1), 73--103

work page 2012

[21] [21]

Ji, P. and Z. Zhao (2014). Rate optimal multiple testing procedure in high-dimensional regression. arXiv preprint arXiv:1404.2961\/

work page arXiv 2014

[22] [22]

Jin, J. (2008). Proportion of non-zero normal means: universal oracle equivalences and uniformly consistent estimators. Journal of the Royal Statistical Society: Series B\/ 70\/ (3), 461--493

work page 2008

[23] [23]

Jin, J. and T. T. Cai (2007). Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons. Journal of the American Statistical Association\/ 102\/ (478), 495--506

work page 2007

[24] [24]

Jin, J., Z. T. Ke, and W. Wang (2017). Phase transitions for high dimensional clustering and related problems. The Annals of Statistics\/ 45\/ (5), 2151--2189

work page 2017

[25] [25]

Katsevich, E. and A. Ramdas (2020). Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. The Annals of Statistics\/ 48\/ (6), 3465--3487

work page 2020

[26] [26]

Li, W. V. and Q.-M. Shao (2002). A normal comparison inequality and its applications. Probability Theory and Related Fields\/ 122\/ (4), 494--508

work page 2002

[27] [27]

Meinshausen, N. and J. Rice (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. The Annals of Statistics\/ 34\/ (1), 373--393

work page 2006

[28] [28]

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology)\/ 64\/ (3), 479--498

work page 2002

[29] [29]

Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society Series B: Statistical Methodology\/ 66\/ (1), 187--205

work page 2004