Using Importance Sampling to Estimate p-values in All-Subset Meta-Analysis, with Applications to Single-Cell eQTL Mapping
Pith reviewed 2026-05-08 07:56 UTC · model grok-4.3
The pith
Importance sampling yields accurate estimates of extremely small p-values for ASSET all-subset meta-analysis even when normality fails.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop a computationally efficient importance-sampling (IS) algorithm that provides accurate ASSET p-value estimates for both independent and overlapping studies, achieving substantial efficiency gains over naïve Monte Carlo, particularly for very small p-values. Using IS, we show that ASSET's analytic approximation is highly accurate across nearly the entire p-value range when normality holds. In contrast, when normality is violated (due to small sample sizes, low-frequency variants, or non-normal traits), ASSET p-values can be inflated or deflated by orders of magnitude, whereas our IS approach remains accurate.
What carries the argument
An importance sampling distribution constructed to oversample the extreme tail of the null distribution of the ASSET test statistic, combined with appropriate reweighting to recover unbiased small p-value estimates.
If this is right
- The original analytic approximation can be used safely for moderate p-values when normality holds but must be replaced for extreme tails or non-normal regimes.
- Single-cell eQTL mapping and similar large-scale genetic studies can now obtain trustworthy p-values from exhaustive subset searches without prohibitive computation.
- Meta-analyses involving overlapping studies gain reliable type I error control at stringent significance thresholds.
- The efficiency gain scales with how small the target p-value is, making genome-wide scans with millions of variants feasible.
Where Pith is reading between the lines
- The same importance sampling construction could be adapted to other all-subset or model-selection procedures in high-dimensional genomics beyond ASSET.
- Integration into standard genetic analysis software would allow routine use of exhaustive subset searches instead of pre-specified subsets.
- When non-normality is detected, the method supplies a practical route to calibrated inference without requiring larger samples or data transformations.
Load-bearing premise
An effective importance sampling distribution must be constructible that covers the far tail of the null distribution of the ASSET statistic.
What would settle it
A direct comparison in which a known small p-value computed from billions of naive Monte Carlo draws differs by more than sampling error from the importance sampling estimate on the same null model.
Figures
read the original abstract
Pooling genome-wide association studies of multiple related traits can substantially increase power for detecting genetic variants with pleiotropic effects. ASSET, which exhaustively searches all subsets of studies for association signals, has been widely used to detect modest effects and improve interpretability. Under a normality assumption, ASSET computes p-values via an analytic approximation that accounts for multiple testing. However, this approximation has been evaluated only in limited scenarios and for p-values no smaller than $10^{-3}$. A systematic assessment in the extreme tail is therefore needed, yet na\"ive Monte Carlo methods would require prohibitively many simulations. We develop a computationally efficient importance-sampling (IS) algorithm that provides accurate ASSET p-value estimates for both independent and overlapping studies, achieving substantial efficiency gains over na\"ive Monte Carlo, particularly for very small p-values. Using IS, we show that ASSET's analytic approximation is highly accurate across nearly the entire p-value range when normality holds. In contrast, when normality is violated (due to small sample sizes, low-frequency variants, or non-normal traits), ASSET p-values can be inflated or deflated by orders of magnitude, whereas our IS approach remains accurate. We illustrate the method through applications to single-cell eQTL mapping using peripheral blood mononuclear cells from the OneK1K cohort and lung cells from a Korean population.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to develop a computationally efficient importance sampling (IS) algorithm to estimate p-values for the ASSET all-subset meta-analysis procedure, applicable to both independent and overlapping studies. It reports that the existing analytic approximation is highly accurate under normality across nearly the full p-value range but can inflate or deflate p-values by orders of magnitude when normality fails (small samples, low-frequency variants, non-normal traits). The IS method achieves substantial efficiency gains over naive Monte Carlo, especially in the extreme tail, and is illustrated via applications to single-cell eQTL mapping in PBMC (OneK1K cohort) and lung cells (Korean population).
Significance. If the central claims hold, the work supplies a practical, scalable solution for accurate tail-probability estimation in multi-trait meta-analysis, a setting where genome-wide scans routinely require reliable p-values far below 10^{-3} and normality is frequently violated. The efficiency gains, explicit handling of study overlap through joint null covariance, and validation under both normal and non-normal regimes constitute a clear methodological advance with immediate utility for pleiotropy detection in genomics.
minor comments (3)
- [Abstract] Abstract: the statement that the analytic approximation is 'highly accurate across nearly the entire p-value range' under normality would be strengthened by citing the specific simulation ranges and error metrics (e.g., relative error or coverage) that support this claim.
- [Methods] Methods (IS proposal construction): the description of how the importance distribution is chosen to cover the extreme tail under the joint null for overlapping studies should include an explicit statement of the effective sample size achieved for p-values < 10^{-8} to confirm the reported efficiency gains are not limited to moderate tails.
- [Results] Results (non-normality experiments): the claim that ASSET p-values 'can be inflated or deflated by orders of magnitude' would benefit from a table or figure panel that directly contrasts IS estimates against the analytic approximation for the same non-normal simulation settings, including the magnitude of discrepancy at the smallest p-values examined.
Simulated Author's Rebuttal
We thank the referee for their positive summary of our work and the recommendation for minor revision. The referee's description accurately reflects the manuscript's contributions regarding the importance sampling approach for ASSET p-value estimation under normality and non-normality assumptions.
Circularity Check
No significant circularity in the importance sampling algorithm
full rationale
The paper introduces a new importance-sampling algorithm for estimating extreme-tail ASSET p-values under both normal and non-normal regimes. This is a direct methodological construction from standard IS principles (proposal distribution, joint null covariance for overlapping studies, effective sample size monitoring) rather than a derivation that reduces to fitted parameters, self-citations, or prior ansatzes. Validation proceeds via independent Monte Carlo comparisons and analytic checks that do not presuppose the target result. No load-bearing step equates the output estimator to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ability to simulate from the null distribution of the ASSET statistic under both normal and non-normal data regimes
Reference graph
Works this paper leans on
-
[1]
\"A ngquist, L. and O. H \"o ssjer (2004). Using importance sampling to improve simulation in linkage analysis. Statistical Applications in Genetics & Molecular Biology\/ 3\/ (1)
work page 2004
-
[2]
Bhattacharjee, S., P. Rajaraman, K. B. Jacobs, W. A. Wheeler, B. S. Melin, P. Hartge, M. Yeager, C. C. Chung, S. J. Chanock, and N. Chatterjee (2012). A subset-based approach improves power and interpretation for the combined analysis of genetic association studies of heterogeneous traits. The American Journal of Human Genetics\/ 90\/ (5), 821--835
work page 2012
-
[3]
Bucklew, J. A. (2004). Introduction to Rare Event Simulation , Volume 5. New York: Springer
work page 2004
-
[4]
Chen, C. and L. Han (2025). Deciphering genetic regulation at single-cell resolution in gastric cancer. Cell Genomics\/ 5\/ (4)
work page 2025
-
[5]
Cotsapas, C., B. F. Voight, E. Rossin, K. Lage, B. M. Neale, C. Wallace, G. R. Abecasis, J. C. Barrett, T. Behrens, J. Cho, et al. (2011). Pervasive sharing of genetic effects in autoimmune disease. PLoS genetics\/ 7\/ (8), e1002254
work page 2011
-
[6]
Elvira, V. and L. Martino (2021). Advances in importance sampling. In N. Balakrishnan, T. Colton, B. Everitt, W. Piegorsch, F. Ruggeri, and J. Teugels (Eds.), Wiley StatsRef: Statistics Reference Online . Wiley
work page 2021
-
[7]
Fehringer, G., P. Kraft, P. D. Pharoah, R. A. Eeles, N. Chatterjee, F. R. Schumacher, J. M. Schildkraut, S. Lindstr \"o m, P. Brennan, H. Bickeb \"o ller, et al. (2016). Cross-cancer genome-wide analysis of lung, ovary, breast, prostate, and colorectal cancer reveals novel pleiotropic associations. Cancer research\/ 76\/ (17), 5103--5114
work page 2016
-
[8]
Fisher, R. A. (1925). Statistical Methods for Research Workers\/ (1 ed.). Edinburgh: Oliver and Boyd
work page 1925
-
[9]
Goertzel, G. and H. Kahn (1951). Monte Carlo Methods for Shield Computation , Volume 2807. US Atomic Energy Commission, Technical Information Division
work page 1951
-
[10]
Jee, Y. H., Y. He, W. Lu, Y. Shi, D. Lazarev, M. J. Daly, M. P. Reeve, and A. R. Martin (2025). Dissecting pleiotropy to gain mechanistic insights into human disease. Nature Reviews Genetics\/ , 1--14
work page 2025
-
[11]
Kahn, H. (1949). Stochastic (monte carlo) attenuation analysis. Technical report
work page 1949
-
[12]
Kahn, H. and A. W. Marshall (1953). Methods of reducing sample size in monte carlo computations. Journal of the Operations Research Society of America\/ 1\/ (5), 263--278
work page 1953
-
[13]
Kar, S. P., J. Beesley, A. Amin Al Olama, K. Michailidou, J. Tyrer, Z. Kote-Jarai, K. Lawrenson, S. Lindstrom, S. J. Ramus, D. J. Thompson, et al. (2016). Genome-wide meta-analyses of breast, ovarian, and prostate cancer association studies identify multiple new susceptibility loci shared by at least two cancer types. Cancer discovery\/ 6\/ (9), 1052--1067
work page 2016
-
[14]
Kimmel, G. and R. Shamir (2006). A fast method for computing high-significance disease association in large population-based studies. The American Journal of Human Genetics\/ 79\/ (3), 481--492
work page 2006
-
[15]
Li, J. and G. C. Tseng (2011, June). An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies. The Annals of Applied Statistics\/ 5\/ (2A), 994--1019
work page 2011
-
[16]
Li, Y. R., J. Li, S. D. Zhao, J. P. Bradfield, F. D. Mentch, S. M. Maggadottir, C. Hou, D. J. Abrams, D. Chang, F. Gao, et al. (2015). Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases. Nature medicine\/ 21\/ (9), 1018--1027
work page 2015
-
[17]
Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing , Volume 10. New York: Springer
work page 2001
-
[18]
Liu, J. S. and R. Chen (1998). Sequential monte carlo methods for dynamic systems. Journal of the American statistical association\/ 93\/ (443), 1032--1044
work page 1998
-
[19]
Liu, Y., S. Chen, Z. Li, A. C. Morrison, E. Boerwinkle, and X. Lin (2019). Acat: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. The american journal of human genetics\/ 104\/ (3), 410--421
work page 2019
-
[20]
Lloyd, C. J. (2012). Computing highly accurate or exact p-values using importance sampling. Computational Statistics & Data Analysis\/ 56\/ (6), 1784--1794
work page 2012
-
[21]
Luong, T., J. Yin, B. Li, J. H. Shin, E. Sisay, S. Mikhail, F. Qin, S. Anyaso-Samuel, A. Kane, A. Golden, et al. (2026). Single-cell lung eqtl dataset of asian never-smokers highlights the roles of alveolar cells in lung cancer etiology. bioRxiv\/ , 2026--03
work page 2026
-
[22]
Mahajan, A., C. N. Spracklen, W. Zhang, M. C. Ng, L. E. Petty, H. Kitajima, G. Z. Yu, S. R \"u eger, L. Speidel, Y. J. Kim, et al. (2022). Multi-ancestry genetic study of type 2 diabetes highlights the power of diverse populations for discovery and translation. Nature genetics\/ 54\/ (5), 560--572
work page 2022
-
[23]
Malley, J. D., D. Q. Naiman, and J. E. Bailey-Wilson (2003). A comprehensive method for genome scans. Human heredity\/ 54\/ (4), 174--185
work page 2003
-
[24]
Naiman, D. Q. and C. E. Priebe (2001). Computing scan statistic p values using importance sampling, with applications to genetics and medical image analysis. Journal of Computational and Graphical Statistics\/ 10\/ (2), 296--328
work page 2001
-
[25]
Natri, H. M., C. B. Del Azodi, L. Peter, C. J. Taylor, S. Chugh, R. Kendle, M.-i. Chung, D. K. Flaherty, B. K. Matlock, C. L. Calvi, et al. (2024). Cell-type-specific and disease-associated expression quantitative trait loci in the human lung. Nature Genetics\/ 56\/ (4), 595--604
work page 2024
-
[26]
Pollak, M. and B. Yakir (1998). A new representation for a renewal-theoretic constant appearing in asymptotic approximations of large deviations. The Annals of Applied Probability\/ 8\/ (3), 749--774
work page 1998
-
[27]
Qi, G., S. B. Chhetri, D. Ray, D. Dutta, A. Battle, S. Bhattacharjee, and N. Chatterjee (2024). Genome-wide large-scale multi-trait analysis characterizes global patterns of pleiotropy and unique trait-specific variants. Nature communications\/ 15\/ (1), 6985
work page 2024
-
[28]
Shi, J., D. Siegmund, and B. Yakir (2007). Importance sampling for estimating p values in linkage analysis. Journal of the American Statistical Association\/ 102\/ (479), 929--937
work page 2007
-
[29]
Siegmund, D. (1976). Importance sampling in the monte carlo study of sequential tests. The Annals of Statistics\/ 4\/ (4), 673--684
work page 1976
-
[30]
Siegmund, D. and B. Yakir (2000). Tail probabilities for the null distribution of scanning statistics. Bernoulli\/ 6\/ (2), 191--213
work page 2000
-
[31]
Skol, A. D., L. J. Scott, G. R. Abecasis, and M. Boehnke (2006). Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature genetics\/ 38\/ (2), 209--213
work page 2006
-
[32]
Soskic, B., K. Cano-Gamez, D. J. Smyth, K. Ambridge, Z. Ke, J. C. Matte, L. Bossini-Castillo, J. Kaplanis, L. Ramirez-Navarro, A. Lorenc, et al. (2022). Immune disease risk variants regulate gene expression dynamics during cd4+ t cell activation. Nature genetics\/ 54\/ (6), 817--826
work page 2022
-
[33]
Taylor, J. E., K. Worsley, and F. Gosselin (2007). Maxima of discretely sampled random fields, with an application to ‘bubbles’. Biometrika\/ 94\/ (1), 1--18
work page 2007
- [34]
-
[35]
Urbut, S. M., G. Wang, P. Carbonetto, and M. Stephens (2019). Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nature genetics\/ 51\/ (1), 187--195
work page 2019
-
[36]
Wu, X. and D. Q. Naiman (2005). P-value simulation for affected sib pair multiple testing. Human Heredity\/ 59\/ (4), 190--200
work page 2005
-
[37]
Yazar, S., J. Alquicira-Hernandez, K. Wing, A. Senabouth, M. G. Gordon, S. Andersen, Q. Lu, A. Rowson, T. R. Taylor, L. Clarke, et al. (2022). Single-cell eqtl mapping identifies cell type--specific genetic control of autoimmune disease. Science\/ 376\/ (6589), eabf3041
work page 2022
-
[38]
Yu, K., Q. Li, A. W. Bergen, R. M. Pfeiffer, P. S. Rosenberg, N. Caporaso, P. Kraft, and N. Chatterjee (2009). Pathway analysis by adaptive combination of p-values. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society\/ 33\/ (8), 700--709
work page 2009
- [39]
-
[40]
Zaykin, D. V., L. A. Zhivotovsky, P. H. Westfall, and B. S. Weir (2002). Truncated product method for combining p-values. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society\/ 22\/ (2), 170--185
work page 2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.