Active Hypothesis Testing under Computational Budgets with Applications to GWAS and LLM
Pith reviewed 2026-05-17 03:27 UTC · model grok-4.3
The pith
A budget-constrained testing method probabilistically mixes exact statistics and cheap proxies to deliver valid p-values or e-values while using exactly the allotted compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central construction is a data-adaptive rule that, given a global budget B and a family of inexpensive auxiliary statistics, selects for each hypothesis a randomization probability q_i such that the expected cost equals B exactly; when the exact statistic is chosen it is used directly, otherwise a transformed proxy is substituted in a way that preserves the validity of the reported p-value or e-value. The authors prove that this procedure attains the minimal possible expected type-I error (or maximal expected e-value) among all budget-feasible rules under independence, and remains admissible without independence.
What carries the argument
The randomized decision rule that uses auxiliary statistics to allocate exact versus proxy computation while enforcing an exact global budget constraint.
If this is right
- Under a fixed compute budget the method returns more discoveries than uniform allocation of the same budget.
- The same framework applies unchanged to any setting where an exact test statistic can be replaced by a cheaper but still valid proxy.
- Optimality for e-values holds regardless of dependence among hypotheses.
- Admissibility for p-values continues to hold when the auxiliary statistics are correlated with the exact statistics.
- The procedure can be applied directly to existing GWAS pipelines or LLM scoring tasks without altering the underlying test statistic.
Where Pith is reading between the lines
- The same randomization idea could be extended to sequential testing where the budget is revealed gradually rather than fixed in advance.
- If auxiliary statistics are themselves expensive to compute, a two-stage hierarchy of proxies might further reduce total cost.
- In settings where multiple testing corrections are applied after the active procedure, the dependence structure between the active decisions and the correction step would need separate analysis.
- The framework suggests a general template for any resource-constrained inference problem in which cheap proxies can be certified to preserve validity when substituted probabilistically.
Load-bearing premise
Inexpensive auxiliary statistics exist that carry enough information to guide the probabilistic choice without destroying the validity of the final p-value or e-value.
What would settle it
Run the procedure on a collection of independent null hypotheses with known exact p-value distributions; if the empirical distribution of reported p-values deviates systematically from uniform or if the realized total compute exceeds the declared budget, the guarantee fails.
Figures
read the original abstract
In large-scale hypothesis testing, computing exact $p$-values or $e$-values is often resource-intensive, creating a need for budget-aware inferential methods. We propose a general framework for active hypothesis testing that leverages inexpensive auxiliary statistics to allocate a global computational budget. For each hypothesis, our data-adaptive procedure probabilistically decides whether to compute the exact test statistic or a transformed proxy, guaranteeing a valid $p$-value or $e$-value while satisfying the exact budget constraint. Theoretical guarantees are established for our constructions, showing that the procedure achieves optimality for $e$-values and for $p$-values under independence, and admissibility for $p$-values under general dependence. Empirical results from simulations and two real-world applications, including a large-scale genome-wide association study (GWAS) and a clinical prediction task leveraging large language models (LLM), demonstrate that our framework improves statistical efficiency under fixed resource limits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a general framework for active hypothesis testing under a fixed computational budget. It uses inexpensive auxiliary statistics to make per-hypothesis probabilistic decisions on whether to compute an exact test statistic or a transformed proxy, with a global adjustment mechanism to enforce the exact budget constraint. The authors claim that the resulting p-values or e-values remain valid, that the procedure is optimal for e-values and for p-values under independence, and admissible for p-values under general dependence. These claims are supported by theoretical constructions and demonstrated via simulations plus applications to a large-scale GWAS and an LLM-based clinical prediction task.
Significance. If the validity, optimality, and admissibility results hold after accounting for the exact budget mechanism, the framework would provide a principled method for improving statistical efficiency in resource-limited multiple testing settings. This has clear relevance for genomics and machine-learning inference tasks where exact computations are costly. The empirical applications add practical weight, though the overall significance depends on whether the theoretical guarantees survive the dependence introduced by global budget enforcement.
major comments (1)
- [Theoretical guarantees section (admissibility under dependence)] The admissibility result for p-values under general dependence (abstract and the section presenting the theoretical guarantees) appears to treat the allocation indicators as exogenous or independent of the test statistics. The global coupling mechanism required to enforce the exact (not merely expected) budget constraint correlates these indicators across hypotheses. This correlation risks violating the conditions needed for admissibility or validity unless an explicit martingale, conditional-independence, or coupling argument is supplied; the current derivation does not obviously contain this step.
minor comments (2)
- [Abstract] The abstract refers to a 'transformed proxy' without defining its construction or properties; an early concrete example or equation would aid readability.
- [Applications and simulations] The empirical sections would benefit from explicit pseudocode or a small table showing how the global budget constraint is realized in the GWAS and LLM implementations.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for identifying a key point about the dependence structure induced by the exact budget constraint. We address the comment below and will strengthen the theoretical section accordingly.
read point-by-point responses
-
Referee: [Theoretical guarantees section (admissibility under dependence)] The admissibility result for p-values under general dependence (abstract and the section presenting the theoretical guarantees) appears to treat the allocation indicators as exogenous or independent of the test statistics. The global coupling mechanism required to enforce the exact (not merely expected) budget constraint correlates these indicators across hypotheses. This correlation risks violating the conditions needed for admissibility or validity unless an explicit martingale, conditional-independence, or coupling argument is supplied; the current derivation does not obviously contain this step.
Authors: We agree that the global mechanism enforcing the exact budget introduces correlation among allocation indicators and that the current derivation does not supply an explicit martingale or coupling argument to address this. The procedure first computes per-hypothesis inclusion probabilities from the auxiliary statistics and then applies a global adjustment (via randomized rounding or rejection sampling) to meet the exact total budget. This adjustment preserves the marginal inclusion probability for each hypothesis conditional on its auxiliary statistic, which is sufficient for marginal validity of the resulting p-values. For admissibility under arbitrary dependence, the argument compares the expected performance of our procedure against any other budget-feasible rule; because the comparison is in terms of the joint distribution of discoveries and the marginal validity is already guaranteed, the dependence does not invalidate the dominance. Nevertheless, to make the reasoning fully rigorous we will insert a new subsection that constructs an explicit coupling between the allocation vector and the test statistics and verifies the martingale property of the adjusted p-values. This addition will appear in the revised theoretical guarantees section. revision: yes
Circularity Check
No circularity: derivations rely on standard validity arguments and external statistical concepts
full rationale
The paper's abstract and described framework establish validity, optimality, and admissibility guarantees for a budget-constrained active testing procedure using auxiliary statistics and probabilistic allocation. No equations, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations are visible in the provided text that would reduce the claimed optimality or validity results to tautological constructions. The central claims build on classical p-value/e-value properties and budget constraints without evident reduction to inputs by definition. The global coupling mechanism for exact budget satisfaction is presented as a construction that preserves validity, with theoretical results stated as established rather than derived circularly from the same assumptions. This is the common honest case of a self-contained statistical proposal.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions ensuring that p-values and e-values remain valid under the probabilistic allocation rule
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 1 (Active e-value) and Corollary 1: E_active = β/(1-h(E^a)) or (1-β)/h(E^a)·E with sup a(1-h)≤β and sup b h ≤1-β
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Normalized allocation hi(X^a) = nb · u_i(X^a_i) / sum u_j and E[sum C_i] ≤ nb
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Learning U-Statistics with Active Inference
Active inference framework for U-statistics using augmented IPW to optimize label queries and minimize variance under budget constraints.
Reference graph
Works this paper leans on
-
[1]
N., Bates, S., Fannjiang, C., Jordan, M
Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., and Zrnic, T. (2023). Prediction-Powered Inference . Science , 382(6671):669--674
work page 2023
-
[2]
Aoshima, M. and Yata, K. (2011). Two-stage procedures for high-dimensional data. Seq. Anal. , 30(4):356--399
work page 2011
-
[3]
Barber, R. F. and Ramdas, A. (2017). The p-filter: multilayer false discovery rate control for grouped hypotheses. J. R. Stat. Soc. B , 79(4):1247--1268
work page 2017
-
[4]
Bates, S., Cand \`e s, E., Lei, L., Romano, Y., and Sesia, M. (2023). Testing for outliers with conformal p-values. Ann. Statist. , 51(1):149--178
work page 2023
-
[5]
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. B , 57(1):289--300
work page 1995
-
[6]
Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency . Ann. Statist. , 29(4):1165 -- 1188
work page 2001
-
[7]
Cai, T. T., Sun, W., and Xia, Y. (2022). Laws: A locally adaptive weighting and screening approach to spatial multiple testing. J. Am. Statist. Assoc. , 117(539):1370--1383
work page 2022
-
[8]
J., Ruppert, D., and Stefanski, L
Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995). Measurement error in nonlinear models , volume 105. CRC press
work page 1995
-
[9]
Chao, P. and Fithian, W. (2021). Adapt-gmm: Powerful and robust covariate-assisted multiple testing. arXiv preprint arXiv:2106.15812
-
[10]
Cohn, D., Ghahramani, Z., and Jordan, M. I. (1996). Active learning with statistical models. J. Artif. Intell. Res. , 4:129--145
work page 1996
-
[11]
Cook, T., Mishler, A., and Ramdas, A. (2024). Semiparametric efficient inference in adaptive experiments. In Causal Learning and Reasoning , pages 1033--1064. PMLR
work page 2024
-
[12]
Dunbar, O. R., Duncan, A. B., Stuart, A. M., and Wolfram, M.-T. (2022). Ensemble inference methods for models with noisy and expensive likelihoods. SIAM J. Appl. Dyn. Syst. , 21(2):1539--1572
work page 2022
-
[13]
Fithian, W. and Lei, L. (2022). Conditional calibration for false discovery rate control under dependence. Ann. Statist. , 50(6):3091--3118
work page 2022
-
[14]
Freestone, J., Noble, W. S., and Keich, U. (2024). A semi-supervised framework for diverse multiple hypothesis testing scenarios. arXiv preprint arXiv:2411.15771
-
[15]
Fuller, W. A. (2009). Measurement error models . John Wiley & Sons
work page 2009
-
[16]
R., Roeder, K., and Wasserman, L
Genovese, C. R., Roeder, K., and Wasserman, L. (2006). False discovery control with p-value weighting. Biometrika , 93(3):509--524
work page 2006
-
[17]
Golovenkin, S., Gorban, A., Mirkes, E., Shulman, V., Rossiev, D., Shesternya, P., Nikulina, S., Orlova, Y., and Dorrer, M. (2020). Myocardial infarction complications Database
work page 2020
-
[18]
Y., Delaigle, A., and Gustafson, P
Grace, Y. Y., Delaigle, A., and Gustafson, P. (2021). Handbook of measurement error models . CRC Press
work page 2021
-
[19]
Ignatiadis, N., Klaus, B., Zaugg, J. B., and Huber, W. (2016). Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat. Methods , 13(7):577--580
work page 2016
- [20]
-
[21]
M., Lu, K., Zrnic, T., Wang, S., and Bates, S
Kluger, D. M., Lu, K., Zrnic, T., Wang, S., and Bates, S. (2025). Prediction-powered inference with imputed covariates and nonuniform sampling. arXiv preprint arXiv:2501.18577
-
[22]
Lee, J. and Ren, Z. (2024). Boosting e-bh via conditional calibration. arXiv preprint arXiv:2404.17562
-
[23]
Lei, L. and Fithian, W. (2018). Adapt: An interactive procedure for multiple testing with side information. J. R. Stat. Soc. B , 80(4):649--679
work page 2018
-
[24]
Li, A. and Barber, R. F. (2019). Multiple testing with the structure-adaptive benjamini--hochberg algorithm. J. R. Stat. Soc. B , 81(1):45--74
work page 2019
-
[25]
Liu, Y., Sarkar, S. K., and Zhao, Z. (2016). A new approach to multiple testing of grouped hypotheses. J. Stat. Plan. Inference , 179:1--14
work page 2016
-
[26]
Motwani, K. and Witten, D. (2023). Revisiting inference after prediction. J. Mach. Learn. Res. , 24(394):1--18
work page 2023
-
[27]
Paquette, M., Chong, M., Saavedra, Y. G. L., Paré, G., Dufour, R., and Baass, A. (2017). The 9p21.3 locus and cardiovascular risk in familial hypercholesterolemia. J. Clin. Lipidol. , 11(2):406--412
work page 2017
-
[28]
Ramdas, A. and Wang, R. (2024). Hypothesis testing with e-values. arXiv preprint arXiv:2410.23614
-
[29]
Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., and Wang, X. (2021). A survey of deep active learning. ACM Comput. Surv. , 54(9):180:1--180:40
work page 2021
-
[30]
Ren, Z. and Barber, R. F. (2023). Derandomised knockoffs: leveraging e-values for false discovery rate control. J. R. Stat. Soc. B , 86(1):122--154
work page 2023
-
[31]
Sener, O. and Savarese, S. (2018). Active learning for convolutional neural networks: A core-set approach. In Int. Conf. Learn. Represent
work page 2018
-
[32]
Settles, B. (2009). Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison
work page 2009
-
[33]
Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. B , 64(3):479--498
work page 2002
-
[34]
Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J. R. Stat. Soc. B , 66(1):187--205
work page 2004
-
[35]
Su, W. J. (2018). The fdr-linking theorem. arXiv preprint arXiv:1812.08965
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Sun, S. (2013). A survey of multi-view machine learning. Neural Comput. Appl. , 23(7--8):2031--2038
work page 2013
-
[37]
Vovk, V. and Wang, R. (2021). E-values: Calibration, combination and applications. Ann. Statist. , 49(3):1736--1754
work page 2021
-
[38]
Wang, R. and Ramdas, A. (2022). False discovery rate control with e-values. J. R. Stat. Soc. B , 84(3):822--852
work page 2022
-
[39]
Xia, Y., Cai, T. T., and Sun, W. (2020). Gap: A general framework for information pooling in two-sample sparse inference. J. Am. Statist. Assoc
work page 2020
- [40]
- [41]
-
[42]
Zehetmayer, S., Bauer, P., and Posch, M. (2005). Two-stage designs for experiments with a large number of hypotheses. Bioinformatics , 21(19):3771--3777
work page 2005
-
[43]
Zhang, D., He, J., Liu, Y., Si, L., and Lawrence, R. (2011). Multi-view transfer learning with a large margin approach. In Int. Conf. Knowl. Discov. Data Min. , pages 1208--1216
work page 2011
-
[44]
Zhang, K. W., Janson, L., and Murphy, S. (2021). Statistical inference with m-estimators on adaptively collected data. In Adv. Neural Inf. Process. Syst
work page 2021
-
[45]
Zhang, M. J., Xia, F., and Zou, J. (2019). Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing. Nat. Commun. , 10(1):3433
work page 2019
-
[46]
Zhao, J., Xie, X., Xu, X., and Sun, S. (2017). Multi-view learning overview: Recent progress and new challenges. Inf. Fusion , 38:43--54
work page 2017
-
[47]
Zrnic, T. and Cand\` e s, E. J. (2024). Active statistical inference. In Int. Conf. Mach. Learn. , ICML'24. JMLR.org
work page 2024
-
[48]
Zrnic, T. and Candès, E. J. (2024). Cross-prediction-powered inference. Proc. Natl. Acad. Sci. U.S.A. , 121(15):e2322083121
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.