pith. sign in

arxiv: 2512.01423 · v2 · submitted 2025-12-01 · 📊 stat.ME

Active Hypothesis Testing under Computational Budgets with Applications to GWAS and LLM

Pith reviewed 2026-05-17 03:27 UTC · model grok-4.3

classification 📊 stat.ME
keywords active hypothesis testingcomputational budgetp-valuese-valuesgenome-wide association studylarge language modelsdata-adaptive proceduresresource-constrained inference
0
0 comments X p. Extension

The pith

A budget-constrained testing method probabilistically mixes exact statistics and cheap proxies to deliver valid p-values or e-values while using exactly the allotted compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In large-scale testing problems the cost of computing precise p-values or e-values often exceeds available resources. The paper introduces an active procedure that, for each hypothesis, draws on inexpensive auxiliary statistics to decide probabilistically whether to evaluate the exact statistic or a transformed proxy. The rule is constructed so that the resulting p-value or e-value remains valid and the total computational outlay exactly matches the preset budget. Theory shows the construction is optimal for e-values and for p-values when hypotheses are independent, and admissible more generally. Simulations and real applications to genome-wide association studies and LLM-based clinical prediction illustrate that the same statistical power can be achieved with substantially less total computation.

Core claim

The central construction is a data-adaptive rule that, given a global budget B and a family of inexpensive auxiliary statistics, selects for each hypothesis a randomization probability q_i such that the expected cost equals B exactly; when the exact statistic is chosen it is used directly, otherwise a transformed proxy is substituted in a way that preserves the validity of the reported p-value or e-value. The authors prove that this procedure attains the minimal possible expected type-I error (or maximal expected e-value) among all budget-feasible rules under independence, and remains admissible without independence.

What carries the argument

The randomized decision rule that uses auxiliary statistics to allocate exact versus proxy computation while enforcing an exact global budget constraint.

If this is right

  • Under a fixed compute budget the method returns more discoveries than uniform allocation of the same budget.
  • The same framework applies unchanged to any setting where an exact test statistic can be replaced by a cheaper but still valid proxy.
  • Optimality for e-values holds regardless of dependence among hypotheses.
  • Admissibility for p-values continues to hold when the auxiliary statistics are correlated with the exact statistics.
  • The procedure can be applied directly to existing GWAS pipelines or LLM scoring tasks without altering the underlying test statistic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same randomization idea could be extended to sequential testing where the budget is revealed gradually rather than fixed in advance.
  • If auxiliary statistics are themselves expensive to compute, a two-stage hierarchy of proxies might further reduce total cost.
  • In settings where multiple testing corrections are applied after the active procedure, the dependence structure between the active decisions and the correction step would need separate analysis.
  • The framework suggests a general template for any resource-constrained inference problem in which cheap proxies can be certified to preserve validity when substituted probabilistically.

Load-bearing premise

Inexpensive auxiliary statistics exist that carry enough information to guide the probabilistic choice without destroying the validity of the final p-value or e-value.

What would settle it

Run the procedure on a collection of independent null hypotheses with known exact p-value distributions; if the empirical distribution of reported p-values deviates systematically from uniform or if the realized total compute exceeds the declared budget, the guarantee fails.

Figures

Figures reproduced from arXiv: 2512.01423 by Bowen Gang, Qi Kuang, Yin Xia.

Figure 1
Figure 1. Figure 1: Performance comparison as a function of π with a budget of nb = 500. All methods successfully control the FDR at α = 0.1. Active-Default achieves the highest efficiency. The results in [PITH_FULL_IMAGE:figures/full_fig_p024_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison as a function of the hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison as a function of π, with a fixed σ = 1. 1 2 3 4 5 0 500 2000 4000 6000 8000 10000 12000 Average number of queries Queries (p-value) 1 2 3 4 5 0.05 0.10 0.15 0.20 0.25 Efficiency Efficiency (p-value) 1 2 3 4 5 0 500 2000 4000 6000 8000 10000 12000 Average number of queries Queries (e-value) 1 2 3 4 5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Efficiency Efficiency (e-value) 1 2 3 4 5 0.000 0.002… view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison as a function of σ, with a fixed π = 0.1. perfectly adhere to the nb = 500 query limit. The Active-Default method again emerges as the most efficient, with its advantage growing as the proportion of true signals increases. The second analysis, presented in [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison as a function of π, with a fixed ρ = 0.5. 0.2 0.4 0.6 0.8 0 500 2000 4000 6000 8000 10000 12000 Average number of queries Queries (p-value) 0.2 0.4 0.6 0.8 0.05 0.10 0.15 0.20 0.25 Efficiency Efficiency (p-value) 0.2 0.4 0.6 0.8 0 500 2000 4000 6000 8000 10000 12000 Average number of queries Queries (e-value) 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Efficiency Efficiency (e-va… view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison as a function of ρ, with a fixed π = 0.1. behavior. As ρ increases, the auxiliary statistic becomes a more faithful proxy for the gold-standard statistic. This increased information quality allows all active inference methods to improve their power and efficiency. However, Active-Default demonstrates the most significant gains. Its efficiency curve rises more steeply than those of th… view at source ↗
Figure 7
Figure 7. Figure 7: Performance on the GWAS data analysis. (Left) The number of queried MI [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance on the MI Complications data analysis. All methods control the FDR, [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
read the original abstract

In large-scale hypothesis testing, computing exact $p$-values or $e$-values is often resource-intensive, creating a need for budget-aware inferential methods. We propose a general framework for active hypothesis testing that leverages inexpensive auxiliary statistics to allocate a global computational budget. For each hypothesis, our data-adaptive procedure probabilistically decides whether to compute the exact test statistic or a transformed proxy, guaranteeing a valid $p$-value or $e$-value while satisfying the exact budget constraint. Theoretical guarantees are established for our constructions, showing that the procedure achieves optimality for $e$-values and for $p$-values under independence, and admissibility for $p$-values under general dependence. Empirical results from simulations and two real-world applications, including a large-scale genome-wide association study (GWAS) and a clinical prediction task leveraging large language models (LLM), demonstrate that our framework improves statistical efficiency under fixed resource limits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a general framework for active hypothesis testing under a fixed computational budget. It uses inexpensive auxiliary statistics to make per-hypothesis probabilistic decisions on whether to compute an exact test statistic or a transformed proxy, with a global adjustment mechanism to enforce the exact budget constraint. The authors claim that the resulting p-values or e-values remain valid, that the procedure is optimal for e-values and for p-values under independence, and admissible for p-values under general dependence. These claims are supported by theoretical constructions and demonstrated via simulations plus applications to a large-scale GWAS and an LLM-based clinical prediction task.

Significance. If the validity, optimality, and admissibility results hold after accounting for the exact budget mechanism, the framework would provide a principled method for improving statistical efficiency in resource-limited multiple testing settings. This has clear relevance for genomics and machine-learning inference tasks where exact computations are costly. The empirical applications add practical weight, though the overall significance depends on whether the theoretical guarantees survive the dependence introduced by global budget enforcement.

major comments (1)
  1. [Theoretical guarantees section (admissibility under dependence)] The admissibility result for p-values under general dependence (abstract and the section presenting the theoretical guarantees) appears to treat the allocation indicators as exogenous or independent of the test statistics. The global coupling mechanism required to enforce the exact (not merely expected) budget constraint correlates these indicators across hypotheses. This correlation risks violating the conditions needed for admissibility or validity unless an explicit martingale, conditional-independence, or coupling argument is supplied; the current derivation does not obviously contain this step.
minor comments (2)
  1. [Abstract] The abstract refers to a 'transformed proxy' without defining its construction or properties; an early concrete example or equation would aid readability.
  2. [Applications and simulations] The empirical sections would benefit from explicit pseudocode or a small table showing how the global budget constraint is realized in the GWAS and LLM implementations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying a key point about the dependence structure induced by the exact budget constraint. We address the comment below and will strengthen the theoretical section accordingly.

read point-by-point responses
  1. Referee: [Theoretical guarantees section (admissibility under dependence)] The admissibility result for p-values under general dependence (abstract and the section presenting the theoretical guarantees) appears to treat the allocation indicators as exogenous or independent of the test statistics. The global coupling mechanism required to enforce the exact (not merely expected) budget constraint correlates these indicators across hypotheses. This correlation risks violating the conditions needed for admissibility or validity unless an explicit martingale, conditional-independence, or coupling argument is supplied; the current derivation does not obviously contain this step.

    Authors: We agree that the global mechanism enforcing the exact budget introduces correlation among allocation indicators and that the current derivation does not supply an explicit martingale or coupling argument to address this. The procedure first computes per-hypothesis inclusion probabilities from the auxiliary statistics and then applies a global adjustment (via randomized rounding or rejection sampling) to meet the exact total budget. This adjustment preserves the marginal inclusion probability for each hypothesis conditional on its auxiliary statistic, which is sufficient for marginal validity of the resulting p-values. For admissibility under arbitrary dependence, the argument compares the expected performance of our procedure against any other budget-feasible rule; because the comparison is in terms of the joint distribution of discoveries and the marginal validity is already guaranteed, the dependence does not invalidate the dominance. Nevertheless, to make the reasoning fully rigorous we will insert a new subsection that constructs an explicit coupling between the allocation vector and the test statistics and verifies the martingale property of the adjusted p-values. This addition will appear in the revised theoretical guarantees section. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations rely on standard validity arguments and external statistical concepts

full rationale

The paper's abstract and described framework establish validity, optimality, and admissibility guarantees for a budget-constrained active testing procedure using auxiliary statistics and probabilistic allocation. No equations, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations are visible in the provided text that would reduce the claimed optimality or validity results to tautological constructions. The central claims build on classical p-value/e-value properties and budget constraints without evident reduction to inputs by definition. The global coupling mechanism for exact budget satisfaction is presented as a construction that preserves validity, with theoretical results stated as established rather than derived circularly from the same assumptions. This is the common honest case of a self-contained statistical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard probability theory for p-value and e-value validity plus the existence of informative auxiliary statistics; no free parameters, invented entities, or ad-hoc axioms are explicitly introduced.

axioms (1)
  • standard math Standard assumptions ensuring that p-values and e-values remain valid under the probabilistic allocation rule
    The procedure is claimed to guarantee valid p-values or e-values, which presupposes underlying measure-theoretic probability results for multiple testing.

pith-pipeline@v0.9.0 · 5454 in / 1418 out tokens · 70905 ms · 2026-05-17T03:27:44.710115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning U-Statistics with Active Inference

    stat.ML 2026-05 unverdicted novelty 6.0

    Active inference framework for U-statistics using augmented IPW to optimize label queries and minimize variance under budget constraints.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    N., Bates, S., Fannjiang, C., Jordan, M

    Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., and Zrnic, T. (2023). Prediction-Powered Inference . Science , 382(6671):669--674

  2. [2]

    and Yata, K

    Aoshima, M. and Yata, K. (2011). Two-stage procedures for high-dimensional data. Seq. Anal. , 30(4):356--399

  3. [3]

    Barber, R. F. and Ramdas, A. (2017). The p-filter: multilayer false discovery rate control for grouped hypotheses. J. R. Stat. Soc. B , 79(4):1247--1268

  4. [4]

    Bates, S., Cand \`e s, E., Lei, L., Romano, Y., and Sesia, M. (2023). Testing for outliers with conformal p-values. Ann. Statist. , 51(1):149--178

  5. [5]

    and Hochberg, Y

    Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. B , 57(1):289--300

  6. [6]

    and Yekutieli, D

    Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency . Ann. Statist. , 29(4):1165 -- 1188

  7. [7]

    T., Sun, W., and Xia, Y

    Cai, T. T., Sun, W., and Xia, Y. (2022). Laws: A locally adaptive weighting and screening approach to spatial multiple testing. J. Am. Statist. Assoc. , 117(539):1370--1383

  8. [8]

    J., Ruppert, D., and Stefanski, L

    Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995). Measurement error in nonlinear models , volume 105. CRC press

  9. [9]

    and Fithian, W

    Chao, P. and Fithian, W. (2021). Adapt-gmm: Powerful and robust covariate-assisted multiple testing. arXiv preprint arXiv:2106.15812

  10. [10]

    Cohn, D., Ghahramani, Z., and Jordan, M. I. (1996). Active learning with statistical models. J. Artif. Intell. Res. , 4:129--145

  11. [11]

    Cook, T., Mishler, A., and Ramdas, A. (2024). Semiparametric efficient inference in adaptive experiments. In Causal Learning and Reasoning , pages 1033--1064. PMLR

  12. [12]

    R., Duncan, A

    Dunbar, O. R., Duncan, A. B., Stuart, A. M., and Wolfram, M.-T. (2022). Ensemble inference methods for models with noisy and expensive likelihoods. SIAM J. Appl. Dyn. Syst. , 21(2):1539--1572

  13. [13]

    and Lei, L

    Fithian, W. and Lei, L. (2022). Conditional calibration for false discovery rate control under dependence. Ann. Statist. , 50(6):3091--3118

  14. [14]

    S., and Keich, U

    Freestone, J., Noble, W. S., and Keich, U. (2024). A semi-supervised framework for diverse multiple hypothesis testing scenarios. arXiv preprint arXiv:2411.15771

  15. [15]

    Fuller, W. A. (2009). Measurement error models . John Wiley & Sons

  16. [16]

    R., Roeder, K., and Wasserman, L

    Genovese, C. R., Roeder, K., and Wasserman, L. (2006). False discovery control with p-value weighting. Biometrika , 93(3):509--524

  17. [17]

    Golovenkin, S., Gorban, A., Mirkes, E., Shulman, V., Rossiev, D., Shesternya, P., Nikulina, S., Orlova, Y., and Dorrer, M. (2020). Myocardial infarction complications Database

  18. [18]

    Y., Delaigle, A., and Gustafson, P

    Grace, Y. Y., Delaigle, A., and Gustafson, P. (2021). Handbook of measurement error models . CRC Press

  19. [19]

    B., and Huber, W

    Ignatiadis, N., Klaus, B., Zaugg, J. B., and Huber, W. (2016). Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat. Methods , 13(7):577--580

  20. [20]

    Ji, W., Lei, L., and Zrnic, T. (2025). Predictions as surrogates: Revisiting surrogate outcomes in the age of ai. arXiv preprint arXiv:2501.09731

  21. [21]

    M., Lu, K., Zrnic, T., Wang, S., and Bates, S

    Kluger, D. M., Lu, K., Zrnic, T., Wang, S., and Bates, S. (2025). Prediction-powered inference with imputed covariates and nonuniform sampling. arXiv preprint arXiv:2501.18577

  22. [22]

    and Ren, Z

    Lee, J. and Ren, Z. (2024). Boosting e-bh via conditional calibration. arXiv preprint arXiv:2404.17562

  23. [23]

    and Fithian, W

    Lei, L. and Fithian, W. (2018). Adapt: An interactive procedure for multiple testing with side information. J. R. Stat. Soc. B , 80(4):649--679

  24. [24]

    and Barber, R

    Li, A. and Barber, R. F. (2019). Multiple testing with the structure-adaptive benjamini--hochberg algorithm. J. R. Stat. Soc. B , 81(1):45--74

  25. [25]

    K., and Zhao, Z

    Liu, Y., Sarkar, S. K., and Zhao, Z. (2016). A new approach to multiple testing of grouped hypotheses. J. Stat. Plan. Inference , 179:1--14

  26. [26]

    and Witten, D

    Motwani, K. and Witten, D. (2023). Revisiting inference after prediction. J. Mach. Learn. Res. , 24(394):1--18

  27. [27]

    Paquette, M., Chong, M., Saavedra, Y. G. L., Paré, G., Dufour, R., and Baass, A. (2017). The 9p21.3 locus and cardiovascular risk in familial hypercholesterolemia. J. Clin. Lipidol. , 11(2):406--412

  28. [28]

    and Wang, R

    Ramdas, A. and Wang, R. (2024). Hypothesis testing with e-values. arXiv preprint arXiv:2410.23614

  29. [29]

    B., Chen, X., and Wang, X

    Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., and Wang, X. (2021). A survey of deep active learning. ACM Comput. Surv. , 54(9):180:1--180:40

  30. [30]

    and Barber, R

    Ren, Z. and Barber, R. F. (2023). Derandomised knockoffs: leveraging e-values for false discovery rate control. J. R. Stat. Soc. B , 86(1):122--154

  31. [31]

    and Savarese, S

    Sener, O. and Savarese, S. (2018). Active learning for convolutional neural networks: A core-set approach. In Int. Conf. Learn. Represent

  32. [32]

    Settles, B. (2009). Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison

  33. [33]

    Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. B , 64(3):479--498

  34. [34]

    D., Taylor, J

    Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J. R. Stat. Soc. B , 66(1):187--205

  35. [35]

    Su, W. J. (2018). The fdr-linking theorem. arXiv preprint arXiv:1812.08965

  36. [36]

    Sun, S. (2013). A survey of multi-view machine learning. Neural Comput. Appl. , 23(7--8):2031--2038

  37. [37]

    and Wang, R

    Vovk, V. and Wang, R. (2021). E-values: Calibration, combination and applications. Ann. Statist. , 49(3):1736--1754

  38. [38]

    and Ramdas, A

    Wang, R. and Ramdas, A. (2022). False discovery rate control with e-values. J. R. Stat. Soc. B , 84(3):822--852

  39. [39]

    T., and Sun, W

    Xia, Y., Cai, T. T., and Sun, W. (2020). Gap: A general framework for information pooling in two-sample sparse inference. J. Am. Statist. Assoc

  40. [40]

    Xu, Z., Solari, A., Fischer, L., de Heide, R., Ramdas, A., and Goeman, J. (2025a). Bringing closure to false discovery rate control: A general principle for multiple testing. arXiv preprint arXiv:2509.02517

  41. [41]

    Xu, Z., Wang, C., Wasserman, L., Roeder, K., and Ramdas, A. (2025b). Active multiple testing with proxy p-values and e-values. arXiv preprint arXiv:2502.05715

  42. [42]

    Zehetmayer, S., Bauer, P., and Posch, M. (2005). Two-stage designs for experiments with a large number of hypotheses. Bioinformatics , 21(19):3771--3777

  43. [43]

    Zhang, D., He, J., Liu, Y., Si, L., and Lawrence, R. (2011). Multi-view transfer learning with a large margin approach. In Int. Conf. Knowl. Discov. Data Min. , pages 1208--1216

  44. [44]

    W., Janson, L., and Murphy, S

    Zhang, K. W., Janson, L., and Murphy, S. (2021). Statistical inference with m-estimators on adaptively collected data. In Adv. Neural Inf. Process. Syst

  45. [45]

    J., Xia, F., and Zou, J

    Zhang, M. J., Xia, F., and Zou, J. (2019). Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing. Nat. Commun. , 10(1):3433

  46. [46]

    Zhao, J., Xie, X., Xu, X., and Sun, S. (2017). Multi-view learning overview: Recent progress and new challenges. Inf. Fusion , 38:43--54

  47. [47]

    and Cand\` e s, E

    Zrnic, T. and Cand\` e s, E. J. (2024). Active statistical inference. In Int. Conf. Mach. Learn. , ICML'24. JMLR.org

  48. [48]

    and Candès, E

    Zrnic, T. and Candès, E. J. (2024). Cross-prediction-powered inference. Proc. Natl. Acad. Sci. U.S.A. , 121(15):e2322083121