Unified Conformalized Multiple Testing with Full Data Efficiency
Pith reviewed 2026-05-22 12:56 UTC · model grok-4.3
The pith
A unified framework uses all available data to raise power in conformalized multiple testing while controlling the false discovery rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing non-conformity scores from the entire collection of null, alternative, and unlabelled points and then calibrating p-values through a full permutation of the combined scores, the procedure controls the false discovery rate at the target level and simultaneously increases the number of discoveries that can be made.
What carries the argument
the full permutation strategy applied to non-conformity scores built from the pooled dataset, which simultaneously improves score quality and enlarges the calibration sample.
If this is right
- The same dataset yields more true discoveries at any fixed false-discovery-rate target.
- A practitioner can compare several candidate score functions on the full data and retain the one that produces the largest number of rejections without extra splitting.
- The method applies directly to mixtures of labelled and unlabelled observations without requiring separate calibration samples.
- Power gains appear across both low- and high-dimensional settings provided the exchangeability condition on the scores is preserved.
Where Pith is reading between the lines
- The same full-permutation logic could be tested on streaming or sequentially arriving data to see whether power remains stable as the sample grows.
- Replacing the hand-crafted scores with scores learned from a neural network trained on the pooled data might further enlarge the power advantage.
- The automatic selection rule inside the framework could be applied to other conformal tasks such as constructing prediction sets rather than testing.
Load-bearing premise
The full permutation of scores built from the combined data is assumed to deliver valid false-discovery-rate control, which holds only when the observations satisfy exchangeability or the chosen scores obey the required symmetry properties.
What would settle it
Generate data that deliberately violates exchangeability, such as a sequence with a clear time trend, run the procedure at a nominal false-discovery-rate level of 0.05, and check whether the realized proportion of false discoveries exceeds 0.05 by a statistically significant margin.
read the original abstract
Conformalized multiple testing offers a model-free way to control predictive uncertainty in decision-making. Existing methods typically use only part of the available data to build score functions tailored to specific settings. We propose a unified framework that puts data utilisation at the centre: it uses all available data-null, alternative, and unlabelled-to construct scores and calibrate p-values through a full permutation strategy. This unified use of all available data significantly improves power by enhancing non-conformity score quality and maximising calibration set size while rigorously controlling the false discovery rate. Crucially, our framework provides a systematic design principle for conformal testing and enables automatic selection of the best conformal procedure among candidates without extra data splitting. Extensive numerical experiments demonstrate that our enhanced methods deliver superior efficiency and adaptability across diverse scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a unified framework for conformalized multiple testing that uses all available data (null, alternative, and unlabelled) to construct non-conformity scores and calibrate p-values via a full permutation strategy. This is claimed to improve power through better score quality and larger calibration sets while rigorously controlling the false discovery rate, and to enable automatic selection of the optimal conformal procedure among candidates without additional data splitting. Numerical experiments are presented to demonstrate superior efficiency and adaptability.
Significance. If the FDR control is rigorously established, the framework could meaningfully advance conformal multiple testing by addressing data efficiency, a common practical limitation in existing methods. The systematic design principle and automatic selection feature would be valuable contributions, particularly if the approach generalizes across diverse scenarios as suggested by the experiments.
major comments (1)
- The central FDR control claim depends on the full permutation strategy remaining valid when non-conformity scores are constructed from the combined dataset including alternatives. The proof (in the theoretical guarantees section) must explicitly show how exchangeability or a suitable bound is preserved for true-null hypotheses despite the score function's dependence on alternative data; without this, the symmetry required for valid p-values may not hold.
minor comments (1)
- The abstract and introduction could more explicitly reference the specific prior conformal multiple testing methods being unified, to better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential significance of the unified conformalized multiple testing framework. We address the major comment below in detail.
read point-by-point responses
-
Referee: The central FDR control claim depends on the full permutation strategy remaining valid when non-conformity scores are constructed from the combined dataset including alternatives. The proof (in the theoretical guarantees section) must explicitly show how exchangeability or a suitable bound is preserved for true-null hypotheses despite the score function's dependence on alternative data; without this, the symmetry required for valid p-values may not hold.
Authors: We thank the referee for this precise observation on the theoretical foundation. We agree that greater explicitness is warranted to clarify how exchangeability is preserved for true-null hypotheses when the score function depends on the full dataset (including alternatives). In the revised manuscript we will expand the theoretical guarantees section with a new dedicated paragraph and supporting lemma. The argument proceeds by conditioning on the alternative observations (which are fixed) and showing that the full-permutation distribution over the remaining indices still induces uniform ranks for the true-null p-values, because the score function is applied symmetrically across all permuted assignments. This establishes the required marginal validity without relying on full joint exchangeability of the entire sample. We believe this addition will fully resolve the concern while leaving the main results unchanged. revision: yes
Circularity Check
Framework rests on standard conformal and permutation principles with no load-bearing self-definition or fitted-input prediction
full rationale
The paper's central construction—using the full dataset (null, alternative, and unlabeled) to build non-conformity scores followed by a full-permutation calibration—directly invokes the classical exchangeability assumption of conformal prediction and the permutation test for FDR control. No equation or theorem reduces a claimed prediction to a parameter fitted from the same quantity by construction, nor does any uniqueness result or ansatz depend on a self-citation chain that itself lacks independent verification. The derivation therefore remains self-contained against external benchmarks of conformal validity and permutation-based multiple testing.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The data points satisfy exchangeability conditions sufficient for the permutation-based calibration to control FDR.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.