Unified Conformalized Multiple Testing with Full Data Efficiency

Changliang Zou; Haojie Ren; Xiaoyang Wu; Yuyang Huo

arxiv: 2508.12085 · v2 · pith:TWEPOFCWnew · submitted 2025-08-16 · 📊 stat.ME · stat.ML

Unified Conformalized Multiple Testing with Full Data Efficiency

Yuyang Huo , Xiaoyang Wu , Changliang Zou , Haojie Ren This is my paper

Pith reviewed 2026-05-22 12:56 UTC · model grok-4.3

classification 📊 stat.ME stat.ML

keywords conformalized multiple testingfalse discovery ratefull permutationdata efficiencynon-conformity scoresunlabelled datapower improvement

0 comments

The pith

A unified framework uses all available data to raise power in conformalized multiple testing while controlling the false discovery rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a single framework that folds null, alternative, and unlabelled observations together when building non-conformity scores and when calibrating p-values. It replaces partial-data splits with a full permutation strategy that keeps every observation in play for both steps. This produces stronger scores and a larger effective calibration set, which in turn lifts detection power while still guaranteeing false-discovery-rate control. The same construction supplies a rule for picking the strongest procedure from a list of candidates without setting aside extra data.

Core claim

By constructing non-conformity scores from the entire collection of null, alternative, and unlabelled points and then calibrating p-values through a full permutation of the combined scores, the procedure controls the false discovery rate at the target level and simultaneously increases the number of discoveries that can be made.

What carries the argument

the full permutation strategy applied to non-conformity scores built from the pooled dataset, which simultaneously improves score quality and enlarges the calibration sample.

If this is right

The same dataset yields more true discoveries at any fixed false-discovery-rate target.
A practitioner can compare several candidate score functions on the full data and retain the one that produces the largest number of rejections without extra splitting.
The method applies directly to mixtures of labelled and unlabelled observations without requiring separate calibration samples.
Power gains appear across both low- and high-dimensional settings provided the exchangeability condition on the scores is preserved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same full-permutation logic could be tested on streaming or sequentially arriving data to see whether power remains stable as the sample grows.
Replacing the hand-crafted scores with scores learned from a neural network trained on the pooled data might further enlarge the power advantage.
The automatic selection rule inside the framework could be applied to other conformal tasks such as constructing prediction sets rather than testing.

Load-bearing premise

The full permutation of scores built from the combined data is assumed to deliver valid false-discovery-rate control, which holds only when the observations satisfy exchangeability or the chosen scores obey the required symmetry properties.

What would settle it

Generate data that deliberately violates exchangeability, such as a sequence with a clear time trend, run the procedure at a nominal false-discovery-rate level of 0.05, and check whether the realized proportion of false discoveries exceeds 0.05 by a statistically significant margin.

read the original abstract

Conformalized multiple testing offers a model-free way to control predictive uncertainty in decision-making. Existing methods typically use only part of the available data to build score functions tailored to specific settings. We propose a unified framework that puts data utilisation at the centre: it uses all available data-null, alternative, and unlabelled-to construct scores and calibrate p-values through a full permutation strategy. This unified use of all available data significantly improves power by enhancing non-conformity score quality and maximising calibration set size while rigorously controlling the false discovery rate. Crucially, our framework provides a systematic design principle for conformal testing and enables automatic selection of the best conformal procedure among candidates without extra data splitting. Extensive numerical experiments demonstrate that our enhanced methods deliver superior efficiency and adaptability across diverse scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a unified framework for conformalized multiple testing that uses all available data (null, alternative, and unlabelled) to construct non-conformity scores and calibrate p-values via a full permutation strategy. This is claimed to improve power through better score quality and larger calibration sets while rigorously controlling the false discovery rate, and to enable automatic selection of the optimal conformal procedure among candidates without additional data splitting. Numerical experiments are presented to demonstrate superior efficiency and adaptability.

Significance. If the FDR control is rigorously established, the framework could meaningfully advance conformal multiple testing by addressing data efficiency, a common practical limitation in existing methods. The systematic design principle and automatic selection feature would be valuable contributions, particularly if the approach generalizes across diverse scenarios as suggested by the experiments.

major comments (1)

The central FDR control claim depends on the full permutation strategy remaining valid when non-conformity scores are constructed from the combined dataset including alternatives. The proof (in the theoretical guarantees section) must explicitly show how exchangeability or a suitable bound is preserved for true-null hypotheses despite the score function's dependence on alternative data; without this, the symmetry required for valid p-values may not hold.

minor comments (1)

The abstract and introduction could more explicitly reference the specific prior conformal multiple testing methods being unified, to better situate the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the unified conformalized multiple testing framework. We address the major comment below in detail.

read point-by-point responses

Referee: The central FDR control claim depends on the full permutation strategy remaining valid when non-conformity scores are constructed from the combined dataset including alternatives. The proof (in the theoretical guarantees section) must explicitly show how exchangeability or a suitable bound is preserved for true-null hypotheses despite the score function's dependence on alternative data; without this, the symmetry required for valid p-values may not hold.

Authors: We thank the referee for this precise observation on the theoretical foundation. We agree that greater explicitness is warranted to clarify how exchangeability is preserved for true-null hypotheses when the score function depends on the full dataset (including alternatives). In the revised manuscript we will expand the theoretical guarantees section with a new dedicated paragraph and supporting lemma. The argument proceeds by conditioning on the alternative observations (which are fixed) and showing that the full-permutation distribution over the remaining indices still induces uniform ranks for the true-null p-values, because the score function is applied symmetrically across all permuted assignments. This establishes the required marginal validity without relying on full joint exchangeability of the entire sample. We believe this addition will fully resolve the concern while leaving the main results unchanged. revision: yes

Circularity Check

0 steps flagged

Framework rests on standard conformal and permutation principles with no load-bearing self-definition or fitted-input prediction

full rationale

The paper's central construction—using the full dataset (null, alternative, and unlabeled) to build non-conformity scores followed by a full-permutation calibration—directly invokes the classical exchangeability assumption of conformal prediction and the permutation test for FDR control. No equation or theorem reduces a claimed prediction to a parameter fitted from the same quantity by construction, nor does any uniqueness result or ansatz depend on a self-citation chain that itself lacks independent verification. The derivation therefore remains self-contained against external benchmarks of conformal validity and permutation-based multiple testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from conformal prediction and permutation testing rather than introducing many new fitted parameters or invented entities.

axioms (1)

domain assumption The data points satisfy exchangeability conditions sufficient for the permutation-based calibration to control FDR.
Invoked implicitly when claiming rigorous FDR control via full permutation on the combined data.

pith-pipeline@v0.9.0 · 5658 in / 1073 out tokens · 46657 ms · 2026-05-22T12:56:08.412620+00:00 · methodology

Unified Conformalized Multiple Testing with Full Data Efficiency

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)