arxiv: 2604.09779 · v1 · submitted 2026-04-10 · 📊 stat.ME

Recognition: unknown

Inference conditional on selection: a review

Anna Neufeld , Ronan Perry , Daniela Witten

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3

classification 📊 stat.ME

keywords selective inferencepost-selection inferenceconditional inferencedata-driven selectionconfidence intervalsclusteringregression trees

0 comments

The pith

When the target of inference is chosen from the data, classical methods lose their coverage guarantees, but conditioning on the selection event restores validity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This review examines selective inference, the setting where the statistical question itself depends on the observed data, such as identifying which cluster has the largest mean or which region in a tree to examine. Standard confidence intervals and hypothesis tests no longer achieve their advertised properties in these cases because the selection step introduces dependence. The authors present concrete examples involving a data-driven winner, regression tree regions, and cluster comparisons to show the breakdown and argue that conditional guarantees given the selection are scientifically relevant. They connect several existing approaches that deliver such guarantees and illustrate the methods on simulated data as well as single-cell RNA sequencing applications.

Core claim

The paper establishes that when an inference target such as the mean of a selected winner or the difference between two data-chosen clusters is a function of the data, unconditional classical procedures do not attain nominal coverage or type I error control. Conditioning the inference on the event that produced the selection restores these guarantees, and the review surveys and links the main techniques that achieve this conditional validity.

What carries the argument

Conditioning on the selection event, which adjusts the sampling distribution to account for the data-dependent choice of the inference target.

If this is right

Confidence intervals for the mean of a data-selected winner attain nominal coverage.
Means of regions chosen by a regression tree can be estimated with valid intervals.
Tests for differences between data-chosen clusters control type I error at the nominal level.
The same conditional procedures apply to high-dimensional genomic data sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning idea could apply to variable selection in regression or feature selection in prediction models.
Routine use might reduce overconfidence in exploratory analyses across many fields.
Software that automates these conditional calculations would make the approach practical for everyday data work.

Load-bearing premise

That scientists want and can use inference statements that are valid only after conditioning on how the target was selected from the data.

What would settle it

Repeatedly simulate data from a known distribution, apply a data-dependent selection rule such as picking the largest sample mean, compute both unconditional and conditional 95 percent intervals for the true parameter of the selected target, and check whether only the conditional intervals achieve approximately 95 percent coverage.

Figures

Figures reproduced from arXiv: 2604.09779 by Anna Neufeld, Daniela Witten, Ronan Perry.

**Figure 2.** Figure 2: The left panel shows uniform QQ plots of the 1000 p-values resulting from each of the five [PITH_FULL_IMAGE:figures/full_fig_p034_2.png] view at source ↗

read the original abstract

In this article, we review selective inference, a set of techniques for inference when the statistical question asked is a function of the data. This setting often arises in contemporary scientific workflows, where hypotheses and parameters may be selected from the data, rather than specified in advance. In this setting, classical inferential techniques do not achieve "classical" guarantees, such as nominal coverage of confidence intervals. We consider three examples for which selective inference solutions are required: inference on a "winner", inference on the mean of a region in a regression tree, and inference on the difference in means between a pair of clusters. We argue that conditional guarantees are of scientific interest in such settings. We then review and draw connections between several approaches that provide such guarantees. Finally, we illustrate these approaches in simulation and through an application to single-cell RNA sequencing data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear, well-structured review of selective inference that organizes existing ideas around three good examples but adds no new methods or results.

read the letter

This review pulls together the selective inference literature in a straightforward way. The main point is that when you select what to estimate from the data itself, standard confidence intervals and p-values lose their usual coverage properties, and the authors show this with three running examples: inference on the winner, the mean of a selected region in a regression tree, and the difference between two selected cluster means. They then review the main approaches that restore conditional guarantees and connect them, ending with simulations and a single-cell RNA-seq application. That structure makes the ideas easy to follow and shows how the theory applies in practice. The citations look current and relevant, and there is no original math or data claim that needs independent checking. The one softer spot is the section arguing that conditional inference is scientifically interesting; it is presented as a modeling choice rather than a deeply defended position, which is fine for a review but leaves the motivation a bit thin. Overall the paper is coherent on its own terms and does not contain internal contradictions or unfalsifiable claims. It is aimed at statisticians and applied researchers who encounter data-driven selection and want a consolidated reference rather than a new technique. A reader preparing a methods section or teaching a course on post-selection inference would get real value. I would cite it for background and I would send it to peer review; the examples and connections are useful enough to justify referee time even though the contribution is synthetic.

Referee Report

0 major / 3 minor

Summary. The paper reviews selective inference, a framework for valid inference when the statistical target (e.g., a parameter or hypothesis) is chosen as a function of the data. It demonstrates that classical unconditional methods lose their nominal guarantees (such as coverage of confidence intervals) in this setting, using three concrete examples: inference on a 'winner,' the mean of a selected region in a regression tree, and the difference in means between a pair of selected clusters. The manuscript argues that conditional guarantees (conditioning on the selection event) are scientifically relevant, surveys and connects several existing approaches that deliver such guarantees, and illustrates the methods through simulation studies and an application to single-cell RNA sequencing data.

Significance. If the connections between approaches are accurately drawn and the examples are representative, the review would be a useful synthesis for a field that has grown rapidly in response to data-driven workflows. The provision of concrete, reproducible illustrations (simulations plus real-data example) and the explicit framing of conditional inference as a modeling choice rather than an unexamined default add practical value for applied statisticians.

minor comments (3)

[Abstract] Abstract: the claim that classical methods 'do not achieve classical guarantees' is correct in principle but would be strengthened by a one-sentence indication of the typical magnitude of undercoverage observed in the three examples.
The manuscript states that it 'draws connections between several approaches'; a short comparative table (or diagram) summarizing the assumptions, computational requirements, and scope of each reviewed method would make these connections more immediately usable.
In the single-cell RNA-seq application, the precise definition of the selection event and the conditioning set should be stated explicitly so that readers can verify how the conditional guarantee is implemented in that setting.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript and the recommendation for minor revision. The referee's summary correctly identifies the core contributions: the review of selective inference techniques that deliver conditional guarantees, the concrete examples (winner inference, regression trees, clustering), the argument for the scientific relevance of conditioning on selection, and the illustrations via simulations and single-cell RNA sequencing data. We appreciate the recognition that such a synthesis can be useful for applied statisticians working with data-driven workflows.

Circularity Check

0 steps flagged

No significant circularity in this review paper

full rationale

This is a review article summarizing established results on selective inference and the failure of unconditional inference under data-dependent selection. No new theorems, derivations, or empirical predictions are advanced; the three examples (winner inference, regression-tree region mean, cluster-mean difference) are illustrative of known issues rather than original claims that reduce to fitted inputs or self-citations. The central premise that classical methods lose nominal coverage is presented as a standard fact from prior literature, and the preference for conditional guarantees is framed as a modeling choice, not a load-bearing derivation. The paper is self-contained against external benchmarks with no self-definitional steps or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This review relies on standard statistical assumptions from the selective inference literature it summarizes, with no free parameters, invented entities, or ad hoc axioms introduced by the authors themselves.

axioms (1)

domain assumption Conditional inference provides scientifically meaningful guarantees in settings where selection depends on the data
The abstract states 'We argue that conditional guarantees are of scientific interest in such settings.'

pith-pipeline@v0.9.0 · 5432 in / 1189 out tokens · 35909 ms · 2026-05-10T16:43:44.844643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages

[1]

Optimal inference after model selection.arXiv preprint arXiv:1410.2597, 2014

arXiv:1410.2597. Claudio Fuentes, George Casella, and Martin T Wells. Confidence intervals for the means of the selected populations.Electronic Journal of Statistics, 12:58–79, 2018. ISSN 1935-7524. Lucy L. Gao, Jacob Bien, and Daniela Witten. Selective Inference for Hierarchical Clustering. Journal of the American Statistical Association, 119:332–342, Oc...

work page arXiv 2018
[2]

null” model that has no biological variability. On the positive control dataset, we retain the 1000 genes that deviate most from this “null

The signal is strong enough that the estimated clusters tend to approximate the true clus- ters fairly closely. •Strong:β“4. The true clusters are well separated, meaning that the estimated clusters are equal to the true clusters with probability close to 1. In each signal setting, we report the unconditional coverage, i.e. the proportion of the 2000 conf...

2000