Recognition: unknown
Inference conditional on selection: a review
Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3
The pith
When the target of inference is chosen from the data, classical methods lose their coverage guarantees, but conditioning on the selection event restores validity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that when an inference target such as the mean of a selected winner or the difference between two data-chosen clusters is a function of the data, unconditional classical procedures do not attain nominal coverage or type I error control. Conditioning the inference on the event that produced the selection restores these guarantees, and the review surveys and links the main techniques that achieve this conditional validity.
What carries the argument
Conditioning on the selection event, which adjusts the sampling distribution to account for the data-dependent choice of the inference target.
If this is right
- Confidence intervals for the mean of a data-selected winner attain nominal coverage.
- Means of regions chosen by a regression tree can be estimated with valid intervals.
- Tests for differences between data-chosen clusters control type I error at the nominal level.
- The same conditional procedures apply to high-dimensional genomic data sets.
Where Pith is reading between the lines
- The same conditioning idea could apply to variable selection in regression or feature selection in prediction models.
- Routine use might reduce overconfidence in exploratory analyses across many fields.
- Software that automates these conditional calculations would make the approach practical for everyday data work.
Load-bearing premise
That scientists want and can use inference statements that are valid only after conditioning on how the target was selected from the data.
What would settle it
Repeatedly simulate data from a known distribution, apply a data-dependent selection rule such as picking the largest sample mean, compute both unconditional and conditional 95 percent intervals for the true parameter of the selected target, and check whether only the conditional intervals achieve approximately 95 percent coverage.
Figures
read the original abstract
In this article, we review selective inference, a set of techniques for inference when the statistical question asked is a function of the data. This setting often arises in contemporary scientific workflows, where hypotheses and parameters may be selected from the data, rather than specified in advance. In this setting, classical inferential techniques do not achieve "classical" guarantees, such as nominal coverage of confidence intervals. We consider three examples for which selective inference solutions are required: inference on a "winner", inference on the mean of a region in a regression tree, and inference on the difference in means between a pair of clusters. We argue that conditional guarantees are of scientific interest in such settings. We then review and draw connections between several approaches that provide such guarantees. Finally, we illustrate these approaches in simulation and through an application to single-cell RNA sequencing data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reviews selective inference, a framework for valid inference when the statistical target (e.g., a parameter or hypothesis) is chosen as a function of the data. It demonstrates that classical unconditional methods lose their nominal guarantees (such as coverage of confidence intervals) in this setting, using three concrete examples: inference on a 'winner,' the mean of a selected region in a regression tree, and the difference in means between a pair of selected clusters. The manuscript argues that conditional guarantees (conditioning on the selection event) are scientifically relevant, surveys and connects several existing approaches that deliver such guarantees, and illustrates the methods through simulation studies and an application to single-cell RNA sequencing data.
Significance. If the connections between approaches are accurately drawn and the examples are representative, the review would be a useful synthesis for a field that has grown rapidly in response to data-driven workflows. The provision of concrete, reproducible illustrations (simulations plus real-data example) and the explicit framing of conditional inference as a modeling choice rather than an unexamined default add practical value for applied statisticians.
minor comments (3)
- [Abstract] Abstract: the claim that classical methods 'do not achieve classical guarantees' is correct in principle but would be strengthened by a one-sentence indication of the typical magnitude of undercoverage observed in the three examples.
- The manuscript states that it 'draws connections between several approaches'; a short comparative table (or diagram) summarizing the assumptions, computational requirements, and scope of each reviewed method would make these connections more immediately usable.
- In the single-cell RNA-seq application, the precise definition of the selection event and the conditioning set should be stated explicitly so that readers can verify how the conditional guarantee is implemented in that setting.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the manuscript and the recommendation for minor revision. The referee's summary correctly identifies the core contributions: the review of selective inference techniques that deliver conditional guarantees, the concrete examples (winner inference, regression trees, clustering), the argument for the scientific relevance of conditioning on selection, and the illustrations via simulations and single-cell RNA sequencing data. We appreciate the recognition that such a synthesis can be useful for applied statisticians working with data-driven workflows.
Circularity Check
No significant circularity in this review paper
full rationale
This is a review article summarizing established results on selective inference and the failure of unconditional inference under data-dependent selection. No new theorems, derivations, or empirical predictions are advanced; the three examples (winner inference, regression-tree region mean, cluster-mean difference) are illustrative of known issues rather than original claims that reduce to fitted inputs or self-citations. The central premise that classical methods lose nominal coverage is presented as a standard fact from prior literature, and the preference for conditional guarantees is framed as a modeling choice, not a load-bearing derivation. The paper is self-contained against external benchmarks with no self-definitional steps or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Conditional inference provides scientifically meaningful guarantees in settings where selection depends on the data
Reference graph
Works this paper leans on
-
[1]
Optimal inference after model selection.arXiv preprint arXiv:1410.2597, 2014
arXiv:1410.2597. Claudio Fuentes, George Casella, and Martin T Wells. Confidence intervals for the means of the selected populations.Electronic Journal of Statistics, 12:58–79, 2018. ISSN 1935-7524. Lucy L. Gao, Jacob Bien, and Daniela Witten. Selective Inference for Hierarchical Clustering. Journal of the American Statistical Association, 119:332–342, Oc...
-
[2]
null” model that has no biological variability. On the positive control dataset, we retain the 1000 genes that deviate most from this “null
The signal is strong enough that the estimated clusters tend to approximate the true clus- ters fairly closely. •Strong:β“4. The true clusters are well separated, meaning that the estimated clusters are equal to the true clusters with probability close to 1. In each signal setting, we report the unconditional coverage, i.e. the proportion of the 2000 conf...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.