Post-selection Inference in Multiverse Analysis (PIMA): an inferential framework based on the sign flipping score test

Anna Vesely; Antonio Calcagn\`i; Dani\"el Lakens; Gianmarco Alto\`e; Livio Finos; Massimiliano Pastore; Paolo Girardi

arxiv: 2210.02794 · v2 · submitted 2022-10-06 · 📊 stat.ME · stat.AP

Post-selection Inference in Multiverse Analysis (PIMA): an inferential framework based on the sign flipping score test

Paolo Girardi , Anna Vesely , Dani\"el Lakens , Gianmarco Alto\`e , Massimiliano Pastore , Antonio Calcagn\`i , Livio Finos This is my paper

Pith reviewed 2026-05-24 10:56 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords multiverse analysispost-selection inferencesign flipping score testfamily-wise error rategeneralized linear modelsspecification curve analysisreplication crisis

0 comments

The pith

PIMA tests whether a predictor is associated with the outcome by combining evidence from every reasonable analysis model while strongly controlling family-wise error rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Researchers face many justifiable but arbitrary choices when preparing and modeling data, which can lead to selective reporting. The paper introduces PIMA as a way to move multiverse analysis from pure description to formal inference. It pools results across the full set of defensible specifications to test the null that a given predictor has no association with the outcome. If the method works as described, analysts can reject that null for each individual specification that reaches significance, with a guarantee that the overall chance of any erroneous rejection stays at the nominal level. Readers would care because the procedure offers a concrete route to claims about effects without having to defend one chosen model over equally plausible alternatives.

Core claim

The paper claims that the Post-selection Inference in Multiverse Analysis (PIMA) framework, built on a conditional resampling procedure that uses the sign flipping score test, tests the null hypothesis of no association between a predictor and the outcome by merging information from the entire multiverse of reasonable analyses and pre-processing choices. The method applies to any generalized linear model and supplies strong family-wise error rate control, which permits the direct claim that the null hypothesis can be rejected for each specification that shows a significant effect.

What carries the argument

The conditional resampling procedure based on the sign flipping score test, which integrates results across the multiverse of model specifications while adjusting for post-selection.

If this is right

The null hypothesis of no association can be tested by pooling evidence from all justifiable model specifications at once.
Strong family-wise error rate control allows rejection of the null for each specification that individually reaches significance.
The procedure extends to arbitrary pre-processing steps and any generalized linear model.
Researchers can identify which specific analysis paths support an effect rather than only whether at least one path rejects the null.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same resampling logic could be examined for models outside the generalized linear family, such as survival or mixed-effects specifications.
Routine use of PIMA might encourage preregistration of the full multiverse rather than post-hoc selection of paths.
Direct comparisons of PIMA's rejection patterns against those from specification curve analysis on the same datasets would clarify relative power and conservatism.

Load-bearing premise

The conditional resampling procedure based on the sign flipping score test achieves strong family-wise error rate control when applied after selecting models from the multiverse.

What would settle it

Generate data under the global null of no predictor-outcome association in every specification, apply the full PIMA procedure, and check whether the family-wise error rate exceeds the nominal alpha.

read the original abstract

When analyzing data researchers make some decisions that are either arbitrary, based on subjective beliefs about the data generating process, or for which equally justifiable alternative choices could have been made. This wide range of data-analytic choices can be abused, and has been one of the underlying causes of the replication crisis in several fields. Recently, the introduction of multiverse analysis provides researchers with a method to evaluate the stability of the results across reasonable choices that could be made when analyzing data. Multiverse analysis is confined to a descriptive role, lacking a proper and comprehensive inferential procedure. Recently, specification curve analysis adds an inferential procedure to multiverse analysis, but this approach is limited to simple cases related to the linear model, and only allows researchers to infer whether at least one specification rejects the null hypothesis, but not which specifications should be selected. In this paper we present a Post-selection Inference approach to Multiverse Analysis (PIMA) which is a flexible and general inferential approach that accounts for all possible models, i.e., the multiverse of reasonable analyses. The approach allows for a wide range of data specifications (i.e. pre-processing) and any generalized linear model; it allows testing the null hypothesis of a given predictor not being associated with the outcome, by merging information from all reasonable models of multiverse analysis, and provides strong control of the family-wise error rate such that it allows researchers to claim that the null-hypothesis can be rejected for each specification that shows a significant effect. The inferential proposal is based on a conditional resampling procedure. To be continued...

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIMA extends multiverse inference to GLMs with per-spec claims and FWER control, but the sign-flipping resampling's validity under post-selection is the unverified step.

read the letter

The main thing to know is that this paper gives a resampling-based way to test a predictor across a whole multiverse of GLM specs while claiming strong FWER control for individual rejections, going beyond the all-or-nothing inference in specification curve analysis. It handles arbitrary pre-processing and any GLM, which is a step up from prior descriptive multiverse work or linear-model limits. That is the actual advance on the table. The paper does a clean job framing the replication-crisis angle and showing why descriptive multiverse alone is not enough for claims about specific models. It positions the method as letting researchers reject the null for those specs that come out significant, with the control applying across the collection. The soft spot is exactly the one the stress test flags: the conditional sign-flipping procedure is asserted to deliver the FWER bound once the multiverse is fixed, but the dependence across overlapping specs makes it unclear whether the conditioning preserves the exchangeability or pivotality required for exact control. Without seeing explicit derivations or targeted simulations that check this under realistic overlap, the load-bearing claim stays unverified. The abstract alone does not supply the equations, so the full paper needs to demonstrate that the resampling distribution is correctly adjusted for the joint selection event. This work is for methodologists who build inference tools for flexible analyses and for applied teams in psychology or biomedicine who already run multiverse checks and want something more than description. A reader who cares about post-selection methods will find the framing useful even if they end up adjusting the procedure. It deserves a serious referee because the problem it targets is concrete and the proposed extension is new, even though the technical justification for the control needs checking. I would send it to review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Post-selection Inference in Multiverse Analysis (PIMA), an inferential framework extending multiverse analysis beyond its descriptive role. It employs a conditional resampling procedure based on the sign flipping score test to test the null that a given predictor is unassociated with the outcome, merging information across all reasonable specifications in the multiverse. The approach is presented for generalized linear models with arbitrary pre-processing and claims to deliver strong family-wise error rate (FWER) control, permitting per-specification rejections while bounding the probability of any false rejection under the global null.

Significance. If the claimed strong FWER control is valid, PIMA would represent a meaningful advance by furnishing an inferential procedure for multiverse analysis that is more general than specification curve analysis and applicable to GLMs. This could help researchers make defensible claims about individual specifications while controlling error rates induced by data-analytic choices, directly addressing a contributor to the replication crisis. The flexibility for arbitrary pre-processing steps is a clear strength of the proposal.

major comments (2)

[Abstract] Abstract: The central claim that the method 'provides strong control of the family-wise error rate' such that 'the null-hypothesis can be rejected for each specification that shows a significant effect' is stated without any supporting derivation, equation, or simulation result demonstrating that the conditional sign-flipping resampling achieves exact strong control once the dependence across overlapping specifications in the multiverse is taken into account.
[Abstract (and inferential proposal section)] The load-bearing premise is that the conditioning set defined by the fixed collection of 'reasonable' specifications preserves the exchangeability or pivotal property of the sign-flipping score test under post-selection. No argument is supplied showing that the resampling distribution remains valid for the joint selection event induced by the multiverse, which is required for the per-specification rejections to be licensed under strong FWER control.

minor comments (1)

[Abstract] The abstract ends with 'To be continued...', indicating the manuscript may be incomplete; this should be addressed before resubmission.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and for recognizing the potential of PIMA to advance inferential methods in multiverse analysis. We address each major comment point by point below. We agree that the presentation of the strong FWER control claim can be improved with additional details and supporting evidence, and we will make revisions to the abstract and the inferential proposal section accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the method 'provides strong control of the family-wise error rate' such that 'the null-hypothesis can be rejected for each specification that shows a significant effect' is stated without any supporting derivation, equation, or simulation result demonstrating that the conditional sign-flipping resampling achieves exact strong control once the dependence across overlapping specifications in the multiverse is taken into account.

Authors: The abstract is intended as a high-level overview and therefore omits detailed derivations and simulations. The supporting theory for strong FWER control, which accounts for the dependence structure through conditional resampling, is developed in the main text. To better support the claim in the abstract, we will revise it to briefly reference the conditional sign-flipping procedure and its properties. Additionally, we will include new simulation studies in the revised manuscript that explicitly demonstrate the FWER control in the presence of overlapping specifications. revision: yes
Referee: [Abstract (and inferential proposal section)] The load-bearing premise is that the conditioning set defined by the fixed collection of 'reasonable' specifications preserves the exchangeability or pivotal property of the sign-flipping score test under post-selection. No argument is supplied showing that the resampling distribution remains valid for the joint selection event induced by the multiverse, which is required for the per-specification rejections to be licensed under strong FWER control.

Authors: We believe the manuscript does supply the argument that the a priori fixed set of reasonable specifications creates a conditioning event under which the sign-flipping scores remain exchangeable under the global null, enabling valid inference for each specification. However, we agree that this could be articulated more clearly and with greater formality to address concerns about the joint selection event. In the revision, we will expand the relevant section with a more detailed explanation and a theorem statement outlining why the pivotal property holds conditionally on the multiverse selection. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; derivation relies on external resampling properties

full rationale

The paper presents PIMA as an extension of multiverse analysis via conditional sign-flipping score test resampling to achieve strong FWER control across specifications. No quoted derivation step reduces a claimed prediction or control property to a fitted input by construction, nor does the load-bearing validity of the conditional procedure reduce to a self-citation chain or self-definitional ansatz. The central claim is positioned as following from the statistical properties of the resampling method applied to the multiverse collection, which is treated as an independent extension rather than tautological. This aligns with the default expectation that most papers are non-circular when the key inferential step is not shown to collapse into its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the validity of the sign flipping score test under conditional resampling for post-selection inference; no free parameters, invented entities, or additional axioms are identifiable from the provided text.

axioms (1)

domain assumption The sign flipping score test yields valid conditional p-values that enable strong FWER control across the multiverse of specifications.
This is the core premise invoked when the abstract states that the inferential proposal is based on a conditional resampling procedure.

pith-pipeline@v0.9.0 · 5849 in / 1372 out tokens · 28172 ms · 2026-05-24T10:56:19.232414+00:00 · methodology

Post-selection Inference in Multiverse Analysis (PIMA): an inferential framework based on the sign flipping score test

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)