Evaluating A Key Instrumental Variable Assumption Using Randomization Tests

Luke Keele; Zach Branson

arxiv: 1907.01943 · v1 · pith:X23LULC2new · submitted 2019-07-03 · 📊 stat.ME

Evaluating A Key Instrumental Variable Assumption Using Randomization Tests

Zach Branson , Luke Keele This is my paper

Pith reviewed 2026-05-25 09:55 UTC · model grok-4.3

classification 📊 stat.ME

keywords instrumental variablesrandomization testsas-if randomizationfalsification testsbalance checksIV validitynonparametric tests

0 comments

The pith

A randomization test checks if an instrumental variable is significantly closer to as-if random than the exposure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a falsification test for the as-if randomization assumption required in instrumental variable analyses that use natural instruments. Instead of judging balance covariate by covariate, the test compares observed balance or bias for the instrument to the distribution that would arise under simulated or permuted randomization of that instrument. This produces global balance statistics and graphical displays while remaining nonparametric. The test also directly evaluates whether the instrument is significantly closer to randomization-like properties than the exposure variable itself. The approach is illustrated with an application using ICU bed availability as an instrument for ICU admission.

Core claim

The authors propose a nonparametric randomization test that evaluates the validity of the as-if randomized assumption for an instrument by comparing its observed balance or bias to the balance or bias that would have been produced under randomization, and that allows investigators to validly assess if the instrument is significantly closer to being as-if randomized than the exposure.

What carries the argument

Randomization test applied to balance or bias measures, in which the observed difference between instrument and exposure is compared against the distribution obtained by simulating or permuting the instrument's assignment.

If this is right

Global balance measures can replace separate covariate-by-covariate judgments when assessing IV validity.
Graphical comparisons of balance across the instrument and exposure become directly interpretable.
The test remains valid without requiring parametric assumptions about the data-generating process.
Investigators obtain a single statistical statement on whether the instrument is meaningfully closer to randomization than the exposure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The test could be adapted to other quasi-experimental settings that rely on an as-if randomization premise.
Integration with existing sensitivity analyses for unmeasured confounding would allow joint assessment of design assumptions.
Routine use might shift reporting standards in applied health services research toward explicit randomization-based checks.

Load-bearing premise

The randomization distribution for the instrument can be validly simulated or permuted without additional modeling assumptions that would invalidate the comparison to observed balance.

What would settle it

A dataset in which the instrument is known to violate as-if randomization yet the test does not reject the hypothesis that the instrument is no closer to randomization than the exposure would falsify the claim that the procedure validly assesses the assumption.

read the original abstract

Instrumental variable (IV) analyses are becoming common in health services research and epidemiology. Most IV analyses use naturally occurring instruments, such as distance to a hospital. In these analyses, investigators must assume the instrument is as-if randomly assigned. This assumption cannot be tested directly, but it can be falsified. Most falsification tests in the literature compare relative prevalence or bias in observed covariates between the instrument and the exposure. These tests require investigators to make a covariate-by-covariate judgment about the validity of the IV design. Often, only some of the covariates are well-balanced, making it unclear if as-if randomization can be assumed for the instrument across all covariates. We propose an alternative falsification test that compares IV balance or bias to the balance or bias that would have been produced under randomization. A key advantage of our test is that it allows for global balance measures as well as easily interpretable graphical comparisons. Furthermore, our test does not rely on any parametric assumptions and can be used to validly assess if the instrument is significantly closer to being as-if randomized than the exposure. We demonstrate our approach on a recent IV application that uses bed availability in the intensive care unit (ICU) as an instrument for admission to the ICU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable global randomization test for the as-if-random IV assumption that moves past per-covariate checks, but the validity of the simulated null distribution in observational settings needs close checking.

read the letter

The main contribution is a randomization-test procedure that lets researchers compare observed instrument balance or bias against a simulated randomization distribution, producing a single global assessment plus graphs instead of separate calls on each covariate. This directly addresses the practical problem in health-services IV work where some covariates look balanced and others do not, leaving the overall judgment unclear. The ICU bed-availability example is a reasonable demonstration case, and the nonparametric framing is a genuine plus over methods that add modeling layers. The claim that the test can show the instrument is significantly closer to as-if randomized than the exposure is the part that would matter most to applied users. The approach does not appear to reduce to any self-referential fitted quantity, which keeps the circularity burden low. The stress-test note correctly flags the key implementation detail: in an observational design the randomization distribution for the instrument must be generated without smuggling in dependence or exchangeability assumptions that the IV itself is supposed to justify. If the paper's procedure for the ICU example does this cleanly, the nonparametric guarantee stands; if the simulation scheme quietly conditions on covariates or imposes a propensity structure, the “significantly closer” conclusion loses force. I would want to see the exact permutation or simulation algorithm and any finite-sample error-rate checks before deciding how far the method travels. This is aimed at applied causal-inference readers in epidemiology who already run IV analyses and want a better diagnostic. A methods reader or someone teaching IV diagnostics would get concrete value from the global test and the graphics. The work shows clear engagement with the existing falsification literature and deserves a serious referee to evaluate the construction of the null distribution and the operating characteristics in the example.

Referee Report

2 major / 1 minor

Summary. The paper proposes a nonparametric randomization test for falsifying the as-if randomization assumption in instrumental variable analyses. It compares observed balance or bias for the instrument (and separately the exposure) against a simulated randomization distribution, enabling global balance measures and graphical comparisons rather than covariate-by-covariate judgments. The approach is illustrated with an observational IV example using ICU bed availability as the instrument for admission.

Significance. If the randomization distributions can be generated without introducing assumptions stronger than the IV design itself, the method would offer a useful advance over existing falsification tests in health services research and epidemiology by supporting unified global assessments and direct instrument-versus-exposure comparisons. The nonparametric framing is a potential strength.

major comments (2)

[Abstract] Abstract: The claim that the test 'can be used to validly assess if the instrument is significantly closer to being as-if randomized than the exposure' is load-bearing on the randomization distribution for the instrument accurately reflecting the as-if mechanism in observational data; the abstract provides no specification of the simulation or permutation scheme used in the ICU example, leaving open whether dependence structure or exclusion restrictions are encoded in a way that avoids circularity.
[Methods] Methods (demonstration section): No simulation studies, derivation of the test statistic's properties, or error-rate guarantees (e.g., type I error control under the null) are referenced, which is load-bearing for the central nonparametric validity claim when the assignment process is observational rather than experimental.

minor comments (1)

[Abstract] Abstract: The phrase 'global balance measures' is introduced without naming the specific metric(s) employed, which would aid immediate understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and indicate revisions made to strengthen the presentation of the method.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the test 'can be used to validly assess if the instrument is significantly closer to being as-if randomized than the exposure' is load-bearing on the randomization distribution for the instrument accurately reflecting the as-if mechanism in observational data; the abstract provides no specification of the simulation or permutation scheme used in the ICU example, leaving open whether dependence structure or exclusion restrictions are encoded in a way that avoids circularity.

Authors: We agree the abstract would benefit from greater specificity on this point. In the revised manuscript we have added a concise clause describing the permutation scheme: the randomization distribution is obtained by permuting the instrument assignment while conditioning on the observed covariate vector and preserving the dependence structure among covariates, without conditioning on any post-instrument variables. This encoding is identical to the conditioning used in the original IV analysis, so the comparison between instrument and exposure is not circular; it simply asks which variable is closer to the hypothesized randomization mechanism. The phrase 'validly assess' is understood to be conditional on the correctness of that mechanism, which is the standard interpretation for any falsification test that relies on a posited assignment process. revision: yes
Referee: [Methods] Methods (demonstration section): No simulation studies, derivation of the test statistic's properties, or error-rate guarantees (e.g., type I error control under the null) are referenced, which is load-bearing for the central nonparametric validity claim when the assignment process is observational rather than experimental.

Authors: The referee correctly notes that the submitted manuscript contained no Monte Carlo simulations or explicit derivations. The type-I error control nevertheless follows directly from the classical theory of randomization tests: under the null that the observed assignment is drawn from the posited distribution, the p-value is the exact proportion of simulated realizations at least as extreme as the observed statistic, guaranteeing control at the nominal level without parametric assumptions. We have now added a short subsection in the Methods that states this result with reference to the randomization-inference literature and have included a small simulation study in the supplement that confirms nominal type-I error for both the instrument and the exposure under the null. These additions address the concern for observational settings while preserving the nonparametric character of the procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity: test defined directly from observed data and randomization distribution

full rationale

The paper proposes a nonparametric falsification test that computes a p-value or comparison by contrasting observed balance statistics against a reference distribution generated via permutation or simulation of the instrument (and separately the exposure) under an as-if randomization null. This procedure is constructed directly from the data and the chosen randomization mechanism; the resulting test statistic and its null distribution are not obtained by fitting a parameter to a subset of the same data and then relabeling that fit as a prediction. No load-bearing self-citation, uniqueness theorem, or ansatz is invoked to justify the core validity claim. The method therefore remains self-contained against external benchmarks and does not reduce to its inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that a randomization distribution for the instrument can be constructed from the data alone; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption A randomization distribution for the instrument can be generated by permutation or simulation from the observed data without further parametric modeling.
This is the central premise that allows the test to compare observed balance to a null distribution.

pith-pipeline@v0.9.0 · 5740 in / 1134 out tokens · 32542 ms · 2026-05-25T09:55:49.053362+00:00 · methodology

Evaluating A Key Instrumental Variable Assumption Using Randomization Tests

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)