Sequential Randomization Tests Using e-values: Applications for trial monitoring

Fernando G Zampieri

arxiv: 2512.04366 · v9 · submitted 2025-12-04 · 📊 stat.ME · stat.AP

Sequential Randomization Tests Using e-values: Applications for trial monitoring

Fernando G Zampieri This is my paper

Pith reviewed 2026-05-17 01:59 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords sequential monitoringe-valuesrandomization testsmartingalesclinical trialsnonparametric testsanytime validityType I error control

0 comments

The pith

Sequential randomization tests using e-values deliver anytime-valid Type I error control for clinical trial monitoring derived solely from the randomization mechanism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a family of nonparametric sequential tests called e-RT that monitor randomized trials for binary, event, and continuous outcomes without parametric assumptions or fixed sample sizes. Each test builds a test martingale by placing wagers on randomized treatment assignments or event labels and updating wealth only after observing the outcome. Under the null of no treatment effect, expected wealth cannot grow, which automatically bounds the probability of ever crossing any error threshold no matter when monitoring stops. This approach is effect-size agnostic by default but can incorporate design-calibrated or growth-rate-optimal wagers when a clinically meaningful alternative is available. A reader would care because it removes the need to pre-specify stopping rules or rely on asymptotic approximations while still guaranteeing valid inference in ongoing trials.

Core claim

By constructing test martingales through sequential wagers on randomized assignments before incorporating observed labels, the e-RT procedures ensure that under the null hypothesis the expected wealth process is a supermartingale, thereby guaranteeing anytime-valid Type I error control regardless of the stopping rule chosen by the investigator.

What carries the argument

The test martingale formed by wagering on randomized treatment assignments or observed event labels before wealth updates, which enforces the supermartingale property under the null.

If this is right

Anytime-valid Type I error control holds for arbitrary stopping times in sequential monitoring of randomized trials.
The same framework applies uniformly to binary, time-to-event, and continuous endpoints without model assumptions.
Default effect-size-agnostic wagers allow monitoring to begin immediately without specifying a target treatment effect.
Optional growth-rate-optimal wagers can be substituted when a fixed design alternative is credible to improve power.
The methods serve as a conservative, assumption-light complement to existing model-based sequential analyses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Investigators could use these procedures to justify adaptive or response-adaptive randomization schemes in which stopping decisions depend on accumulating data.
The same randomization-derived martingale construction might transfer to other experimental settings that rely on known random assignment, such as online A/B testing platforms.
Hybrid monitoring rules could combine e-RT validity bounds with parametric likelihood ratios to gain power while preserving the nonparametric safety net.

Load-bearing premise

Treatment assignments are generated by a known random mechanism that is independent of potential outcomes under the null hypothesis.

What would settle it

In repeated simulations under the null with a data-dependent stopping rule that rejects when wealth exceeds a fixed threshold, the observed rejection rate would exceed the nominal alpha level.

Figures

Figures reproduced from arXiv: 2512.04366 by Fernando G Zampieri.

**Figure 2.** Figure 2: E-processes trajectories under the alternative hypothesis (true ARR = 10%). Left: n = [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Trajectories of the e-RTd process (25% baseline mortality, 5pp ARR, 500 deaths). [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Trajectories of the continuous randomization e-process for a trial designed to detect [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Trajectories of the e-Survival process for a trial designed to detect a Hazard Ratio of [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Trajectories of e-RTms for a trial with N = 1000 patients. Left: under the null hypothesis (equal transition matrices), wealth fluctuates randomly. Right: under the alternative hypothesis, wealth grows as treatment improves recovery transitions. Dashed red line: rejection threshold (1/α = 20). predict which arm the patient was in?” This comes at a cost: we lose information about which specific transitions … view at source ↗

read the original abstract

Sequential monitoring of randomized trials traditionally relies on parametric assumptions or asymptotic approximations. We discuss a family of nonparametric sequential tests - collectively called e-RT - for binary, event-only, and continuous endpoints. All active variants derive validity from the randomization mechanism. Using a betting framework, each test constructs a test martingale by sequentially wagering on randomized assignments or observed event labels before using the current label in the wealth update. Under the null hypothesis of no treatment effect, the expected wealth cannot grow, guaranteeing anytime-valid Type I error control regardless of stopping rule. The default e-RT posture is effect-size agnostic: monitoring can begin without specifying a hypothesized treatment effect. Alternatively, fixed design-calibrated wagers, including growth-rate-optimal (GROW) wagers, may be used as optional efficiency tools when a clinically meaningful design alternative is credible. We present simulation studies demonstrating calibration and power, and discuss the principled asymmetry in betting strategies across outcome types. These methods provide a conservative, assumption-light complement to model-based sequential analyses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper builds sequential e-value tests for randomized trial monitoring that draw validity directly from the randomization mechanism rather than models or asymptotics.

read the letter

The main point is that the authors construct a family of e-RT procedures for binary, event, and continuous endpoints in which you wager on the next randomized assignment or label before seeing the outcome, then update a wealth process. Under the sharp null the conditional expectation of each increment stays at 1, so the whole thing is a test martingale and Ville’s inequality gives anytime-valid Type I error control for any stopping rule. That is the clean part: no parametric assumptions and no need to pre-specify an effect size for the basic version. They also show how to plug in GROW wagers when a clinically relevant alternative is credible, and they report simulations that check calibration and power across the endpoint types. The asymmetry they note in betting strategies for different data types is a reasonable practical touch. The core argument tracks the stress-test note and does not appear to hide any data-dependent fitting that would break the martingale step. The main soft spot is that the manuscript still needs to spell out the exact wager functions and the handling of ties or censoring for event endpoints in enough detail for a reader to reproduce the simulations without guesswork; that is fixable but currently leaves the efficiency claims a bit underspecified. Readers who work on adaptive designs or nonparametric sequential analysis will find the most direct value here. The paper is coherent on its own terms and the central guarantee is reproducible from the randomization alone, so it deserves a serious referee. I would send it out for review rather than desk-reject.

Referee Report

2 major / 3 minor

Summary. The paper proposes a family of nonparametric sequential tests called e-RT for monitoring randomized clinical trials with binary, event-only, and continuous endpoints. These tests construct test martingales via a sequential betting framework that wagers on randomized treatment assignments or event labels before updating wealth. Under the sharp null of no treatment effect, the construction ensures that expected wealth is non-increasing, yielding anytime-valid Type I error control for arbitrary stopping times via Ville's inequality. The default approach is effect-size agnostic, though optional design-calibrated wagers (including GROW) are discussed for efficiency gains. Simulation studies are presented to illustrate calibration and power.

Significance. If the martingale property holds as described, the work supplies a conservative, randomization-based complement to parametric or asymptotic sequential methods. It avoids reliance on fitted models or asymptotic approximations and directly leverages the trial's randomization mechanism for validity. The explicit separation of default agnostic monitoring from optional efficiency tools, together with the simulation evidence, strengthens its potential utility for flexible trial monitoring.

major comments (2)

[§3.2] §3.2, the continuous-endpoint wager definition: the conditional-expectation argument under the null is stated to equal 1, but the precise form of the wager function (e.g., how the continuous outcome is mapped to a bounded payoff) is not shown explicitly; this step is load-bearing for the martingale property and requires an expanded derivation or reference to the supporting lemma.
[§5] Simulation protocol in §5: the reported power curves for the GROW wager variant assume a specific clinically meaningful alternative; without a sensitivity analysis over a broader range of alternatives (including those near the boundary of detectability), the efficiency claim relative to the agnostic version remains incompletely supported.

minor comments (3)

[Abstract] The abstract introduces 'e-RT' without an immediate expansion; adding '(e-value Randomization Test)' on first use would improve readability.
[§2] Notation for the natural filtration in the martingale construction could be made more explicit (e.g., by indexing the sigma-algebra explicitly with the sequence of randomizations and observations) to assist readers less familiar with sequential analysis.
[Table 2] Table 2 caption should clarify whether the reported Type I error rates are exact or Monte Carlo estimates and include the number of replications used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive feedback on our manuscript. The comments help clarify important aspects of the e-RT framework. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3.2] §3.2, the continuous-endpoint wager definition: the conditional-expectation argument under the null is stated to equal 1, but the precise form of the wager function (e.g., how the continuous outcome is mapped to a bounded payoff) is not shown explicitly; this step is load-bearing for the martingale property and requires an expanded derivation or reference to the supporting lemma.

Authors: We agree that additional detail on the continuous-endpoint wager would improve the exposition. In the revised manuscript, we will provide an explicit definition of the wager function for continuous outcomes, specifying how the outcome is mapped to a bounded payoff (for example, through a normalized transformation ensuring the payoff lies in [0,1]). We will then include a detailed derivation demonstrating that the conditional expectation under the sharp null hypothesis equals 1, based on the randomization distribution. This will be presented directly in §3.2 or as a supporting lemma in the appendix. revision: yes
Referee: [§5] Simulation protocol in §5: the reported power curves for the GROW wager variant assume a specific clinically meaningful alternative; without a sensitivity analysis over a broader range of alternatives (including those near the boundary of detectability), the efficiency claim relative to the agnostic version remains incompletely supported.

Authors: The simulations in §5 were designed to illustrate performance under a specific, clinically relevant alternative to highlight the benefits of the GROW wager. We recognize that a sensitivity analysis would provide stronger support for the efficiency claims. Accordingly, we will expand the simulation section to include additional scenarios, particularly those near the boundary of detectability, and report the corresponding power comparisons between the agnostic and design-calibrated variants. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via external randomization

full rationale

The paper constructs test martingales for sequential randomization tests whose validity under the sharp null follows from the external randomization mechanism: treatment assignments are independent of fixed potential outcomes, so each wealth increment has conditional expectation 1 by construction of the betting update. This property, combined with the standard martingale inequality (Ville), directly yields the anytime-valid Type I error bound for arbitrary stopping times. No equation reduces a derived quantity to a data-fitted parameter by definition, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in; the central guarantee is therefore independent of the paper's own outputs and rests on the stated randomization assumption plus classical probability results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the randomization mechanism as the source of validity and on the martingale property under the null; no free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption Treatment assignments are randomized independently of potential outcomes under the null hypothesis of no treatment effect.
Invoked to guarantee that expected wealth does not grow under the null.

pith-pipeline@v0.9.0 · 5467 in / 1185 out tokens · 30568 ms · 2026-05-17T01:59:25.417145+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under the null hypothesis of no treatment effect, the expected wealth cannot grow, guaranteeing anytime-valid Type I error control regardless of stopping rule.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1. Under the null hypothesis, the wealth process (Wn) is a nonnegative martingale.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

Duan, B., Ramdas, A., and Wasserman, L. (2022). Interactive rank testing by betting. In Schölkopf, B., Uhler, C., and Zhang, K., editors,Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 ofProceedings of Machine Learning Research, pages 201–235. PMLR. Grünwald, P., Ly, A., Perez-Ortiz, M., and Schure, J. T. (2021). The safe ...

work page 2022
[2]

Kelly, J. L. (1956). A new interpretation of information rate.Bell System Technical Journal, 35(4):917–926

work page 1956
[3]

Koning, N. W. (2025). Measuring evidence against exchangeability and group invariance with e- values. arXiv preprint arXiv:2310.01153

work page arXiv 2025
[4]

Ramdas, A. (2021). Game-theoretic probability and statistics (lecture notes). Accessed: 2025-12-09

work page 2021
[5]

Ramdas, A., Ruf, J., Larsson, M., and Koolen, W. M. (2022). Testing exchangeability: Fork- convexity, supermartingales and e-processes.International Journal of Approximate Reasoning, 141:83–109

work page 2022
[6]

and Wang, R

Ramdas, A. and Wang, R. (2025). Hypothesis testing with e-values.Foundations and Trends in Statistics, 1(1-2):1–390

work page 2025
[7]

Shafer, G. (2021). Testing by betting: A strategy for statistical and scientific communication. Journal of the Royal Statistical Society: Series A, 184(2):407–431

work page 2021
[8]

(1939).Étude critique de la notion de collectif

Ville, J. (1939).Étude critique de la notion de collectif. PhD thesis, Gauthier-Villars, Paris

work page 1939
[9]

and Wang, R

Vovk, V. and Wang, R. (2021). E-values: Calibration, combination and applications.Annals of Statistics, 49(3):1736–1754

work page 2021
[10]

Must␣specify␣either␣p_trt␣or␣hypothesized_ARR

Waudby-Smith, I. and Ramdas, A. (2023). Estimating means of bounded random variables by betting.Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1):1–27. A R Code # e - RT : S e q u e n t i a l R a n d o m i z a t i o n Tests Using e - values # S u p p l e m e n t a r y R Code library(tidyverse) # --- e - RT : Binary Outcomes...

work page 2023

[1] [1]

Duan, B., Ramdas, A., and Wasserman, L. (2022). Interactive rank testing by betting. In Schölkopf, B., Uhler, C., and Zhang, K., editors,Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 ofProceedings of Machine Learning Research, pages 201–235. PMLR. Grünwald, P., Ly, A., Perez-Ortiz, M., and Schure, J. T. (2021). The safe ...

work page 2022

[2] [2]

Kelly, J. L. (1956). A new interpretation of information rate.Bell System Technical Journal, 35(4):917–926

work page 1956

[3] [3]

Koning, N. W. (2025). Measuring evidence against exchangeability and group invariance with e- values. arXiv preprint arXiv:2310.01153

work page arXiv 2025

[4] [4]

Ramdas, A. (2021). Game-theoretic probability and statistics (lecture notes). Accessed: 2025-12-09

work page 2021

[5] [5]

Ramdas, A., Ruf, J., Larsson, M., and Koolen, W. M. (2022). Testing exchangeability: Fork- convexity, supermartingales and e-processes.International Journal of Approximate Reasoning, 141:83–109

work page 2022

[6] [6]

and Wang, R

Ramdas, A. and Wang, R. (2025). Hypothesis testing with e-values.Foundations and Trends in Statistics, 1(1-2):1–390

work page 2025

[7] [7]

Shafer, G. (2021). Testing by betting: A strategy for statistical and scientific communication. Journal of the Royal Statistical Society: Series A, 184(2):407–431

work page 2021

[8] [8]

(1939).Étude critique de la notion de collectif

Ville, J. (1939).Étude critique de la notion de collectif. PhD thesis, Gauthier-Villars, Paris

work page 1939

[9] [9]

and Wang, R

Vovk, V. and Wang, R. (2021). E-values: Calibration, combination and applications.Annals of Statistics, 49(3):1736–1754

work page 2021

[10] [10]

Must␣specify␣either␣p_trt␣or␣hypothesized_ARR

Waudby-Smith, I. and Ramdas, A. (2023). Estimating means of bounded random variables by betting.Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1):1–27. A R Code # e - RT : S e q u e n t i a l R a n d o m i z a t i o n Tests Using e - values # S u p p l e m e n t a r y R Code library(tidyverse) # --- e - RT : Binary Outcomes...

work page 2023