arxiv: 2605.03554 · v1 · submitted 2026-05-05 · 📊 stat.ME

Communicating results in trials with multiple hypotheses or adaptive design features

Elina Asikanius , Marcel Wolbers , Mouna Akacha , Andreas Brandt , Benjamin Hofner , Dieter A. H\"aring , Kit C.B. Roes , Marc Vandemeulebroecke

show 3 more authors

David Wright Joerg Zinserling Kaspar Rufibach

This is my paper

Pith reviewed 2026-05-07 14:02 UTC · model grok-4.3

classification 📊 stat.ME

keywords adaptive designsmultiplicity adjustmentclinical trialsestimationcommunication of resultsType I error controlmultiple endpointshypothesis testing

0 comments p. Extension

The pith

Complex clinical trials with adaptations or multiple hypotheses lack simple methods for estimating effects and communicating results after controlling false positives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines clinical trials that incorporate interim analyses, adaptations, multiple endpoints, and multiplicity schemes under frequentist inference, where Type 1 error must be controlled across multiple looks, endpoints, populations, or treatment comparisons. Advanced tools such as adaptive designs and graphical multiple testing procedures achieve this control by focusing on hypothesis testing decisions. Yet estimation of effect sizes remains essential for benefit-risk assessments and for use by regulators, clinicians, and other stakeholders. Examples illustrate specific difficulties in obtaining appropriate estimates and conveying results transparently. The authors conclude there are no simple solutions to these conceptual and communicational challenges and aim to raise awareness to prompt further discussion.

Core claim

In frequentist trials with adaptive features or multiple hypotheses, sophisticated procedures successfully limit the overall Type 1 error rate, but this testing-centric approach leaves estimation procedures that are required for benefit-risk judgments and stakeholder interpretation inadequately supported, creating persistent challenges in transparent communication with no straightforward resolutions.

What carries the argument

Graphical multiple testing procedures combined with adaptive design elements that enforce Type 1 error control across multiple sources of multiplicity while subsequent estimation for effect magnitude remains unaddressed.

If this is right

Stakeholders may receive results focused only on binary test outcomes rather than effect sizes needed for decisions.
Benefit-risk assessments could rely on estimates that do not properly account for the multiplicity adjustments or adaptations.
Transparent reporting to regulators and clinicians becomes harder when standard estimation techniques conflict with the testing framework.
Future trial planning must weigh error control against the need for interpretable estimates from the start.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulatory review processes may need explicit guidance on how to handle estimation when graphical or adaptive multiplicity methods are used.
The same tension between testing and estimation could appear in other domains that apply frequentist multiplicity corrections, such as genomics or quality control studies.
Exploration of hybrid frequentist-Bayesian approaches might provide practical ways to report both error-controlled decisions and calibrated effect estimates.

Load-bearing premise

Estimation of treatment effects stays essential for benefit-risk assessments and stakeholder use, even as current methods prioritize hypothesis testing and Type 1 error control.

What would settle it

A validated general-purpose method that delivers unbiased estimates, valid confidence intervals, and clear communication for adaptive multi-endpoint trials without compromising Type 1 error control would show that simple solutions do exist.

Figures

Figures reproduced from arXiv: 2605.03554 by Andreas Brandt, Benjamin Hofner, David Wright, Dieter A. H\"aring, Elina Asikanius, Joerg Zinserling, Kaspar Rufibach, Kit C.B. Roes, Marcel Wolbers, Marc Vandemeulebroecke, Mouna Akacha.

**Figure 1.** Figure 1: Hierarchical testing approach applied to the randomized trial in early view at source ↗

**Figure 2.** Figure 2: Graphical multiple testing scheme for the fictional obesity trial. view at source ↗

read the original abstract

Over time, clinical trials have increasingly incorporated complex design and analysis elements such as interim analyses, adaptations, multiple endpoints, and sophisticated multiplicity schemes for multiple endpoints and/or treatment arms following the paradigm of frequentist inference. In frequentist clinical trials multiplicity can come from (at least) four sources: multiple looks at the data, multiple endpoints, multiple populations, or multiple treatment comparisons. Normally, Type 1 error control across the multiple hypotheses is implemented to control chance of false positive decisions. To achieve this advanced techniques such as adaptive designs or graphical multiple testing procedures have been developed and are used in the design of clinical trials. However, these methods focus on hypothesis testing while subsequent estimation remains crucial to allow for a benefit-risk assessment and further use of the results by various stakeholders. Through examples, we illustrate challenges in estimation and transparent communication. In general, there are no simple solutions to this conceptual and communicational challenge. The purpose of this paper is to generate awareness of these issues and initiate a discussion about how to address them moving forward.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This discussion paper flags real estimation and communication headaches in complex frequentist trials but offers no new methods or fixes.

read the letter

The paper's core point is that multiplicity adjustments and adaptive features in clinical trials keep Type I error in check but often leave estimates and intervals that are tough to explain or use for benefit-risk decisions. It walks through the four sources of multiplicity and shows via examples how the focus on testing creates downstream problems for stakeholders. That framing is useful and stays grounded in how these trials actually run today. The authors give credit to existing work on graphical procedures and adaptations without pretending to invent anything new. What stands out is the explicit call to address the estimation-communication gap rather than treating testing control as the end of the story. The soft spots are predictable for a discussion piece. The examples illustrate the issues but do not measure how common or severe they are in practice, and the claim that no simple solutions exist rests on those illustrations without testing alternatives like adjusted intervals or hybrid approaches. There is no data, code, or derivation to verify. This is for clinical statisticians and regulators who design or review confirmatory trials and already know the multiplicity literature. A reader looking for technical advances will not find them here, but the paper could prompt useful follow-up on reporting standards. It deserves peer review so the community can weigh in on whether the problems are as widespread as suggested and what practical steps might help.

Referee Report

0 major / 3 minor

Summary. The paper claims that frequentist clinical trials increasingly use complex features such as interim analyses, adaptations, multiple endpoints, and multiplicity adjustments, which require Type I error control via advanced techniques like graphical multiple testing procedures. While these focus on hypothesis testing, the authors argue that subsequent estimation is essential for benefit-risk assessment and stakeholder use, yet faces conceptual and communicational challenges. Through examples, they illustrate difficulties in transparent reporting and conclude that no simple solutions exist, with the purpose of raising awareness and initiating discussion.

Significance. If the assessment holds, the paper usefully highlights a practical gap between sophisticated frequentist testing methods and the downstream need for communicable estimates in clinical trials. This could encourage development of better reporting guidelines or hybrid approaches. The authors deserve credit for grounding the discussion in standard practices and for framing the issue as a call for community dialogue rather than proposing untested new methods.

minor comments (3)

The abstract and introduction could more explicitly reference key literature on post-selection inference or bias in adaptive designs to strengthen the context for the claimed challenges.
The examples section would benefit from clearer separation between the hypothesis testing step and the estimation/communication step in each case, to make the difficulties more immediately apparent to readers.
Consider adding a short concluding section with specific suggestions for future work or open questions to move beyond the general call for discussion.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the accurate summary of our manuscript and for the positive assessment of its significance in highlighting the practical gap between advanced frequentist testing methods and the need for communicable estimates in clinical trials. We appreciate the recommendation for minor revision.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript is a conceptual discussion paper that illustrates estimation and communication challenges in frequentist trials with multiplicity or adaptations via examples, without advancing any derivations, equations, fitted parameters, predictions, or first-principles results. Its central claim—that no simple solutions exist and further discussion is needed—is observational and rests on domain knowledge rather than any load-bearing technical chain that could reduce to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked in a manner that creates circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard frequentist assumptions about Type 1 error control in the presence of multiplicity but introduces no new free parameters, axioms, or invented entities.

axioms (1)

domain assumption Frequentist inference in clinical trials requires control of Type 1 error rate across multiple hypotheses arising from interim analyses, multiple endpoints, populations, or treatment comparisons.
Explicitly stated in the abstract as the basis for current multiplicity adjustment techniques.

pith-pipeline@v0.9.0 · 5524 in / 1162 out tokens · 70550 ms · 2026-05-07T14:02:06.313881+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

32 extracted references

[1]

Asikanius, B

E. Asikanius, B. Hofner, L. V. Hampson, G. Wassmer, C. Jennison, T. Mielke, C. U. Kunz, and K. Rufibach. Clinical trials with interim analyses: standardizing terminology to increase clarity.Trials, 26(1):247, 2025

2025
[2]

Bauer, F

P. Bauer, F. Koenig, W. Brannath, and M. Posch. Selection and bias—two hostile brothers.Statistics in Medicine, 29(1):1–13, 2010

2010
[3]

Draft guideline on multiplicity issues in clinical trials, 2017

Biostatistics Working Party. Draft guideline on multiplicity issues in clinical trials, 2017

2017
[4]

Brannath, L

W. Brannath, L. Kluge, and M. Scharpenberg. Informative simultaneous confidence intervals for graphical test procedures.Statistical Methods in Medical Research, 35(1):101–117, 2026

2026
[5]

Bretz, W

F. Bretz, W. Maurer, W. Brannath, and M. Posch. A graphical approach to sequentially rejective multiple test procedures.Statistics in medicine, 28(4):586– 604, 2009

2009
[6]

Coburger and G

S. Coburger and G. Wassmer. Conditional point estimation in adaptive group sequential test designs.Biometrical Journal, 43(7):821–833, 2001

2001
[7]

Reflection paper on methodological issues in confirmatory clinical trials with flexible design and analysis plan, 2007

European Medicines Agency. Reflection paper on methodological issues in confirmatory clinical trials with flexible design and analysis plan, 2007

2007
[8]

A guideline on summary of product characteristics, 2009

European Medicines Agency. A guideline on summary of product characteristics, 2009

2009
[9]

Adaptive Designs for Clinical Trials of Drugs and Biologics - guidance for Industry, 2019

FDA. Adaptive Designs for Clinical Trials of Drugs and Biologics - guidance for Industry, 2019

2019
[10]

Multiple Endpoints in Clinical Trials - guidance for industry, 2022

FDA. Multiple Endpoints in Clinical Trials - guidance for industry, 2022

2022
[11]

Freidlin and E

B. Freidlin and E. L. Korn. Stopping clinical trials early for benefit: impact on estimation.Clin Trials, 6(2):119–125, 2009

2009
[12]

Glimm, W

E. Glimm, W. Maurer, and F. Bretz. Hierarchical testing of multiple endpoints in group-sequential trials.Statistics in medicine, 29(2):219–228, 2010

2010
[13]

H. J. Hung, S.-J. Wang, and R. O’Neill. Statistical considerations for testing multiple endpoints in group sequential or adaptive clinical trials.Journal of biopharmaceutical statistics, 17(6):1201–1210, 2007

2007
[14]

ICH E20 adaptive designs for clinical trials - Scientific guideline, 2025

ICH E20 expert writing group. ICH E20 adaptive designs for clinical trials - Scientific guideline, 2025. 20

2025
[15]

ICH E9 Statistical principles for clinical trials - Scientific guideline, 1998

ICH E9 expert writing group. ICH E9 Statistical principles for clinical trials - Scientific guideline, 1998

1998
[16]

Jennison and B

C. Jennison and B. W. Turnbull. Repeated confidence intervals for group sequential clinical trials.Controlled Clinical Trials, 5(1):33–45, 1984

1984
[17]

Magirr, T

D. Magirr, T. Jaki, M. Posch, and F. Klinglmueller. Simultaneous confidence intervals that are compatible with closed testing in adaptive designs.Biometrika, 100(4):985–996, 2013

2013
[18]

I. C. Marschner, M. Schou, and A. J. Martin. Estimation of the treatment effect following a clinical trial that stopped early for benefit.Statistical Methods in Medical Research, 31(12):2456–2469, 2022

2022
[19]

Maurer and F

W. Maurer and F. Bretz. Multiple testing in group sequential trials using graphical approaches.Statistics in Biopharmaceutical Research, 5(4):311–320, 2013

2013
[20]

J. C. Pinheiro and D. L. DeMets. Estimating and reducing bias in group sequential designs with gaussian independent increment structure.Biometrika, 84(4):831–845, 1997

1997
[21]

R Foundation for Statistical Computing, Vienna, Austria, 2025

R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2025

2025
[22]

D. S. Robertson, T. Burnett, B. Choodari-Oskooei, M. Dimairo, M. Grayling, P. Pallmann, and T. Jaki. Confidence intervals for adaptive trial designs i: a methodological review.Statistics in Medicine, 44(18-19):e70174, 2025

2025
[23]

D. S. Robertson, T. Burnett, B. Choodari-Oskooei, M. Dimairo, M. Grayling, P. Pallmann, and T. Jaki. Confidence intervals for adaptive trial designs ii: Case study and practical guidance.Statistics in Medicine, 44(18-19):e70202, 2025

2025
[24]

D. S. Robertson, B. Choodari-Oskooei, M. Dimairo, L. Flight, P. Pallmann, and T. Jaki. Point estimation for adaptive trial designs i: a methodological review. Statistics in medicine, 42(2):122–145, 2023

2023
[25]

D. S. Robertson, B. Choodari-Oskooei, M. Dimairo, L. Flight, P. Pallmann, and T. Jaki. Point estimation for adaptive trial designs ii: practical considerations and guidance.Statistics in medicine, 42(14):2496–2520, 2023

2023
[26]

Schmidt and W

S. Schmidt and W. Brannath. Informative simultaneous confidence intervals in hierarchical testing.Methods of Information in Medicine, 53(04):278–283, 2014

2014
[27]

Strassburger and F

K. Strassburger and F. Bretz. Compatible simultaneous lower confidence bounds for the holm procedure and other bonferroni-based closed tests.Statistics in Medicine, 27(24):4914–4927, 2008

2008
[28]

J. F. Troendle and K. F. Yu. Conditional estimation following a group sequential clinical trial.Communications in Statistics-Theory and Methods, 28(7):1617–1634, 1999. 21

1999
[29]

Wassmer and W

G. Wassmer and W. Brannath.Group sequential and confirmatory adaptive designs in clinical trials (second edition), volume 301. Springer, 2025

2025
[30]

Wassmer and F

G. Wassmer and F. Pahlke.rpact: Confirmatory Adaptive Clinical Trial Design and Analysis, 2025. R package version 4.2.1

2025
[31]

Whitehead

J. Whitehead. On the bias of maximum likelihood estimation following a sequential test.Biometrika, 73(3):573–581, 1986

1986
[32]

Y. Zhu, Y. Zhang, X. Deng, K. Anderson, and N. Xiao.gMCPLite: Lightweight Graph Based Multiple Comparison Procedures, 2026. R package version 0.1.7. 22

2026