pith. the verified trust layer for science. sign in

arxiv: 2511.17292 · v3 · submitted 2025-11-21 · 📊 stat.ME

Balancing Evidentiary Value and Sample Size of Adaptive Designs with Application to Animal Experiments

Pith reviewed 2026-05-17 20:20 UTC · model grok-4.3

classification 📊 stat.ME
keywords experimental unit information indexadaptive designsanimal experimentsevidentiary valuegroup-sequential designsdiagnostic odds ratiosample size reduction3R principles
0
0 comments X p. Extension

The pith

The experimental unit information index quantifies the evidentiary value of each experimental unit to balance sample size and statistical reliability in adaptive designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the experimental unit information index to measure how much evidence one unit contributes in a statistical test. This measure combines power, type I error, and sample size into an adjusted diagnostic odds ratio that also has Bayesian interpretations. A sympathetic reader would care because it directly addresses the goal of reducing animal use in research while maintaining the ability to make sound inferences. The index is defined for standard tests and then extended to adaptive designs with early stopping for efficacy or futility. Reanalysis of over 2700 animal experiments with simulated interim analyses shows that this can lead to practical reductions in the number of subjects required.

Core claim

The authors propose the experimental unit information index (EUII) as a novel measure of evidentiary value per experimental unit, obtained by adjusting diagnostic likelihood ratios and the diagnostic odds ratio for sample size. The EUII has interpretations in terms of frequentist error rates and Bayesian posterior odds. Its asymptotic value depends only on the relative effect size under the alternative. The definition is extended to adaptive designs, and application to group-sequential designs demonstrates its use for maximizing evidentiary value per unit. A reanalysis of 2738 animal experiments illustrates possible sample size savings.

What carries the argument

The experimental unit information index (EUII), which is the sample-size-adjusted diagnostic odds ratio that quantifies the evidentiary contribution of one experimental unit.

If this is right

  • Group-sequential adaptive designs can be evaluated and optimized using the EUII to achieve smaller sample sizes while controlling error rates.
  • The asymptotic EUII value depends solely on the assumed relative effect size under the alternative hypothesis.
  • EUII provides interpretations both for frequentist power and type I error and for Bayesian posterior odds.
  • Post-hoc interim analyses on existing animal experiment data can identify opportunities for reduced sample sizes in future studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers in other fields using human subjects or costly experiments could adopt the EUII to similarly optimize resource allocation.
  • The approach might be generalized to other types of sequential designs or more complex statistical models beyond group-sequential tests.
  • Integration into software for trial design could make it easier to plan studies that maximize evidence per unit.

Load-bearing premise

The relative effect size under the alternative hypothesis is known or can be prespecified so that power calculations stay accurate despite sample size reductions from early stopping.

What would settle it

Conduct a simulation where the true effect size is set differently from the value assumed in the EUII calculation, and check whether the observed error rates or evidentiary strength deviate from what the index predicts.

Figures

Figures reproduced from arXiv: 2511.17292 by Fadoua Balabdaoui, Leonhard Held, Samuel Pawel, Saverio Fontana.

Figure 1
Figure 1. Figure 1: Distribution of sample sizes and standardised mean di [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The experimental unit information index for unequal randomisation of [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The experimental unit information index for the standard one-sample one-sided [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The EUII (top) of different group-sequential methods with four interim analyses in comparison to a fixed design with n = 50. The maximum sample size nMax of O’Brien￾Fleming and Pocock is chosen so that their power (top axis) and Type-I error rate (2.5%) match with the fixed design, whereas nMax from Haybittle-Peto is always n = 50, leading to slightly higher power and Type-I error rate. Middle: Improvement… view at source ↗
Figure 5
Figure 5. Figure 5: The experimental unit information index (first-order) of several group-sequential [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The experimental unit information index (second-order) of several group [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Futility bounds on p-values based on predictive power. or the Haybittle-Peto method now performing best. Pocock has fairly constant values of EUII and is always among the top two methods. Most other methods also have a fairly constant EUII with varying nMax, with the exception of Haybittle-Peto, which is increasing from nMax = 16 to nMax = 32. As a result, Haybittle-Peto is even better than Pocock if the a… view at source ↗
read the original abstract

Reducing the number of experimental units is one of the three pillars of the 3R principles (Replace, Reduce, Refine) in animal research. At the same time, statistical error rates need to be controlled to enable reliable inferences and decisions. This paper proposes to adopt diagnostic likelihood ratios and the diagnostic odds ratio to statistical hypothesis tests and to adjust it for sample size to obtain a novel measure to quantify for the evidentiary value of one experimental unit. The experimental unit information index (EUII) is based on power, Type-I error and sample size, and has attractive interpretations both in terms of frequentist error rates and Bayesian posterior odds. We introduce the EUII in simple statistical test settings and show that its asymptotic value depends only on the assumed relative effect size under the alternative. We then extend the definition to adaptive designs where early stopping for efficacy or futility may cause reductions in sample size. Application to group-sequential designs show the usefulness of the approach when the goal is to maximize the evidentiary value of one experimental unit. A reanalysis of 2738 animal experiments with simulated results from (post-hoc) interim analyses illustrates the possible savings in sample size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Experimental Unit Information Index (EUII), defined from the power, Type I error rate, and sample size of a statistical test, to measure the evidentiary value per experimental unit. It demonstrates that the asymptotic value of the EUII depends solely on the assumed relative effect size under the alternative hypothesis. The definition is extended to adaptive designs, specifically group-sequential designs with early stopping for efficacy or futility, and applied to a reanalysis of 2738 animal experiments to show potential reductions in sample size while maintaining evidentiary value.

Significance. If the EUII and its extension to adaptive designs are valid, this work could contribute to more efficient animal experimentation by allowing smaller sample sizes without compromising the ability to make reliable inferences, in line with the 3R principles. The dual frequentist and Bayesian interpretations of the EUII are a strength. The large-scale reanalysis provides practical illustration of the method's potential impact on reducing animal use in experiments.

major comments (3)
  1. The claim that the asymptotic EUII depends only on the assumed relative effect size is presented without the explicit derivation or limiting argument. Since the EUII is constructed directly from power, Type-I error, and sample size, and the relative effect size is an input to the power calculation, it is important to show that the limit is indeed independent of other parameters to support the interpretation as a measure of evidentiary value per unit.
  2. In the extension to group-sequential designs, the manuscript does not provide the adjusted formulas for power and Type-I error that account for the stopping boundaries. Without these, it is unclear whether the EUII correctly reflects the evidentiary value when early stopping reduces the realized sample size, which is central to the claim of balancing evidentiary value and sample size.
  3. The reanalysis of 2738 experiments simulates post-hoc interim analyses but lacks sensitivity analysis to the choice of the assumed relative effect size or error bars on the estimated savings. This weakens the illustration of possible sample size reductions.
minor comments (2)
  1. The notation for the EUII formula could be clarified to distinguish between the finite-sample and asymptotic versions.
  2. Some figures in the application section would benefit from clearer labeling of the adaptive vs non-adaptive cases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We are pleased that the referee recognizes the potential contribution to more efficient animal experimentation in line with the 3R principles. Below, we provide point-by-point responses to the major comments and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: The claim that the asymptotic EUII depends only on the assumed relative effect size is presented without the explicit derivation or limiting argument. Since the EUII is constructed directly from power, Type-I error, and sample size, and the relative effect size is an input to the power calculation, it is important to show that the limit is indeed independent of other parameters to support the interpretation as a measure of evidentiary value per unit.

    Authors: We agree that an explicit derivation would strengthen the presentation. In the revised manuscript, we will add a dedicated subsection deriving the asymptotic limit. Under the usual normal approximation for the test statistic, as n tends to infinity the power tends to 1 at a rate governed by the relative effect size δ; after the sample-size normalization built into the EUII definition, the limit simplifies to a closed-form expression depending only on δ (and the fixed α), independent of the particular choice of target power. This limiting argument directly supports the per-unit evidentiary interpretation. revision: yes

  2. Referee: In the extension to group-sequential designs, the manuscript does not provide the adjusted formulas for power and Type-I error that account for the stopping boundaries. Without these, it is unclear whether the EUII correctly reflects the evidentiary value when early stopping reduces the realized sample size, which is central to the claim of balancing evidentiary value and sample size.

    Authors: We acknowledge the need for greater explicitness. The revised version will include the standard expressions for the overall Type I error and power under the group-sequential boundaries (using the joint multivariate normal distribution of the sequential test statistics or the corresponding boundary-crossing probabilities). These adjusted quantities will then be inserted directly into the EUII formula, making clear how the index accounts for the random realized sample size induced by early stopping. revision: yes

  3. Referee: The reanalysis of 2738 experiments simulates post-hoc interim analyses but lacks sensitivity analysis to the choice of the assumed relative effect size or error bars on the estimated savings. This weakens the illustration of possible sample size reductions.

    Authors: We agree that robustness checks would improve the illustration. In the revision we will add a sensitivity analysis that repeats the reanalysis over a grid of plausible relative effect sizes and will report the resulting range of estimated sample-size savings. We will also attach simulation-based standard errors (or bootstrap intervals) to the aggregate savings figures to quantify uncertainty across the 2738 experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity: EUII is explicitly constructed from power, alpha and n with derived asymptotic property

full rationale

The paper defines the experimental unit information index directly from power, Type-I error rate and sample size, then derives that its asymptotic value depends only on the pre-specified relative effect size delta under the alternative. This dependence is a mathematical consequence of the definition rather than a reduction of an independent claim to its inputs. No load-bearing self-citation, uniqueness theorem, or fitted parameter renamed as prediction is present in the provided derivation chain. The extension to group-sequential adaptive designs applies the same explicit construction while preserving the error-rate interpretations, making the overall argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard hypothesis-testing assumptions plus the choice of relative effect size to anchor the asymptotic EUII. No new physical entities are postulated.

free parameters (1)
  • relative effect size under the alternative
    Determines the asymptotic value of the EUII; must be assumed or specified by the user.
axioms (1)
  • standard math Power and Type-I error rates are well-defined and can be computed for the chosen test and design.
    Invoked when constructing the EUII from diagnostic likelihood ratios.
invented entities (1)
  • Experimental Unit Information Index (EUII) no independent evidence
    purpose: Quantify evidentiary value contributed by each experimental unit after adjusting for sample size.
    Newly introduced composite measure; no independent falsifiable prediction supplied in the abstract.

pith-pipeline@v0.9.0 · 5513 in / 1387 out tokens · 32520 ms · 2026-05-17T20:20:11.586900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 1 internal anchor

  1. [1]

    Anderson

    K. Anderson. gsDesign: Group Sequential Design, 2024. URL https://CRAN.R-project.org/package=gsDesign. R package version 3.6.5

  2. [2]

    Bayarri, D

    M. Bayarri, D. J. Benjamin, J. O. Berger, and T. M. Sellke. Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses. Journal of Mathematical Psychology, 72: 0 90--103, June 2016. ISSN 00222496. doi:10.1016/j.jmp.2015.12.007

  3. [3]

    Blenkinsop, M

    A. Blenkinsop, M. K. Parmar, and B. Choodari-Oskooei. Assessing the impact of efficacy stopping rules on the error rates under the multi-arm multi-stage framework. Clinical Trials, 16 0 (2): 0 132--141, 2019. ISSN 1740-7745. doi:10.1177/1740774518823551

  4. [4]

    Blotwijk, S

    S. Blotwijk, S. Hernot, and K. Barbé. Group sequential designs for in vivo studies: Minimizing animal numbers and handling uncertainty in power analysis. Research in Veterinary Science, 145: 0 248--254, 2022. doi:10.1016/j.rvsc.2022.03.003

  5. [5]

    Bonapersona, H

    V. Bonapersona, H. Hoijtink, R. A. Sarabdjitsingh, and M. Joëls. Increasing the statistical power of animal experiments with historical control data. Nature Neuroscience, 24 0 (4): 0 470--477, 2021. doi:10.1038/s41593-020-00792-3

  6. [6]

    N. E. Breslow. Statistics in epidemiology: The case-control study. Journal of the American Statistical Association, 91 0 (433): 0 14–28, Mar. 1996. doi:10.1080/01621459.1996.10476660

  7. [7]

    W. S. Browner. Are all significant p values created equal?: The analogy between diagnostic tests and clinical research. JAMA, 257 0 (18): 0 2459, 1987. doi:10.1001/jama.1987.03390180077027

  8. [8]

    Cavus, B

    M. Cavus, B. Yazici, and A. Sezer. Penalized power approach to compare the power of the tests when type I error probabilities are different. Communications in Statistics - Simulation and Computation, 50 0 (7): 0 1912–1926, Mar. 2019. doi:10.1080/03610918.2019.1588310

  9. [9]

    R. P. Chalmers and M. C. Adkins. Writing effective and reliable Monte Carlo simulations with the SimDesign package. The Quantitative Methods for Psychology, 16 0 (4): 0 248--280, 2020. doi:10.20982/tqmp.16.4.p248

  10. [10]

    Cornfield

    J. Cornfield. A method of estimating comparative rates from clinical data; applications to cancer of the lung, breast, and cervix. Journal Natl Cancer Inst, 1 0 (6): 0 1269--75, 1951

  11. [11]

    D. B. Dahl, D. Scott, C. Roosen, A. Magnusson, and J. Swinton. xtable: Export Tables to LaTeX or HTML, 2019. URL https://CRAN.R-project.org/package=xtable. R package version 1.8-4

  12. [12]

    M. H. De Groot and M. J. Schervish. Probability and Statistics. Addison-Wesley, 4th edition, 2012

  13. [13]

    J. J. Deeks and D. G. Altman. Diagnostic tests 4: likelihood ratios. BMJ, 329: 0 168--169, 2004

  14. [14]

    Fisch, I

    R. Fisch, I. Jones, J. Jones, J. Kerman, G. K. Rosenkranz, and H. Schmidli. Bayesian Design of Proof -of- Concept Trials . Therapeutic Innovation & Regulatory Science, 49 0 (1): 0 155--162, Jan. 2015. ISSN 2168-4790, 2168-4804. doi:10.1177/2168479014533970

  15. [15]

    Gerber and T

    F. Gerber and T. Gsponer. gsbDesign : An R Package for Evaluating the Operating Characteristics of a Group Sequential Bayesian Design . Journal of Statistical Software, 69: 0 1--23, Mar. 2016. doi:10.18637/jss.v069.i11

  16. [16]

    A. S. Glas, J. G. Lijmer, M. H. Prins, G. J. Bonsel, and P. M. Bossuyt. The diagnostic odds ratio: a single indicator of test performance. Journal of Clinical Epidemiology, 56 0 (11): 0 1129--1135, Nov. 2003. ISSN 08954356. doi:10.1016/S0895-4356(03)00177-X

  17. [17]

    W. M. Goodman, S. E. Spruill, and E. Komaroff. A Proposed Hybrid Effect Size Plus p- Value Criterion : Empirical Evidence Supporting its Use . The American Statistician, 73 0 (sup1): 0 168--185, Mar. 2019. ISSN 0003-1305, 1537-2731. doi:10.1080/00031305.2018.1564697

  18. [18]

    Gravestock and L

    I. Gravestock and L. Held. Adaptive power priors with empirical Bayes for clinical trials. Pharmaceutical Statistics, 16 0 (5): 0 349--360, 2017. doi:10.1002/pst.1814

  19. [19]

    Gravestock and L

    I. Gravestock and L. Held. Power priors based on multiple historical studies for binary outcomes. Biometrical Journal, 61 0 (5): 0 1201--1218, 2018. doi:10.1002/bimj.201700246

  20. [20]

    A. P. Grieve. Optimising the trade-off between type I and type II errors: A review and extensions. 2024. doi:10.48550/arXiv.2409.12081. arXiv preprint

  21. [21]

    G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes . Oxford University Press, Oxford, UK, 3rd edition, 2001

  22. [22]

    Gsponer, F

    T. Gsponer, F. Gerber, B. Bornkamp, D. Ohlssen, M. Vandemeulebroecke, and H. Schmidli. A practical guide to Bayesian group sequential designs. Pharmaceutical Statistics, 13 0 (1): 0 71--80, 2014. doi:10.1002/pst.1593

  23. [23]

    Heinze, A

    G. Heinze, A. Boulesteix, M. Kammer, T. P. Morris, and I. R. White. Phases of methodological research in biostatistics---building the evidence base for new methods. Biometrical Journal, 66 0 (1), 2023. doi:10.1002/bimj.202200222

  24. [24]

    L. Held. A new standard for the analysis and design of replication studies (with discussion). Journal of the Royal Statistical Society: S eries A (Statistics in Society) , 183 0 (2): 0 431--448, 2020. doi:10.1111/rssa.12493

  25. [25]

    L. Held, F. Gerber, K. Rufibach, S. R. Haile, S. Meyer, S. Rueeger, and S. Schwab. biostatUZH : Misc Tools of the Department of Biostatistics, EBPI, University of Zurich , 2024. URL https://github.com/EBPI-Biostatistics/biostatUZH. R package version 2.2.7, commit c7834604b20d382651f12a6399a2e4e87abeef76

  26. [26]

    Huang and L

    Q. Huang and L. Trinquart. Relative likelihood ratios for neutral comparisons of statistical tests in simulation studies. Biometrical Journal, 66 0 (1): 0 2200102, 2024. doi:10.1002/bimj.202200102

  27. [27]

    J. P. A. Ioannidis. Why most published research findings are false. PLoS Medicine , 2 0 (8): 0 e124, 2005. doi:10.1371/journal.pmed.0020124

  28. [28]

    Jennison and B

    C. Jennison and B. W. Turnbull. Group Sequential Methods with Applications to Clinical Trials. Chapman & Hall, 1999

  29. [29]

    J. A. Kairalla, C. S. Coffey, M. A. Thomann, and K. E. Muller. Adaptive trial designs: a review of barriers and opportunities. Trials, 13 0 (1), 2012. doi:10.1186/1745-6215-13-145

  30. [30]

    J. Kang, T. Koulis, and T. Pourmohamad. Sample size reduction in preclinical experiments: A Bayesian sequential decision-making framework. Journal of Biopharmaceutical Statistics, pages 1--16, 2025. doi:10.1080/10543406.2025.2556680

  31. [31]

    Kassambara

    A. Kassambara. ggpubr: 'ggplot2' Based Publication Ready Plots, 2023. URL https://CRAN.R-project.org/package=ggpubr. R package version 0.6.0

  32. [32]

    Kirkwood and J

    B. Kirkwood and J. Sterne. E ssential M edical S tatistics. Blackwell Publishing, 2003

  33. [33]

    Koehler, E

    E. Koehler, E. Brown, and S. J.-P. A. Haneuse. On the assessment of Monte Carlo error in simulation-based statistical analyses. The American Statistician, 63 0 (2): 0 155--162, 2009. doi:10.1198/tast.2009.0030

  34. [34]

    E. L. Lehmann. Testing Statistical Hypotheses. John Wiley & Sons, 1959

  35. [35]

    C. J. Lloyd. Estimating test power adjusted for size. Journal of Statistical Computation and Simulation, 75 0 (11): 0 921–933, Nov. 2005. doi:10.1080/00949650412331321160

  36. [36]

    Ludbrook

    J. Ludbrook. Interim analyses of data as they accumulate in laboratory experimentation. BMC Medical Research Methodology, 3 0 (1): 0 15, Dec. 2003. doi:10.1186/1471-2288-3-15

  37. [37]

    P. D. Lyden, F. Bosetti, M. A. Diniz, A. Rogatko, J. I. Koenig, J. Lamb, K. A. Nagarkatti, R. P. Cabeen, D. C. Hess, P. K. Kamat, M. B. Khan, K. Wood, K. Dhandapani, A. S. Arbab, E. C. Leira, A. K. Chauhan, N. Dhanesha, R. B. Patel, M. Kumskova, D. Thedens, A. Morais, T. Imai, T. Qin, C. Ayata, L. S. Boisserand, A. L. Herman, H. E. Beatty, S. E. Velazquez...

  38. [38]

    J. N. Matthews. Introduction to Randomized Controlled Clinical Trials. Chapman and Hall/ CRC , New York, 2006. doi:10.1201/9781420011302

  39. [39]

    Micheloud and L

    C. Micheloud and L. Held. Power calculations for replication studies. Statistical Science, 37 0 (3): 0 369--379, 2022. doi:10.1214/21-sts828

  40. [40]

    T. P. Morris, I. R. White, and M. J. Crowther. Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38 0 (11): 0 2074--2102, 2019. doi:10.1002/sim.8086

  41. [41]

    J. F. Mudge, L. F. Baker, C. B. Edge, and J. E. Houlahan. Setting an optimal that minimizes errors in null hypothesis significance tests. PLOS ONE , 7 0 (2): 0 e32734, 2012. doi:10.1371/journal.pone.0032734

  42. [42]

    Neuenschwander, S

    B. Neuenschwander, S. Weber, H. Schmidli, and A. O'Hagan. Predictively consistent prior effective sample sizes. Biometrics, 76 0 (2): 0 578--587, 2020. doi:10.1111/biom.13252

  43. [43]

    Neumann, U

    K. Neumann, U. Grittner, S. K. Piper, A. Rex, O. Florez-Vargas, G. Karystianis, A. Schneider, I. Wellwood, B. Siegerink, J. P. A. Ioannidis, J. Kimmelman, and U. Dirnagl. Increasing efficiency of preclinical research by group sequential designs. PLOS Biology, 15 0 (3): 0 e2001307, Mar. 2017. ISSN 1545-7885. doi:10.1371/journal.pbio.2001307

  44. [44]

    Nikolakopoulos, K

    S. Nikolakopoulos, K. C. Roes, and I. van der Tweel. Sequential designs with small samples: Evaluation and recommendations for normal responses. Statistical Methods in Medical Research, 27 0 (4): 0 1115--1127, 2016. doi:10.1177/0962280216653778

  45. [45]

    M. Pepe. T he Statistical Evaluation of Medical Tests for Classification and Prediction . Oxford University Press, USA, 2004

  46. [46]

    P. S. Phelan. The delta likelihood ratio does not incorporate study power. Journal of Clinical Epidemiology, 101: 0 128--129, 2018. doi:10.1016/j.jclinepi.2018.04.021

  47. [47]

    Pourmohamad and C

    T. Pourmohamad and C. Wang. Sequential Bayes factors for sample size reduction in preclinical experiments with binary outcomes. Statistics in Biopharmaceutical Research, 15 0 (4): 0 706--715, 2022. doi:10.1080/19466315.2022.2123386

  48. [48]

    R: A Language and Environment for Statistical Computing

    R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. URL https://www.R-project.org/

  49. [49]

    Reinagel

    P. Reinagel. Is N -Hacking Ever OK? The consequences of collecting more data in pursuit of statistical significance . PLOS Biology, 21 0 (11): 0 e3002345, 2023. doi:10.1371/journal.pbio.3002345

  50. [50]

    P. S. Reynolds. The well-built research question. Lab Animal, 52 0 (10): 0 221--223, 2023. ISSN 1548-4475. doi:10.1038/s41684-023-01257-3

  51. [51]

    P. S. Reynolds. Statistical design of experiments: the forgotten component of reduction. Lab Animal, 53 0 (3): 0 57--59, 2024 a . doi:10.1038/s41684-024-01334-1

  52. [52]

    P. S. Reynolds. Study design: think ‘scientific value’ not ‘p-values’. Laboratory Animals, 58 0 (5): 0 404--410, 2024 b . doi:10.1177/00236772241276806

  53. [53]

    D. M. Rom and J. A. McTague. Exact critical values for group sequential designs with small sample sizes. Journal of Biopharmaceutical Statistics, 30 0 (4): 0 752--764, 2020. doi:10.1080/10543406.2020.1730878

  54. [54]

    G. K. Rosenkranz. Replicability of studies following a dual-criterion design. Statistics in Medicine, 40 0 (18): 0 4068--4076, 2021. doi:10.1002/sim.9014

  55. [55]

    Roychoudhury, N

    S. Roychoudhury, N. Scheuer, and B. Neuenschwander. Beyond p -values: A phase II dual-criterion design with statistical significance and clinical relevance. Clinical Trials, 15 0 (5): 0 452--461, Oct. 2018. ISSN 1740-7745, 1740-7753. doi:10.1177/1740774518770661

  56. [56]

    Rufibach, H

    K. Rufibach, H. U. Burger, and M. Abt. Bayesian predictive power: choice of prior and some recommendations for its use as probability of success in drug development. Pharmaceutical Statistics, 15 0 (5): 0 438--446, 2016. doi:10.1002/pst.1764

  57. [57]

    W. M. S. Russell and R. L. Burch. The Principles of Humane Experimental Technique. Methuen, London, U.K., 1959

  58. [58]

    B. S. Siepe, F. Barto s , T. P. Morris, A.-L. Boulesteix, D. W. Heck, and S. Pawel. Simulation studies for methodological research in psychology: A standardized structure for planning, preregistration, and reporting. Psychological Methods, 2024. doi:10.1037/met0000695. To appear

  59. [59]

    R. Simon. Randomized Clinical Trials and Research Strategy . Cancer Treatment Reports, 66: 0 1083--1087, 1982

  60. [60]

    R. Simon. S ome practical aspects of the interim monitoring of clinical trials . Statistics in Medicine, 13: 0 1401--1409, 1994

  61. [61]

    D. J. Spiegelhalter, L. S. Freedman, and P. R. Blackburn. Monitoring clinical trials: Conditional or predictive power? Controlled Clinical Trials, 7 0 (1): 0 8--17, Mar. 1986. ISSN 01972456. doi:10.1016/0197-2456(86)90003-6

  62. [62]

    D. J. Spiegelhalter, R. Abrams, and J. P. Myles. Bayesian Approaches to Clinical Trials and Health-Care Evaluation . New York: Wiley, 2004

  63. [63]

    M. J. Staquet, M. Rozencweig, D. D. Von Hoff, and F. M. Muggia. T he delta and epsilon errors in the assessment of cancer clinical trials . Cancer Treatment Reports, 63 0 (11-12): 0 1917--1921, 1979

  64. [64]

    H. G. G. Townsend, K. Osterrieder, M. D. Jelinski, D. W. Morck, C. L. Waldner, W. R. Cox, V. Gerdts, A. A. Potter, L. A. Babiuk, and J. C. Cross. A call to action to address critical flaws and bias in laboratory animal experiments and preclinical research. Scientific Reports, 15 0 (1): 0 30745, 2025. doi:10.1038/s41598-025-15935-4

  65. [65]

    R. J. Walley and A. P. Grieve. Optimising the trade-off between type i and II error rates in the bayesian context. Pharmaceutical Statistics, 20 0 (4): 0 710--720, 2021. doi:10.1002/pst.2102

  66. [66]

    Wassmer and W

    G. Wassmer and W. Brannath. Group Sequential and Confirmatory Adaptive Designs in Clinical Trials . Springer, New York, 2016. doi:10.1007/978-3-319-32562-0

  67. [67]

    H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer International Publishing, Cham, 2016. ISBN 978-3-319-24277-4. doi:10.1007/978-3-319-24277-4

  68. [68]

    Wickham, R

    H. Wickham, R. François, L. Henry, and K. Müller. dplyr: A Grammar of Data Manipulation, 2022. URL https://CRAN.R-project.org/package=dplyr. R package version 1.0.10

  69. [69]

    Wickham, D

    H. Wickham, D. Vaughan, and M. Girlich. tidyr: Tidy Messy Data, 2024. URL https://CRAN.R-project.org/package=tidyr. R package version 1.3.1

  70. [70]

    Wiesenfarth and S

    M. Wiesenfarth and S. Calderazzo. Quantification of prior impact in terms of effective current sample size. Biometrics, 76 0 (1): 0 326--336, 2020. doi:10.1111/biom.13124

  71. [71]

    C. O. Wilke. cowplot: Streamlined Plot Theme and Plot Annotations for 'ggplot2', 2024. URL https://CRAN.R-project.org/package=cowplot. R package version 1.1.3

  72. [72]

    Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R, 2024. URL https://yihui.org/knitr/. R package version 1.46

  73. [73]

    Y. Zhao, D. Li, R. Liu, and Y. Yuan. Bayesian optimal phase II designs with dual-criterion decision making. Pharmaceutical Statistics, 22 0 (4): 0 605--618, 2023. ISSN 1539-1612. doi:10.1002/pst.2296