Sample size calculations for multilevel factorial longitudinal cluster randomised trials

Andrew Copas; Andrew Forbes; Fan Li; James Wason; Jessica Kasza; Rebecca Walwyn; Rhys Bowden

arxiv: 2508.01610 · v2 · submitted 2025-08-03 · 📊 stat.ME

Sample size calculations for multilevel factorial longitudinal cluster randomised trials

Rhys Bowden , Rebecca Walwyn , Jessica Kasza , Andrew Copas , Fan Li , James Wason , Andrew Forbes This is my paper

Pith reviewed 2026-05-19 01:44 UTC · model grok-4.3

classification 📊 stat.ME

keywords sample size calculationpower analysisfactorial designcluster randomised triallongitudinal datamultilevel interventionsplit-plot designstepped wedge

0 comments

The pith

Power to detect individual, cluster, and interaction effects in split-plot factorial longitudinal cluster trials follows from standard formulas for simpler trial designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops sample size methods for trials that simultaneously test an individual-level intervention and a cluster-level intervention using longitudinal cluster randomisation schemes such as stepped-wedge or crossover designs. It demonstrates that the required number of participants or clusters to achieve specified power for the two main effects and their interaction can be obtained by combining existing power results from individually randomised trials and from longitudinal cluster randomised trials. This combined approach is applied to the SharES breast-cancer trial to illustrate planning for continuous outcomes under the multilevel factorial structure. A sympathetic reader would care because many real-world interventions act at both personal and group levels, yet prior methodology did not cover the longitudinal cluster setting.

Core claim

For continuous outcomes in split-plot factorial longitudinal cluster randomised trials, the power to detect a given effect of the individual-level intervention, the cluster-level intervention, and the interaction between them is determined directly from standard power formulas for individually randomised trials and for longitudinal cluster randomised trials.

What carries the argument

The split-plot factorial longitudinal cluster randomised trial, in which clusters are randomised to levels of one factor while individuals within clusters are randomised to levels of the second factor, with outcomes measured repeatedly over time.

If this is right

Trial designers can plan joint assessment of patient-level and clinic-level interventions in stepped-wedge or crossover settings using readily available power calculators.
The approach extends to any longitudinal cluster design whose power formula is already known, without requiring a completely new derivation.
Interaction effects between the two intervention levels become testable within the same study that also estimates the separate main effects.
Sample size can be expressed in terms of the variances and intra-class correlations already familiar from simpler trial types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may reduce the barrier to running multilevel intervention studies in fields such as oncology or primary care where both clinician behaviour and patient education matter.
Extension to binary or time-to-event outcomes would follow the same logic once the corresponding component formulas are substituted.
The framework invites direct comparison of efficiency between split-plot factorial longitudinal designs and separate single-factor trials.

Load-bearing premise

That the power for the full multilevel factorial longitudinal design can be assembled by direct combination of existing formulas from separate individual-randomised and longitudinal cluster-randomised trials without extra terms for time-dependent interactions or the specific factorial randomisation structure.

What would settle it

A Monte Carlo simulation of a split-plot factorial longitudinal cluster trial in which the sample size chosen from the combined formulas fails to deliver the nominal power for the interaction effect when the true data-generating process includes realistic cluster-level and temporal correlation.

Figures

Figures reproduced from arXiv: 2508.01610 by Andrew Copas, Andrew Forbes, Fan Li, James Wason, Jessica Kasza, Rebecca Walwyn, Rhys Bowden.

**Figure 2.** Figure 2: Power vs number of observations per cluster period, for a standardised effect [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of the variance of treatment effect estimators vs within-period [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

read the original abstract

Typically, trials investigate the impact of either an individual-level intervention on participant outcomes, or the impact of a cluster-level intervention on participant outcomes. Factorial designs consider two (or more) treatments for each of two (or more) different factors. In factorial trial designs, trial units (individuals or clusters) are each randomised to a level of each of the treatments; these designs allow assessment of the interactions between different interventions. Recently, there has been growing interest in the design of trials that jointly assess the impact of individual- and cluster-level interventions (i.e. multi-level interventions); requiring the development of methodology that accommodates randomisation at multiple levels. While recent work has developed sample size methodology for variants combining standard cluster randomisation and individual randomisation, that work does not apply to longitudinal cluster randomised trial designs such as the stepped wedge design or cluster randomised crossover design. Here we present dedicated sample size methodology for "split-plot factorial longitudinal cluster randomised trials" with continuous outcomes: allowing for joint assessment of individual-level and cluster-level interventions that allows for the impact of the cluster-level intervention to be assessed using any longitudinal cluster randomised trial design. We show how the power to detect given effects of the individual-level intervention, the cluster-level intervention, and the interaction between the two depends on standard results for individually-randomised trials and longitudinal cluster randomised trials. We apply these results to the SharES trial, which considered the effects of a patient- and clinician-level interventions for patients with breast cancer on patient knowledge about the risks and benefits of treatment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives sample size formulas for split-plot factorial longitudinal cluster trials by combining standard results from individual and longitudinal cluster designs, with an example from the SharES trial.

read the letter

This paper's main contribution is a set of sample size formulas for trials that randomise at both individual and cluster levels while using longitudinal cluster designs like stepped wedge or crossover. It shows the power for the individual effect, cluster effect, and their interaction by pulling from standard results in simpler trial settings. What stands out is the extension to these longitudinal factorial setups. Earlier papers handled multilevel interventions in standard cluster trials, but this one adapts it for designs where clusters are followed over time. The authors apply it to the SharES trial, which tested patient and clinician interventions for breast cancer knowledge, giving a concrete example. The work does a good job of identifying the gap and providing a way to plan such studies without reinventing everything from the ground up. By depending on established formulas for individual randomised trials and longitudinal cluster ones, it keeps things practical for users who already know those basics. That said, the description suggests the new power expressions come from direct combination rather than fresh derivations that might capture interactions between the levels and time. If the multilevel factorial aspect introduces dependencies not present in the separate cases, this could miss something. The lack of mentioned validation simulations or error bounds in the abstract makes it harder to judge how robust the method is in practice. Overall, this targets trial statisticians who design studies with mixed intervention levels in cluster settings over time. A reader working on sample size for a stepped wedge factorial trial would get usable guidance here. The paper shows clear engagement with the literature on cluster trial methods and presents a logical extension. It deserves a serious referee to check the technical details and assumptions. I would recommend putting it through peer review rather than a desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript presents dedicated sample size methodology for split-plot factorial longitudinal cluster randomised trials with continuous outcomes. It enables joint assessment of individual-level and cluster-level interventions (and their interaction) where the cluster-level intervention is evaluated via any longitudinal cluster randomised trial design such as stepped wedge or cluster crossover. The central claim is that power for the three effects follows directly from combining standard power results for individually randomised trials and longitudinal cluster randomised trials; the approach is illustrated via application to the SharES trial.

Significance. If the combination of existing formulas is shown to be statistically valid for the split-plot factorial structure, the work would supply a practical, accessible tool for sample-size planning in trials that evaluate multilevel interventions within longitudinal cluster designs. This addresses a methodological gap without requiring wholly new derivations, which is efficient. The real-trial application adds immediate relevance for trialists working on patient- and clinician-level interventions.

major comments (1)

[Abstract and Methods] Abstract and Methods: the claim that power expressions for the individual-level, cluster-level, and interaction effects are obtained directly by combining standard results rests on the unstated assumption that the multilevel factorial structure introduces no additional variance components or time-dependent interaction terms beyond those already present in the separate individual-randomised and longitudinal-cluster formulae. Without an explicit combined variance expression or design-matrix adjustment (e.g., for the interaction contrast in a longitudinal mixed model), it is not possible to verify that the combination preserves correct type-I error and power for the interaction test.

minor comments (1)

[Application] The SharES trial example would benefit from a short table listing the assumed effect sizes, intra-cluster correlations, and autocorrelation values used to obtain the reported sample sizes, allowing readers to reproduce the calculations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. The observation regarding the need for explicit verification of the combined variance structure is well taken and has prompted us to strengthen the presentation of the underlying model.

read point-by-point responses

Referee: Abstract and Methods: the claim that power expressions for the individual-level, cluster-level, and interaction effects are obtained directly by combining standard results rests on the unstated assumption that the multilevel factorial structure introduces no additional variance components or time-dependent interaction terms beyond those already present in the separate individual-randomised and longitudinal-cluster formulae. Without an explicit combined variance expression or design-matrix adjustment (e.g., for the interaction contrast in a longitudinal mixed model), it is not possible to verify that the combination preserves correct type-I error and power for the interaction test.

Authors: We agree that an explicit derivation of the variance expressions would improve verifiability. The split-plot factorial structure nests individual-level randomisation within the cluster-level longitudinal design, so that the fixed-effects contrasts for the individual-level main effect, cluster-level main effect, and their interaction remain orthogonal under the assumed linear mixed model. The random-effects structure (cluster, time, and cluster-by-time) is inherited directly from the longitudinal cluster trial component and does not acquire extra time-dependent interaction terms from the factorial aspect. In the revised manuscript we will add a subsection that states the full model, supplies the design-matrix contrast vectors for each effect, and derives the variance of the interaction estimator as a linear combination of the variances already present in the separate individual-randomised and longitudinal-cluster formulae. This will confirm that type-I error and power for the interaction test are preserved under the standard assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation combines external standard results

full rationale

The paper's central claim is that power calculations for the split-plot factorial longitudinal cluster design follow from combining existing formulas for individually randomised trials and longitudinal cluster randomised trials. No load-bearing step reduces to a self-citation chain, fitted parameter renamed as prediction, or self-definitional loop within the manuscript. The derivation treats the multilevel factorial structure as separable into known components once the design is specified, with the abstract explicitly stating dependence on 'standard results' from prior literature rather than internal fits or author-specific uniqueness theorems. This is the most common honest non-finding for papers that extend methodology by composition of established results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard statistical assumptions for continuous outcomes in randomised trials and the novel claim that power expressions combine without additional multilevel-specific terms.

axioms (2)

domain assumption Continuous outcomes follow linear mixed models with random effects capturing clustering and longitudinal correlation.
Standard assumption invoked for power calculations in cluster and longitudinal trial designs.
ad hoc to paper The factorial multilevel structure does not introduce new variance components beyond those in the separate individual and cluster designs.
This is the key modelling choice that allows direct combination of existing power results.

pith-pipeline@v0.9.0 · 5819 in / 1308 out tokens · 31789 ms · 2026-05-19T01:44:27.982848+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show how the power to detect given effects ... depends on standard results for individually-randomised trials and longitudinal cluster randomised trials.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

var(ˆβI) = (1−ρCT) / (m T σZ² (1−πX)) · σ²/n

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Analysis of variance, design, and regression: Linear modeling for unbalanced data

Ronald Christensen. Analysis of variance, design, and regression: Linear modeling for unbalanced data. Chapman and Hall/CRC, 2018

work page 2018
[2]

Multilevel factorial ex- periments for developing behavioral interventions: power, sample size, and resource considerations

John J Dziak, Inbal Nahum-Shani, and Linda M Collins. Multilevel factorial ex- periments for developing behavioral interventions: power, sample size, and resource considerations. Psychological methods, 17(2):153, 2012

work page 2012
[3]

Sarah T Hawley, Kelley Kidwell, David Zahrieh, Anne McCarthy, Rachel Wills, Aaron Rankin, Timothy Hofer, Selina Chow, Reshma Jagsi, and Heather Neuman. 25 Improving patient-centered communication in breast cancer: a study protocol for a multilevel intervention of a shared treatment deliberation system (shares) within the nci community oncology research pr...

work page 2023
[4]

Hooper, S

R. Hooper, S. Teerenstra, E. de Hoop, and S. Eldridge. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Stat Med , 35:4718–4728, 2016

work page 2016
[5]

Does it decay? obtaining decaying correlation parameter values from pre- viously analysed cluster randomised trials

Jessica Kasza, Rhys Bowden, Yongdong Ouyang, Monica Taljaard, and Andrew B Forbes. Does it decay? obtaining decaying correlation parameter values from pre- viously analysed cluster randomised trials. Statistical Methods in Medical Research, 32(11):2123–2134, 2023

work page 2023
[6]

Cluster randomized controlled trial of a multilevel physical activity intervention for older adults

Jacqueline Kerr, Dori Rosenberg, Rachel A Millstein, Khalisa Bolling, Katie Crist, Michelle Takemoto, Suneeta Godbole, Kevin Moran, Loki Natarajan, Cynthia Castro-Sweet, et al. Cluster randomized controlled trial of a multilevel physical activity intervention for older adults. International Journal of Behavioral Nutrition and Physical Activity, 15:1–9, 2018

work page 2018
[7]

Walking the talk on multi-level interventions: the power of parsimony

Colleen M McBride, Hannah LF Cooper, David R Williams, and Karen M Emmons. Walking the talk on multi-level interventions: the power of parsimony. Social Sci- ence & Medicine, 283:114189, 2021

work page 2021
[8]

The 2 × 2 cluster randomized controlled factorial trial design is mainly used for efficiency and to explore intervention interactions: a systematic review

Noreen D Mdege, Sally Brabyn, Catherine Hewitt, Rachel Richardson, and David J Torgerson. The 2 × 2 cluster randomized controlled factorial trial design is mainly used for efficiency and to explore intervention interactions: a systematic review. Journal of Clinical Epidemiology , 67(10):1083–1092, 2014

work page 2014
[9]

A modern theory of factorial design

Rahul Mukerjee and CF Jeff Wu. A modern theory of factorial design . Springer Science & Business Media, 2007

work page 2007
[10]

Multilevel factorial designs in intervention development

Inbal Nahum-Shani and John J Dziak. Multilevel factorial designs in intervention development. Optimization of behavioral, biobehavioral, and biomedical interven- tions: Advanced topics , pages 47–87, 2018

work page 2018
[11]

Multilevel implementation strategies for adolescent human papillomavirus vaccine uptake: a cluster randomized clinical trial

Lila J Finney Rutten, Joan M Griffin, Jennifer L St Sauver, Kathy MacLaughlin, Jessica D Austin, Gregory Jenkins, Jeph Herrin, and Robert M Jacobson. Multilevel implementation strategies for adolescent human papillomavirus vaccine uptake: a cluster randomized clinical trial. JAMA pediatrics, 178(1):29–36, 2024

work page 2024
[12]

Multi- level intervention stepped wedge designs (mli-swds)

John Sperger, Michael R Kosorok, Laura Linnan, and Shawn M Kneipp. Multi- level intervention stepped wedge designs (mli-swds). Prevention Science, 25(Suppl 3):371–383, 2024

work page 2024
[13]

Sample size calculation in hierarchical 2 × 2 factorial trials with unequal cluster sizes

Zizhong Tian, Denise Esserman, Guangyu Tong, Ondrej Blaha, James Dziura, Peter Peduzzi, and Fan Li. Sample size calculation in hierarchical 2 × 2 factorial trials with unequal cluster sizes. Statistics in medicine , 41(4):645–664, 2022. 26

work page 2022
[14]

Learning from a factorial stepped-wedge cluster randomised controlled trial in primary care

BAM Van der Geest and JV Been. Learning from a factorial stepped-wedge cluster randomised controlled trial in primary care. Neonatal hyperbilirubinaemia, page 127, 2020. 27

work page 2020

[1] [1]

Analysis of variance, design, and regression: Linear modeling for unbalanced data

Ronald Christensen. Analysis of variance, design, and regression: Linear modeling for unbalanced data. Chapman and Hall/CRC, 2018

work page 2018

[2] [2]

Multilevel factorial ex- periments for developing behavioral interventions: power, sample size, and resource considerations

John J Dziak, Inbal Nahum-Shani, and Linda M Collins. Multilevel factorial ex- periments for developing behavioral interventions: power, sample size, and resource considerations. Psychological methods, 17(2):153, 2012

work page 2012

[3] [3]

Sarah T Hawley, Kelley Kidwell, David Zahrieh, Anne McCarthy, Rachel Wills, Aaron Rankin, Timothy Hofer, Selina Chow, Reshma Jagsi, and Heather Neuman. 25 Improving patient-centered communication in breast cancer: a study protocol for a multilevel intervention of a shared treatment deliberation system (shares) within the nci community oncology research pr...

work page 2023

[4] [4]

Hooper, S

R. Hooper, S. Teerenstra, E. de Hoop, and S. Eldridge. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Stat Med , 35:4718–4728, 2016

work page 2016

[5] [5]

Does it decay? obtaining decaying correlation parameter values from pre- viously analysed cluster randomised trials

Jessica Kasza, Rhys Bowden, Yongdong Ouyang, Monica Taljaard, and Andrew B Forbes. Does it decay? obtaining decaying correlation parameter values from pre- viously analysed cluster randomised trials. Statistical Methods in Medical Research, 32(11):2123–2134, 2023

work page 2023

[6] [6]

Cluster randomized controlled trial of a multilevel physical activity intervention for older adults

Jacqueline Kerr, Dori Rosenberg, Rachel A Millstein, Khalisa Bolling, Katie Crist, Michelle Takemoto, Suneeta Godbole, Kevin Moran, Loki Natarajan, Cynthia Castro-Sweet, et al. Cluster randomized controlled trial of a multilevel physical activity intervention for older adults. International Journal of Behavioral Nutrition and Physical Activity, 15:1–9, 2018

work page 2018

[7] [7]

Walking the talk on multi-level interventions: the power of parsimony

Colleen M McBride, Hannah LF Cooper, David R Williams, and Karen M Emmons. Walking the talk on multi-level interventions: the power of parsimony. Social Sci- ence & Medicine, 283:114189, 2021

work page 2021

[8] [8]

The 2 × 2 cluster randomized controlled factorial trial design is mainly used for efficiency and to explore intervention interactions: a systematic review

Noreen D Mdege, Sally Brabyn, Catherine Hewitt, Rachel Richardson, and David J Torgerson. The 2 × 2 cluster randomized controlled factorial trial design is mainly used for efficiency and to explore intervention interactions: a systematic review. Journal of Clinical Epidemiology , 67(10):1083–1092, 2014

work page 2014

[9] [9]

A modern theory of factorial design

Rahul Mukerjee and CF Jeff Wu. A modern theory of factorial design . Springer Science & Business Media, 2007

work page 2007

[10] [10]

Multilevel factorial designs in intervention development

Inbal Nahum-Shani and John J Dziak. Multilevel factorial designs in intervention development. Optimization of behavioral, biobehavioral, and biomedical interven- tions: Advanced topics , pages 47–87, 2018

work page 2018

[11] [11]

Multilevel implementation strategies for adolescent human papillomavirus vaccine uptake: a cluster randomized clinical trial

Lila J Finney Rutten, Joan M Griffin, Jennifer L St Sauver, Kathy MacLaughlin, Jessica D Austin, Gregory Jenkins, Jeph Herrin, and Robert M Jacobson. Multilevel implementation strategies for adolescent human papillomavirus vaccine uptake: a cluster randomized clinical trial. JAMA pediatrics, 178(1):29–36, 2024

work page 2024

[12] [12]

Multi- level intervention stepped wedge designs (mli-swds)

John Sperger, Michael R Kosorok, Laura Linnan, and Shawn M Kneipp. Multi- level intervention stepped wedge designs (mli-swds). Prevention Science, 25(Suppl 3):371–383, 2024

work page 2024

[13] [13]

Sample size calculation in hierarchical 2 × 2 factorial trials with unequal cluster sizes

Zizhong Tian, Denise Esserman, Guangyu Tong, Ondrej Blaha, James Dziura, Peter Peduzzi, and Fan Li. Sample size calculation in hierarchical 2 × 2 factorial trials with unequal cluster sizes. Statistics in medicine , 41(4):645–664, 2022. 26

work page 2022

[14] [14]

Learning from a factorial stepped-wedge cluster randomised controlled trial in primary care

BAM Van der Geest and JV Been. Learning from a factorial stepped-wedge cluster randomised controlled trial in primary care. Neonatal hyperbilirubinaemia, page 127, 2020. 27

work page 2020