Recognition: unknown
Which Small-Sample Correction Should Be Used When Analyzing Stepped-Wedge Designs with Time-Varying Treatment Effects?
Pith reviewed 2026-05-10 03:53 UTC · model grok-4.3
The pith
When random effects are misspecified in stepped-wedge trials with time-varying effects, the Mancl-DeRouen estimator restores coverage for continuous outcomes while the Morel-Bokossa-Neerchal estimator does so for binary outcomes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Exposure-time indicator models target the time-averaged treatment effect and long-term effect in stepped-wedge designs when effects vary with exposure duration. Under misspecified random-effects structures, model-based standard errors undercover, but robust variance estimators improve performance. For continuous outcomes the Mancl-DeRouen estimator paired with a t-distribution whose degrees of freedom equal the number of clusters minus two yields the most consistent coverage; for binary outcomes the Morel-Bokossa-Neerchal estimator is the only consistently reliable choice. Both model-based and robust approaches remain unstable when targeting the long-term effect.
What carries the argument
Exposure-time indicator (ETI) models combined with robust variance estimators (classic sandwich, Kauermann-Carroll, Mancl-DeRouen, Morel-Bokossa-Neerchal) that adjust standard errors for small numbers of clusters and possible random-effects misspecification.
If this is right
- Model-based standard errors produce undercoverage for both the time-averaged and long-term effects when random effects are misspecified.
- For continuous outcomes the Mancl-DeRouen estimator with t-distribution and degrees of freedom equal to clusters minus two gives consistent coverage across scenarios.
- For binary outcomes the Morel-Bokossa-Neerchal estimator is the only small-sample correction that remains reliable.
- The Mancl-DeRouen estimator can become unstable in one-cluster-per-sequence designs because of data sparsity.
- Inference on the long-term effect stays unstable whether model-based or robust standard errors are used.
Where Pith is reading between the lines
- Trialists who expect effects to change with exposure duration should fit exposure-time indicator models with the recommended robust corrections instead of immediate-treatment models.
- Designs with few clusters or one cluster per sequence may still require additional safeguards beyond these estimators when targeting long-term effects.
- Similar robust-variance recommendations could be tested in other longitudinal cluster designs such as crossover trials that also feature time-dependent exposures.
Load-bearing premise
The specific simulation scenarios, including the chosen forms of random-effects misspecification and the data-generating processes for time-varying effects, adequately represent conditions in real stepped-wedge cluster randomized trials.
What would settle it
Empirical coverage of nominal 95 percent intervals computed from a real stepped-wedge trial dataset whose true time-varying effects are known independently, checked separately for the Mancl-DeRouen and Morel-Bokossa-Neerchal estimators against the model-based standard errors.
read the original abstract
Stepped-wedge cluster randomized trials (SW-CRTs) evaluate interventions rolled out across clusters over time. Standard analyses typically use immediate-treatment (IT) models, which assume effects begin at crossover and remain constant thereafter. When effects vary with exposure duration, IT models may misrepresent target effects. Exposure-time indicator (ETI) models address this by allowing treatment effects to differ by time since exposure and by targeting the time-averaged treatment effect (TATE) and long-term effect (LTE). Like IT models, ETI models require specification of a random-effects structure, which is often misspecified, and the performance of robust variance estimators (RVEs) in this setting is not well understood. We review RVEs for ETI models and evaluate them in simulation studies with continuous and binary outcomes under correctly specified (binary only) and misspecified random-effects structures. We compare the classic sandwich, Kauermann-Carroll (KC), Mancl-DeRouen (MD), and Morel-Bokossa-Neerchal (MBN) estimators for inference on the TATE and LTE. Our simulations show that under misspecified random-effects structures, model-based standard errors (SE) produced undercoverage, whereas RVEs improved performance. For continuous outcomes, MD with a t-distribution and degrees of freedom equal to the number of clusters minus two gave the most consistent coverage probabilities. For binary outcomes, MBN was the only consistently reliable option. MD, however, could be unstable in one-cluster-per-sequence designs because of data sparsity. Across scenarios, both model-based SE and RVE for LTE were unstable, indicating that greater caution is needed when targeting LTE under ETI models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates robust variance estimators (classic sandwich, KC, MD, MBN) for exposure-time indicator (ETI) models in stepped-wedge cluster randomized trials (SW-CRTs) that allow time-varying treatment effects. It targets the time-averaged treatment effect (TATE) and long-term effect (LTE) under misspecified random-effects structures via Monte Carlo simulations for continuous and binary outcomes. Key results indicate model-based SEs undercover while RVEs improve coverage; MD with t-distribution (df = clusters-2) performs best for continuous outcomes, MBN is most reliable for binary, but LTE estimates are unstable across scenarios and MD can be unstable in sparse one-cluster-per-sequence designs.
Significance. If the simulation findings hold, the work offers timely practical guidance for choosing small-sample corrections in SW-CRT analyses when random-effects assumptions are violated and time-varying effects are present. The Monte Carlo design directly compares finite-sample performance of multiple RVEs on both TATE and LTE, addressing a gap where theoretical results are limited; this empirical evidence can help analysts avoid undercoverage in real trials.
major comments (2)
- [Methods / Simulation design] Simulation design (Methods section): the data-generating processes specify particular random-effects misspecifications and exposure-time effect patterns, but no sensitivity analyses are reported for stronger cluster-size heterogeneity, higher ICC variability, or non-monotonic time-varying effects outside the simulated envelope. Because the headline recommendations (MD+t best for continuous; MBN only reliable for binary) rest entirely on these grids, limited coverage of realistic SW-CRT conditions weakens generalizability of the performance claims.
- [Results] Results on LTE (Results section): both model-based SE and all RVEs for the long-term effect are reported as unstable across scenarios, yet the paper provides no quantitative thresholds (e.g., minimum number of clusters or periods) or alternative estimators that would make LTE inference reliable. This instability directly affects the central claim that ETI models can target LTE, so explicit guidance or caveats are needed to avoid over-interpretation.
minor comments (2)
- [Introduction] The abstract and introduction use TATE and LTE without an early formal definition or equation; adding a brief display equation in the Introduction would improve readability for readers unfamiliar with ETI models.
- [Results] Table captions (Results) should explicitly state the number of Monte Carlo replications and the exact coverage target (e.g., 95%) so that reported probabilities can be interpreted without returning to the Methods text.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript evaluating robust variance estimators for exposure-time indicator models in stepped-wedge cluster randomized trials. The feedback on simulation scope and long-term effect stability is helpful, and we have revised the paper accordingly. We address each major comment point by point below.
read point-by-point responses
-
Referee: Simulation design (Methods section): the data-generating processes specify particular random-effects misspecifications and exposure-time effect patterns, but no sensitivity analyses are reported for stronger cluster-size heterogeneity, higher ICC variability, or non-monotonic time-varying effects outside the simulated envelope. Because the headline recommendations (MD+t best for continuous; MBN only reliable for binary) rest entirely on these grids, limited coverage of realistic SW-CRT conditions weakens generalizability of the performance claims.
Authors: We agree that the simulation grid does not cover every conceivable SW-CRT variation, including more extreme cluster-size heterogeneity, wider ICC ranges, or non-monotonic exposure-time patterns. Our design prioritized representative misspecifications and monotonic patterns to generate practical recommendations under conditions where random-effects assumptions are violated. To improve transparency, we will expand the Discussion section with an explicit description of the simulated envelope and clear statements limiting the applicability of the MD+t and MBN recommendations to settings similar to those examined. revision: partial
-
Referee: Results on LTE (Results section): both model-based SE and all RVEs for the long-term effect are reported as unstable across scenarios, yet the paper provides no quantitative thresholds (e.g., minimum number of clusters or periods) or alternative estimators that would make LTE inference reliable. This instability directly affects the central claim that ETI models can target LTE, so explicit guidance or caveats are needed to avoid over-interpretation.
Authors: We acknowledge that the observed instability of LTE estimates across scenarios requires more explicit guidance to prevent over-interpretation. The manuscript already states that greater caution is warranted when targeting the LTE, but we will add quantitative summaries drawn from the existing simulation results (e.g., scenarios in which coverage for the LTE approached nominal levels only when the number of clusters exceeded a certain threshold) and will recommend considering immediate-treatment models when long-term effects are the primary target and time-varying patterns are plausible. These additions will be placed in the Results and Discussion sections. revision: yes
Circularity Check
No circularity: claims rest on independent Monte Carlo simulations
full rationale
The paper evaluates robust variance estimators for exposure-time indicator models in stepped-wedge trials exclusively through simulation studies that generate data under controlled random-effects misspecifications and compare coverage and stability of model-based SEs versus RVEs (KC, MD, MBN). These simulations constitute external benchmarks independent of the target performance metrics; no equations reduce TATE or LTE inference to fitted parameters by construction, no self-definitional loops appear in the estimator definitions, and no load-bearing self-citations or imported uniqueness theorems are invoked to force the reported conclusions. The derivation chain is therefore self-contained as an empirical comparison rather than a tautological renaming or fit.
Axiom & Free-Parameter Ledger
free parameters (1)
- Simulation design parameters (number of clusters, periods, effect sizes, intraclass correlations)
axioms (1)
- domain assumption The exposure-time indicator model correctly captures the time-varying treatment effects in the data-generating process
Reference graph
Works this paper leans on
-
[1]
London: Arnold; 2000
Donner Allan.Design and analysis of cluster randomization trials in health research. London: Arnold; 2000. Book Title: Design and analysis of cluster randomization trials in health research
2000
-
[2]
OuyangYongdong,HemmingKarla,LiFan,TaljaardMonica.Estimatingintra-clustercorrelationcoefficientsforplanning longitudinal cluster randomized trials: a tutorial.International Journal of Epidemiology.2023;:dyad062
2023
-
[3]
Design and analysis of stepped wedge cluster randomized trials.Contemporary Clinical Trials.2007;28(2):182–191
Hussey Michael A., Hughes James P.. Design and analysis of stepped wedge cluster randomized trials.Contemporary Clinical Trials.2007;28(2):182–191
2007
-
[4]
CopasAndrewJ.,LewisJamesJ.,ThompsonJenniferA.,DaveyCalum,BaioGianluca,HargreavesJamesR..Designinga stepped wedge trial: three main designs, carry-over effects and randomisation approaches.Trials.2015;16(1):352
2015
-
[5]
Mixed-effects mod- els for the design and analysis of stepped wedge cluster randomized trials: An overview.Statistical Methods in Medical Research.2021;30(2):612–639
Li Fan, Hughes James P, Hemming Karla, Taljaard Monica, Melnick Edward R., Heagerty Patrick J. Mixed-effects mod- els for the design and analysis of stepped wedge cluster randomized trials: An overview.Statistical Methods in Medical Research.2021;30(2):612–639
2021
-
[6]
Ouyang Yongdong, Taljaard Monica, Forbes Andrew B, Li Fan. Maintaining the validity of inference from linear mixed models in stepped-wedge cluster randomized trials under misspecified random-effects structures.Statistical Methods in Medical Research.2024;33(9):1497–1516
2024
-
[7]
Sample size calculation for stepped wedge and other longitudinal cluster randomised trials.Statistics in Medicine.2016;35(26):4718–4728
Hooper Richard, Teerenstra Steven, Hoop Esther, Eldridge Sandra. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials.Statistics in Medicine.2016;35(26):4718–4728
2016
-
[8]
Girling Alan J., Hemming Karla. Statistical efficiency and optimal design for stepped cluster stud- ies under linear mixed effects models.Statistics in Medicine.2016;35(13):2149–2166. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.6850
-
[9]
KaszaJ.,HemmingK.,HooperR.,MatthewsJns,ForbesA.B..Impactofnon-uniformcorrelationstructureonsamplesize and power in multiple-period cluster randomised trials.Statistical Methods in Medical Research.2019;28(3):703–716
2019
-
[10]
Model misspecification in stepped wedge trials: Random effects for time or treatment.Statistics in Medicine.2022;41(10):1751–1766
Voldal Emily C., Xia Fan, Kenny Avi, Heagerty Patrick J., Hughes James P.. Model misspecification in stepped wedge trials: Random effects for time or treatment.Statistics in Medicine.2022;41(10):1751–1766
2022
-
[11]
Sample size calculators for planning stepped-wedge cluster randomized trials: a review and comparison.International Journal of Epidemiology.2022;:dyac123
Ouyang Yongdong, Li Fan, Preisser John S, Taljaard Monica. Sample size calculators for planning stepped-wedge cluster randomized trials: a review and comparison.International Journal of Epidemiology.2022;:dyac123
2022
-
[12]
Accounting for complex intra- clustercorrelationsinlongitudinalclusterrandomizedtrials:acasestudyinmalariavectorcontrol.BMCMedicalResearch Methodology.2023;23(1):64
Ouyang Yongdong, Kulkarni Manisha A., Protopopoff Natacha, Li Fan, Taljaard Monica. Accounting for complex intra- clustercorrelationsinlongitudinalclusterrandomizedtrials:acasestudyinmalariavectorcontrol.BMCMedicalResearch Methodology.2023;23(1):64
2023
-
[13]
Analysis of cluster randomised stepped wedge trials with repeated cross-sectional samples.Trials.2017;18(1):101
Hemming Karla, Taljaard Monica, Forbes Andrew. Analysis of cluster randomised stepped wedge trials with repeated cross-sectional samples.Trials.2017;18(1):101
2017
-
[14]
Contemporary Clinical Trials.2015;45(Pt A):55–60
HughesJamesP.,GranstonTanyaS.,HeagertyPatrickJ..Currentissuesinthedesignandanalysisofsteppedwedgetrials. Contemporary Clinical Trials.2015;45(Pt A):55–60
2015
-
[15]
KennyAvi,VoldalEmilyC.,XiaFan,HeagertyPatrickJ.,HughesJamesP..Analysisofsteppedwedgeclusterrandomized trials in the presence of a time-varying treatment effect.Statistics in Medicine.2022;41(22):4311–4339
2022
-
[16]
Assessing exposure-time treatment effect heterogeneity in stepped- wedge cluster randomized trials.Biometrics.2023;79(3):2551–2564
Maleyeff Lara, Li Fan, Haneuse Sebastien, Wang Rui. Assessing exposure-time treatment effect heterogeneity in stepped- wedge cluster randomized trials.Biometrics.2023;79(3):2551–2564
2023
-
[17]
How to achieve model-robust inference in stepped wedge trials with model-based methods?.Biometrics.2024;80(4):ujae123
Wang Bingkai, Wang Xueqi, Li Fan. How to achieve model-robust inference in stepped wedge trials with model-based methods?.Biometrics.2024;80(4):ujae123
2024
-
[18]
Adherence to key recommendations for design and analysis of stepped-wedge cluster randomized trials: A review of trials published 2016–2022.Clinical Trials.2024;21(2):199–210
Nevins Pascale, Ryan Mary, Davis-Plourde Kendra, et al. Adherence to key recommendations for design and analysis of stepped-wedge cluster randomized trials: A review of trials published 2016–2022.Clinical Trials.2024;21(2):199–210. 22 Ouyang et al
2016
-
[19]
TongGuangyu,NevinsPascale,RyanMary,etal.Areviewofcurrentpracticeinthedesignandanalysisofextremelysmall stepped-wedge cluster randomized trials.Clinical Trials (London, England).2025;22(1):45–56
2025
-
[20]
Inference for the treatment effect in staircase designs with continuous outcomes: a simulation study.BMC medical research methodology.2025;25(1):127
Rezaei-Darzi Ehsan, Grantham Kelsey L., Forbes Andrew B., Kasza Jessica. Inference for the treatment effect in staircase designs with continuous outcomes: a simulation study.BMC medical research methodology.2025;25(1):127
2025
-
[21]
LiangKung-Yee,ZegerScottL..LongitudinalDataAnalysisUsingGeneralizedLinearModels.Biometrika.1986;73(1):13– 22
1986
-
[22]
The fixed-effects model for robust analysis of stepped-wedge cluster trials with a small number of clusters and continuous outcomes: a simulation study.Trials.2024;25(1):718
Lee Kenneth Menglin, Cheung Yin Bun. The fixed-effects model for robust analysis of stepped-wedge cluster trials with a small number of clusters and continuous outcomes: a simulation study.Trials.2024;25(1):718
2024
-
[23]
Scott JoAnna M, deCamp Allan, Juraska Michal, Fay Michael P, Gilbert Peter B. Finite-sample corrected generalized esti- mating equation of population average treatment effects in stepped wedge cluster randomized trials.Statistical Methods in Medical Research.2017;26(2):583–597
2017
-
[24]
FordWhitneyP.,WestgatePhilipM..Maintainingthevalidityofinferenceinsmall-samplesteppedwedgeclusterrandom- ized trials with binary outcomes when using generalized estimating equations.Statistics in Medicine.2020;39(21):2779–
2020
-
[25]
_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.8575
-
[26]
Comparison of small-sample standard-error corrections for generalised estimating equations in stepped wedge cluster randomised trials with a binary outcome: A simulation study
Thompson JA, Hemming K, Forbes A, Fielding K, Hayes R. Comparison of small-sample standard-error corrections for generalised estimating equations in stepped wedge cluster randomised trials with a binary outcome: A simulation study. Statistical Methods in Medical Research.2021;30(2):425–439
2021
-
[27]
ThompsonJ.A.,DaveyC.,FieldingK.,HargreavesJ.R.,HayesR.J..Robustanalysisofsteppedwedgetrialsusingcluster- level summaries within periods.Statistics in Medicine.2018;37(16):2487–2500
2018
-
[28]
Randomization-based inference for a marginal treatment effect in stepped wedge cluster randomized trials.Statistics in Medicine.2021;40(20):4442–4456
Rabideau Dustin J., Wang Rui. Randomization-based inference for a marginal treatment effect in stepped wedge cluster randomized trials.Statistics in Medicine.2021;40(20):4442–4456
2021
-
[29]
Qu Yinsheng, Piedmonte Marion R, Williams George V. Small sample validity of latent variable models for cor- related binary data.Communications in Statistics - Simulation and Computation.1994;23(1):243–269. _eprint: https://doi.org/10.1080/03610919408813167
-
[30]
Wang Hao, Chen Xinyuan, Courtright Katherine R., et al. On Anticipation Effect in Stepped Wedge Cluster Randomized Trials.Statistics in Medicine.2026;45(3-5):e70380. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.70380
-
[31]
PustejovskyJames.clubSandwich:Cluster-Robust(Sandwich)VarianceEstimatorswithSmall-SampleCorrections.2022
2022
-
[32]
ZegerS.L.,LiangK.Y..Longitudinaldataanalysisfordiscreteandcontinuousoutcomes.Biometrics.1986;42(1):121–130
1986
-
[33]
Bias reduction in standard errors for linear regression with multi-stage samples
Bell Robert M., McCaffrey Daniel F. Bias reduction in standard errors for linear regression with multi-stage samples. Statistics Canada.2002;(12-001-XIE)
2002
-
[34]
A., DeRouen T
Mancl L. A., DeRouen T. A.. A covariance estimator for GEE with improved small-sample properties.Biometrics. 2001;57(1):126–134
2001
-
[35]
MorelJ.g.,BokossaM.c.,NeerchalN.k..SmallSampleCorrectionfortheVarianceofGEEEstimators.BiometricalJournal. 2003;45(4):395–409. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/bimj.200390021
-
[36]
P., Graubard B
Fay M. P., Graubard B. I.. Small-sample adjustments for Wald-type tests using sandwich estimators.Biometrics. 2001;57(4):1198–1206
2001
-
[37]
Pustejovsky James E., Tipton Elizabeth. Small-Sample Methods for Cluster-Robust Variance Estimation and Hypoth- esis Testing in Fixed Effects Models.Journal of Business & Economic Statistics.2018;36(4):672–683. _eprint: https://doi.org/10.1080/07350015.2016.1247004. Ouyang et al. 23
-
[38]
HughesJamesP.,LeeWen-Yu,TroxelAndreaB.,HeagertyPatrickJ..SampleSizeCalculationsforSteppedWedgeDesigns with Treatment Effects that May Change with the Duration of Time under Intervention.Prevention Science: The Official Journal of the Society for Prevention Research.2024;25(Suppl 3):348–355
2024
-
[39]
KatzJoanne,TielschJamesM.,KhatrySubarnaK.,etal.ImpactofImprovedBiomassandLiquidPetroleumGasStoveson BirthOutcomesinRuralNepal:Resultsof2RandomizedTrials.GlobalHealth,ScienceandPractice.2020;8(3):372–382
2020
-
[40]
Peiris David, Praveen Devarsetty, Mogulluru Kishor, et al. SMARThealth India: A stepped-wedge, cluster randomised controlledtrialofacommunityhealthworkermanagedmobilehealthinterventionforpeopleassessedathighcardiovascular disease risk in rural India.PloS One.2019;14(3):e0213708
2019
-
[41]
Robust inference for the stepped wedge design.Biometrics
Hughes James P., Heagerty Patrick J., Xia Fan, Ren Yuqi. Robust inference for the stepped wedge design.Biometrics. 2020;76(1):119–130
2020
-
[42]
Covariance estimators for generalized estimating equations (GEE) in longitudinal analysis with small samples.Statistics in Medicine.2016;35(10):1706–1721
Wang Ming, Kong Lan, Li Zheng, Zhang Lijun. Covariance estimators for generalized estimating equations (GEE) in longitudinal analysis with small samples.Statistics in Medicine.2016;35(10):1706–1721
2016
-
[43]
2016;51(4):495–518
McNeishDaniel,StapletonLauraM..ModelingClusteredDatawithVeryFewClusters.MultivariateBehavioralResearch. 2016;51(4):495–518
2016
-
[44]
The Effect of Small Sample Size on Two-Level Model Estimates: A Review and Illustration.Educational Psychology Review.2016;28(2):295–314
McNeish Daniel M., Stapleton Laura M.. The Effect of Small Sample Size on Two-Level Model Estimates: A Review and Illustration.Educational Psychology Review.2016;28(2):295–314
2016
- [45]
-
[46]
LeeKennethM.,TurnerElizabethL.,KennyAvi.AnalysisofStepped-WedgeClusterRandomizedTrialsWhenTreatment Effects Vary by Exposure Time or Calendar Time.Statistics in Medicine.2025;44(20-22):e70256
2025
-
[47]
Grand rounds in methodology: improving the design of staggered implementation cluster randomised trials.BMJ quality & safety.2025;34(9):631–636
Watson Samuel I., Hooper Richard. Grand rounds in methodology: improving the design of staggered implementation cluster randomised trials.BMJ quality & safety.2025;34(9):631–636
2025
-
[48]
The staircase cluster randomised trial design: A pragmatic alternative to the stepped wedge.Statistical Methods in Medical Research.2024;33(1):24–41
Grantham Kelsey L, Forbes Andrew B, Hooper Richard, Kasza Jessica. The staircase cluster randomised trial design: A pragmatic alternative to the stepped wedge.Statistical Methods in Medical Research.2024;33(1):24–41
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.