A comparison of methods for designing hybrid type 2 cluster-randomized trials with continuous effectiveness and implementation endpoints

Donna Spiegelman; Fan Li; Melody Owen; Ruyi Liu

arxiv: 2510.20741 · v2 · pith:ZO7C6ZZZnew · submitted 2025-10-23 · 📊 stat.ME · stat.AP

A comparison of methods for designing hybrid type 2 cluster-randomized trials with continuous effectiveness and implementation endpoints

Melody Owen , Fan Li , Ruyi Liu , Donna Spiegelman This is my paper

Pith reviewed 2026-05-21 20:12 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords hybrid type 2 trialscluster-randomized trialsco-primary endpointspower analysissample size calculationimplementation outcomeseffectiveness outcomes

0 comments

The pith

P-value adjustments are always less powerful than combined outcomes or single weighted tests for powering hybrid type 2 cluster-randomized trials with two continuous endpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares five methods for calculating power and sample size in cluster-randomized trials that treat both implementation success and health effectiveness as continuous co-primary endpoints. Theoretical comparison of the power equations shows that simply adjusting p-values to account for testing two outcomes at once is always inferior in power to either combining the outcomes into a single analysis or using a weighted single test that incorporates both. Large numerical simulations across 45,000 scenarios then map out when the remaining methods trade off advantages: the disjunctive two-degree-of-freedom test tends to win when the two treatment effects differ in magnitude, while the single one-degree-of-freedom test tends to win when the effects are similar. The work introduces an R package that lets designers evaluate these methods for their own correlation and cluster-size values. The results supply direct rules for choosing a design approach that avoids unnecessary inflation of the number of clusters required.

Core claim

The paper establishes that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single weighted 1-DF test. It identifies conditions where the disjunctive 2-DF test is less powerful than the single 1-DF test. Across 45,000 input scenarios the simulations show that the disjunctive 2-DF test tends to be most powerful when treatment effects are unequal while the single 1-DF test dominates when effects are equal.

What carries the argument

Theoretical comparison of power equations for p-value adjustment methods, combined outcomes approach, single weighted 1-DF test, disjunctive 2-DF test, and conjunctive test, evaluated numerically via the crt2power R package for cluster-randomized trials with two continuous co-primary endpoints.

If this is right

P-value adjustment methods should not be used when powering these trials because they are dominated by the combined outcomes and single weighted 1-DF approaches.
When treatment effects on the two endpoints are expected to differ, the disjunctive 2-DF test supplies the highest power among the five methods.
When treatment effects on the two endpoints are expected to be similar, the single weighted 1-DF test supplies the highest power.
The crt2power package makes it feasible to select the optimal method for any given correlation, intraclass correlation, and cluster size.
The dominance relations identified by the power-equation comparisons hold globally across broad ranges of parameter values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers facing similar dual-endpoint problems but with binary or time-to-event outcomes could derive parallel power formulas and repeat the same dominance checks.
Routine use of the preferred method for a given effect-equality pattern could reduce the total number of clusters needed in resource-constrained implementation studies.
Pre-specifying the analysis method at the design stage, rather than defaulting to p-value adjustment, would directly translate into smaller, more feasible trials.
The patterns observed here may inform sample-size planning for other multi-outcome cluster designs outside the hybrid type 2 setting.

Load-bearing premise

The power equations and simulation results assume that the two continuous endpoints follow models that allow closed-form or accurate numerical power calculations for all five methods under the stated correlation and cluster-size conditions.

What would settle it

A set of power calculations or simulations using the same five methods but with endpoint distributions that violate the linear mixed-model assumptions, for example by adding strong skewness or cluster-level outliers, to check whether the reported power ordering reverses.

read the original abstract

Hybrid type 2 studies are gaining popularity for their ability to assess both implementation and health outcomes as co-primary endpoints. Often conducted as cluster-randomized trials (CRTs), five design methods can validly power these studies: p-value adjustment methods, combined outcomes approach, single weighted 1-DF test, disjunctive 2-DF test, and conjunctive test. We compared these methods theoretically and numerically. Theoretical comparisons of power equations allowed us to identify when one method had more or less power than another globally. We showed that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single 1-DF test, and identified conditions where the disjunctive 2-DF test is less powerful than the single 1-DF test. To further investigate when power advantages shift, we conducted a large-scale numerical study using our novel crt2power R package, which calculates power or sample size for CRTs with two continuous co-primary endpoints using these methods. Across 45,000 input scenarios, we found specific patterns: when treatment effects are unequal, the disjunctive 2-DF test tends to be most powerful; when treatment effects are equal, the single 1-DF test tends to dominate. Together, these comparisons offer practical guidance for powering hybrid type 2 studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives practical rules for picking power methods in hybrid type 2 CRTs and ships a usable R package, but the global claim that p-value adjustments are always weakest rests on power formulas whose joint-distribution assumptions are not fully stress-tested.

read the letter

The key point is that this work supplies clear, simulation-backed advice on when to use the disjunctive 2-DF test versus the single 1-DF test for hybrid type 2 cluster-randomized trials that have two continuous co-primary endpoints. It also ships the crt2power package that implements all five methods, which is the most immediately useful output for people actually designing these studies.

Referee Report

2 major / 2 minor

Summary. The manuscript compares five methods for powering hybrid type 2 cluster-randomized trials (CRTs) with two continuous co-primary endpoints (effectiveness and implementation): p-value adjustment methods, combined outcomes approach, single weighted 1-DF test, disjunctive 2-DF test, and conjunctive test. Theoretical comparisons of power equations are used to establish global orderings, including that p-value adjustment methods are always less powerful than the combined outcomes approach and the single weighted 1-DF test. A large numerical study across 45,000 scenarios, implemented via the authors' crt2power R package, identifies that the disjunctive 2-DF test tends to be most powerful when treatment effects are unequal while the single 1-DF test tends to dominate when effects are equal.

Significance. If the central claims hold, the paper supplies practical design guidance for hybrid type 2 CRTs, an increasingly common study type in implementation science. The development and use of the crt2power package for power and sample-size calculations under bivariate CRT linear mixed models is a clear strength that supports reproducibility. The combination of analytic power inequalities with an extensive simulation study provides a useful framework for choosing among valid powering methods under varying effect-size and correlation conditions.

major comments (2)

[theoretical comparisons] Abstract and theoretical comparisons: the global claim that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single weighted 1-DF test rests on the power equations for the bivariate normal distribution of test statistics under the CRT variance structure (ICC, cluster size, design effect). The manuscript should present the explicit non-centrality parameter and power formula used for the p-value adjustment method (e.g., Bonferroni) and derive or show the inequality with the other methods to confirm that the joint distribution is modeled identically across approaches.
[numerical study] Numerical study section: the 45,000-scenario results inherit the same power formulas via the crt2power package. To strengthen the claim that patterns (disjunctive 2-DF most powerful when effects unequal; single 1-DF when equal) are robust, the manuscript should state whether the scenario grid (ICC values, cluster sizes, endpoint correlations, effect-size ratios) was fully pre-specified before running the simulations.

minor comments (2)

[abstract] The abstract refers to the 'crt2power R package' as novel; adding a GitHub or CRAN link and a brief description of its core functions would improve accessibility and reproducibility.
Minor notation: ensure consistent use of '1-DF' versus 'one-degree-of-freedom' throughout the text and figures for reader clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our theoretical results and the robustness of the simulation design. We respond to each major comment below.

read point-by-point responses

Referee: Abstract and theoretical comparisons: the global claim that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single weighted 1-DF test rests on the power equations for the bivariate normal distribution of test statistics under the CRT variance structure (ICC, cluster size, design effect). The manuscript should present the explicit non-centrality parameter and power formula used for the p-value adjustment method (e.g., Bonferroni) and derive or show the inequality with the other methods to confirm that the joint distribution is modeled identically across approaches.

Authors: We agree that greater explicitness will strengthen the manuscript. In the revision we will add a subsection presenting the non-centrality parameters and power formulas for every method, including the Bonferroni adjustment (which applies the adjusted alpha to the marginal non-centrality parameters). We will then derive the strict inequality by direct comparison of the power functions under the identical bivariate normal distribution of the test statistics that incorporates the same CRT variance components (ICC, cluster size, and design effect) for all approaches. revision: yes
Referee: Numerical study section: the 45,000-scenario results inherit the same power formulas via the crt2power package. To strengthen the claim that patterns (disjunctive 2-DF most powerful when effects unequal; single 1-DF when equal) are robust, the manuscript should state whether the scenario grid (ICC values, cluster sizes, endpoint correlations, effect-size ratios) was fully pre-specified before running the simulations.

Authors: We will add an explicit statement in the Numerical study section that the 45,000-scenario grid was fully defined before any simulations were executed. The ranges for ICC (0.01–0.20), cluster size (10–100), endpoint correlation (−0.5 to 0.9), and effect-size ratio (0.5–2) were chosen in advance to span values typical of hybrid type 2 CRTs in implementation science, ensuring the reported dominance patterns are not the result of post-hoc selection. revision: yes

Circularity Check

0 steps flagged

No circularity: power comparisons derive from independent bivariate CRT formulas

full rationale

The paper's central claims rest on direct algebraic comparisons of closed-form power equations for five distinct testing procedures (p-value adjustment, combined outcomes, weighted 1-DF, disjunctive 2-DF, conjunctive) under a shared bivariate normal model that incorporates cluster-level random effects, ICC, and design effects for both endpoints. These equations are stated as standard extensions of linear mixed-model power formulas rather than being defined in terms of one another or fitted from the same simulation outputs. The 45,000-scenario numerical study simply evaluates the same pre-derived formulas via the authors' implementation package; it does not feed results back into the theoretical ordering or redefine any non-centrality parameter from the data being compared. No self-citation is invoked as a load-bearing uniqueness theorem, no ansatz is smuggled, and no known empirical pattern is merely renamed. The derivation chain is therefore self-contained against external statistical benchmarks for multivariate CRT power.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard statistical assumptions for continuous endpoints in cluster-randomized trials and on the validity of the power formulas for each of the five methods; no new entities or fitted constants are introduced beyond those already required by the methods being compared.

axioms (1)

domain assumption Continuous endpoints follow linear mixed models permitting closed-form or accurate numerical power calculations for the five tests under the assumed intra-cluster correlation and cluster sizes.
Invoked to justify both the theoretical power comparisons and the simulation scenarios.

pith-pipeline@v0.9.0 · 5777 in / 1375 out tokens · 39911 ms · 2026-05-21T20:12:19.486270+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theoretical comparisons of the power equations allowed us to identify when one method had more or less power than another globally. We showed that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single 1-DF test
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compared these methods theoretically and numerically... using our novel crt2power R package

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.