A comparison of methods for designing hybrid type 2 cluster-randomized trials with continuous effectiveness and implementation endpoints
Pith reviewed 2026-05-21 20:12 UTC · model grok-4.3
The pith
P-value adjustments are always less powerful than combined outcomes or single weighted tests for powering hybrid type 2 cluster-randomized trials with two continuous endpoints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single weighted 1-DF test. It identifies conditions where the disjunctive 2-DF test is less powerful than the single 1-DF test. Across 45,000 input scenarios the simulations show that the disjunctive 2-DF test tends to be most powerful when treatment effects are unequal while the single 1-DF test dominates when effects are equal.
What carries the argument
Theoretical comparison of power equations for p-value adjustment methods, combined outcomes approach, single weighted 1-DF test, disjunctive 2-DF test, and conjunctive test, evaluated numerically via the crt2power R package for cluster-randomized trials with two continuous co-primary endpoints.
If this is right
- P-value adjustment methods should not be used when powering these trials because they are dominated by the combined outcomes and single weighted 1-DF approaches.
- When treatment effects on the two endpoints are expected to differ, the disjunctive 2-DF test supplies the highest power among the five methods.
- When treatment effects on the two endpoints are expected to be similar, the single weighted 1-DF test supplies the highest power.
- The crt2power package makes it feasible to select the optimal method for any given correlation, intraclass correlation, and cluster size.
- The dominance relations identified by the power-equation comparisons hold globally across broad ranges of parameter values.
Where Pith is reading between the lines
- Designers facing similar dual-endpoint problems but with binary or time-to-event outcomes could derive parallel power formulas and repeat the same dominance checks.
- Routine use of the preferred method for a given effect-equality pattern could reduce the total number of clusters needed in resource-constrained implementation studies.
- Pre-specifying the analysis method at the design stage, rather than defaulting to p-value adjustment, would directly translate into smaller, more feasible trials.
- The patterns observed here may inform sample-size planning for other multi-outcome cluster designs outside the hybrid type 2 setting.
Load-bearing premise
The power equations and simulation results assume that the two continuous endpoints follow models that allow closed-form or accurate numerical power calculations for all five methods under the stated correlation and cluster-size conditions.
What would settle it
A set of power calculations or simulations using the same five methods but with endpoint distributions that violate the linear mixed-model assumptions, for example by adding strong skewness or cluster-level outliers, to check whether the reported power ordering reverses.
read the original abstract
Hybrid type 2 studies are gaining popularity for their ability to assess both implementation and health outcomes as co-primary endpoints. Often conducted as cluster-randomized trials (CRTs), five design methods can validly power these studies: p-value adjustment methods, combined outcomes approach, single weighted 1-DF test, disjunctive 2-DF test, and conjunctive test. We compared these methods theoretically and numerically. Theoretical comparisons of power equations allowed us to identify when one method had more or less power than another globally. We showed that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single 1-DF test, and identified conditions where the disjunctive 2-DF test is less powerful than the single 1-DF test. To further investigate when power advantages shift, we conducted a large-scale numerical study using our novel crt2power R package, which calculates power or sample size for CRTs with two continuous co-primary endpoints using these methods. Across 45,000 input scenarios, we found specific patterns: when treatment effects are unequal, the disjunctive 2-DF test tends to be most powerful; when treatment effects are equal, the single 1-DF test tends to dominate. Together, these comparisons offer practical guidance for powering hybrid type 2 studies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares five methods for powering hybrid type 2 cluster-randomized trials (CRTs) with two continuous co-primary endpoints (effectiveness and implementation): p-value adjustment methods, combined outcomes approach, single weighted 1-DF test, disjunctive 2-DF test, and conjunctive test. Theoretical comparisons of power equations are used to establish global orderings, including that p-value adjustment methods are always less powerful than the combined outcomes approach and the single weighted 1-DF test. A large numerical study across 45,000 scenarios, implemented via the authors' crt2power R package, identifies that the disjunctive 2-DF test tends to be most powerful when treatment effects are unequal while the single 1-DF test tends to dominate when effects are equal.
Significance. If the central claims hold, the paper supplies practical design guidance for hybrid type 2 CRTs, an increasingly common study type in implementation science. The development and use of the crt2power package for power and sample-size calculations under bivariate CRT linear mixed models is a clear strength that supports reproducibility. The combination of analytic power inequalities with an extensive simulation study provides a useful framework for choosing among valid powering methods under varying effect-size and correlation conditions.
major comments (2)
- [theoretical comparisons] Abstract and theoretical comparisons: the global claim that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single weighted 1-DF test rests on the power equations for the bivariate normal distribution of test statistics under the CRT variance structure (ICC, cluster size, design effect). The manuscript should present the explicit non-centrality parameter and power formula used for the p-value adjustment method (e.g., Bonferroni) and derive or show the inequality with the other methods to confirm that the joint distribution is modeled identically across approaches.
- [numerical study] Numerical study section: the 45,000-scenario results inherit the same power formulas via the crt2power package. To strengthen the claim that patterns (disjunctive 2-DF most powerful when effects unequal; single 1-DF when equal) are robust, the manuscript should state whether the scenario grid (ICC values, cluster sizes, endpoint correlations, effect-size ratios) was fully pre-specified before running the simulations.
minor comments (2)
- [abstract] The abstract refers to the 'crt2power R package' as novel; adding a GitHub or CRAN link and a brief description of its core functions would improve accessibility and reproducibility.
- Minor notation: ensure consistent use of '1-DF' versus 'one-degree-of-freedom' throughout the text and figures for reader clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our theoretical results and the robustness of the simulation design. We respond to each major comment below.
read point-by-point responses
-
Referee: Abstract and theoretical comparisons: the global claim that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single weighted 1-DF test rests on the power equations for the bivariate normal distribution of test statistics under the CRT variance structure (ICC, cluster size, design effect). The manuscript should present the explicit non-centrality parameter and power formula used for the p-value adjustment method (e.g., Bonferroni) and derive or show the inequality with the other methods to confirm that the joint distribution is modeled identically across approaches.
Authors: We agree that greater explicitness will strengthen the manuscript. In the revision we will add a subsection presenting the non-centrality parameters and power formulas for every method, including the Bonferroni adjustment (which applies the adjusted alpha to the marginal non-centrality parameters). We will then derive the strict inequality by direct comparison of the power functions under the identical bivariate normal distribution of the test statistics that incorporates the same CRT variance components (ICC, cluster size, and design effect) for all approaches. revision: yes
-
Referee: Numerical study section: the 45,000-scenario results inherit the same power formulas via the crt2power package. To strengthen the claim that patterns (disjunctive 2-DF most powerful when effects unequal; single 1-DF when equal) are robust, the manuscript should state whether the scenario grid (ICC values, cluster sizes, endpoint correlations, effect-size ratios) was fully pre-specified before running the simulations.
Authors: We will add an explicit statement in the Numerical study section that the 45,000-scenario grid was fully defined before any simulations were executed. The ranges for ICC (0.01–0.20), cluster size (10–100), endpoint correlation (−0.5 to 0.9), and effect-size ratio (0.5–2) were chosen in advance to span values typical of hybrid type 2 CRTs in implementation science, ensuring the reported dominance patterns are not the result of post-hoc selection. revision: yes
Circularity Check
No circularity: power comparisons derive from independent bivariate CRT formulas
full rationale
The paper's central claims rest on direct algebraic comparisons of closed-form power equations for five distinct testing procedures (p-value adjustment, combined outcomes, weighted 1-DF, disjunctive 2-DF, conjunctive) under a shared bivariate normal model that incorporates cluster-level random effects, ICC, and design effects for both endpoints. These equations are stated as standard extensions of linear mixed-model power formulas rather than being defined in terms of one another or fitted from the same simulation outputs. The 45,000-scenario numerical study simply evaluates the same pre-derived formulas via the authors' implementation package; it does not feed results back into the theoretical ordering or redefine any non-centrality parameter from the data being compared. No self-citation is invoked as a load-bearing uniqueness theorem, no ansatz is smuggled, and no known empirical pattern is merely renamed. The derivation chain is therefore self-contained against external statistical benchmarks for multivariate CRT power.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continuous endpoints follow linear mixed models permitting closed-form or accurate numerical power calculations for the five tests under the assumed intra-cluster correlation and cluster sizes.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theoretical comparisons of the power equations allowed us to identify when one method had more or less power than another globally. We showed that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single 1-DF test
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We compared these methods theoretically and numerically... using our novel crt2power R package
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.