ICCDesign: An R Package for the Design and Analysis of ICC-Based Reliability Studies with Continuous Responses

Chenge Gao; Ruilin Ma; Yundan Zhang; Ziyu Liu

arxiv: 2606.02059 · v1 · pith:IFSDPFKOnew · submitted 2026-06-01 · 📊 stat.ME

ICCDesign: An R Package for the Design and Analysis of ICC-Based Reliability Studies with Continuous Responses

Ziyu Liu , Ruilin Ma , Yundan Zhang , Chenge Gao This is my paper

Pith reviewed 2026-06-28 13:20 UTC · model grok-4.3

classification 📊 stat.ME

keywords intraclass correlationreliabilityR packagesample sizeconfidence intervalShiny appMcGraw and Wong

0 comments

The pith

The ICCDesign R package provides an integrated workflow for estimating, planning, and evaluating intraclass correlations in reliability studies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the ICCDesign package to solve the problem of fragmented R tools for intraclass correlation coefficient analysis in reliability research. It combines point estimation with confidence intervals, sample size planning, reliability evaluation, and a Shiny app into a single package with a decision framework for selecting the right ICC form. A sympathetic reader would care because this reduces the risk of switching between tools and making analytical errors when applying ICC methods in medical, psychological, and behavioral studies.

Core claim

ICCDesign integrates four core functionalities for ICC-based reliability studies with continuous responses: point estimation and ANOVA-based confidence intervals for supported ICC forms following the McGraw and Wong framework with a four-step decision guide, sample size planning based on Zou's closed-form formulas, automated reliability evaluation using Koo and Li criteria, and an interactive Shiny web application.

What carries the argument

The ICCDesign package and its built-in four-step decision framework that guides selection of the appropriate ICC form under the McGraw and Wong framework.

Load-bearing premise

The built-in four-step decision framework correctly maps user study designs to the appropriate ICC form under the McGraw and Wong framework and the package implementations match the cited methods without coding errors.

What would settle it

Compare the package output for ICC point estimate and confidence interval on a standard dataset to results obtained from direct implementation of the McGraw and Wong ANOVA formulas or other established packages.

read the original abstract

The intraclass correlation coefficient (ICC) is among the most widely used statistics in reliability research, playing a central role in medical measurement, psychological assessment, and behavioral science. However, practical application of ICC faces two major obstacles. First, ICC can be organized into multiple forms under the McGraw and Wong (1996) framework -- including six widely reported standard forms and four additional design combinations -- and researchers must select the appropriate form based on their study design, yet existing guidelines are not always operationalized in software interfaces. Second, available R tools are highly fragmented: sample size calculation, ICC estimation with confidence intervals, and reliability evaluation are distributed across separate packages, compelling researchers to switch between tools and increasing the risk of analytical errors. This paper introduces the ICCDesign package, designed specifically to provide an integrated workflow for ICC-based reliability studies with continuous responses, assuming one continuous rating per subject-rater cell. The package integrates four core functionalities: (1) point estimation, ANOVA-based confidence intervals, and implemented hypothesis tests for supported ICC design combinations following the McGraw and Wong (1996) framework, with a built-in four-step decision framework guiding users toward an appropriate ICC form; (2) sample size planning based on Zou's (2012) closed-form formulas, supporting two planning modes and an inverse assurance calculation; (3) automated reliability evaluation based on Koo and Li (2016) criteria, with an uncertainty notification when the confidence interval spans the 0.75 good-reliability threshold; and (4) an interactive Shiny web application covering the main analysis and planning functionalities. ICCDesign is available from GitHub at https://github.com/KlariZhang/ICCDesign.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ICCDesign bundles existing ICC methods into one R package with a Shiny interface, but supplies zero validation that the code or decision logic is correct.

read the letter

The paper's contribution is an R package that pulls together point estimation and ANOVA CIs for McGraw-Wong ICC forms, Zou's sample-size formulas, Koo-Li reliability thresholds, and a four-step design chooser, all wrapped in a Shiny app. The integration itself is the only new element; the statistical procedures are taken directly from the three cited references.

That convenience is real for applied users who otherwise bounce between irr, psych, and custom scripts. A single interface with automated threshold warnings and inverse assurance calculations could cut down on workflow mistakes in reliability studies.

The problem is the complete absence of any check on whether the implementations match the sources. The manuscript describes the workflow and cites the papers but shows no numerical examples, no side-by-side output against the originals, no test cases, and no comparison with other validated packages. Without that, an error in the four-step mapping or a transcription slip in the closed-form expressions would go undetected and affect every downstream result.

This is for researchers in psychology or medical measurement who run ICC analyses regularly and want one tool rather than several. Readers already comfortable with the cited methods will not find new statistical insight. I would not bring it to a reading group unless the topic is specifically software for reliability work, and I would not cite the package itself.

It could go to peer review if the authors add explicit verification against known results and the original references, but the current version is too thin on evidence of correctness to justify referee time.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the ICCDesign R package for ICC-based reliability studies with continuous responses. It claims to integrate (1) point estimation, ANOVA-based CIs, and hypothesis tests for McGraw & Wong (1996) ICC forms via a built-in four-step decision framework, (2) sample-size planning using Zou (2012) closed-form formulas in two modes plus inverse assurance, (3) automated reliability evaluation per Koo & Li (2016) criteria with uncertainty notification, and (4) a Shiny web application, addressing fragmentation across existing R tools.

Significance. If the implementations prove correct, the package would usefully consolidate sample-size planning, estimation, and evaluation into one workflow with usability aids, reducing switching errors for researchers in medical measurement and behavioral sciences. The decision framework and Shiny component add practical value. However, the complete absence of any validation, test cases, or numerical checks against the cited sources substantially lowers the assessed significance, as the contribution rests entirely on the unverified claim of faithful integration.

major comments (2)

[Abstract] Abstract and overall manuscript: the central claim that the package correctly implements the McGraw & Wong (1996) forms via a four-step decision framework is unsupported, because the manuscript provides neither the decision logic, pseudocode, nor any worked examples showing how user designs are mapped to specific ICC forms.
[Abstract] Abstract and overall manuscript: no section supplies validation, test cases, side-by-side numerical comparisons against Zou (2012) formulas, McGraw & Wong (1996) CIs, Koo & Li (2016) thresholds, or outputs from other packages; this directly undermines the claim that the integrated functionalities are correctly implemented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive criticism. The comments correctly identify that the manuscript lacks explicit documentation of the decision framework and any form of validation or numerical checks. We address each point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] Abstract and overall manuscript: the central claim that the package correctly implements the McGraw & Wong (1996) forms via a four-step decision framework is unsupported, because the manuscript provides neither the decision logic, pseudocode, nor any worked examples showing how user designs are mapped to specific ICC forms.

Authors: We agree that the four-step decision framework is described only at a high level in the current manuscript. In the revised version we will add (i) the explicit decision logic in both text and pseudocode, (ii) a table mapping common study-design features (number of raters, fixed vs. random, etc.) to the six standard McGraw & Wong forms plus the four additional combinations, and (iii) two fully worked examples that trace a user-specified design through the four steps to the resulting ICC form, ANOVA model, and confidence-interval formula. revision: yes
Referee: [Abstract] Abstract and overall manuscript: no section supplies validation, test cases, side-by-side numerical comparisons against Zou (2012) formulas, McGraw & Wong (1996) CIs, Koo & Li (2016) thresholds, or outputs from other packages; this directly undermines the claim that the integrated functionalities are correctly implemented.

Authors: We acknowledge that the manuscript currently contains no validation material. The revised manuscript will include a new “Validation” section containing: (a) unit-test results for the Zou (2012) sample-size formulas against the original closed-form expressions, (b) side-by-side numerical comparisons of ICC point estimates and ANOVA-based CIs with the irr and psych packages for the same data sets, (c) verification that Koo & Li (2016) reliability labels are assigned correctly, including the uncertainty notification when a CI straddles 0.75, and (d) a small set of reproducible R code snippets that readers can run to reproduce the comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: software wrapper around externally published methods

full rationale

The paper introduces an R package that integrates four functionalities by wrapping previously published methods: McGraw and Wong (1996) ICC forms with a four-step decision framework, Zou (2012) sample-size formulas, and Koo and Li (2016) reliability criteria. No new derivations, predictions, fitted parameters, or first-principles results appear in the manuscript. The central claim is the provision of an integrated workflow and Shiny app; all load-bearing statistical content is imported from external citations whose validity is independent of the present work. No self-citation chains, ansatzes, or renamings reduce any claim to its own inputs by construction. This is the expected outcome for a software-description paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical derivations, free parameters, or postulated entities. It packages previously published methods whose assumptions (standard ANOVA models for ICC, closed-form sample-size formulas, fixed reliability thresholds) are inherited from the cited references.

pith-pipeline@v0.9.1-grok · 5849 in / 1175 out tokens · 32911 ms · 2026-06-28T13:20:43.835834+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 4 canonical work pages

[1]

Brueckl, M. (2022). irrNA: Coeﬀicients of Interrater Reliability – Generalized for Randomly In- complete Datasets . R package version 0.2.2. https://CRAN.R-project.org/package=irrNA 21

2022
[2]

Gamer, M., Lemon, J., & Singh, I. F. P. (2019). irr: Various Coeﬀicients of Interrater Reliability and Agreement. R package version 0.84.1. https://CRAN.R-project.org/package=irr

2019
[3]

A guideline of selecting and reporting intraclass correlation coefficients for reliability research,

Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correla- tion coeﬀicients for reliability research. Journal of Chiropractic Medicine , 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012

work page doi:10.1016/j.jcm.2016.02.012 2016
[4]

Liu, Z., Ma, R., Gao, C., & Zhang, Y. (2026). ICCDesign: An R Package for ICC-Based Reliability Studies. Version 0.1.0. https://github.com/KlariZhang/ICCDesign

2026
[5]

Forming inferences about some intraclass correlation coefficients,

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coeﬀicients. Psychological Methods, 1(1), 30–46. https://doi.org/10.1037/1082-989X.1.1.30 R Core Team (2024). R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

work page doi:10.1037/1082-989x.1.1.30 1996
[6]

Revelle, W. (2024). psych: Procedures for Psychological, Psychometric, and Personality Research . R package version 2.4.3. https://CRAN.R-project.org/package=psych

2024
[7]

Intraclass correlations: Uses in assessing rater reliability,

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin , 86(2), 420–428. https://doi.org/10.1037/0033-2909.86.2.420

work page doi:10.1037/0033-2909.86.2.420 1979
[8]

Wickham, H., Hester, J., Chang, W., & Bryan, J. (2022). devtools: Tools to Make Developing R Packages Easier . R package version 2.4.5. https://CRAN.R-project.org/package=devtools

2022
[9]

E., Fairbairn, D

Wolak, M. E., Fairbairn, D. J., & Paulsen, Y. R. (2012). ICC.Sample.Size: Calcu- lation of Sample Size and Power for ICC . R package version 1.0. https://CRAN.R- project.org/package=ICC.Sample.Size

2012
[10]

Zou, G. Y. (2012). Sample size formulas for estimating intraclass correlation coeﬀicients with pre- cision and assurance. Statistics in Medicine , 31(29), 3972–3981. https://doi.org/10.1002/sim.5466 22

work page doi:10.1002/sim.5466 2012

[1] [1]

Brueckl, M. (2022). irrNA: Coeﬀicients of Interrater Reliability – Generalized for Randomly In- complete Datasets . R package version 0.2.2. https://CRAN.R-project.org/package=irrNA 21

2022

[2] [2]

Gamer, M., Lemon, J., & Singh, I. F. P. (2019). irr: Various Coeﬀicients of Interrater Reliability and Agreement. R package version 0.84.1. https://CRAN.R-project.org/package=irr

2019

[3] [3]

A guideline of selecting and reporting intraclass correlation coefficients for reliability research,

Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correla- tion coeﬀicients for reliability research. Journal of Chiropractic Medicine , 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012

work page doi:10.1016/j.jcm.2016.02.012 2016

[4] [4]

Liu, Z., Ma, R., Gao, C., & Zhang, Y. (2026). ICCDesign: An R Package for ICC-Based Reliability Studies. Version 0.1.0. https://github.com/KlariZhang/ICCDesign

2026

[5] [5]

Forming inferences about some intraclass correlation coefficients,

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coeﬀicients. Psychological Methods, 1(1), 30–46. https://doi.org/10.1037/1082-989X.1.1.30 R Core Team (2024). R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

work page doi:10.1037/1082-989x.1.1.30 1996

[6] [6]

Revelle, W. (2024). psych: Procedures for Psychological, Psychometric, and Personality Research . R package version 2.4.3. https://CRAN.R-project.org/package=psych

2024

[7] [7]

Intraclass correlations: Uses in assessing rater reliability,

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin , 86(2), 420–428. https://doi.org/10.1037/0033-2909.86.2.420

work page doi:10.1037/0033-2909.86.2.420 1979

[8] [8]

Wickham, H., Hester, J., Chang, W., & Bryan, J. (2022). devtools: Tools to Make Developing R Packages Easier . R package version 2.4.5. https://CRAN.R-project.org/package=devtools

2022

[9] [9]

E., Fairbairn, D

Wolak, M. E., Fairbairn, D. J., & Paulsen, Y. R. (2012). ICC.Sample.Size: Calcu- lation of Sample Size and Power for ICC . R package version 1.0. https://CRAN.R- project.org/package=ICC.Sample.Size

2012

[10] [10]

Zou, G. Y. (2012). Sample size formulas for estimating intraclass correlation coeﬀicients with pre- cision and assurance. Statistics in Medicine , 31(29), 3972–3981. https://doi.org/10.1002/sim.5466 22

work page doi:10.1002/sim.5466 2012