A note on closed-form solutions for estimating sample size when externally validating a binary prediction model based on $C$-statistic precision

Denis A. Shah; Erick D. De Wolf; Laurence V. Madden; Pierce A. Paul

arxiv: 2605.23664 · v1 · pith:IYWLZKA6new · submitted 2026-05-22 · 📊 stat.ME

A note on closed-form solutions for estimating sample size when externally validating a binary prediction model based on C-statistic precision

Denis A. Shah , Erick D. De Wolf , Pierce A. Paul , Laurence V. Madden This is my paper

Pith reviewed 2026-05-25 03:27 UTC · model grok-4.3

classification 📊 stat.ME

keywords closed-form solutionssample size estimationC-statisticexternal validationprediction modelsNewcombe's formulabinary outcomes

0 comments

The pith

Seven closed-form solutions rearrange Newcombe's formula to calculate sample size for precise C-statistic estimation in model validation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives seven distinct closed-form expressions for determining the minimum sample size needed to estimate the C-statistic with a desired level of precision in external validation studies of binary prediction models. These expressions come from algebraically rearranging Newcombe's formula for the standard error of the C-statistic using computer algebra systems and AI tools. A sympathetic reader would care because iterative methods for this calculation can be slow, and external validation is essential to ensure clinical prediction models perform as expected before use. The closed forms match the iterative results exactly but execute far more rapidly.

Core claim

Seven novel closed-form solutions to the rearrangement of Newcombe's formula for the standard error of the C-statistic provide direct computation of the required sample size for precise estimation during external validation of binary prediction models. These solutions, obtained through different computer algebra systems and artificial intelligence models, are mathematically equivalent to the existing iterative method and produce identical sample size estimates in examples. Benchmarking shows they are between 148000 and 264000 times faster in median execution time.

What carries the argument

Algebraic rearrangement of Newcombe's formula for SE(C) into explicit closed-form expressions for the sample size n.

Load-bearing premise

The computer algebra systems and AI models performed the symbolic rearrangements without introducing algebraic errors or simplifications that break equivalence to the original formula.

What would settle it

Apply the closed-form expressions and the iterative method to a set of varied input values for expected C, desired SE, and prevalence, and verify if the computed sample sizes agree exactly.

Figures

Figures reproduced from arXiv: 2605.23664 by Denis A. Shah, Erick D. De Wolf, Laurence V. Madden, Pierce A. Paul.

**Figure 1.** Figure 1: Values of the standard error of the C-statistic {SE(C)} and estimated sample size calculated for C = 0.6 and outcome event proportions (ϕ) of 0.1 to 0.5, for seven different closed-form solutions for the sample size required for the external validation of a binary prediction model. The solid vertical line references SE(C) = 0.02551, corresponding to a confidence interval width of 0.1 for the C-statistic. B… view at source ↗

**Figure 2.** Figure 2: Violin plot of measured execution time (in nanoseconds) of functions implementing closed-form solutions for [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

External validation of clinical prediction models is crucial for assessing whether they are fit for use. The $C$-statistic is a widely used measure of discriminative performance of such models predicting a binary outcome. A method for obtaining the minimum sample size required for the precise estimation of the $C$-statistic during validation, based on the rearrangement of Newcombe's formula for the standard error of the $C$-statistic {SE($C$)}, was recently proposed and implemented in R and Stata software via an iterative computational approach. We present seven novel closed-form solutions, derived using different computer algebra systems and artificial intelligence models, to the algebraic rearrangement of Newcombe's formula. We present these distinct forms to demonstrate how different computational tools yield structurally distinct but mathematically equivalent solutions, and to evaluate their practical differences in computational performance. Our closed-form solutions yield identical sample size estimates to the iterative method when applied to illustrative examples. In a benchmarking analysis, the closed-form solutions were on average 148,000 to 264,000 times faster in median execution time than the current iterative implementation, while also exhibiting minor efficiency differences among themselves. This work provides a validated, highly efficient computational tool applicable to sample size calculation for external validation studies. R code functions implementing the closed-form solutions are provided.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript derives seven closed-form algebraic solutions for the minimum sample size n required to achieve a prespecified standard error of the C-statistic in external validation of a binary prediction model. These expressions are obtained by rearranging Newcombe's formula for SE(C) using multiple computer algebra systems and AI models; the authors assert that the closed forms are mathematically equivalent to the existing iterative solver, produce identical numerical results on illustrative examples, and deliver median speedups of 148000–264000 times, with accompanying R code.

Significance. If the algebraic equivalence and numerical stability hold over the full relevant domain, the work supplies a practical, reproducible computational improvement for sample-size planning in clinical prediction-model validation studies. The provision of R implementations is a clear strength that supports immediate usability.

major comments (2)

[Abstract / results on closed-form solutions] Abstract and results on equivalence: the central claim that each of the seven expressions is an exact algebraic rearrangement of Newcombe's formula (and therefore yields identical n for any valid input) rests solely on agreement with the iterative solver on a small set of illustrative examples. No systematic numerical verification is reported that sweeps the domain of C (0.5–1), target SE(C), and prevalence, leaving open the possibility of transcription errors, extraneous roots, or regions of numerical instability.
[Benchmarking analysis] Benchmarking section: the reported speedups are quantified only in median execution time; without accompanying information on the range of parameter values tested or on cases near the boundaries (e.g., C approaching 0.5 or very small target SE), it is unclear whether the performance advantage persists uniformly or whether any closed form becomes undefined or slow in edge cases.

minor comments (2)

The manuscript would benefit from an explicit statement of the domain restrictions (e.g., prevalence > 0, C > 0.5) under which each closed form is defined.
Intermediate CAS output or simplification steps for at least one of the seven expressions would strengthen the reproducibility claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify opportunities to strengthen the empirical support for our claims. We address each major point below and will incorporate the suggested additions in a revised version.

read point-by-point responses

Referee: [Abstract / results on closed-form solutions] Abstract and results on equivalence: the central claim that each of the seven expressions is an exact algebraic rearrangement of Newcombe's formula (and therefore yields identical n for any valid input) rests solely on agreement with the iterative solver on a small set of illustrative examples. No systematic numerical verification is reported that sweeps the domain of C (0.5–1), target SE(C), and prevalence, leaving open the possibility of transcription errors, extraneous roots, or regions of numerical instability.

Authors: We agree that reliance on illustrative examples alone leaves room for undetected issues such as extraneous roots or domain-specific instability. Although the seven expressions were obtained via symbolic rearrangement in multiple computer algebra systems (which guarantees algebraic equivalence when the derivations are correct), we will add a systematic numerical verification in the revised manuscript. This will consist of a grid evaluation over C ∈ [0.5, 1], a range of target SE(C) values, and prevalence levels, confirming that all closed forms return identical n to the iterative solver (within floating-point tolerance) and remain defined and stable throughout the domain. revision: yes
Referee: [Benchmarking analysis] Benchmarking section: the reported speedups are quantified only in median execution time; without accompanying information on the range of parameter values tested or on cases near the boundaries (e.g., C approaching 0.5 or very small target SE), it is unclear whether the performance advantage persists uniformly or whether any closed form becomes undefined or slow in edge cases.

Authors: We concur that median-only reporting is insufficient to establish uniform performance. In the revision we will expand the benchmarking section to report the full range of execution times, explicitly document the parameter grid used (including the boundary regions C near 0.5 and small target SE(C)), and confirm that all seven closed forms remain defined and retain their speed advantage in those edge cases. Any isolated numerical exceptions will be noted. revision: yes

Circularity Check

0 steps flagged

No circularity: closed-forms are algebraic rearrangements of externally published Newcombe formula

full rationale

The paper's derivation consists of symbolic rearrangement of Newcombe's SE(C) formula (an external reference) into closed-form expressions for n, performed via independent CAS and AI tools. The resulting expressions are asserted to be algebraically equivalent and are checked for numerical agreement on illustrative examples; no parameters are fitted to data, no self-citations form the load-bearing step, and no ansatz or uniqueness claim is imported from the authors' prior work. The central result is therefore self-contained against the external benchmark formula and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that Newcombe's SE(C) formula is the appropriate starting point and on standard algebraic manipulation; no free parameters are introduced and no new entities are postulated.

axioms (1)

domain assumption Newcombe's formula for the standard error of the C-statistic is valid and appropriate for determining sample size in external validation.
The entire derivation begins from this existing formula without re-deriving or testing its assumptions.

pith-pipeline@v0.9.0 · 5776 in / 1236 out tokens · 71601 ms · 2026-05-25T03:27:54.411688+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Collins, K.G.M

G.S. Collins, K.G.M. Moons, P. Dhiman, R.D. Riley, A.L. Beam, B. Van Calster, M. Ghassemi, X. Liu, J.B. Reitsma, M. van Smeden, A.-L. Boulesteix, J.C. Camaradou, L.A. Celi, S. Denaxas, A.K. Denniston, B. Glocker, R.M. Golub, H. Harvey, G. Heinze, M.M. Hoffman, A.P. Kengne, E. Lam, N. Lee, E.W. Loder, L. Maier-Hein, B.A. Mateen, M.D. McCradden, L. Oakden-R...

work page 2024
[2]

Efthimiou, M

O. Efthimiou, M. Seo, K. Chalkou, T. Debray, M. Egger, and G. Salanti. Developing clinical prediction models: a step-by-step guide.BMJ, 386:e078276, 2024

work page 2024
[3]

Shipe, S.A

M.E. Shipe, S.A. Deppen, F. Farjah, and E.L. Grogan. Developing prediction models for clinical use using logistic regression: an overview.J Thorac Dis, 11(Suppl 4):S574–S584, March 2019

work page 2019
[4]

Collins, P

G.S. Collins, P. Dhiman, J. Ma, M.M. Schlussel, L. Archer, B. Van Calster, F.E. Harrell, G.P. Martin, K.G.M. Moons, M. van Smeden, M. Sperrin, G.S. Bullock, and R.D. Riley. Evaluation of clinical prediction models (part 1): from development to external validation.BMJ, 384:e074819, January 2024

work page 2024
[5]

S. A. Tiruneh, T. T. T. Vu, L. J. Moran, E. J. Callander, J. Allotey, S. Thangaratinam, D. L. Rolnik, H. J. Teede, R. Wang, and J. Enticott. Externally validated prediction models for pre-eclampsia: systematic review and meta-analysis.Ultrasound Obstet Gynecol, 63(5):592–604, May 2024

work page 2024
[6]

Riley, T.P.A

R.D. Riley, T.P.A. Debray, G.S. Collins, L. Archer, J. Ensor, M. van Smeden, and K.I.E. Snell. Minimum sample size for external validation of a clinical prediction model with a binary outcome.Stat Med, 40(19):4230–4251, August 2021

work page 2021
[7]

Riley, K.I.E

R.D. Riley, K.I.E. Snell, L. Archer, J. Ensor, T.P.A. Debray, B. Van Calster, M. van Smeden, and G.S. Collins. Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study.BMJ, 384:e074821, January 2024

work page 2024
[8]

Newcombe

R.G. Newcombe. Confidence intervals for an effect size measure based on the Mann–Whitney statistic. Part 2: asymptotic methods and evaluation.Stat Med, 25(4):559–573, February 2006

work page 2006
[9]

Shah and E

D.A. Shah and E. DeWolf. Rapid response to: Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study.BMJ, August 2024

work page 2024
[10]

Spreitzer, O

C. Spreitzer, O. Straser, S. Zehetmeier, and K. Maaß. Mathematical modelling abilities of artificial intelligence tools: the case of ChatGPT.Educ Sci, 14(7):698, 2024

work page 2024
[11]

Valverde

J.C. Valverde. Wolfram|Alpha & ChatGPT in the design and analysis of mathematical models. InINTED2025 Proceedings, pages 7003–7010, 2025. 19th International Technology, Education and Development Conference, Place: Valencia, Spain

work page 2025
[12]

T. Tao. Machine assisted proof.Not Am Math Soc, 71(1), January 2025

work page 2025
[13]

Yin and X

D.S. Yin and X. Yin. Scaffolding learning: From specific to generic with large language models.PLOS ONE, 19(9):e0310409, 09 2024. 8

work page 2024

[1] [1]

Collins, K.G.M

G.S. Collins, K.G.M. Moons, P. Dhiman, R.D. Riley, A.L. Beam, B. Van Calster, M. Ghassemi, X. Liu, J.B. Reitsma, M. van Smeden, A.-L. Boulesteix, J.C. Camaradou, L.A. Celi, S. Denaxas, A.K. Denniston, B. Glocker, R.M. Golub, H. Harvey, G. Heinze, M.M. Hoffman, A.P. Kengne, E. Lam, N. Lee, E.W. Loder, L. Maier-Hein, B.A. Mateen, M.D. McCradden, L. Oakden-R...

work page 2024

[2] [2]

Efthimiou, M

O. Efthimiou, M. Seo, K. Chalkou, T. Debray, M. Egger, and G. Salanti. Developing clinical prediction models: a step-by-step guide.BMJ, 386:e078276, 2024

work page 2024

[3] [3]

Shipe, S.A

M.E. Shipe, S.A. Deppen, F. Farjah, and E.L. Grogan. Developing prediction models for clinical use using logistic regression: an overview.J Thorac Dis, 11(Suppl 4):S574–S584, March 2019

work page 2019

[4] [4]

Collins, P

G.S. Collins, P. Dhiman, J. Ma, M.M. Schlussel, L. Archer, B. Van Calster, F.E. Harrell, G.P. Martin, K.G.M. Moons, M. van Smeden, M. Sperrin, G.S. Bullock, and R.D. Riley. Evaluation of clinical prediction models (part 1): from development to external validation.BMJ, 384:e074819, January 2024

work page 2024

[5] [5]

S. A. Tiruneh, T. T. T. Vu, L. J. Moran, E. J. Callander, J. Allotey, S. Thangaratinam, D. L. Rolnik, H. J. Teede, R. Wang, and J. Enticott. Externally validated prediction models for pre-eclampsia: systematic review and meta-analysis.Ultrasound Obstet Gynecol, 63(5):592–604, May 2024

work page 2024

[6] [6]

Riley, T.P.A

R.D. Riley, T.P.A. Debray, G.S. Collins, L. Archer, J. Ensor, M. van Smeden, and K.I.E. Snell. Minimum sample size for external validation of a clinical prediction model with a binary outcome.Stat Med, 40(19):4230–4251, August 2021

work page 2021

[7] [7]

Riley, K.I.E

R.D. Riley, K.I.E. Snell, L. Archer, J. Ensor, T.P.A. Debray, B. Van Calster, M. van Smeden, and G.S. Collins. Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study.BMJ, 384:e074821, January 2024

work page 2024

[8] [8]

Newcombe

R.G. Newcombe. Confidence intervals for an effect size measure based on the Mann–Whitney statistic. Part 2: asymptotic methods and evaluation.Stat Med, 25(4):559–573, February 2006

work page 2006

[9] [9]

Shah and E

D.A. Shah and E. DeWolf. Rapid response to: Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study.BMJ, August 2024

work page 2024

[10] [10]

Spreitzer, O

C. Spreitzer, O. Straser, S. Zehetmeier, and K. Maaß. Mathematical modelling abilities of artificial intelligence tools: the case of ChatGPT.Educ Sci, 14(7):698, 2024

work page 2024

[11] [11]

Valverde

J.C. Valverde. Wolfram|Alpha & ChatGPT in the design and analysis of mathematical models. InINTED2025 Proceedings, pages 7003–7010, 2025. 19th International Technology, Education and Development Conference, Place: Valencia, Spain

work page 2025

[12] [12]

T. Tao. Machine assisted proof.Not Am Math Soc, 71(1), January 2025

work page 2025

[13] [13]

Yin and X

D.S. Yin and X. Yin. Scaffolding learning: From specific to generic with large language models.PLOS ONE, 19(9):e0310409, 09 2024. 8

work page 2024