pith. sign in

arxiv: 2503.08389 · v2 · submitted 2025-03-11 · 📊 stat.ME

Clustered Flexible Calibration Plots For Binary Outcomes Using Random Effects Modeling

Pith reviewed 2026-05-23 00:25 UTC · model grok-4.3

classification 📊 stat.ME
keywords calibration plotsrandom effectsclustered dataprediction modelsbinary outcomesmeta-analysismixed modelsflexible curves
0
0 comments X

The pith

Random effects modeling produces flexible calibration plots with confidence and prediction intervals that account for clustering across centers or datasets in binary outcome prediction models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When validating prediction models on data from multiple centers, calibration between predicted risks and observed outcomes can vary by cluster, and standard plots may miss this heterogeneity. The paper presents three random-effects methods—clustered group calibration, two-stage meta-analysis calibration, and mixed model calibration—to generate flexible calibration plots while incorporating clustering. These approaches also supply confidence intervals for the overall curve and prediction intervals that reflect between-cluster variation. Simulations and a case study on ovarian tumor malignancy risk show that two-stage meta-analysis calibration with splines recovers the overall curve well and that mixed model calibration recovers cluster-specific curves well, particularly with limited data per cluster. A reader would care because ignoring clustering can produce misleading summaries of model performance in multi-center validation.

Core claim

The paper establishes that clustered group calibration, two-stage meta-analysis calibration, and mixed model calibration can obtain flexible calibration plots with random effects modeling and provide confidence and prediction intervals; simulations indicate that two-stage meta-analysis calibration with splines estimates the overall curve and 95% prediction interval closest to truth while mixed model calibration produces cluster-specific curves closest to truth, leading to the recommendation of these two approaches especially when sample size per cluster is limited.

What carries the argument

The three approaches (clustered group calibration, two-stage meta-analysis calibration, and mixed model calibration) that use random effects to model heterogeneity across clusters while fitting flexible calibration curves.

If this is right

  • Calibration assessment in external validation of prediction models can incorporate between-center variation instead of assuming a single overall curve.
  • Prediction intervals can be reported that reflect uncertainty due to both sampling and clustering.
  • Cluster-specific calibration curves become available even when individual centers have small sample sizes.
  • Ready-to-use code allows direct application of the recommended methods to new datasets.
  • Heterogeneity in calibration can be visualized and quantified across centers or datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The methods could be applied to settings with a moderate number of clusters to decide whether a single overall model suffices or center-specific adjustments are needed.
  • Comparison of these random-effects plots against standard non-clustered calibration plots on the same data would quantify how often clustering changes conclusions about model performance.
  • The approaches might support model updating by identifying which clusters show systematic over- or under-prediction.
  • Extension to non-binary outcomes would require adapting the random-effects structure but could follow the same two-stage or mixed-model logic.

Load-bearing premise

The random-effects distributional assumptions, typically normality of cluster-specific intercepts and slopes, adequately capture the heterogeneity in calibration across centers.

What would settle it

A new simulation or real multi-center dataset with known non-normal cluster heterogeneity or extreme between-cluster differences where the recommended methods fail to recover the true overall or cluster-specific calibration curves.

Figures

Figures reproduced from arXiv: 2503.08389 by Bavo D.C. Campo, Ben Van Calster, Lasai Barre\~nada, Laure Wynants.

Figure 7
Figure 7. Figure 7: MIX-C was the best performing method in all scenarios when the truth was based on a logistic regression, with splines working equally well in 5 scenarios. When the truth was based on a random [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 1
Figure 1. Figure 1: Prevalence and mean predicted ADNEX risk by center across the 14 centers in the dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Traditional flexible calibration curves for the ADNEX model in the motivating example. Observed proportion is estimated with a logistic model with restricted cubic splines to model nonlinear effects and estimated risks are grouped in 10 groups. Confidence intervals are shown for 1000 bootstraps with a shaded area for splines and a + for grouped calibration. Dashed diagonal line indicates perfect calibratio… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the standard logistic regression with splines and the 3 introduced methodologies with confidence (bright shaded) and prediction intervals (light shaded). Number of quantiles for CG-C were 10, 2MA-C fitted center specific curves with splines or LOESS and MIX-C used random intercept and slopes with restricted cubic splines and 3 knots. Dashed diagonal line indicates perfect calibration [PITH_F… view at source ↗
Figure 5
Figure 5. Figure 5: Pointwise prediction interval coverage with varying validation sample size. The model validated is the same in each superpopulation and it was trained from a center with average event rate and with adequate sample size. Black dotted line indicates nominal coverage (95%) [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Center specific (grey) and average true calibration plots for the synthetic data with 1000000 observations per [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
read the original abstract

Evaluation of clinical prediction models across multiple clusters, whether centers or datasets, is becoming increasingly common. A comprehensive evaluation includes an assessment of the agreement between the estimated risks and the observed outcomes, also known as calibration. Calibration is of utmost importance for clinical decision making with prediction models and it may vary between clusters. We present three approaches to take clustering into account when evaluating calibration. (1) Clustered group calibration (CG-C), (2) two-stage meta-analysis calibration (2MA-C) and (3) mixed model calibration (MIX-C) can obtain flexible calibration plots with random effects modelling and providing confidence and prediction intervals. As a case example, we externally validate a model to estimate the risk that an ovarian tumor is malignant in multiple centers (N = 2489). We also conduct a simulation study and synthetic data study generated from a true clustered dataset to evaluate the methods. In the simulation study MIX-C and 2MA-C (splines) gave estimated curves closest to the true overall curve. In the synthetic data study MIX-C produced cluster specific curves closest to the truth. Coverage of the prediction interval across the plot was best for 2MA-C with splines. We recommend using 2MA-C with splines to estimate the overall curve and the 95% PI and MIX-C for the cluster specific curves, especially when sample size per cluster is limited. We provide ready-to-use code to construct summary flexible calibration curves with confidence and prediction intervals to assess heterogeneity in calibration across datasets or centers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces three random-effects-based methods (CG-C, 2MA-C, and MIX-C) for constructing flexible calibration plots for binary outcomes that account for clustering across centers or datasets. It evaluates these via a case study on ovarian tumor malignancy risk (N=2489), a simulation study, and a synthetic-data study generated from a real clustered dataset, concluding that 2MA-C with splines performs best for the overall curve and 95% prediction interval while MIX-C is preferable for cluster-specific curves, especially with small per-cluster samples. Ready-to-use code is provided.

Significance. If the performance claims hold under realistic conditions, the work supplies practical, implementable tools for multi-center calibration assessment that incorporate both confidence and prediction intervals while quantifying heterogeneity. The provision of ready-to-use code is a clear strength that lowers the barrier to adoption.

major comments (2)
  1. [simulation study] Simulation study (as described in the abstract and methods): data are generated exactly under the normal random-effects model for cluster-specific intercepts and slopes. No sensitivity analyses are reported for non-normal heterogeneity (e.g., t-distributed random effects or finite mixtures), which is common in multi-center data. Because the reported superiority of 2MA-C coverage and MIX-C cluster-curve accuracy rests on these correctly-specified simulations, the practical recommendation may not generalize when the distributional assumption fails.
  2. [synthetic data study] Synthetic data study: the data-generating process is described as coming from a true clustered dataset, but the manuscript does not state whether the random-effects distribution used for data generation matches the normality assumption of the fitted models or includes misspecification checks. This directly affects the strength of the claim that MIX-C produces cluster-specific curves closest to the truth.
minor comments (2)
  1. [abstract] The abstract states that MIX-C and 2MA-C (splines) gave curves closest to the true overall curve, but does not quantify the metric (e.g., integrated squared error) or report variability across replications.
  2. [methods] Notation for the three methods (CG-C, 2MA-C, MIX-C) is introduced without an explicit comparison table of their modeling assumptions, computational requirements, and interval types.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and agree that clarifying the data-generating processes and adding sensitivity analyses will improve the robustness of our evaluations.

read point-by-point responses
  1. Referee: [simulation study] Simulation study (as described in the abstract and methods): data are generated exactly under the normal random-effects model for cluster-specific intercepts and slopes. No sensitivity analyses are reported for non-normal heterogeneity (e.g., t-distributed random effects or finite mixtures), which is common in multi-center data. Because the reported superiority of 2MA-C coverage and MIX-C cluster-curve accuracy rests on these correctly-specified simulations, the practical recommendation may not generalize when the distributional assumption fails.

    Authors: We agree that the primary simulation study generates data under the normal random-effects assumption matching the fitted models, and no sensitivity analyses for non-normal heterogeneity (such as t-distributions or mixtures) were included. This is a valid limitation for generalizability. We will add a new subsection with sensitivity simulations under t-distributed random effects (df=3) and a two-component mixture to assess whether the relative performance of 2MA-C and MIX-C holds under misspecification. These results will be reported alongside the original findings. revision: yes

  2. Referee: [synthetic data study] Synthetic data study: the data-generating process is described as coming from a true clustered dataset, but the manuscript does not state whether the random-effects distribution used for data generation matches the normality assumption of the fitted models or includes misspecification checks. This directly affects the strength of the claim that MIX-C produces cluster-specific curves closest to the truth.

    Authors: We acknowledge that the manuscript does not explicitly describe the random-effects distribution in the synthetic data generation step. The synthetic data were derived by resampling from an empirical real clustered dataset (ovarian tumor data), so the underlying heterogeneity reflects observed (likely non-normal) patterns rather than an imposed normal distribution. We will revise the methods section to state this explicitly, clarify that this serves as a partial check against normality assumptions, and note any limitations in the discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity; methods evaluated against independent simulation truth and real data

full rationale

The paper defines CG-C, 2MA-C and MIX-C approaches using standard random-effects modeling for clustered calibration, then evaluates performance via a simulation study (data generated from known models) and a real multi-center ovarian tumor validation dataset (N=2489). No equation or recommendation reduces a claimed prediction to a fitted parameter by construction, nor imports uniqueness via self-citation chains. The derivation chain is self-contained against external benchmarks, consistent with the reader's assessment of score 2.0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard mixed-model assumptions for binary data and on the simulations being representative of real clustered calibration heterogeneity. No new entities are postulated.

free parameters (1)
  • cluster-specific random-effect variances
    Estimated from data in the mixed and meta-analytic models; central to the interval construction.
axioms (1)
  • domain assumption Random effects (intercepts and slopes) are normally distributed
    Invoked by the mixed-model and meta-analytic formulations for cluster heterogeneity.

pith-pipeline@v0.9.0 · 5818 in / 1192 out tokens · 120340 ms · 2026-05-23T00:25:34.156112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Advanced methods in meta -analysis: multivariate approach and meta-regression

    van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta -analysis: multivariate approach and meta-regression. Stat Med. 2002;21(4):589-624. doi:10.1002/sim.1040

  2. [2]

    A random -effects regression model for meta- analysis

    Berkey CS, Hoaglin DC, Mosteller F, Colditz GA. A random -effects regression model for meta- analysis. Stat Med. 1995;14(4):395-411. doi:10.1002/sim.4780140406

  3. [3]

    Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews

    Reitsma JB, Glas AS, Rutjes AWS, Scholten RJPM, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol. 2005;58(10):982-990. doi:10.1016/j.jclinepi.2005.02.022

  4. [4]

    Interpretation of random effects meta -analyses

    Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta -analyses. BMJ. 2011;342:d549. doi:10.1136/bmj.d549

  5. [5]

    Methods to estimate the between‐study variance and its uncertainty in meta ‐ analysis

    Veroniki AA, Jackson D, Viechtbauer W, et al. Methods to estimate the between‐study variance and its uncertainty in meta ‐ analysis. Res Synth Methods . 2016;7(1):55 -79. doi:10.1002/jrsm.1164

  6. [6]

    Meta-Analysis with R

    Schwarzer G, Carpenter JR, Rücker G. Meta-Analysis with R . Cham: Springer International Publishing; 2015. doi:10.1007/978-3-319-21416-0

  7. [7]

    Snell KI, Ensor J, Debray TP, Moons KG, Riley RD. Meta -analysis of prediction model performance across multiple studies: Which scale helps ensure between -study normality for the C-statistic and calibration measures? Stat Methods Med Res . 2018;27(11):3505 -3522. doi:10.1177/0962280217705678

  8. [8]

    A re -evaluation of random-effects meta-analysis

    Higgins JPT, Thompson SG, Spiegelhalter DJ. A re -evaluation of random-effects meta-analysis. J R Stat Soc Ser A Stat Soc. 2009;172(1):137-159. doi:10.1111/j.1467-985X.2008.00552.x

  9. [9]

    Random effects meta -analysis: Coverage performance of 95% confidence and prediction intervals following REML estimation

    Partlett C, Riley RD. Random effects meta -analysis: Coverage performance of 95% confidence and prediction intervals following REML estimation. Stat Med . 2017;36(2):301 -317. doi:10.1002/sim.7140

  10. [10]

    Prediction intervals for random -effects meta-analysis: A confidence distribution approach

    Nagashima K, Noma H, Furukawa TA. Prediction intervals for random -effects meta-analysis: A confidence distribution approach. Stat Methods Med Res . 2019;28(6):1689 -1702. doi:10.1177/0962280218773520

  11. [11]

    The inclusion of the estimated inter-study variation into forest plots for random effects meta-analyses – a suggestion for a graphical representation

    Skipka G. The inclusion of the estimated inter-study variation into forest plots for random effects meta-analyses – a suggestion for a graphical representation. https://abstracts.cochrane.org/2006 - dublin/inclusion-estimated-inter-study-variation-forest-plots-random-effects-meta-analyses. Published 2006

  12. [12]

    Validation of models to diagnose ovarian cancer in patients managed surgically or conservatively: multicentre cohort study

    Van Calster B, Valentin L, Froyman W, et al. Validation of models to diagnose ovarian cancer in patients managed surgically or conservatively: multicentre cohort study. BMJ. 2020;370:m2614. doi:10.1136/bmj.m2614