On the variability of regression shrinkage methods for clinical prediction models: simulation study on predictive performance
Pith reviewed 2026-05-24 15:27 UTC · model grok-4.3
The pith
Shrinkage methods for clinical prediction models improve average calibration but often worsen performance in the datasets where it is most needed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Although shrinkage of regression coefficients improves calibration slopes on average across repeated samples, the between-sample variability of those slopes is frequently larger than under maximum-likelihood estimation, and the correlation between the shrinkage factor estimated from a sample and the factor that would optimally remove overfitting is typically negative.
What carries the argument
Calibration slope (the slope of observed versus predicted log-odds) as the primary performance metric, together with the sample-to-sample variability of estimated shrinkage factors.
If this is right
- Shrinkage improves average calibration slope relative to maximum likelihood.
- Between-sample variability of calibration slopes often rises when shrinkage is applied.
- Bootstrap-based uniform shrinkage produces the most stable improvements among the tested methods.
- Firth’s correction applies only modest shrinkage but with the lowest variability.
- Negative correlation between estimated and optimal shrinkage means the methods tend to under-shrink precisely when overfitting is worst.
Where Pith is reading between the lines
- Developers may need to combine shrinkage with other safeguards such as pre-specifying strong predictors or using external validation to detect when a model remains unstable.
- Reporting only average performance across simulations can mask the risk that a single fitted model will be poorly calibrated in the target population.
- Methods that adapt the degree of shrinkage to an estimate of required shrinkage might reduce the negative correlation observed here.
Load-bearing premise
The chosen ranges of sample size, number of predictors, and event rates in the simulations match the conditions under which these shrinkage methods are used in real clinical data.
What would settle it
A new simulation or empirical dataset in which the correlation between estimated shrinkage and optimal shrinkage is positive (or zero) while average calibration still improves would falsify the reported pattern of negative correlation and high variability.
Figures
read the original abstract
When developing risk prediction models, shrinkage methods are recommended, especially when the sample size is limited. Several earlier studies have shown that the shrinkage of model coefficients can reduce overfitting of the prediction model and subsequently result in better predictive performance on average. In this simulation study, we aimed to investigate the variability of regression shrinkage on predictive performance for a binary outcome, with focus on the calibration slope. The slope indicates whether risk predictions are too extreme (slope < 1) or not extreme enough (slope > 1). We investigated the following shrinkage methods in comparison to standard maximum likelihood estimation: uniform shrinkage (likelihood-based and bootstrap-based), ridge regression, penalized maximum likelihood, LASSO regression, adaptive LASSO, non-negative garrote, and Firth's correction. There were three main findings. First, shrinkage improved calibration slopes on average. Second, the between-sample variability of calibration slopes was often increased relative to maximum likelihood. Among the shrinkage methods, the bootstrap-based uniform shrinkage worked well overall. In contrast to other shrinkage approaches, Firth's correction had only a small shrinkage effect but did so with low variability. Third, the correlation between the estimated shrinkage and the optimal shrinkage to remove overfitting was typically negative. Hence, although shrinkage improved predictions on average, it often worked poorly in individual datasets, in particular when shrinkage was most needed. The observed variability of shrinkage methods implies that these methods do not solve problems associated with small sample size or low number of events per variable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This simulation study compares the performance and variability of several shrinkage methods (uniform shrinkage via likelihood and bootstrap, ridge regression, penalized maximum likelihood, LASSO, adaptive LASSO, non-negative garrote, and Firth's correction) against standard maximum likelihood estimation in logistic regression for binary clinical outcomes. The focus is on calibration slope across varying sample sizes, numbers of predictors, and event rates. The central claims are that shrinkage improves average calibration but often increases between-sample variability, that the correlation between estimated and optimal shrinkage is typically negative, and that these methods therefore do not reliably address problems of small samples or low events per variable, with bootstrap uniform shrinkage performing best overall and Firth's correction showing low variability.
Significance. If the simulation results generalize, the work provides concrete evidence that average improvements from shrinkage can mask high variability and poor performance in individual datasets, particularly when shrinkage is most needed. This has implications for guidelines on developing clinical prediction models in low-EPV settings. The study is strengthened by its direct comparison of multiple methods on a clinically relevant metric (calibration slope) using a coherent simulation design.
major comments (3)
- [Simulation design] Simulation design section: The chosen ranges for n, p, and event rates (and thus EPV) lack explicit justification or sensitivity analyses against empirical distributions from real clinical datasets, where predictors are typically correlated and models may be misspecified. This is load-bearing for the central claim that shrinkage methods 'do not solve problems associated with small sample size or low number of events per variable,' as the observed variability and negative correlations could be artifacts of the specific regimes simulated.
- [Methods and results (correlation analysis)] Methods and results on correlation: The computation of the 'optimal shrinkage' factor (against which estimated shrinkage is correlated) is not described in sufficient detail to allow verification that it is independent of the fitted models; this underpins the key finding of typically negative correlations and the conclusion that shrinkage 'often worked poorly in individual datasets.'
- [Methods] Methods: The number of Monte Carlo replications per scenario and handling of edge cases (e.g., perfect separation, zero events) are not reported, which directly affects the reliability of the reported between-sample variability measures and standard deviations of calibration slopes.
minor comments (2)
- [Abstract] Abstract and results: The statement that shrinkage 'often worked poorly' would benefit from quantitative support (e.g., proportion of datasets where calibration slope fell below a threshold) rather than relying solely on averages and correlations.
- [Figures and tables] Figure captions and tables: Ensure all simulation parameters (exact grids for n, p, event rates) are listed explicitly so readers can assess coverage of low-EPV regimes.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Simulation design] Simulation design section: The chosen ranges for n, p, and event rates (and thus EPV) lack explicit justification or sensitivity analyses against empirical distributions from real clinical datasets, where predictors are typically correlated and models may be misspecified. This is load-bearing for the central claim that shrinkage methods 'do not solve problems associated with small sample size or low number of events per variable,' as the observed variability and negative correlations could be artifacts of the specific regimes simulated.
Authors: The ranges for n, p, and event rates were chosen to span EPV values from approximately 2 to 20, reflecting scenarios commonly encountered in clinical prediction model development. In the revised manuscript we will add explicit justification with supporting references from the EPV literature. We acknowledge that our design used independent predictors and correct specification to isolate shrinkage effects; we will expand the discussion to note that correlations or misspecification could alter results but are unlikely to remove the observed between-sample variability or negative correlations. We therefore do not view the central claim as an artifact of the chosen regimes, though we agree a sensitivity analysis would strengthen the work and will discuss this limitation. revision: partial
-
Referee: [Methods and results (correlation analysis)] Methods and results on correlation: The computation of the 'optimal shrinkage' factor (against which estimated shrinkage is correlated) is not described in sufficient detail to allow verification that it is independent of the fitted models; this underpins the key finding of typically negative correlations and the conclusion that shrinkage 'often worked poorly in individual datasets.'
Authors: We will revise the methods section to describe the optimal shrinkage factor in full detail: it is obtained analytically from the known true coefficients of the data-generating model by determining the single shrinkage value that produces a population calibration slope of exactly 1 when applied to predictions evaluated on a large independent validation sample drawn from the same population. This quantity is computed once per scenario and is independent of any model fitted to the simulated training data. The revised text will include the explicit formula and computational steps. revision: yes
-
Referee: [Methods] Methods: The number of Monte Carlo replications per scenario and handling of edge cases (e.g., perfect separation, zero events) are not reported, which directly affects the reliability of the reported between-sample variability measures and standard deviations of calibration slopes.
Authors: We will add the missing information to the methods section: 1000 Monte Carlo replications were performed per scenario. Replications resulting in non-convergence under maximum likelihood (including perfect separation or zero events) were excluded from summary statistics for that method; penalized approaches such as Firth's correction were retained as they handle separation by design. These details will be reported to support the reported variability measures. revision: yes
Circularity Check
No circularity: pure simulation study with direct empirical evaluation
full rationale
This paper is a Monte Carlo simulation study. Data are generated under explicitly stated mechanisms (binary outcomes, logistic regression, fixed ranges of n, p, event rates), shrinkage methods are applied, and performance metrics (calibration slopes, variability, correlations) are computed directly from the simulated replicates. No equations derive a 'prediction' that reduces to a fitted parameter by construction, no self-citations supply load-bearing uniqueness theorems or ansatzes for the central claims, and no renaming of known results occurs. The design parameters are stated assumptions whose representativeness is a separate external-validity question, not a circularity issue. All reported findings (average improvement, increased between-sample variability, negative correlation between estimated and optimal shrinkage) are statistical summaries of the simulation output and do not collapse to the inputs by definition.
Axiom & Free-Parameter Ledger
free parameters (1)
- simulation design parameters (sample size, predictors, event rate)
axioms (2)
- domain assumption Logistic regression is the correct model for the simulated binary outcomes
- domain assumption Calibration slope is an appropriate and sufficient metric for assessing overfitting in risk predictions
Reference graph
Works this paper leans on
-
[1]
Babyak MA. What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression - Type Models. Psychosom Med 2004;66:411 -
work page 2004
-
[2]
Regression modelling strategies for improved prognostic prediction
Harrell FE , Lee K L, Califf RM, Pryor DB , Rosati R A. Regression modelling strategies for improved prognostic prediction. Stat Med 1984; 3143 -
work page 1984
-
[3]
A simulation study of the number of events per variable in logistic regression analysis
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373 -
work page 1996
-
[4]
Steyerberg EW, Eijkemans MJ, Harrell FE Jr, Habbema JD. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med 2000;19:1059 -
work page 2000
-
[5]
How to develop a more accurate risk prediction model when there are few events
Pavlou M, Ambler G, Seaman SR, Guttmann O, Elliott P, King M, Omar RZ. How to develop a more accurate risk prediction model when there are few events. BMJ 2015;351:h3868
work page 2015
-
[6]
Courvoisier DS, Combescure C, Agoritsas T, Gayet - Ag eron A, Perneger TV. Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. J Clin Epidemiol 2011;64:993 -
work page 2011
-
[7]
van Smeden M, de Groot JA, Moons KG, Collins GS, Altman DG, Eijkemans MJ, Reitsma JB. Sample size for binary logistic prediction models: Beyond events per variable criteria No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol 2016;16:163
work page 2016
-
[8]
Sample size for binary logistic prediction models: Beyond events per variable criteria
van Smeden M, Moons KG, de Groot JA, Coll ins GS, Altman DG, Eijkemans MJ, Reitsma JB. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Meth Med Res , forthcoming. Doi 10.1177/0962280218784726
-
[9]
Riley RD, Snell KI, Ensor J, Burke DL, Harrell FE Jr, Moons K G, Collins GS. Minimum sample size for developing a multivariable prediction model: PART II - binary and time - to - event outcomes. Stat Med 2019;38:1276 -
work page 2019
-
[10]
Adequate sample size for developing prediction models is not si mply related to events per variable
Ogundimu EO, Altman DG, Collins GS. Adequate sample size for developing prediction models is not si mply related to events per variable. J Clin Epidemiol 2016;76:175 -
work page 2016
-
[11]
Steyerberg EW. Eijkemans MJC, Habbema JDF. Application of shrinkage techniques in logistic regression analysis: a case study. Stat Neerl 2001;55:76 -
work page 2001
-
[12]
Eijkemans MJC, Harrell FE Jr, Habbema JDF
Steyerberg EW. Eijkemans MJC, Harrell FE Jr, Habbema JDF. Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets . Med Decis Making 2001; 21 : 45 - 56
work page 2001
-
[13]
An evaluation of penalised survival methods for developing prognostic models with rare events
Ambler G, Seaman S, Omar RZ. An evaluation of penalised survival methods for developing prognostic models with rare events. Stat Med 2012;31:1150 -
work page 2012
-
[14]
Pavlou M, Ambler G, Seaman SR, De Iorio M, Omar RZ. Review and evaluation of penalised regression methods for risk prediction in low - dimensional data with few events. Stat Med 2016;35:1159 - 117
work page 2016
-
[15]
Puhr R, Heinze G, Nold M, Lusa L, Geroldinger A. Firth’ s logistic regression with rare events: accurate effect estimates and predictions? Stat Med 2017;36:2302 -
work page 2017
-
[16]
Sample size considerations and predictive performance of multinomial logistic prediction models
De Jong VMT, Eijkemans MJC, Van Calster B, Timmerman D, Moons KGM, Steyerberg EW, van S meden M. Sample size considerations and predictive performance of multinomial logistic prediction models. Stat Med 2019;38:1601 -
work page 2019
-
[17]
Regression, Prediction and Shrinkage
Copas JB. Regression, Prediction and Shrinkage. J R Stat Soc B 1983;45:311 -
work page 1983
-
[18]
Shrinkage and penalized likelihood as methods to improve predictive accuracy
van Houwelingen JC. Shrinkage and penalized likelihood as methods to improve predictive accuracy. Stat Neerl 2001;55:17 -
work page 2001
-
[19]
Ridge regression: biased estimation for nonorthogonal problems
Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970;12:55 –
work page 1970
-
[20]
Re gression Shrinkage and Selection via the Lasso
Tibshirani R. Re gression Shrinkage and Selection via the Lasso . J R Stat Soc B 1996;58:267 -
work page 1996
-
[21]
Bias reduction of maximum likelihood estimates
Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993;80:27 –
work page 1993
-
[22]
Validation of prediction models based on lasso regression with multiply imputed data
Musoro JZ, Zwinderman AH, Puhan MA, ter Riet G, Geskus RB. Validation of prediction models based on lasso regression with multiply imputed data . BMC Med Res Methodol 2014;14:116
work page 2014
-
[23]
Predictive value of statistical models
Van Houwelingen JC, le Cessie S. Predictive value of statistical models. Stat Med 1990;9:1303 -
work page 1990
-
[24]
Schaefer RL, Roi LD, Wolfe RA. A ridge logistic estimator. Commun St at Theory Methods 1984;13:99 -
work page 1984
-
[25]
Ridge estimators in logistic regression
Le Cessie S, van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc C 1992;41:191 –
work page 1992
-
[26]
The adaptive lasso and its oracle properties
Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc 2006;101:1418 -
work page 2006
-
[27]
Better subset regression using the nonnegative garrote
Breiman L. Better subset regression using the nonnegative garrote. Technometrics 1995;37:373 –
work page 1995
-
[28]
Logistic regression with the nonnegative garro te
Makalic E, Schmidt DF. Logistic regression with the nonnegative garro te. In Wang D, Reynolds M (eds), AI 2011: Advances in Artificial Intelligence. Lecture Notes in Computer Science , vol
work page 2011
-
[29]
Bias reduction of maximum l ikelihood estimates
Firth D. Bias reduction of maximum l ikelihood estimates. Biometrika 1993;80:27 –
work page 1993
-
[30]
A solution to the problem of separation in logistic regression
Heinze G, Schemper M. A solution to the problem of separation in logistic regression . Stat Med 2002;21:2409 -
work page 2002
-
[31]
Interpreting Incremental Value of Markers Added to Risk Prediction Model s
Pencina MJ, D’Agostino RB Sr, Pencina KM, Janssens ACJW, Greenland P. Interpreting Incremental Value of Markers Added to Risk Prediction Model s . Am J Epidemiol 2012;176:473 -
work page 2012
-
[32]
Regularization Paths for Generalized Linear Models via Coordinate Descent
Friedman JH, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 2010;33:1
work page 2010
-
[33]
Regression and time series model selection in small samples
Hurvich CM, Tsai CL . Regression and time series model selection in small samples. Bi ometrika 1989; 76:297 -
work page 1989
-
[34]
A calibration hierarchy for risk models was defined: from utopia to empirical data
Van Calster B , Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol 2016;74:167 -
work page 2016
-
[35]
Two further applications of a model for binary regression
Cox DR. Two further applications of a model for binary regression. Biometrika 1958;45:562 -
work page 1958
-
[36]
Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction m odels . J Clin Epidemiol 2019;110:12 -
work page 2019
-
[37]
Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
Van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints . BMC Med Res Methodol 2014;14:137
work page 2014
-
[38]
Classifier technology and th e illusion of progress
Hand DJ. Classifier technology and th e illusion of progress. Stat Sci 2006;1:1 - 14
work page 2006
-
[39]
Regularization and Variable Selection via the Elastic Net
Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. J R Stat Soc B 2005;67: 301 –
work page 2005
-
[40]
Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties
Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J Am Stat Assoc 2001; 96:1348 - 1360
work page 2001
-
[41]
Shrinkage and model selection with correlated variabeles via weighted fusion
Daye ZJ , Jeng XJ. Shrinkage and model selection with correlated variabeles via weighted fusion. Computational Statistics and Data Analysis 2009; 53 :1284 –
work page 2009
-
[42]
Ex plainingSupport Vector Machines: A Color Based Nomogram
Van Belle V, Van Calster B, Van Huffel S, Suykens JAK, Lisboa P. Ex plainingSupport Vector Machines: A Color Based Nomogram. PLoS One 2016;11:e0164568
work page 2016
-
[43]
Overview of the characteristics of the 60 simulation scenarios Predictors Correlation Event rate EPV Events Sample size True c statistic Model intercept 5 true predictors, or 5 true + 5 noise predictors 0 0.1 3 15 150 0.75 - 2.57 5 25 250 10 50 500 20 100 1000 50 250 2500 0.5 3 15 30 0.74 0 5 25 50 10 50 100 20 100 200 50 250 500 0.5 0.1 3 15 150 0.83 - 2...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.