pith. sign in

arxiv: 1907.11493 · v1 · pith:4CCKDWUVnew · submitted 2019-07-26 · 📊 stat.ME

On the variability of regression shrinkage methods for clinical prediction models: simulation study on predictive performance

Pith reviewed 2026-05-24 15:27 UTC · model grok-4.3

classification 📊 stat.ME
keywords regression shrinkageclinical prediction modelscalibration slopesimulation studyoverfittingbinary outcomevariabilitypenalized regression
0
0 comments X

The pith

Shrinkage methods for clinical prediction models improve average calibration but often worsen performance in the datasets where it is most needed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses simulations to test how various shrinkage techniques affect the calibration slope of binary-outcome risk models. It finds that while most methods reduce average overfitting, the slope values fluctuate more across repeated samples than under ordinary maximum-likelihood fitting. The amount of shrinkage actually applied in a given sample tends to be negatively correlated with the amount that would have been optimal, so the methods shrink least when the data most require it. Bootstrap-based uniform shrinkage performed most consistently among the options examined, while Firth’s correction produced only modest shrinkage with low variability. These patterns indicate that shrinkage alone does not overcome the instability caused by small samples or low events per variable.

Core claim

Although shrinkage of regression coefficients improves calibration slopes on average across repeated samples, the between-sample variability of those slopes is frequently larger than under maximum-likelihood estimation, and the correlation between the shrinkage factor estimated from a sample and the factor that would optimally remove overfitting is typically negative.

What carries the argument

Calibration slope (the slope of observed versus predicted log-odds) as the primary performance metric, together with the sample-to-sample variability of estimated shrinkage factors.

If this is right

  • Shrinkage improves average calibration slope relative to maximum likelihood.
  • Between-sample variability of calibration slopes often rises when shrinkage is applied.
  • Bootstrap-based uniform shrinkage produces the most stable improvements among the tested methods.
  • Firth’s correction applies only modest shrinkage but with the lowest variability.
  • Negative correlation between estimated and optimal shrinkage means the methods tend to under-shrink precisely when overfitting is worst.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may need to combine shrinkage with other safeguards such as pre-specifying strong predictors or using external validation to detect when a model remains unstable.
  • Reporting only average performance across simulations can mask the risk that a single fitted model will be poorly calibrated in the target population.
  • Methods that adapt the degree of shrinkage to an estimate of required shrinkage might reduce the negative correlation observed here.

Load-bearing premise

The chosen ranges of sample size, number of predictors, and event rates in the simulations match the conditions under which these shrinkage methods are used in real clinical data.

What would settle it

A new simulation or empirical dataset in which the correlation between estimated shrinkage and optimal shrinkage is positive (or zero) while average calibration still improves would falsify the reported pattern of negative correlation and high variability.

Figures

Figures reproduced from arXiv: 1907.11493 by Ben Van Calster, Ewout W. Steyerberg, Maarten van Smeden.

Figure 2
Figure 2. Figure 2: For all scenarios, box plots are given in Figure S3, and MAD in Figure S4. The variability of [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

When developing risk prediction models, shrinkage methods are recommended, especially when the sample size is limited. Several earlier studies have shown that the shrinkage of model coefficients can reduce overfitting of the prediction model and subsequently result in better predictive performance on average. In this simulation study, we aimed to investigate the variability of regression shrinkage on predictive performance for a binary outcome, with focus on the calibration slope. The slope indicates whether risk predictions are too extreme (slope < 1) or not extreme enough (slope > 1). We investigated the following shrinkage methods in comparison to standard maximum likelihood estimation: uniform shrinkage (likelihood-based and bootstrap-based), ridge regression, penalized maximum likelihood, LASSO regression, adaptive LASSO, non-negative garrote, and Firth's correction. There were three main findings. First, shrinkage improved calibration slopes on average. Second, the between-sample variability of calibration slopes was often increased relative to maximum likelihood. Among the shrinkage methods, the bootstrap-based uniform shrinkage worked well overall. In contrast to other shrinkage approaches, Firth's correction had only a small shrinkage effect but did so with low variability. Third, the correlation between the estimated shrinkage and the optimal shrinkage to remove overfitting was typically negative. Hence, although shrinkage improved predictions on average, it often worked poorly in individual datasets, in particular when shrinkage was most needed. The observed variability of shrinkage methods implies that these methods do not solve problems associated with small sample size or low number of events per variable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This simulation study compares the performance and variability of several shrinkage methods (uniform shrinkage via likelihood and bootstrap, ridge regression, penalized maximum likelihood, LASSO, adaptive LASSO, non-negative garrote, and Firth's correction) against standard maximum likelihood estimation in logistic regression for binary clinical outcomes. The focus is on calibration slope across varying sample sizes, numbers of predictors, and event rates. The central claims are that shrinkage improves average calibration but often increases between-sample variability, that the correlation between estimated and optimal shrinkage is typically negative, and that these methods therefore do not reliably address problems of small samples or low events per variable, with bootstrap uniform shrinkage performing best overall and Firth's correction showing low variability.

Significance. If the simulation results generalize, the work provides concrete evidence that average improvements from shrinkage can mask high variability and poor performance in individual datasets, particularly when shrinkage is most needed. This has implications for guidelines on developing clinical prediction models in low-EPV settings. The study is strengthened by its direct comparison of multiple methods on a clinically relevant metric (calibration slope) using a coherent simulation design.

major comments (3)
  1. [Simulation design] Simulation design section: The chosen ranges for n, p, and event rates (and thus EPV) lack explicit justification or sensitivity analyses against empirical distributions from real clinical datasets, where predictors are typically correlated and models may be misspecified. This is load-bearing for the central claim that shrinkage methods 'do not solve problems associated with small sample size or low number of events per variable,' as the observed variability and negative correlations could be artifacts of the specific regimes simulated.
  2. [Methods and results (correlation analysis)] Methods and results on correlation: The computation of the 'optimal shrinkage' factor (against which estimated shrinkage is correlated) is not described in sufficient detail to allow verification that it is independent of the fitted models; this underpins the key finding of typically negative correlations and the conclusion that shrinkage 'often worked poorly in individual datasets.'
  3. [Methods] Methods: The number of Monte Carlo replications per scenario and handling of edge cases (e.g., perfect separation, zero events) are not reported, which directly affects the reliability of the reported between-sample variability measures and standard deviations of calibration slopes.
minor comments (2)
  1. [Abstract] Abstract and results: The statement that shrinkage 'often worked poorly' would benefit from quantitative support (e.g., proportion of datasets where calibration slope fell below a threshold) rather than relying solely on averages and correlations.
  2. [Figures and tables] Figure captions and tables: Ensure all simulation parameters (exact grids for n, p, event rates) are listed explicitly so readers can assess coverage of low-EPV regimes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Simulation design] Simulation design section: The chosen ranges for n, p, and event rates (and thus EPV) lack explicit justification or sensitivity analyses against empirical distributions from real clinical datasets, where predictors are typically correlated and models may be misspecified. This is load-bearing for the central claim that shrinkage methods 'do not solve problems associated with small sample size or low number of events per variable,' as the observed variability and negative correlations could be artifacts of the specific regimes simulated.

    Authors: The ranges for n, p, and event rates were chosen to span EPV values from approximately 2 to 20, reflecting scenarios commonly encountered in clinical prediction model development. In the revised manuscript we will add explicit justification with supporting references from the EPV literature. We acknowledge that our design used independent predictors and correct specification to isolate shrinkage effects; we will expand the discussion to note that correlations or misspecification could alter results but are unlikely to remove the observed between-sample variability or negative correlations. We therefore do not view the central claim as an artifact of the chosen regimes, though we agree a sensitivity analysis would strengthen the work and will discuss this limitation. revision: partial

  2. Referee: [Methods and results (correlation analysis)] Methods and results on correlation: The computation of the 'optimal shrinkage' factor (against which estimated shrinkage is correlated) is not described in sufficient detail to allow verification that it is independent of the fitted models; this underpins the key finding of typically negative correlations and the conclusion that shrinkage 'often worked poorly in individual datasets.'

    Authors: We will revise the methods section to describe the optimal shrinkage factor in full detail: it is obtained analytically from the known true coefficients of the data-generating model by determining the single shrinkage value that produces a population calibration slope of exactly 1 when applied to predictions evaluated on a large independent validation sample drawn from the same population. This quantity is computed once per scenario and is independent of any model fitted to the simulated training data. The revised text will include the explicit formula and computational steps. revision: yes

  3. Referee: [Methods] Methods: The number of Monte Carlo replications per scenario and handling of edge cases (e.g., perfect separation, zero events) are not reported, which directly affects the reliability of the reported between-sample variability measures and standard deviations of calibration slopes.

    Authors: We will add the missing information to the methods section: 1000 Monte Carlo replications were performed per scenario. Replications resulting in non-convergence under maximum likelihood (including perfect separation or zero events) were excluded from summary statistics for that method; penalized approaches such as Firth's correction were retained as they handle separation by design. These details will be reported to support the reported variability measures. revision: yes

Circularity Check

0 steps flagged

No circularity: pure simulation study with direct empirical evaluation

full rationale

This paper is a Monte Carlo simulation study. Data are generated under explicitly stated mechanisms (binary outcomes, logistic regression, fixed ranges of n, p, event rates), shrinkage methods are applied, and performance metrics (calibration slopes, variability, correlations) are computed directly from the simulated replicates. No equations derive a 'prediction' that reduces to a fitted parameter by construction, no self-citations supply load-bearing uniqueness theorems or ansatzes for the central claims, and no renaming of known results occurs. The design parameters are stated assumptions whose representativeness is a separate external-validity question, not a circularity issue. All reported findings (average improvement, increased between-sample variability, negative correlation between estimated and optimal shrinkage) are statistical summaries of the simulation output and do not collapse to the inputs by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The study rests on standard simulation assumptions for logistic data generation and performance metrics without introducing fitted parameters beyond design choices or new entities.

free parameters (1)
  • simulation design parameters (sample size, predictors, event rate)
    Chosen by authors to represent limited-sample clinical scenarios; not data-fitted but selected to explore variability.
axioms (2)
  • domain assumption Logistic regression is the correct model for the simulated binary outcomes
    Underlying data-generating process in the simulation setup.
  • domain assumption Calibration slope is an appropriate and sufficient metric for assessing overfitting in risk predictions
    Central evaluation criterion used to compare methods.

pith-pipeline@v0.9.0 · 5806 in / 1330 out tokens · 26592 ms · 2026-05-24T15:27:27.506949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression - Type Models

    Babyak MA. What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression - Type Models. Psychosom Med 2004;66:411 -

  2. [2]

    Regression modelling strategies for improved prognostic prediction

    Harrell FE , Lee K L, Califf RM, Pryor DB , Rosati R A. Regression modelling strategies for improved prognostic prediction. Stat Med 1984; 3143 -

  3. [3]

    A simulation study of the number of events per variable in logistic regression analysis

    Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373 -

  4. [4]

    Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets

    Steyerberg EW, Eijkemans MJ, Harrell FE Jr, Habbema JD. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med 2000;19:1059 -

  5. [5]

    How to develop a more accurate risk prediction model when there are few events

    Pavlou M, Ambler G, Seaman SR, Guttmann O, Elliott P, King M, Omar RZ. How to develop a more accurate risk prediction model when there are few events. BMJ 2015;351:h3868

  6. [6]

    Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure

    Courvoisier DS, Combescure C, Agoritsas T, Gayet - Ag eron A, Perneger TV. Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. J Clin Epidemiol 2011;64:993 -

  7. [7]

    Sample size for binary logistic prediction models: Beyond events per variable criteria No rationale for 1 variable per 10 events criterion for binary logistic regression analysis

    van Smeden M, de Groot JA, Moons KG, Collins GS, Altman DG, Eijkemans MJ, Reitsma JB. Sample size for binary logistic prediction models: Beyond events per variable criteria No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol 2016;16:163

  8. [8]

    Sample size for binary logistic prediction models: Beyond events per variable criteria

    van Smeden M, Moons KG, de Groot JA, Coll ins GS, Altman DG, Eijkemans MJ, Reitsma JB. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Meth Med Res , forthcoming. Doi 10.1177/0962280218784726

  9. [9]

    Minimum sample size for developing a multivariable prediction model: PART II - binary and time - to - event outcomes

    Riley RD, Snell KI, Ensor J, Burke DL, Harrell FE Jr, Moons K G, Collins GS. Minimum sample size for developing a multivariable prediction model: PART II - binary and time - to - event outcomes. Stat Med 2019;38:1276 -

  10. [10]

    Adequate sample size for developing prediction models is not si mply related to events per variable

    Ogundimu EO, Altman DG, Collins GS. Adequate sample size for developing prediction models is not si mply related to events per variable. J Clin Epidemiol 2016;76:175 -

  11. [11]

    Eijkemans MJC, Habbema JDF

    Steyerberg EW. Eijkemans MJC, Habbema JDF. Application of shrinkage techniques in logistic regression analysis: a case study. Stat Neerl 2001;55:76 -

  12. [12]

    Eijkemans MJC, Harrell FE Jr, Habbema JDF

    Steyerberg EW. Eijkemans MJC, Harrell FE Jr, Habbema JDF. Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets . Med Decis Making 2001; 21 : 45 - 56

  13. [13]

    An evaluation of penalised survival methods for developing prognostic models with rare events

    Ambler G, Seaman S, Omar RZ. An evaluation of penalised survival methods for developing prognostic models with rare events. Stat Med 2012;31:1150 -

  14. [14]

    Review and evaluation of penalised regression methods for risk prediction in low - dimensional data with few events

    Pavlou M, Ambler G, Seaman SR, De Iorio M, Omar RZ. Review and evaluation of penalised regression methods for risk prediction in low - dimensional data with few events. Stat Med 2016;35:1159 - 117

  15. [15]

    Firth’ s logistic regression with rare events: accurate effect estimates and predictions? Stat Med 2017;36:2302 -

    Puhr R, Heinze G, Nold M, Lusa L, Geroldinger A. Firth’ s logistic regression with rare events: accurate effect estimates and predictions? Stat Med 2017;36:2302 -

  16. [16]

    Sample size considerations and predictive performance of multinomial logistic prediction models

    De Jong VMT, Eijkemans MJC, Van Calster B, Timmerman D, Moons KGM, Steyerberg EW, van S meden M. Sample size considerations and predictive performance of multinomial logistic prediction models. Stat Med 2019;38:1601 -

  17. [17]

    Regression, Prediction and Shrinkage

    Copas JB. Regression, Prediction and Shrinkage. J R Stat Soc B 1983;45:311 -

  18. [18]

    Shrinkage and penalized likelihood as methods to improve predictive accuracy

    van Houwelingen JC. Shrinkage and penalized likelihood as methods to improve predictive accuracy. Stat Neerl 2001;55:17 -

  19. [19]

    Ridge regression: biased estimation for nonorthogonal problems

    Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970;12:55 –

  20. [20]

    Re gression Shrinkage and Selection via the Lasso

    Tibshirani R. Re gression Shrinkage and Selection via the Lasso . J R Stat Soc B 1996;58:267 -

  21. [21]

    Bias reduction of maximum likelihood estimates

    Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993;80:27 –

  22. [22]

    Validation of prediction models based on lasso regression with multiply imputed data

    Musoro JZ, Zwinderman AH, Puhan MA, ter Riet G, Geskus RB. Validation of prediction models based on lasso regression with multiply imputed data . BMC Med Res Methodol 2014;14:116

  23. [23]

    Predictive value of statistical models

    Van Houwelingen JC, le Cessie S. Predictive value of statistical models. Stat Med 1990;9:1303 -

  24. [24]

    A ridge logistic estimator

    Schaefer RL, Roi LD, Wolfe RA. A ridge logistic estimator. Commun St at Theory Methods 1984;13:99 -

  25. [25]

    Ridge estimators in logistic regression

    Le Cessie S, van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc C 1992;41:191 –

  26. [26]

    The adaptive lasso and its oracle properties

    Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc 2006;101:1418 -

  27. [27]

    Better subset regression using the nonnegative garrote

    Breiman L. Better subset regression using the nonnegative garrote. Technometrics 1995;37:373 –

  28. [28]

    Logistic regression with the nonnegative garro te

    Makalic E, Schmidt DF. Logistic regression with the nonnegative garro te. In Wang D, Reynolds M (eds), AI 2011: Advances in Artificial Intelligence. Lecture Notes in Computer Science , vol

  29. [29]

    Bias reduction of maximum l ikelihood estimates

    Firth D. Bias reduction of maximum l ikelihood estimates. Biometrika 1993;80:27 –

  30. [30]

    A solution to the problem of separation in logistic regression

    Heinze G, Schemper M. A solution to the problem of separation in logistic regression . Stat Med 2002;21:2409 -

  31. [31]

    Interpreting Incremental Value of Markers Added to Risk Prediction Model s

    Pencina MJ, D’Agostino RB Sr, Pencina KM, Janssens ACJW, Greenland P. Interpreting Incremental Value of Markers Added to Risk Prediction Model s . Am J Epidemiol 2012;176:473 -

  32. [32]

    Regularization Paths for Generalized Linear Models via Coordinate Descent

    Friedman JH, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 2010;33:1

  33. [33]

    Regression and time series model selection in small samples

    Hurvich CM, Tsai CL . Regression and time series model selection in small samples. Bi ometrika 1989; 76:297 -

  34. [34]

    A calibration hierarchy for risk models was defined: from utopia to empirical data

    Van Calster B , Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol 2016;74:167 -

  35. [35]

    Two further applications of a model for binary regression

    Cox DR. Two further applications of a model for binary regression. Biometrika 1958;45:562 -

  36. [36]

    A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction m odels

    Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction m odels . J Clin Epidemiol 2019;110:12 -

  37. [37]

    Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

    Van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints . BMC Med Res Methodol 2014;14:137

  38. [38]

    Classifier technology and th e illusion of progress

    Hand DJ. Classifier technology and th e illusion of progress. Stat Sci 2006;1:1 - 14

  39. [39]

    Regularization and Variable Selection via the Elastic Net

    Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. J R Stat Soc B 2005;67: 301 –

  40. [40]

    Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties

    Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J Am Stat Assoc 2001; 96:1348 - 1360

  41. [41]

    Shrinkage and model selection with correlated variabeles via weighted fusion

    Daye ZJ , Jeng XJ. Shrinkage and model selection with correlated variabeles via weighted fusion. Computational Statistics and Data Analysis 2009; 53 :1284 –

  42. [42]

    Ex plainingSupport Vector Machines: A Color Based Nomogram

    Van Belle V, Van Calster B, Van Huffel S, Suykens JAK, Lisboa P. Ex plainingSupport Vector Machines: A Color Based Nomogram. PLoS One 2016;11:e0164568

  43. [43]

    Overview of the characteristics of the 60 simulation scenarios Predictors Correlation Event rate EPV Events Sample size True c statistic Model intercept 5 true predictors, or 5 true + 5 noise predictors 0 0.1 3 15 150 0.75 - 2.57 5 25 250 10 50 500 20 100 1000 50 250 2500 0.5 3 15 30 0.74 0 5 25 50 10 50 100 20 100 200 50 250 500 0.5 0.1 3 15 150 0.83 - 2...