A prior-free blind detection of information leakage from model predictions

Laurence A. Jacobs

arxiv: 2606.11267 · v1 · pith:NZ4TGI6Vnew · submitted 2026-06-09 · 💻 cs.LG · cs.CR

A prior-free blind detection of information leakage from model predictions

Laurence A. Jacobs This is my paper

Pith reviewed 2026-06-27 14:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords data leakage detectionprediction-only auditingblind leakage testunit-purity headrecalibrated leakageproper scoring rulesdecision curve analysis

0 comments

The pith

A near-deterministic subgroup produced by near-label leakage creates a sustained unit-purity head in predictions that no honest predictor of a non-deterministic outcome can generate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies a decision-theoretic framework that treats leakage diagnostics as functionals of the joint law of predicted risk and observed outcome. It proves that any recalibrated leak whose calibration and discrimination match those of an honest model is indistinguishable from honest performance by every function of the predictions alone. It then isolates the one signature that cannot be hidden: a near-deterministic subgroup yields a unit-purity head whose length and height no legitimate predictor of a stochastic label can sustain. The resulting trichotomy—miscalibrated, broad-calibrated, and deterministic—pairs each class with its own detector and its own undetectable failure mode. Validation on time-windowed comorbidity leakage in UK Biobank shows that the deterministic test flags contamination down to a cohort-specific floor of roughly 0.007 while returning a verdict on a prediction vector in under a second.

Core claim

Leakage falls into three exhaustive classes. Miscalibrated leakage is visible to any proper scoring rule. Broad-calibrated leakage that matches an honest model’s discrimination is invisible to every functional of the prediction vector unless an external ceiling on achievable discrimination is supplied. Deterministic leakage, however, is revealed by the unit-purity head: a subgroup whose predictions are exactly 1 or exactly 0 and whose outcomes match those predictions with probability 1. No predictor that is calibrated and whose labels remain stochastic can produce such a head; its existence is therefore decisive evidence of a near-label leak.

What carries the argument

the sustained unit-purity head: the longest prefix of a sorted prediction vector in which every prediction equals 1 (or 0) and every corresponding outcome equals the prediction, whose length cannot be manufactured by any honest predictor of a non-deterministic label

If this is right

Miscalibrated leakage is detected by any proper scoring rule applied to the prediction-outcome pairs.
Broad-calibrated leakage remains undetectable from output alone once its discrimination equals the best honest discrimination.
Deterministic leakage is detected by the length of the unit-purity head without any external prior or training code.
The numerical detection floor is endpoint- and cohort-specific; the structural impossibility result is not.
The entire procedure runs on a prediction vector alone and returns a verdict in under a second.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Auditors holding only a deployed model’s output file can now screen for the most severe form of leakage without access to training code or external data.
The trichotomy suggests that any future output-only detector must first test for the unit-purity head before attempting to bound the remaining two classes.
If the unit-purity test is negative, residual leakage smaller than the supplied discrimination ceiling cannot be distinguished from an honestly stronger model on the same endpoint.
The same head statistic may be usable as a diagnostic for label noise or for deterministic sub-populations that were never intended to be modeled probabilistically.

Load-bearing premise

An external upper bound on achievable discrimination must be supplied before the broad recalibrated class can be ruled out.

What would settle it

A single legitimate predictor of a non-deterministic binary outcome that nevertheless sustains a unit-purity head of positive length over any subgroup would falsify the claim that only near-label leakage can produce the head.

Figures

Figures reproduced from arXiv: 2606.11267 by Laurence A. Jacobs.

**Figure 2.** Figure 2: Operating characteristic: detection (breadth / purity excess) versus ∆ [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Data leakage -- contamination of a model with information unavailable at baseline -- is the dominant reproducibility failure in machine-learning-based science, yet detection tools require training code, external data, or domain expertise. None operates on the artifact an auditor most often holds: the model's output. We ask what can be decided about leakage from predictions and outcomes alone. We give a decision-theoretic framework in which leakage diagnostics are functionals of the predicted-risk/outcome law, parameterized by a threshold-weighting linked to proper scoring rules and decision-curve analysis. We prove a sharp impossibility: a recalibrated leak matching an honest model's calibration and discrimination is indistinguishable from honest performance by \emph{any} function of the predictions, so the broad class is detectable only against an externally supplied ceiling on achievable discrimination. We then prove what leakage cannot hide: a near-deterministic subgroup -- the signature of a near-label leak -- produces a sustained unit-purity head that no legitimate predictor of a non-deterministic outcome can manufacture, yielding a prior-free test. These results organize leakage into a trichotomy -- miscalibrated, broad-calibrated, and deterministic -- each with a matched detector and failure mode. We validate on UK Biobank using time-windowed comorbidity leakage with known, graded severity, measuring a detection floor of $\Delta\cstar \approx 0.007$ on this endpoint, below which residual leakage is undetectable from output and too small to alter conclusions. The numerical floor is cohort- and endpoint-specific; the structural lesson is general: output-only detection fails where residual leakage is indistinguishable from an honestly stronger predictor. The test returns a verdict on a prediction vector in under a second on commodity hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean trichotomy of leakage types with an impossibility result for broad recalibrated cases and a prior-free unit-purity test for deterministic leaks, but the derivations need checking and the empirical floor is dataset-specific.

read the letter

The main thing here is a decision-theoretic split of leakage into miscalibrated, broad-calibrated, and deterministic categories, plus a proof that you cannot detect the middle kind from predictions and outcomes alone without an external ceiling on discrimination, and a test that flags the last kind via sustained unit-purity heads.

What is new is the trichotomy itself and the unit-purity detector for near-deterministic subgroups. The paper shows why some leakage is invisible from output alone and gives a fast, implementable check that runs in under a second. The UK Biobank example with time-windowed comorbidity leakage is a reasonable validation step that ties the test to known, graded issues and reports a concrete floor of about 0.007.

The work is clear on its parameterization through proper scoring rules and decision-curve analysis, which keeps the thresholds grounded rather than arbitrary. That is a plus.

The soft spots are limited. The impossibility result is stated sharply but depends on that external discrimination ceiling, which the abstract flags. The unit-purity claim rests on the derivations, and without seeing the full steps it is hard to judge edge cases or tightness. The reported floor is cohort- and endpoint-specific, as the authors note, so it does not generalize directly. The test still requires both predictions and outcomes, which is standard for this setting but narrows the use case.

This is for auditors and reproducibility people who hold only the prediction vector and labels. A reader working on leakage detection or medical ML validation will find the framework and the test useful. It deserves a serious referee because the claims are specific, the test is practical, and the limits are stated plainly.

Referee Report

0 major / 2 minor

Summary. The manuscript develops a decision-theoretic framework in which leakage diagnostics are functionals of the predicted-risk/outcome distribution, parameterized by threshold weightings derived from proper scoring rules and decision-curve analysis. It proves an impossibility result: any recalibrated leak that matches an honest model in both calibration and discrimination is indistinguishable from honest performance by any functional of the predictions alone, so broad-class detection requires an externally supplied ceiling on achievable discrimination. It further proves that a near-deterministic subgroup (signature of near-label leakage) produces a sustained unit-purity head that no honest predictor of a non-deterministic outcome can produce, yielding a prior-free detector. Leakage is organized into a trichotomy (miscalibrated, broad-calibrated, deterministic) with matched detectors; the framework is validated on UK Biobank time-windowed comorbidity leakage, reporting a cohort-specific detection floor of Δc* ≈ 0.007 below which residual leakage is undetectable from output alone. The test runs in under a second on commodity hardware.

Significance. If the stated theorems hold, the work supplies a principled, output-only method for detecting a practically important subclass of leakage (near-label) without training code, external data, or domain expertise—an advance for reproducibility auditing in ML-based science. The impossibility result usefully delineates the limits of output-only detection, while the unit-purity test is prior-free by construction. The empirical measurement of a concrete detection floor on real data and the explicit link to decision-curve analysis are additional strengths. The trichotomy organizes the problem space clearly.

minor comments (2)

The abstract states that the proofs exist and that the detection floor is measured, yet supplies neither the derivations nor the explicit computation of Δc*; the full manuscript must place both in the main text (or a clearly referenced appendix) so that the central claims can be verified.
Notation for the threshold-weighting functional and the unit-purity head should be introduced with a single, self-contained definition early in the methods section rather than distributed across the trichotomy discussion.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful summary of the manuscript, for highlighting the decision-theoretic framing and the trichotomy of leakage types, and for noting the practical value of the prior-free unit-purity detector together with the measured detection floor on UK Biobank data. We appreciate the positive assessment of significance. No major comments were listed in the report, so we have no point-by-point responses to provide. We remain available to address any further questions or clarifications the referee may wish to raise.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external benchmarks

full rationale

The framework is parameterized by threshold-weighting linked to proper scoring rules and decision-curve analysis (external standards). The sharp impossibility result for recalibrated leaks and the unit-purity head detector for near-deterministic subgroups are presented as theorems following from the predicted-risk/outcome law; neither reduces to a fitted parameter or self-citation. The trichotomy organizes leakage types with matched detectors and failure modes. Validation uses external cohort data (UK Biobank) with known leakage. No load-bearing step equates a prediction to its input by construction or imports uniqueness via author self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on existing decision-theoretic concepts such as proper scoring rules; no free parameters, new entities, or ad-hoc axioms are described in the abstract.

axioms (1)

standard math Standard axioms of probability and decision theory underlying proper scoring rules and decision-curve analysis
The leakage diagnostics are defined as functionals of the predicted-risk/outcome law using these concepts.

pith-pipeline@v0.9.1-grok · 5826 in / 1259 out tokens · 21362 ms · 2026-06-27T14:02:12.423707+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 18 canonical work pages

[1]

Leakage in data mining: Formulation, detection, and avoidance.ACM Transactions on Knowledge Discovery from Data, 6(4):1–21, 2012

Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. Leakage in data mining: Formulation, detection, and avoidance.ACM Transactions on Knowledge Discovery from Data, 6(4):1–21, 2012. doi: 10.1145/2382577.2382579. Article 15

work page doi:10.1145/2382577.2382579 2012
[2]

Michael A. Lones. How to avoid machine learning pitfalls: a guide for academic researchers. arXiv preprint, 2024. arXiv:2108.02497v4

arXiv 2024
[3]

and Narayanan, A

Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in machine- learning-based science.Patterns, 4(9):100804, 2023. doi: 10.1016/j.patter.2023.100804

work page doi:10.1016/j.patter.2023.100804 2023
[4]

Aviles-Rivero, Christian Etmann, Cathal McCague, Lucian Beer, Jonathan R

Michael Roberts, Derek Driggs, Matthew Thorpe, Julian Gilbey, Michael Yeung, Stephan Ur- sprung, Angelica I. Aviles-Rivero, Christian Etmann, Cathal McCague, Lucian Beer, Jonathan R. Weir-McCall, Zhongzhao Teng, Effrossyni Gkrania-Klotsas, James H. F. Rudd, Evis Sala, and Carola-Bibiane Sch¨ onlieb. Common pitfalls and recommendations for using machine le...

work page doi:10.1038/s42256-021-00307-0 2021
[5]

The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248): 636–638, 2015

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248): 636–638, 2015. doi: 10.1126/science.aaa9375

work page doi:10.1126/science.aaa9375 2015
[6]

Brower-Sinning, Grace A

Chenyang Yang, Rachel A. Brower-Sinning, Grace A. Lewis, and Christian K¨ astner. Data leakage in notebooks: Static detection and better processes. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2022. doi: 10.1145/3551349.3556918. Article 30

work page doi:10.1145/3551349.3556918 2022
[7]

Wolff, Karel G

Robert F. Wolff, Karel G. M. Moons, Richard D. Riley, Penny F. Whiting, Marie Westwood, Gary S. Collins, Johannes B. Reitsma, Jos Kleijnen, and Sue Mallett. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies.Annals of Internal Medicine, 170(1):51–58, 2019. doi: 10.7326/M18-1376. 12

work page doi:10.7326/m18-1376 2019
[8]

Cantrell, Kenny Peng, Thanh Hien Pham, Christopher A

Sayash Kapoor, Emily M. Cantrell, Kenny Peng, Thanh Hien Pham, Christopher A. Bail, Odd Erik Gundersen, Jake M. Hofman, Jessica Hullman, Michael A. Lones, Momin M. Ma- lik, Priyanka Nanayakkara, Russell A. Poldrack, Inioluwa Deborah Raji, Michael Roberts, Matthew J. Salganik, Marta Serra-Garcia, Brandon M. Stewart, Gilles Vandewiele, and Arvind Narayanan....

work page doi:10.1126/sciadv.adk3452 2024
[9]

Collins, Johannes B

Gary S. Collins, Johannes B. Reitsma, Douglas G. Altman, and Karel G. M. Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement.Annals of Internal Medicine, 162(1):55–63, 2015. doi: 10.7326/ M14-0697

2015
[10]

TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods

Gary S. Collins, Karel G. M. Moons, Paula Dhiman, Richard D. Riley, Andrew L. Beam, Ben Van Calster, Marzyeh Ghassemi, Xiaoxuan Liu, Johannes B. Reitsma, Maarten van Smeden, Anne-Laure Boulesteix, Jennifer C. Camaradou, Leo Anthony Celi, Spiros Denaxas, Alastair K. Denniston, Ben Glocker, Robert M. Golub, Hugh Harvey, Georg Heinze, Michael M. Hoffman, And...

work page doi:10.1136/bmj-2023-078378 2024
[11]

Pencina, and Ewout W

Ben Van Calster, Daan Nieboer, Yvonne Vergouwe, Bavo De Cock, Michael J. Pencina, and Ewout W. Steyerberg. A calibration hierarchy for risk models was defined: From utopia to empirical data.Journal of Clinical Epidemiology, 74:167–176, 2016. doi: 10.1016/j.jclinepi. 2015.12.005

work page doi:10.1016/j.jclinepi 2016
[12]

Calibration: the Achilles heel of predictive analytics

Ben Van Calster, David J. McLernon, Maarten van Smeden, Laure Wynants, and Ewout W. Steyerberg. Calibration: The Achilles heel of predictive analytics.BMC Medicine, 17:230, 2019. doi: 10.1186/s12916-019-1466-7

work page doi:10.1186/s12916-019-1466-7 2019
[13]

Harrell, Kerry L

Frank E. Harrell, Kerry L. Lee, and Daniel B. Mark. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15(4):361–387, 1996

1996
[14]

Vickers and Elena B

Andrew J. Vickers and Elena B. Elkin. Decision curve analysis: A novel method for eval- uating prediction models.Medical Decision Making, 26(6):565–574, 2006. doi: 10.1177/ 0272989X06295361

2006
[15]

Jacobs and Andrew J

Laurence A. Jacobs and Andrew J. Vickers. Expected net benefit: From decision curve analysis to a prior-weighted summary measure for evaluating clinical prediction models. Nature Methods — In review, 2026

2026
[16]

Schervish

Mark J. Schervish. A general method for comparing probability assessors.The Annals of Statistics, 17(4):1856–1879, 1989. doi: 10.1214/aos/1176347398

work page doi:10.1214/aos/1176347398 1989
[17]

Werner Ehm, Tilmann Gneiting, Alexander Jordan, and Fabian Kr¨ uger. Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(3):505–562, 2016. doi: 10.1111/rssb.12154. 13

work page doi:10.1111/rssb.12154 2016
[18]

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007

2007
[19]

Membership inference attacks against machine learning models

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In2017 IEEE Symposium on Security and Privacy (S&P), pages 3–18, 2017. doi: 10.1109/SP.2017.41

work page doi:10.1109/sp.2017.41 2017
[20]

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Alexander J. Smola, Peter L. Bartlett, Bernhard Sch¨ olkopf, and Dale Schuurmans, editors,Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999

1999
[21]

PLoS Med.12(3), e1001779 (2015).https://doi.org/10.1371/journal.pmed.1001779

Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, Bette Liu, Paul Matthews, Giok Ong, Jill Pell, Alan Silman, Alan Young, Tim Sprosen, Tim Peakman, and Rory Collins. UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of...

work page doi:10.1371/journal.pmed.1001779 2015
[22]

Regression shrinkage and selection via the Lasso.Journal of the Royal Sta- tistical Society: Series B (Methodological), 58(1):267–288, 1996

Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. doi: 10.1111/j.2517-6161. 1996.tb02080.x

work page doi:10.1111/j.2517-6161 1996
[23]

DeGroot and Stephen E

Morris H. DeGroot and Stephen E. Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983. doi: 10.2307/2987588

work page doi:10.2307/2987588 1983
[24]

Lieb and Michael Loss.Analysis, volume 14 ofGraduate Studies in Mathematics

Elliott H. Lieb and Michael Loss.Analysis, volume 14 ofGraduate Studies in Mathematics. American Mathematical Society, 2nd edition, 2001

2001
[25]

Hosmer and Stanley Lemeshow

David W. Hosmer and Stanley Lemeshow. Goodness-of-fit tests for the multiple logistic regression model.Communications in Statistics — Theory and Methods, 9(10):1043–1069, 1980. doi: 10.1080/03610928008827941

work page doi:10.1080/03610928008827941 1980
[26]

Spiegelhalter

David J. Spiegelhalter. Probabilistic prediction in patient management and clinical trials. Statistics in Medicine, 5(5):421–433, 1986. doi: 10.1002/sim.4780050506. 14

work page doi:10.1002/sim.4780050506 1986

[1] [1]

Leakage in data mining: Formulation, detection, and avoidance.ACM Transactions on Knowledge Discovery from Data, 6(4):1–21, 2012

Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. Leakage in data mining: Formulation, detection, and avoidance.ACM Transactions on Knowledge Discovery from Data, 6(4):1–21, 2012. doi: 10.1145/2382577.2382579. Article 15

work page doi:10.1145/2382577.2382579 2012

[2] [2]

Michael A. Lones. How to avoid machine learning pitfalls: a guide for academic researchers. arXiv preprint, 2024. arXiv:2108.02497v4

arXiv 2024

[3] [3]

and Narayanan, A

Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in machine- learning-based science.Patterns, 4(9):100804, 2023. doi: 10.1016/j.patter.2023.100804

work page doi:10.1016/j.patter.2023.100804 2023

[4] [4]

Aviles-Rivero, Christian Etmann, Cathal McCague, Lucian Beer, Jonathan R

Michael Roberts, Derek Driggs, Matthew Thorpe, Julian Gilbey, Michael Yeung, Stephan Ur- sprung, Angelica I. Aviles-Rivero, Christian Etmann, Cathal McCague, Lucian Beer, Jonathan R. Weir-McCall, Zhongzhao Teng, Effrossyni Gkrania-Klotsas, James H. F. Rudd, Evis Sala, and Carola-Bibiane Sch¨ onlieb. Common pitfalls and recommendations for using machine le...

work page doi:10.1038/s42256-021-00307-0 2021

[5] [5]

The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248): 636–638, 2015

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248): 636–638, 2015. doi: 10.1126/science.aaa9375

work page doi:10.1126/science.aaa9375 2015

[6] [6]

Brower-Sinning, Grace A

Chenyang Yang, Rachel A. Brower-Sinning, Grace A. Lewis, and Christian K¨ astner. Data leakage in notebooks: Static detection and better processes. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2022. doi: 10.1145/3551349.3556918. Article 30

work page doi:10.1145/3551349.3556918 2022

[7] [7]

Wolff, Karel G

Robert F. Wolff, Karel G. M. Moons, Richard D. Riley, Penny F. Whiting, Marie Westwood, Gary S. Collins, Johannes B. Reitsma, Jos Kleijnen, and Sue Mallett. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies.Annals of Internal Medicine, 170(1):51–58, 2019. doi: 10.7326/M18-1376. 12

work page doi:10.7326/m18-1376 2019

[8] [8]

Cantrell, Kenny Peng, Thanh Hien Pham, Christopher A

Sayash Kapoor, Emily M. Cantrell, Kenny Peng, Thanh Hien Pham, Christopher A. Bail, Odd Erik Gundersen, Jake M. Hofman, Jessica Hullman, Michael A. Lones, Momin M. Ma- lik, Priyanka Nanayakkara, Russell A. Poldrack, Inioluwa Deborah Raji, Michael Roberts, Matthew J. Salganik, Marta Serra-Garcia, Brandon M. Stewart, Gilles Vandewiele, and Arvind Narayanan....

work page doi:10.1126/sciadv.adk3452 2024

[9] [9]

Collins, Johannes B

Gary S. Collins, Johannes B. Reitsma, Douglas G. Altman, and Karel G. M. Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement.Annals of Internal Medicine, 162(1):55–63, 2015. doi: 10.7326/ M14-0697

2015

[10] [10]

TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods

Gary S. Collins, Karel G. M. Moons, Paula Dhiman, Richard D. Riley, Andrew L. Beam, Ben Van Calster, Marzyeh Ghassemi, Xiaoxuan Liu, Johannes B. Reitsma, Maarten van Smeden, Anne-Laure Boulesteix, Jennifer C. Camaradou, Leo Anthony Celi, Spiros Denaxas, Alastair K. Denniston, Ben Glocker, Robert M. Golub, Hugh Harvey, Georg Heinze, Michael M. Hoffman, And...

work page doi:10.1136/bmj-2023-078378 2024

[11] [11]

Pencina, and Ewout W

Ben Van Calster, Daan Nieboer, Yvonne Vergouwe, Bavo De Cock, Michael J. Pencina, and Ewout W. Steyerberg. A calibration hierarchy for risk models was defined: From utopia to empirical data.Journal of Clinical Epidemiology, 74:167–176, 2016. doi: 10.1016/j.jclinepi. 2015.12.005

work page doi:10.1016/j.jclinepi 2016

[12] [12]

Calibration: the Achilles heel of predictive analytics

Ben Van Calster, David J. McLernon, Maarten van Smeden, Laure Wynants, and Ewout W. Steyerberg. Calibration: The Achilles heel of predictive analytics.BMC Medicine, 17:230, 2019. doi: 10.1186/s12916-019-1466-7

work page doi:10.1186/s12916-019-1466-7 2019

[13] [13]

Harrell, Kerry L

Frank E. Harrell, Kerry L. Lee, and Daniel B. Mark. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15(4):361–387, 1996

1996

[14] [14]

Vickers and Elena B

Andrew J. Vickers and Elena B. Elkin. Decision curve analysis: A novel method for eval- uating prediction models.Medical Decision Making, 26(6):565–574, 2006. doi: 10.1177/ 0272989X06295361

2006

[15] [15]

Jacobs and Andrew J

Laurence A. Jacobs and Andrew J. Vickers. Expected net benefit: From decision curve analysis to a prior-weighted summary measure for evaluating clinical prediction models. Nature Methods — In review, 2026

2026

[16] [16]

Schervish

Mark J. Schervish. A general method for comparing probability assessors.The Annals of Statistics, 17(4):1856–1879, 1989. doi: 10.1214/aos/1176347398

work page doi:10.1214/aos/1176347398 1989

[17] [17]

Werner Ehm, Tilmann Gneiting, Alexander Jordan, and Fabian Kr¨ uger. Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(3):505–562, 2016. doi: 10.1111/rssb.12154. 13

work page doi:10.1111/rssb.12154 2016

[18] [18]

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007

2007

[19] [19]

Membership inference attacks against machine learning models

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In2017 IEEE Symposium on Security and Privacy (S&P), pages 3–18, 2017. doi: 10.1109/SP.2017.41

work page doi:10.1109/sp.2017.41 2017

[20] [20]

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Alexander J. Smola, Peter L. Bartlett, Bernhard Sch¨ olkopf, and Dale Schuurmans, editors,Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999

1999

[21] [21]

PLoS Med.12(3), e1001779 (2015).https://doi.org/10.1371/journal.pmed.1001779

Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, Bette Liu, Paul Matthews, Giok Ong, Jill Pell, Alan Silman, Alan Young, Tim Sprosen, Tim Peakman, and Rory Collins. UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of...

work page doi:10.1371/journal.pmed.1001779 2015

[22] [22]

Regression shrinkage and selection via the Lasso.Journal of the Royal Sta- tistical Society: Series B (Methodological), 58(1):267–288, 1996

Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. doi: 10.1111/j.2517-6161. 1996.tb02080.x

work page doi:10.1111/j.2517-6161 1996

[23] [23]

DeGroot and Stephen E

Morris H. DeGroot and Stephen E. Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983. doi: 10.2307/2987588

work page doi:10.2307/2987588 1983

[24] [24]

Lieb and Michael Loss.Analysis, volume 14 ofGraduate Studies in Mathematics

Elliott H. Lieb and Michael Loss.Analysis, volume 14 ofGraduate Studies in Mathematics. American Mathematical Society, 2nd edition, 2001

2001

[25] [25]

Hosmer and Stanley Lemeshow

David W. Hosmer and Stanley Lemeshow. Goodness-of-fit tests for the multiple logistic regression model.Communications in Statistics — Theory and Methods, 9(10):1043–1069, 1980. doi: 10.1080/03610928008827941

work page doi:10.1080/03610928008827941 1980

[26] [26]

Spiegelhalter

David J. Spiegelhalter. Probabilistic prediction in patient management and clinical trials. Statistics in Medicine, 5(5):421–433, 1986. doi: 10.1002/sim.4780050506. 14

work page doi:10.1002/sim.4780050506 1986