Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening
Pith reviewed 2026-05-20 10:29 UTC · model grok-4.3
The pith
Self-supervised pretraining improves selective prediction in diabetic retinopathy screening, but longer pretraining does not consistently enhance reliability after accuracy saturates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that SSL pretraining improves calibrated confidence and confidence-based abstention compared to training from scratch, yet once accuracy saturates, selective performance can still change markedly across checkpoints and longer pretraining does not consistently improve reliability.
What carries the argument
calibrated confidence-based abstention under a fixed fine-tuning protocol applied to multiple SSL checkpoints
If this is right
- SSL pretraining enhances selective accuracy and selective macro-F1 across datasets compared to from-scratch training.
- Selective performance varies across checkpoints even after accuracy plateaus.
- Extending pretraining duration does not uniformly improve abstention reliability.
- Abstention-aware evaluation is necessary for assessing safety in clinical screening tasks.
Where Pith is reading between the lines
- Developers may need to track selective metrics across many checkpoints instead of stopping at the accuracy peak.
- The same checkpoint-selection approach could be tested on other medical imaging tasks that require safe deferral.
- In deployment, reliability monitoring might shift from single best-accuracy models to families of checkpoints evaluated for abstention quality.
Load-bearing premise
The fixed fine-tuning protocol combined with calibrated confidence-based abstention metrics sufficiently captures real-world clinical safety, and the chosen datasets represent the variability in actual screening practice.
What would settle it
A new DR dataset where SSL pretraining fails to raise selective accuracy or selective macro-F1 above from-scratch baselines, or where longer pretraining checkpoints show steadily worse abstention after accuracy has plateaued, would falsify the reported pattern.
Figures
read the original abstract
Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines how the duration of self-supervised learning (SSL) pretraining affects calibrated confidence and confidence-based abstention in diabetic retinopathy (DR) grading models. It evaluates multiple SSL checkpoints under a single fixed fine-tuning protocol, reporting metrics of coverage, selective accuracy, and selective macro-F1. The central claims are that SSL pretraining improves selective prediction relative to training from scratch across datasets and regimes, yet selective performance can still vary substantially once accuracy saturates and that longer pretraining does not reliably improve reliability. The work argues for abstention-aware evaluation in safety-critical screening tasks.
Significance. If the reported patterns hold under more varied protocols and datasets, the result would be significant for medical imaging: it shows that accuracy saturation does not imply stable selective performance and that pretraining length is a reliability design choice rather than a mere computational detail. The public code release is a clear strength that aids verification.
major comments (3)
- [Experimental protocol / Results] The central claim that SSL pretraining improves selective metrics rests on a fixed fine-tuning protocol applied uniformly to all checkpoints and the from-scratch baseline. Without evidence that this protocol remains optimal or non-interacting with checkpoint quality (e.g., no ablation of learning-rate schedules or epoch counts per checkpoint), observed fluctuations in selective accuracy and macro-F1 after accuracy saturation could be protocol artifacts rather than intrinsic properties of the pretrained representations.
- [Datasets and evaluation] The abstract and results assert improvements “across datasets and data regimes,” yet the manuscript supplies no quantitative description of dataset diversity (camera models, resolution distributions, population demographics, or label-noise levels). This omission is load-bearing for the safety conclusion, because selective-prediction gains on homogeneous data do not necessarily translate to the variability encountered in clinical DR screening.
- [Results] The observation that “longer pretraining does not consistently improve reliability” is presented without statistical quantification of the variation (e.g., standard errors on selective macro-F1 across checkpoints once accuracy plateaus, or a formal test for trend). A single fixed protocol plus limited checkpoint sampling leaves open the possibility that the non-monotonic behavior is under-powered rather than a robust negative finding.
minor comments (2)
- [Methods] Clarify in the methods whether the confidence calibration (temperature scaling or similar) is performed on a held-out validation set or on the same data used for selective-metric computation; the current description leaves room for optimistic bias.
- [Figures] Figure captions and axis labels should explicitly state the number of checkpoints, the exact pretraining epochs or steps corresponding to each point, and whether error bars represent standard deviation over multiple fine-tuning seeds.
Simulated Author's Rebuttal
Thank you for the detailed review. We appreciate the referee's focus on experimental rigor and generalizability. Below we respond to each major comment, proposing revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: The central claim that SSL pretraining improves selective metrics rests on a fixed fine-tuning protocol applied uniformly to all checkpoints and the from-scratch baseline. Without evidence that this protocol remains optimal or non-interacting with checkpoint quality (e.g., no ablation of learning-rate schedules or epoch counts per checkpoint), observed fluctuations in selective accuracy and macro-F1 after accuracy saturation could be protocol artifacts rather than intrinsic properties of the pretrained representations.
Authors: We intentionally employed a single fixed fine-tuning protocol across all SSL checkpoints and the from-scratch baseline to isolate the impact of pretraining duration on calibrated confidence and selective performance. This approach prevents confounding variables from differing optimization strategies and enables a controlled comparison. While we agree that protocol interactions could exist, the consistent improvements in selective metrics under this protocol support our claims. We will revise the discussion to explicitly acknowledge this design choice as a potential limitation and suggest that future studies explore adaptive fine-tuning per checkpoint. revision: partial
-
Referee: The abstract and results assert improvements “across datasets and data regimes,” yet the manuscript supplies no quantitative description of dataset diversity (camera models, resolution distributions, population demographics, or label-noise levels). This omission is load-bearing for the safety conclusion, because selective-prediction gains on homogeneous data do not necessarily translate to the variability encountered in clinical DR screening.
Authors: We will update the manuscript to provide a more detailed quantitative description of the datasets, including information on camera models, image resolutions, patient demographics, and any available details on label noise. The experiments use well-established public DR datasets that are representative of clinical variability, but we concur that explicit quantification will better support the generalizability of our safety-related conclusions. revision: yes
-
Referee: The observation that “longer pretraining does not consistently improve reliability” is presented without statistical quantification of the variation (e.g., standard errors on selective macro-F1 across checkpoints once accuracy plateaus, or a formal test for trend). A single fixed protocol plus limited checkpoint sampling leaves open the possibility that the non-monotonic behavior is under-powered rather than a robust negative finding.
Authors: We will incorporate statistical quantification by adding standard errors to the reported selective metrics and conducting a formal trend analysis or non-parametric test for the plateau region across checkpoints. This will provide stronger evidence for the observed non-monotonic behavior in reliability metrics. revision: yes
Circularity Check
Empirical evaluation study with no circular derivation or self-referential reduction
full rationale
This is an experimental comparison paper that evaluates SSL pretraining duration effects on selective prediction metrics (coverage, selective accuracy, selective macro-F1) under a fixed fine-tuning protocol, contrasting against training-from-scratch baselines across datasets. No mathematical derivation chain exists; claims rest on direct empirical measurements rather than first-principles results that reduce to inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described methods. The fixed-protocol design and dataset comparisons are presented as external benchmarks, rendering the central findings self-contained without circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
longer pretraining does not consistently improve reliability
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , year =
Selective Classification for Deep Neural Networks , author =. Advances in Neural Information Processing Systems , year =
-
[2]
Proceedings of the 34th International Conference on Machine Learning (ICML) , year =
On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning (ICML) , year =
-
[3]
APTOS 2019 Blindness Detection , author =. 2019 , howpublished =
work page 2019
-
[4]
Mobile Networks and Applications , volume =
A Review of Deep Learning on Medical Image Analysis , author =. Mobile Networks and Applications , volume =. 2021 , doi =
work page 2021
-
[5]
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning , author=. 2022 , eprint=
work page 2022
-
[6]
In Defense of the Triplet Loss for Person Re-Identification
In Defense of the Triplet Loss for Person Re-Identification , author =. arXiv preprint arXiv:1703.07737 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Computer Methods and Programs in Biomedicine , volume =
Shen, Tianyu and Gou, Chao and Wang, Fei-Yue and He, Zilong and Chen, Weiguo , title =. Computer Methods and Programs in Biomedicine , volume =. 2019 , url =
work page 2019
- [8]
-
[9]
Medical Image Analysis , volume =
A novel multiple instance learning framework for COVID-19 severity assessment via data augmentation and self-supervised learning , author =. Medical Image Analysis , volume =. 2021 , doi =
work page 2021
-
[10]
A Simple Framework for Contrastive Learning of Visual Representations , author =. 2020 , eprint =
work page 2020
-
[11]
Barlow Twins: Self-Supervised Learning via Redundancy Reduction , author =. 2021 , eprint =
work page 2021
- [12]
-
[13]
Tsiknakis, Nikos and Theodoropoulos, Dimitris and Manikis, Georgios and Ktistakis, Emmanouil and Boutsora, Ourania and Berto, Alexa and Scarpa, Fabio and Scarpa, Alberto and Fotiadis, Dimitrios I. and Marias, Kostas , title =. Computers in Biology and Medicine , volume =. 2021 , doi =
work page 2021
-
[14]
Logprompt: A log-based anomaly detection framework using prompts
Learning Self-Supervised Representations for Label Efficient Cross-Domain Knowledge Transfer on Diabetic Retinopathy Fundus Images , author =. 2023 International Joint Conference on Neural Networks (IJCNN) , year =. doi:10.1109/IJCNN54540.2023.10191796 , url =
-
[15]
Progress in Artificial Intelligence , volume =
Self-supervised approach for diabetic retinopathy severity detection using vision transformer , author =. Progress in Artificial Intelligence , volume =. 2024 , doi =
work page 2024
-
[16]
Multimedia Tools and Applications , year =
Grading the severity of diabetic retinopathy using an ensemble of self-supervised pre-trained convolutional neural networks: ESSP-CNNs , author =. Multimedia Tools and Applications , year =. doi:10.1007/s11042-024-18968-5 , url =
-
[17]
Proceedings of Machine Learning for Health , pages =
How Transferable Are Self-supervised Features in Medical Image Classification Tasks? , author =. Proceedings of Machine Learning for Health , pages =. 2021 , volume =
work page 2021
-
[18]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Momentum Contrast for Unsupervised Visual Representation Learning , author =. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =. doi:10.1109/CVPR42600.2020.00975 , url =
-
[19]
Advances in Neural Information Processing Systems , volume =
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =
work page 2020
-
[20]
Gulshan, Varun and Peng, Lily and Coram, Marc and Stumpe, Martin and Wu, Derek and Narayanaswamy, Arunachalam and Venugopalan, Subhashini and Widner, Kasumi and Madams, Tom and Cuadros, Jorge and Kim, Ramasamy and Raman, Rajiv and Nelson, Philip and Mega, Jessica and Webster, Dale , year =. Development and Validation of a Deep Learning Algorithm for Detec...
-
[21]
Krause, Jonathan and Gulshan, Varun and Rahimy, Ehsan and Karth, Peter and Widner, Kasumi and Corrado, Greg and Peng, Lily and Webster, Dale , year =. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy , volume =. Ophthalmology , doi =
-
[22]
Performance of a Deep-Learning Algorithm vs Manual Grading for Detecting Diabetic Retinopathy in India , author =. JAMA Ophthalmology , volume =. 2019 , doi =
work page 2019
-
[23]
Proceedings of the 36th International Conference on Machine Learning , pages =
SelectiveNet: A Deep Neural Network with an Integrated Reject Option , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =
work page 2019
-
[24]
De Fauw, Jeffrey and Ledsam, Joseph R. and Romera-Paredes, Bernardino and Nikolov, Stanislav and Tomasev, Nenad and Blackwell, Sam and Askham, Harry and Glorot, Xavier and O'Donoghue, Brendan and Visentin, Daniel and Van Den Driessche, George and Lakshminarayanan, Balaji and Meyer, Clemens and Mackinder, Faith and Bouton, Simon and Ayoub, Kareem and Chopr...
work page 2018
-
[25]
El-Yaniv, Ran and Wiener, Yair , title =. J. Mach. Learn. Res. , month = aug, pages =. 2010 , issue_date =
work page 2010
-
[26]
C Benítez, Veronica Elisa and Castro Matto, Ingrid and Mello Román, Julio César and Vázquez Noguera, José Luis and García-Torres, Miguel and Ayala, Jordan and Pinto-Roa, Diego P. and Gardel-Sotomayor, Pedro E. and Facon, Jacques and Grillo, Sebastian Alberto , title =. 2021 , publisher =. doi:10.5281/zenodo.4647952 , url =
-
[27]
Kaggle EyePACS Diabetic Retinopathy Detection , howpublished =
-
[28]
Messidor-2 , howpublished =
- [29]
-
[30]
IEEE Transactions on information theory , volume=
On optimum recognition error and reject tradeoff , author=. IEEE Transactions on information theory , volume=. 2003 , publisher=
work page 2003
-
[31]
Journal of Machine Learning Research , volume=
Optimal strategies for reject option classifiers , author=. Journal of Machine Learning Research , volume=
-
[32]
Conformal triage for medical imaging AI deployment , author=. medRxiv , pages=. 2024 , publisher=
work page 2024
-
[33]
arXiv preprint arXiv:2305.15508 , year=
How to fix a broken confidence estimator: Evaluating post-hoc methods for selective classification with deep neural networks , author=. arXiv preprint arXiv:2305.15508 , year=
-
[34]
Learning Deep Features for Discriminative Localization , author=. 2015 , eprint=
work page 2015
-
[35]
Park, Wongi and Ryu, Jongbin , title =. 2024 , issue_date =. doi:10.1016/j.compbiomed.2024.108460 , journal =
-
[36]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[37]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[38]
Convolutional neural networks based transfer learning for diabetic retinopathy fundus image classification , author=. 2017 10th international congress on image and signal processing, biomedical engineering and informatics (CISP-BMEI) , pages=. 2017 , organization=
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.