pith. sign in

arxiv: 2605.19133 · v1 · pith:DUFL6ACMnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

Pith reviewed 2026-05-20 10:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords self-supervised learningabstentiondiabetic retinopathyselective predictionconfidence calibrationmedical image analysis
0
0 comments X

The pith

Self-supervised pretraining improves selective prediction in diabetic retinopathy screening, but longer pretraining does not consistently enhance reliability after accuracy saturates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the impact of self-supervised learning pretraining duration on a model's ability to abstain from unreliable predictions in diabetic retinopathy grading. The authors evaluate multiple checkpoints from SSL pretraining using a fixed fine-tuning setup and measure performance on calibrated confidence and abstention metrics such as coverage and selective accuracy. They find that SSL pretraining leads to better selective prediction than training models from scratch across various datasets and data regimes. However, selective performance can still vary significantly across different checkpoints even when overall accuracy has stopped improving. This highlights the need to consider pretraining length as a factor in building reliable, safety-aware medical AI systems rather than focusing solely on accuracy.

Core claim

The paper claims that SSL pretraining improves calibrated confidence and confidence-based abstention compared to training from scratch, yet once accuracy saturates, selective performance can still change markedly across checkpoints and longer pretraining does not consistently improve reliability.

What carries the argument

calibrated confidence-based abstention under a fixed fine-tuning protocol applied to multiple SSL checkpoints

If this is right

  • SSL pretraining enhances selective accuracy and selective macro-F1 across datasets compared to from-scratch training.
  • Selective performance varies across checkpoints even after accuracy plateaus.
  • Extending pretraining duration does not uniformly improve abstention reliability.
  • Abstention-aware evaluation is necessary for assessing safety in clinical screening tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may need to track selective metrics across many checkpoints instead of stopping at the accuracy peak.
  • The same checkpoint-selection approach could be tested on other medical imaging tasks that require safe deferral.
  • In deployment, reliability monitoring might shift from single best-accuracy models to families of checkpoints evaluated for abstention quality.

Load-bearing premise

The fixed fine-tuning protocol combined with calibrated confidence-based abstention metrics sufficiently captures real-world clinical safety, and the chosen datasets represent the variability in actual screening practice.

What would settle it

A new DR dataset where SSL pretraining fails to raise selective accuracy or selective macro-F1 above from-scratch baselines, or where longer pretraining checkpoints show steadily worse abstention after accuracy has plateaued, would falsify the reported pattern.

Figures

Figures reproduced from arXiv: 2605.19133 by Jan H. Terheyden, Lorenz Sparrenberg, Muskaan Chopra, Rafet Sifa.

Figure 1
Figure 1. Figure 1: Downstream macro-F1 across SSL pretraining check [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Selective prediction comparison between SiCoVa and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Grad-CAM visualizations for SiCoVa across three down [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representation structure and severity-dependent abstention using PaCMAP and class-wise acceptance statistics at 70% coverage. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper examines how the duration of self-supervised learning (SSL) pretraining affects calibrated confidence and confidence-based abstention in diabetic retinopathy (DR) grading models. It evaluates multiple SSL checkpoints under a single fixed fine-tuning protocol, reporting metrics of coverage, selective accuracy, and selective macro-F1. The central claims are that SSL pretraining improves selective prediction relative to training from scratch across datasets and regimes, yet selective performance can still vary substantially once accuracy saturates and that longer pretraining does not reliably improve reliability. The work argues for abstention-aware evaluation in safety-critical screening tasks.

Significance. If the reported patterns hold under more varied protocols and datasets, the result would be significant for medical imaging: it shows that accuracy saturation does not imply stable selective performance and that pretraining length is a reliability design choice rather than a mere computational detail. The public code release is a clear strength that aids verification.

major comments (3)
  1. [Experimental protocol / Results] The central claim that SSL pretraining improves selective metrics rests on a fixed fine-tuning protocol applied uniformly to all checkpoints and the from-scratch baseline. Without evidence that this protocol remains optimal or non-interacting with checkpoint quality (e.g., no ablation of learning-rate schedules or epoch counts per checkpoint), observed fluctuations in selective accuracy and macro-F1 after accuracy saturation could be protocol artifacts rather than intrinsic properties of the pretrained representations.
  2. [Datasets and evaluation] The abstract and results assert improvements “across datasets and data regimes,” yet the manuscript supplies no quantitative description of dataset diversity (camera models, resolution distributions, population demographics, or label-noise levels). This omission is load-bearing for the safety conclusion, because selective-prediction gains on homogeneous data do not necessarily translate to the variability encountered in clinical DR screening.
  3. [Results] The observation that “longer pretraining does not consistently improve reliability” is presented without statistical quantification of the variation (e.g., standard errors on selective macro-F1 across checkpoints once accuracy plateaus, or a formal test for trend). A single fixed protocol plus limited checkpoint sampling leaves open the possibility that the non-monotonic behavior is under-powered rather than a robust negative finding.
minor comments (2)
  1. [Methods] Clarify in the methods whether the confidence calibration (temperature scaling or similar) is performed on a held-out validation set or on the same data used for selective-metric computation; the current description leaves room for optimistic bias.
  2. [Figures] Figure captions and axis labels should explicitly state the number of checkpoints, the exact pretraining epochs or steps corresponding to each point, and whether error bars represent standard deviation over multiple fine-tuning seeds.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review. We appreciate the referee's focus on experimental rigor and generalizability. Below we respond to each major comment, proposing revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central claim that SSL pretraining improves selective metrics rests on a fixed fine-tuning protocol applied uniformly to all checkpoints and the from-scratch baseline. Without evidence that this protocol remains optimal or non-interacting with checkpoint quality (e.g., no ablation of learning-rate schedules or epoch counts per checkpoint), observed fluctuations in selective accuracy and macro-F1 after accuracy saturation could be protocol artifacts rather than intrinsic properties of the pretrained representations.

    Authors: We intentionally employed a single fixed fine-tuning protocol across all SSL checkpoints and the from-scratch baseline to isolate the impact of pretraining duration on calibrated confidence and selective performance. This approach prevents confounding variables from differing optimization strategies and enables a controlled comparison. While we agree that protocol interactions could exist, the consistent improvements in selective metrics under this protocol support our claims. We will revise the discussion to explicitly acknowledge this design choice as a potential limitation and suggest that future studies explore adaptive fine-tuning per checkpoint. revision: partial

  2. Referee: The abstract and results assert improvements “across datasets and data regimes,” yet the manuscript supplies no quantitative description of dataset diversity (camera models, resolution distributions, population demographics, or label-noise levels). This omission is load-bearing for the safety conclusion, because selective-prediction gains on homogeneous data do not necessarily translate to the variability encountered in clinical DR screening.

    Authors: We will update the manuscript to provide a more detailed quantitative description of the datasets, including information on camera models, image resolutions, patient demographics, and any available details on label noise. The experiments use well-established public DR datasets that are representative of clinical variability, but we concur that explicit quantification will better support the generalizability of our safety-related conclusions. revision: yes

  3. Referee: The observation that “longer pretraining does not consistently improve reliability” is presented without statistical quantification of the variation (e.g., standard errors on selective macro-F1 across checkpoints once accuracy plateaus, or a formal test for trend). A single fixed protocol plus limited checkpoint sampling leaves open the possibility that the non-monotonic behavior is under-powered rather than a robust negative finding.

    Authors: We will incorporate statistical quantification by adding standard errors to the reported selective metrics and conducting a formal trend analysis or non-parametric test for the plateau region across checkpoints. This will provide stronger evidence for the observed non-monotonic behavior in reliability metrics. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation study with no circular derivation or self-referential reduction

full rationale

This is an experimental comparison paper that evaluates SSL pretraining duration effects on selective prediction metrics (coverage, selective accuracy, selective macro-F1) under a fixed fine-tuning protocol, contrasting against training-from-scratch baselines across datasets. No mathematical derivation chain exists; claims rest on direct empirical measurements rather than first-principles results that reduce to inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described methods. The fixed-protocol design and dataset comparisons are presented as external benchmarks, rendering the central findings self-contained without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical evaluation; no free parameters, axioms, or invented entities are identifiable from the abstract.

pith-pipeline@v0.9.0 · 5744 in / 1054 out tokens · 40684 ms · 2026-05-20T10:29:31.246506+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 2 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , year =

    Selective Classification for Deep Neural Networks , author =. Advances in Neural Information Processing Systems , year =

  2. [2]

    Proceedings of the 34th International Conference on Machine Learning (ICML) , year =

    On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning (ICML) , year =

  3. [3]

    2019 , howpublished =

    APTOS 2019 Blindness Detection , author =. 2019 , howpublished =

  4. [4]

    Mobile Networks and Applications , volume =

    A Review of Deep Learning on Medical Image Analysis , author =. Mobile Networks and Applications , volume =. 2021 , doi =

  5. [5]

    2022 , eprint=

    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning , author=. 2022 , eprint=

  6. [6]

    In Defense of the Triplet Loss for Person Re-Identification

    In Defense of the Triplet Loss for Person Re-Identification , author =. arXiv preprint arXiv:1703.07737 , year =

  7. [7]

    Computer Methods and Programs in Biomedicine , volume =

    Shen, Tianyu and Gou, Chao and Wang, Fei-Yue and He, Zilong and Chen, Weiguo , title =. Computer Methods and Programs in Biomedicine , volume =. 2019 , url =

  8. [8]

    , title =

    Ciga, Olivier and Xu, Tony and Martel, Anne L. , title =. Machine Learning with Applications , volume =. 2022 , doi =

  9. [9]

    Medical Image Analysis , volume =

    A novel multiple instance learning framework for COVID-19 severity assessment via data augmentation and self-supervised learning , author =. Medical Image Analysis , volume =. 2021 , doi =

  10. [10]

    2020 , eprint =

    A Simple Framework for Contrastive Learning of Visual Representations , author =. 2020 , eprint =

  11. [11]

    2021 , eprint =

    Barlow Twins: Self-Supervised Learning via Redundancy Reduction , author =. 2021 , eprint =

  12. [12]

    Entropy , volume =

    Albelwi, Saleh , title =. Entropy , volume =. 2022 , doi =

  13. [13]

    and Marias, Kostas , title =

    Tsiknakis, Nikos and Theodoropoulos, Dimitris and Manikis, Georgios and Ktistakis, Emmanouil and Boutsora, Ourania and Berto, Alexa and Scarpa, Fabio and Scarpa, Alberto and Fotiadis, Dimitrios I. and Marias, Kostas , title =. Computers in Biology and Medicine , volume =. 2021 , doi =

  14. [14]

    Logprompt: A log-based anomaly detection framework using prompts

    Learning Self-Supervised Representations for Label Efficient Cross-Domain Knowledge Transfer on Diabetic Retinopathy Fundus Images , author =. 2023 International Joint Conference on Neural Networks (IJCNN) , year =. doi:10.1109/IJCNN54540.2023.10191796 , url =

  15. [15]

    Progress in Artificial Intelligence , volume =

    Self-supervised approach for diabetic retinopathy severity detection using vision transformer , author =. Progress in Artificial Intelligence , volume =. 2024 , doi =

  16. [16]

    Multimedia Tools and Applications , year =

    Grading the severity of diabetic retinopathy using an ensemble of self-supervised pre-trained convolutional neural networks: ESSP-CNNs , author =. Multimedia Tools and Applications , year =. doi:10.1007/s11042-024-18968-5 , url =

  17. [17]

    Proceedings of Machine Learning for Health , pages =

    How Transferable Are Self-supervised Features in Medical Image Classification Tasks? , author =. Proceedings of Machine Learning for Health , pages =. 2021 , volume =

  18. [18]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Momentum Contrast for Unsupervised Visual Representation Learning , author =. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =. doi:10.1109/CVPR42600.2020.00975 , url =

  19. [19]

    Advances in Neural Information Processing Systems , volume =

    Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

  20. [20]

    Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs , volume =

    Gulshan, Varun and Peng, Lily and Coram, Marc and Stumpe, Martin and Wu, Derek and Narayanaswamy, Arunachalam and Venugopalan, Subhashini and Widner, Kasumi and Madams, Tom and Cuadros, Jorge and Kim, Ramasamy and Raman, Rajiv and Nelson, Philip and Mega, Jessica and Webster, Dale , year =. Development and Validation of a Deep Learning Algorithm for Detec...

  21. [21]

    Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy , volume =

    Krause, Jonathan and Gulshan, Varun and Rahimy, Ehsan and Karth, Peter and Widner, Kasumi and Corrado, Greg and Peng, Lily and Webster, Dale , year =. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy , volume =. Ophthalmology , doi =

  22. [22]

    JAMA Ophthalmology , volume =

    Performance of a Deep-Learning Algorithm vs Manual Grading for Detecting Diabetic Retinopathy in India , author =. JAMA Ophthalmology , volume =. 2019 , doi =

  23. [23]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    SelectiveNet: A Deep Neural Network with an Integrated Reject Option , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

  24. [24]

    De Fauw, Jeffrey and Ledsam, Joseph R. and Romera-Paredes, Bernardino and Nikolov, Stanislav and Tomasev, Nenad and Blackwell, Sam and Askham, Harry and Glorot, Xavier and O'Donoghue, Brendan and Visentin, Daniel and Van Den Driessche, George and Lakshminarayanan, Balaji and Meyer, Clemens and Mackinder, Faith and Bouton, Simon and Ayoub, Kareem and Chopr...

  25. [25]

    El-Yaniv, Ran and Wiener, Yair , title =. J. Mach. Learn. Res. , month = aug, pages =. 2010 , issue_date =

  26. [26]

    and Gardel-Sotomayor, Pedro E

    C Benítez, Veronica Elisa and Castro Matto, Ingrid and Mello Román, Julio César and Vázquez Noguera, José Luis and García-Torres, Miguel and Ayala, Jordan and Pinto-Roa, Diego P. and Gardel-Sotomayor, Pedro E. and Facon, Jacques and Grillo, Sebastian Alberto , title =. 2021 , publisher =. doi:10.5281/zenodo.4647952 , url =

  27. [27]

    Kaggle EyePACS Diabetic Retinopathy Detection , howpublished =

  28. [28]

    Messidor-2 , howpublished =

  29. [29]

    2024 , note =

    ClementP , title =. 2024 , note =

  30. [30]

    IEEE Transactions on information theory , volume=

    On optimum recognition error and reject tradeoff , author=. IEEE Transactions on information theory , volume=. 2003 , publisher=

  31. [31]

    Journal of Machine Learning Research , volume=

    Optimal strategies for reject option classifiers , author=. Journal of Machine Learning Research , volume=

  32. [32]

    medRxiv , pages=

    Conformal triage for medical imaging AI deployment , author=. medRxiv , pages=. 2024 , publisher=

  33. [33]

    arXiv preprint arXiv:2305.15508 , year=

    How to fix a broken confidence estimator: Evaluating post-hoc methods for selective classification with deep neural networks , author=. arXiv preprint arXiv:2305.15508 , year=

  34. [34]

    2015 , eprint=

    Learning Deep Features for Discriminative Localization , author=. 2015 , eprint=

  35. [35]

    2024 , issue_date =

    Park, Wongi and Ryu, Jongbin , title =. 2024 , issue_date =. doi:10.1016/j.compbiomed.2024.108460 , journal =

  36. [36]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  37. [37]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

  38. [38]

    2017 10th international congress on image and signal processing, biomedical engineering and informatics (CISP-BMEI) , pages=

    Convolutional neural networks based transfer learning for diabetic retinopathy fundus image classification , author=. 2017 10th international congress on image and signal processing, biomedical engineering and informatics (CISP-BMEI) , pages=. 2017 , organization=