Recognition: unknown
CURE-OOD: Benchmarking Out-of-Distribution Detection for Survival Prediction
Pith reviewed 2026-05-09 20:07 UTC · model grok-4.3
The pith
CURE-OOD creates the first controlled benchmark to test OOD detection inside cancer survival prediction from CT scans.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CURE-OOD defines scanner-parameter-based training, in-distribution, and OOD test splits across four survival prediction tasks. Experiments demonstrate that covariate shifts from acquisition variations notably reduce survival prediction performance, and that mainstream classification-oriented OOD detectors can fail in survival prediction. HazardDev is included as a simple survival-aware reference baseline for OOD detection, enabling systematic analysis of how distribution shifts affect both downstream survival performance and OOD detectability.
What carries the argument
CURE-OOD benchmark, which creates controlled training/ID/OOD splits from scanner parameters to isolate acquisition-induced covariate shifts in survival tasks.
If this is right
- Covariate shifts from scanner variations reduce survival prediction performance.
- Mainstream classification-oriented OOD detectors fail to transfer reliably to survival prediction.
- HazardDev supplies a survival-aware baseline that can be used for comparison in future OOD studies.
- The benchmark supports systematic measurement of how acquisition shifts simultaneously affect prediction accuracy and detectability.
Where Pith is reading between the lines
- Hospitals could use similar scanner-based splits to stress-test any new survival model before deployment across sites.
- Better survival-specific OOD methods might reduce the number of cases where models silently produce unreliable risk scores on unseen equipment.
- Extending the benchmark to other imaging modalities or to time-varying shifts could reveal whether the same detector weaknesses appear in broader clinical settings.
Load-bearing premise
That scanner-parameter splits produce representative and controlled covariate shifts that match the distribution changes seen in real hospitals.
What would settle it
An experiment that applies the same OOD detectors to real multi-scanner patient cohorts and finds detection rates or performance drops that differ sharply from those measured on CURE-OOD splits.
Figures
read the original abstract
``How long can I live and remain free of cancer?'' is often the first question a patient asks after receiving a cancer diagnosis and treatment. Accurate survival prediction helps alleviate psychological distress and supports risk stratification and personalized treatment planning. Recent survival prediction frameworks have shown strong performance using computed tomography (CT) images. However, variations in imaging acquisition introduce out-of-distribution (OOD) samples caused by covariate shifts that undermine model reliability. Despite this challenge, to our knowledge, no existing benchmark systematically studies OOD detection in cancer survival prediction. To address this gap, we introduce the Cancer sURvival bEnchmark for OOD Detection (CURE-OOD), the first benchmark for systematically evaluating OOD detection in survival prediction under controlled acquisition-induced distribution shifts. CURE-OOD defines scanner-parameter-based training, in-distribution (ID), and OOD test splits across four survival prediction tasks. Our experiments show that covariate shifts notably reduce survival prediction performance. It also shows that mainstream classification-oriented OOD detectors can fail in survival prediction. Finally, we include HazardDev as a simple survival-aware reference baseline for OOD detection. CURE-OOD enables systematic analysis of how distribution shifts affect both downstream survival performance and OOD detectability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CURE-OOD as the first benchmark for systematically evaluating out-of-distribution (OOD) detection in cancer survival prediction from CT images. It defines scanner-parameter-based training/ID/OOD splits across four survival tasks, reports that these covariate shifts degrade downstream survival prediction performance, shows that standard classification-oriented OOD detectors fail in this setting, and proposes HazardDev as a simple survival-aware baseline.
Significance. If the induced shifts prove representative of real multi-center clinical variations, the benchmark could meaningfully advance reliable deployment of survival models by exposing limitations of existing OOD methods and providing a survival-specific reference. The introduction of HazardDev is a constructive, task-aware contribution that strengthens the empirical comparison.
major comments (1)
- [Benchmark Definition] Benchmark Definition section: The claim that scanner-parameter splits produce controlled, clinically relevant covariate shifts for survival prediction is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative validation (e.g., statistical tests on differences in hazard functions, patient covariates, or tumor features between ID and OOD sets). Without this, observed performance drops and detector failures may reflect low-level image statistics rather than shifts that affect hazard estimation in deployment.
minor comments (2)
- [Abstract] Abstract: While experiments are summarized, no quantitative results, dataset sizes, or specific performance metrics are reported; adding these would strengthen the abstract.
- [Experiments] Experiments section: Reproducibility would benefit from explicit details on the four survival tasks (endpoints, censoring rates, and exact scanner-parameter criteria for splits).
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful review. We address the single major comment point-by-point below. Where the feedback identifies a gap in the current manuscript, we have revised the text accordingly.
read point-by-point responses
-
Referee: Benchmark Definition section: The claim that scanner-parameter splits produce controlled, clinically relevant covariate shifts for survival prediction is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative validation (e.g., statistical tests on differences in hazard functions, patient covariates, or tumor features between ID and OOD sets). Without this, observed performance drops and detector failures may reflect low-level image statistics rather than shifts that affect hazard estimation in deployment.
Authors: We agree that explicit quantitative validation strengthens the benchmark. In the revised manuscript we have added a new subsection under Benchmark Definition that reports two-sample Kolmogorov-Smirnov tests and effect-size statistics comparing patient demographics (age, sex), tumor features (stage, size, location), and non-parametric hazard estimates (via Kaplan-Meier and Nelson-Aalen) between the ID and OOD partitions for each of the four tasks. These tests confirm statistically significant differences in clinically relevant covariates and survival distributions. We also include a brief discussion clarifying that the observed degradation in concordance index and Brier score on the OOD sets provides direct task-level evidence that the shifts affect hazard estimation, beyond low-level image statistics. The scanner-parameter construction remains the primary definition of the splits, but the added analyses address the concern about clinical relevance. revision: yes
Circularity Check
No circularity: benchmark definition and empirical splits stand alone
full rationale
The paper defines CURE-OOD via scanner-parameter-based train/ID/OOD splits across four survival tasks and reports empirical performance drops plus OOD detector failures. No equations, fitted parameters, or derivations appear. The central contribution is the benchmark construction itself plus comparison to a simple baseline (HazardDev); this does not reduce to any self-citation chain, prior fitted quantity, or self-definitional loop. The claim of being the 'first' benchmark is a novelty statement, not a load-bearing derivation. Per guidelines, a self-contained benchmark paper receives score 0 when no reduction to inputs by construction is exhibited.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scanner parameters induce controllable and clinically relevant covariate shifts in CT imaging for survival tasks.
invented entities (1)
-
HazardDev
no independent evidence
Reference graph
Works this paper leans on
-
[1]
doi: 10.1056/NEJMoa0912217. URLhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-77954190940&doi=10.1056%2fNEJMoa0912217& partnerID=40&md5=53b452be4e88b8324284858bd1acf5d0. Cited by: 5671; All Open Access, Green Open Access. Marc Aubreville, Nikolas Stathonikos, Christof A. Bertram, Robert Klopfleisch, Natalie ter Hoeve, Francesco Ciompi, Frauke Wilm, C...
-
[2]
doi: https://doi.org/10.1016/j.media.2022.102699
ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2022.102699. URL https://www.sciencedirect.com/science/article/pii/S1361841522003279. Eunsu Baek, Keondo Park, Jiyoon Kim, and Hyung-Sin Kim. Unexplored faces of robustness and out-of-distribution: Covariate shifts in environment and sensor domains. InProceedings of the IEEE/CVF Conference on Computer V...
-
[3]
A benchmark of medical out of distribution detection.arXiv preprint arXiv:2007.04250,
Tianshi Cao, Chin-Wei Huang, David Yu-Tung Hui, and Joseph Paul Cohen. A benchmark of medical out of distribution detection.arXiv preprint arXiv:2007.04250,
-
[4]
doi: https://doi.org/10.1016/S1470-2045(17)30252-8
ISSN 1470-2045. doi: https://doi.org/10.1016/S1470-2045(17)30252-8. URL https://www.sciencedirect.com/science/article/pii/S1470204517302528. Meixu Chen, Kai Wang, and Jing Wang. Advancing head and neck cancer survival prediction via multi-label learning and deep model interpretation.ArXiv, pp. arXiv–2405, 2024a. Meixu Chen, Kai Wang, and Jing Wang. Vision...
-
[5]
Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation shaping for out-of- distribution detection.arXiv preprint arXiv:2209.09858,
-
[6]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and per- turbations.arXiv preprint arXiv:1903.12261,
work page internal anchor Pith review arXiv 1903
-
[7]
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136,
work page internal anchor Pith review arXiv
-
[8]
doi: https://doi.org/10.1016/j.ijrobp.2015.07.2286
ISSN 0360-3016. doi: https://doi.org/10.1016/j.ijrobp.2015.07.2286. URLhttps: //www.sciencedirect.com/science/article/pii/S0360301615030783. Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC medical...
-
[9]
Oasis-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and alzheimer disease.medrxiv, pp
PamelaJLaMontagne, TammieLSBenzinger, JohnCMorris, SarahKeefe, RussHornbeck, ChengjieXiong, Elizabeth Grant, Jason Hassenstab, Krista Moulder, Andrei G Vlassenko, et al. Oasis-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and alzheimer disease.medrxiv, pp. 2019–12,
2019
- [10]
-
[11]
Pan, Bruce G
Hubert Y. Pan, Bruce G. Haffty, Benjamin P. Falit, Thomas A. Buchholz, Lynn D. Wilson, Stephen M. Hahn, and Benjamin D. Smith. Supply and demand for radiation oncology in the united states: Updated projec- tions for 2015 to 2025.International Journal of Radiation Oncology*Biology*Physics, 96(3):493–500,
2015
-
[12]
doi: https://doi.org/10.1016/j.ijrobp.2016.02.064
ISSN 0360-3016. doi: https://doi.org/10.1016/j.ijrobp.2016.02.064. URLhttps://www.sciencedirect.com/science/ article/pii/S0360301616002339. Tobias Rueckert. Phakir challenge at miccai 2024.https://phakir.re-mic.de/,
-
[13]
Numan Saeed, Muhammad Ridzuan, Fadillah Adamsyah Maani, Hussain Alasmawi, Karthik Nandakumar, and Mohammad Yaqub
Accessed: 2025-11-08. Numan Saeed, Muhammad Ridzuan, Fadillah Adamsyah Maani, Hussain Alasmawi, Karthik Nandakumar, and Mohammad Yaqub. Survrnc: Learning ordered representations for survival prediction using rank-n-contrast. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 659–669. Springer,
2025
-
[14]
Breeds: Benchmarks for subpopulation shift
Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift.arXiv preprint arXiv:2008.04859,
-
[15]
From pixel to mask: A survey of out-of-distribution segmentation.arXiv preprint arXiv:2508.10309,
Wenjie Zhao, Jia Li, and Yunhui Guo. From pixel to mask: A survey of out-of-distribution segmentation.arXiv preprint arXiv:2508.10309,
-
[16]
Mitigating the id-ood tradeoff in open-set test-time adaptation.arXiv preprint arXiv:2604.01589,
Wenjie Zhao, Jia Li, Xin Dong, Yapeng Tian, Yu Xiang, and Yunhui Guo. Mitigating the id-ood tradeoff in open-set test-time adaptation.arXiv preprint arXiv:2604.01589,
-
[17]
For both shifts, ID samples tend to produce logits concentrated at lower values, whereas OOD samples show a clear tendency toward larger logits
13 (a)Pixel Spacing (b)Exposure Time Figure 8:Visualization of MTLR logits distributions on the OS task under two acquisition shifts: (a) pixel spacing and (b) exposure time. For both shifts, ID samples tend to produce logits concentrated at lower values, whereas OOD samples show a clear tendency toward larger logits. This shift in the logits distribution...
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.