arxiv: 2605.00350 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

CURE-OOD: Benchmarking Out-of-Distribution Detection for Survival Prediction

Wenjie Zhao , Jia Li , Mingrui Liu , Jing Wang , Yunhui Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords out-of-distribution detectionsurvival predictioncancer prognosisCT imagingbenchmarkcovariate shiftmedical imaging

0 comments

The pith

CURE-OOD creates the first controlled benchmark to test OOD detection inside cancer survival prediction from CT scans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fill the absence of any systematic testbed for spotting when a survival model sees data from a different scanner than it was trained on. It builds CURE-OOD by splitting four survival-prediction tasks into training, in-distribution, and out-of-distribution sets according to scanner parameters. Experiments on this benchmark show that the same shifts that lower survival-model accuracy also make ordinary classification-style OOD detectors unreliable. A simple survival-specific baseline called HazardDev is supplied for comparison. The work matters because survival estimates guide patient counseling and treatment choices, yet models that silently fail on new scanners can produce misleading risk numbers in routine clinical use.

Core claim

CURE-OOD defines scanner-parameter-based training, in-distribution, and OOD test splits across four survival prediction tasks. Experiments demonstrate that covariate shifts from acquisition variations notably reduce survival prediction performance, and that mainstream classification-oriented OOD detectors can fail in survival prediction. HazardDev is included as a simple survival-aware reference baseline for OOD detection, enabling systematic analysis of how distribution shifts affect both downstream survival performance and OOD detectability.

What carries the argument

CURE-OOD benchmark, which creates controlled training/ID/OOD splits from scanner parameters to isolate acquisition-induced covariate shifts in survival tasks.

If this is right

Covariate shifts from scanner variations reduce survival prediction performance.
Mainstream classification-oriented OOD detectors fail to transfer reliably to survival prediction.
HazardDev supplies a survival-aware baseline that can be used for comparison in future OOD studies.
The benchmark supports systematic measurement of how acquisition shifts simultaneously affect prediction accuracy and detectability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hospitals could use similar scanner-based splits to stress-test any new survival model before deployment across sites.
Better survival-specific OOD methods might reduce the number of cases where models silently produce unreliable risk scores on unseen equipment.
Extending the benchmark to other imaging modalities or to time-varying shifts could reveal whether the same detector weaknesses appear in broader clinical settings.

Load-bearing premise

That scanner-parameter splits produce representative and controlled covariate shifts that match the distribution changes seen in real hospitals.

What would settle it

An experiment that applies the same OOD detectors to real multi-scanner patient cohorts and finds detection rates or performance drops that differ sharply from those measured on CURE-OOD splits.

Figures

Figures reproduced from arXiv: 2605.00350 by Jia Li, Jing Wang, Mingrui Liu, Wenjie Zhao, Yunhui Guo.

**Figure 1.** Figure 1: CT data distribution shifts caused by crossinstitution variability and scanner changes over time. OOD test set Training set Train ID test set Deploy Good Bad [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Examples showing how acquisition parameters affect CT image appearance. Larger pixel spacing covers a wider field of view with decreased resolution, longer exposure time and higher tube current reduce noise, and thicker slice thickness produces smoother but blurrier images [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Mean hazard curves for ID and OOD test sets on OS under two acquisition shifts. Shaded regions indicate ±1 std. OOD shows higher mean hazard than ID, suggesting hazard magnitude is an OOD signal. prediction; they often show weak discrimination and, in some settings, even reversed ordering, where ID samples receive more OOD-like scores than OOD samples. The reason is that survival logits do not encode class… view at source ↗

**Figure 6.** Figure 6: Mean logits on the OS task under the exposure time shift. Unlike standard classification logits, ID samples produce systematically smaller, more negative logits across time intervals, while OOD samples drift toward zero. This reversed ordering breaks the confidence interpretation assumed by classification-style OOD scores. 0.00 0.05 0.10 0.15 0.20 0.25 0.30 C-index (ID - OOD) 0.40 0.45 0.50 0.55 0.60 0.65… view at source ↗

**Figure 8.** Figure 8: Visualization of MTLR logits distributions on the OS task under two acquisition shifts: (a) pixel spacing and (b) exposure time. For both shifts, ID samples tend to produce logits concentrated at lower values, whereas OOD samples show a clear tendency toward larger logits. This shift in the logits distribution indicates that acquisition differences systematically affect the model’s output confidence, provi… view at source ↗

**Figure 9.** Figure 9: Distributions of key acquisition parameters used to construct ID and OOD domains in the CURE-OOD benchmark. Each parameter exhibits distinct value ranges. Appendix A Gradient Dynamics of MTLR and Logit Behavior To understand why logits-based OOD detection methods such as Energy and SCALE behave inversely on survival modeling tasks, we analyze the formulation of the MTLR model. MTLR divides the continuous s… view at source ↗

**Figure 10.** Figure 10: Mean hazard curves of ID and OOD test sets on the OS task across four acquisition shifts: (a) exposure time, (b) pixel spacing, (c) slice thickness, and (d) X-ray tube current. The shaded areas denote one standard deviation, showing the variability of predicted hazard values across samples. Across all shifts, ID samples consistently exhibit higher mean hazard values than OOD samples, indicating that hazar… view at source ↗

read the original abstract

``How long can I live and remain free of cancer?'' is often the first question a patient asks after receiving a cancer diagnosis and treatment. Accurate survival prediction helps alleviate psychological distress and supports risk stratification and personalized treatment planning. Recent survival prediction frameworks have shown strong performance using computed tomography (CT) images. However, variations in imaging acquisition introduce out-of-distribution (OOD) samples caused by covariate shifts that undermine model reliability. Despite this challenge, to our knowledge, no existing benchmark systematically studies OOD detection in cancer survival prediction. To address this gap, we introduce the Cancer sURvival bEnchmark for OOD Detection (CURE-OOD), the first benchmark for systematically evaluating OOD detection in survival prediction under controlled acquisition-induced distribution shifts. CURE-OOD defines scanner-parameter-based training, in-distribution (ID), and OOD test splits across four survival prediction tasks. Our experiments show that covariate shifts notably reduce survival prediction performance. It also shows that mainstream classification-oriented OOD detectors can fail in survival prediction. Finally, we include HazardDev as a simple survival-aware reference baseline for OOD detection. CURE-OOD enables systematic analysis of how distribution shifts affect both downstream survival performance and OOD detectability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces CURE-OOD as the first benchmark for systematically evaluating out-of-distribution (OOD) detection in cancer survival prediction from CT images. It defines scanner-parameter-based training/ID/OOD splits across four survival tasks, reports that these covariate shifts degrade downstream survival prediction performance, shows that standard classification-oriented OOD detectors fail in this setting, and proposes HazardDev as a simple survival-aware baseline.

Significance. If the induced shifts prove representative of real multi-center clinical variations, the benchmark could meaningfully advance reliable deployment of survival models by exposing limitations of existing OOD methods and providing a survival-specific reference. The introduction of HazardDev is a constructive, task-aware contribution that strengthens the empirical comparison.

major comments (1)

[Benchmark Definition] Benchmark Definition section: The claim that scanner-parameter splits produce controlled, clinically relevant covariate shifts for survival prediction is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative validation (e.g., statistical tests on differences in hazard functions, patient covariates, or tumor features between ID and OOD sets). Without this, observed performance drops and detector failures may reflect low-level image statistics rather than shifts that affect hazard estimation in deployment.

minor comments (2)

[Abstract] Abstract: While experiments are summarized, no quantitative results, dataset sizes, or specific performance metrics are reported; adding these would strengthen the abstract.
[Experiments] Experiments section: Reproducibility would benefit from explicit details on the four survival tasks (endpoints, censoring rates, and exact scanner-parameter criteria for splits).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful review. We address the single major comment point-by-point below. Where the feedback identifies a gap in the current manuscript, we have revised the text accordingly.

read point-by-point responses

Referee: Benchmark Definition section: The claim that scanner-parameter splits produce controlled, clinically relevant covariate shifts for survival prediction is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative validation (e.g., statistical tests on differences in hazard functions, patient covariates, or tumor features between ID and OOD sets). Without this, observed performance drops and detector failures may reflect low-level image statistics rather than shifts that affect hazard estimation in deployment.

Authors: We agree that explicit quantitative validation strengthens the benchmark. In the revised manuscript we have added a new subsection under Benchmark Definition that reports two-sample Kolmogorov-Smirnov tests and effect-size statistics comparing patient demographics (age, sex), tumor features (stage, size, location), and non-parametric hazard estimates (via Kaplan-Meier and Nelson-Aalen) between the ID and OOD partitions for each of the four tasks. These tests confirm statistically significant differences in clinically relevant covariates and survival distributions. We also include a brief discussion clarifying that the observed degradation in concordance index and Brier score on the OOD sets provides direct task-level evidence that the shifts affect hazard estimation, beyond low-level image statistics. The scanner-parameter construction remains the primary definition of the splits, but the added analyses address the concern about clinical relevance. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition and empirical splits stand alone

full rationale

The paper defines CURE-OOD via scanner-parameter-based train/ID/OOD splits across four survival tasks and reports empirical performance drops plus OOD detector failures. No equations, fitted parameters, or derivations appear. The central contribution is the benchmark construction itself plus comparison to a simple baseline (HazardDev); this does not reduce to any self-citation chain, prior fitted quantity, or self-definitional loop. The claim of being the 'first' benchmark is a novelty statement, not a load-bearing derivation. Per guidelines, a self-contained benchmark paper receives score 0 when no reduction to inputs by construction is exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution is a new benchmark definition rather than a derivation; it rests on domain assumptions about what constitutes a meaningful acquisition shift and on the utility of the chosen survival tasks.

axioms (1)

domain assumption Scanner parameters induce controllable and clinically relevant covariate shifts in CT imaging for survival tasks.
Used to define the ID/OOD splits; no independent validation provided in abstract.

invented entities (1)

HazardDev no independent evidence
purpose: Simple survival-aware reference baseline for OOD detection.
Introduced as a new method tailored to hazard information; no external evidence of its superiority is given.

pith-pipeline@v0.9.0 · 5525 in / 1191 out tokens · 38711 ms · 2026-05-09T20:07:04.398050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 13 canonical work pages · 2 internal anchors

[1]

URLhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-77954190940&doi=10.1056%2fNEJMoa0912217& partnerID=40&md5=53b452be4e88b8324284858bd1acf5d0

doi: 10.1056/NEJMoa0912217. URLhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-77954190940&doi=10.1056%2fNEJMoa0912217& partnerID=40&md5=53b452be4e88b8324284858bd1acf5d0. Cited by: 5671; All Open Access, Green Open Access. Marc Aubreville, Nikolas Stathonikos, Christof A. Bertram, Robert Klopfleisch, Natalie ter Hoeve, Francesco Ciompi, Frauke Wilm, C...

work page doi:10.1056/nejmoa0912217
[2]

doi: https://doi.org/10.1016/j.media.2022.102699

ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2022.102699. URL https://www.sciencedirect.com/science/article/pii/S1361841522003279. Eunsu Baek, Keondo Park, Jiyoon Kim, and Hyung-Sin Kim. Unexplored faces of robustness and out-of-distribution: Covariate shifts in environment and sensor domains. InProceedings of the IEEE/CVF Conference on Computer V...

work page doi:10.1016/j.media.2022.102699 2022
[3]

A benchmark of medical out of distribution detection.arXiv preprint arXiv:2007.04250,

Tianshi Cao, Chin-Wei Huang, David Yu-Tung Hui, and Joseph Paul Cohen. A benchmark of medical out of distribution detection.arXiv preprint arXiv:2007.04250,

work page arXiv 2007
[4]

doi: https://doi.org/10.1016/S1470-2045(17)30252-8

ISSN 1470-2045. doi: https://doi.org/10.1016/S1470-2045(17)30252-8. URL https://www.sciencedirect.com/science/article/pii/S1470204517302528. Meixu Chen, Kai Wang, and Jing Wang. Advancing head and neck cancer survival prediction via multi-label learning and deep model interpretation.ArXiv, pp. arXiv–2405, 2024a. Meixu Chen, Kai Wang, and Jing Wang. Vision...

work page doi:10.1016/s1470-2045(17)30252-8 2045
[5]

Extremely simple activation shaping for out-of- distribution detection.arXiv preprint arXiv:2209.09858,

Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation shaping for out-of- distribution detection.arXiv preprint arXiv:2209.09858,

work page arXiv
[6]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and per- turbations.arXiv preprint arXiv:1903.12261,

work page internal anchor Pith review arXiv 1903
[7]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136,

work page internal anchor Pith review arXiv
[8]

doi: https://doi.org/10.1016/j.ijrobp.2015.07.2286

ISSN 0360-3016. doi: https://doi.org/10.1016/j.ijrobp.2015.07.2286. URLhttps: //www.sciencedirect.com/science/article/pii/S0360301615030783. Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC medical...

work page doi:10.1016/j.ijrobp.2015.07.2286 2015
[9]

Oasis-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and alzheimer disease.medrxiv, pp

PamelaJLaMontagne, TammieLSBenzinger, JohnCMorris, SarahKeefe, RussHornbeck, ChengjieXiong, Elizabeth Grant, Jason Hassenstab, Krista Moulder, Andrei G Vlassenko, et al. Oasis-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and alzheimer disease.medrxiv, pp. 2019–12,

2019
[10]

Liang, Y

Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks.arXiv preprint arXiv:1706.02690,

work page arXiv
[11]

Pan, Bruce G

Hubert Y. Pan, Bruce G. Haffty, Benjamin P. Falit, Thomas A. Buchholz, Lynn D. Wilson, Stephen M. Hahn, and Benjamin D. Smith. Supply and demand for radiation oncology in the united states: Updated projec- tions for 2015 to 2025.International Journal of Radiation Oncology*Biology*Physics, 96(3):493–500,

2015
[12]

doi: https://doi.org/10.1016/j.ijrobp.2016.02.064

ISSN 0360-3016. doi: https://doi.org/10.1016/j.ijrobp.2016.02.064. URLhttps://www.sciencedirect.com/science/ article/pii/S0360301616002339. Tobias Rueckert. Phakir challenge at miccai 2024.https://phakir.re-mic.de/,

work page doi:10.1016/j.ijrobp.2016.02.064 2016
[13]

Numan Saeed, Muhammad Ridzuan, Fadillah Adamsyah Maani, Hussain Alasmawi, Karthik Nandakumar, and Mohammad Yaqub

Accessed: 2025-11-08. Numan Saeed, Muhammad Ridzuan, Fadillah Adamsyah Maani, Hussain Alasmawi, Karthik Nandakumar, and Mohammad Yaqub. Survrnc: Learning ordered representations for survival prediction using rank-n-contrast. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 659–669. Springer,

2025
[14]

Breeds: Benchmarks for subpopulation shift

Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift.arXiv preprint arXiv:2008.04859,

work page arXiv 2008
[15]

From pixel to mask: A survey of out-of-distribution segmentation.arXiv preprint arXiv:2508.10309,

Wenjie Zhao, Jia Li, and Yunhui Guo. From pixel to mask: A survey of out-of-distribution segmentation.arXiv preprint arXiv:2508.10309,

work page arXiv
[16]

Mitigating the id-ood tradeoff in open-set test-time adaptation.arXiv preprint arXiv:2604.01589,

Wenjie Zhao, Jia Li, Xin Dong, Yapeng Tian, Yu Xiang, and Yunhui Guo. Mitigating the id-ood tradeoff in open-set test-time adaptation.arXiv preprint arXiv:2604.01589,

work page arXiv
[17]

For both shifts, ID samples tend to produce logits concentrated at lower values, whereas OOD samples show a clear tendency toward larger logits

13 (a)Pixel Spacing (b)Exposure Time Figure 8:Visualization of MTLR logits distributions on the OS task under two acquisition shifts: (a) pixel spacing and (b) exposure time. For both shifts, ID samples tend to produce logits concentrated at lower values, whereas OOD samples show a clear tendency toward larger logits. This shift in the logits distribution...

2011