Investigating Trustworthiness of Nonparametric Deep Survival Models for Alzheimer's Disease Progression Analysis
Pith reviewed 2026-05-10 17:42 UTC · model grok-4.3
The pith
Nonparametric deep survival models for Alzheimer's disease progression exhibit considerable bias with respect to sex, race, and education.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study shows that deep learning powered survival models are robust tools which can aid clinicians in AD care decisions, but they often exhibit considerable bias, as quantified by the new Time-Dependent Concordance Impurity and Kaplan-Meier Fairness metrics with respect to sensitive attributes such as sex, race, and education.
What carries the argument
Two novel fairness metrics, Time-Dependent Concordance Impurity and Kaplan-Meier Fairness, which quantify bias in nonparametric survival models by measuring inconsistencies in predictions across groups defined by sensitive attributes.
If this is right
- Deep survival models can still support clinical decisions in Alzheimer's care if bias is addressed.
- Feature importance analysis reveals characteristics most critical for reliable predictions.
- Future models should incorporate fairness considerations to avoid unfair predictions toward marginalized groups.
- The proposed metrics provide a way to evaluate bias in other survival analysis tasks.
Where Pith is reading between the lines
- Similar bias issues might appear in survival models for other progressive diseases like cancer or heart disease.
- Clinicians using these models may need additional checks for demographic fairness before relying on them for individual patients.
- Developing bias-mitigation techniques specifically for time-to-event predictions could improve equity in healthcare AI.
- Independent validation on diverse datasets would strengthen the generalizability of these findings.
Load-bearing premise
The two proposed metrics validly and comprehensively measure bias in the survival models without their own methodological artifacts or sensitivity to model hyperparameters.
What would settle it
Finding that the deep survival models show no significant differences in performance or bias metrics across demographic groups on a held-out test set from a different population would challenge the claim of considerable bias.
Figures
read the original abstract
Alzheimer's Dementia (AD) is a progressive neurodegenerative disease marked by irreversible decline, making reliable modeling of its progression essential for effective patient care. Progression-aware methods such as survival analysis are therefore crucial tools for the early detection and monitoring of AD. Recent advancements in deep learning have demonstrated remarkable performance in survival tasks, but alarmingly fewer studies have been conducted in the domain of AD. Further, the studies that do exist do not consider learned bias within the model itself, which could result in unfair and unreliable predictions toward certain marginalized groups. As such, we conduct a rigorous study of fairness in AD progression analysis along with a thorough feature importance study to determine the characteristics which are most important for reliable AD predictions. Furthermore, we propose two novel fairness metrics, called Time-Dependent Concordance Impurity and Kaplan-Meier Fairness, to quantify bias with respect to sensitive attributes such as sex, race, and education in nonparametric survival models. Our study demonstrates that while deep learning powered survival models are robust tools which can aid clinicians in AD care decisions, they often exhibit considerable bias, representing important avenues for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical fairness audit of nonparametric deep survival models (such as DeepSurv and DeepHit) for Alzheimer's Disease progression modeling. It proposes two new metrics—Time-Dependent Concordance Impurity and Kaplan-Meier Fairness—to quantify bias with respect to sensitive attributes (sex, race, education), performs a feature importance analysis, and concludes that while these models are robust clinical tools they often exhibit considerable bias, calling for future research on fair survival models.
Significance. If the new metrics are shown to validly isolate model bias and the reported bias levels hold under controlled conditions, the work would provide a useful starting point for equity-focused survival analysis in AD, a high-stakes domain where biased progression predictions could affect care decisions. The combination of fairness metrics with feature importance offers a practical template for auditing deep survival models.
major comments (2)
- [§3 (Metric Definitions)] §3 (Metric Definitions): The Time-Dependent Concordance Impurity and Kaplan-Meier Fairness metrics are introduced without synthetic experiments that inject controlled bias levels (e.g., by modifying survival curves or censoring rates for subgroups) and demonstrate recovery of those levels. This is load-bearing for the central claim because the headline result that models 'often exhibit considerable bias' is measured exclusively via these metrics; without such validation it remains possible that the scores reflect data artifacts (censoring patterns, subgroup size imbalance) rather than learned unfairness.
- [§5 (Experimental Results)] §5 (Experimental Results): The reported metric values on real AD cohorts lack hyperparameter ablation or stability checks for the underlying nonparametric models (DeepSurv, DeepHit). If the bias findings change materially under reasonable hyperparameter variation, the claim that bias is a general property of these models would be weakened.
minor comments (2)
- [Abstract and §2] Abstract and §2: The claim of a 'rigorous study' and 'thorough feature importance study' would be strengthened by explicitly listing the exact AD datasets (e.g., ADNI version), preprocessing steps, and all baseline models in the main text rather than high-level description.
- [Notation] Notation: Ensure the precise mathematical definitions of the two new metrics (including how time-dependence and impurity are aggregated) are given in a single, self-contained location to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our fairness audit of nonparametric deep survival models for Alzheimer's disease progression. The comments highlight valuable opportunities to strengthen the validation of our proposed metrics and the robustness of our experimental claims. We address each major comment point-by-point below, providing our honest assessment and committing to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: §3 (Metric Definitions): The Time-Dependent Concordance Impurity and Kaplan-Meier Fairness metrics are introduced without synthetic experiments that inject controlled bias levels (e.g., by modifying survival curves or censoring rates for subgroups) and demonstrate recovery of those levels. This is load-bearing for the central claim because the headline result that models 'often exhibit considerable bias' is measured exclusively via these metrics; without such validation it remains possible that the scores reflect data artifacts (censoring patterns, subgroup size imbalance) rather than learned unfairness.
Authors: We appreciate this point, as controlled validation would indeed provide stronger evidence that the metrics isolate learned model bias. Our metrics are direct extensions of the time-dependent concordance index and Kaplan-Meier estimator—both of which have extensive prior validation in survival analysis—so we grounded their definitions in these established properties and applied them to real AD cohorts where subgroup disparities are documented in the clinical literature. Nevertheless, we agree that synthetic experiments injecting known bias (via modified survival curves or differential censoring) would rule out data artifacts more conclusively. In the revised manuscript, we will add a new subsection with such controlled synthetic experiments demonstrating metric recovery of injected bias levels. This addition will directly support the reliability of our real-data bias quantifications. revision: yes
-
Referee: §5 (Experimental Results): The reported metric values on real AD cohorts lack hyperparameter ablation or stability checks for the underlying nonparametric models (DeepSurv, DeepHit). If the bias findings change materially under reasonable hyperparameter variation, the claim that bias is a general property of these models would be weakened.
Authors: We chose hyperparameters following the original DeepSurv and DeepHit papers and optimized them via cross-validation on the AD datasets to maximize concordance. To address the referee's concern about stability, we will incorporate a hyperparameter ablation study in the revised experimental section. This will systematically vary key parameters (e.g., learning rate, network depth, regularization strength) across reasonable ranges and report the resulting fairness metric values, including any variation in observed bias levels. We anticipate the bias patterns will persist, but the added results will demonstrate that the findings are not sensitive to specific hyperparameter selections and thereby reinforce that bias is a general characteristic of these model classes on the AD data. revision: yes
Circularity Check
No circularity: empirical audit with independently defined metrics
full rationale
The paper is an empirical fairness study that defines two new metrics (Time-Dependent Concordance Impurity and Kaplan-Meier Fairness) from standard survival-analysis primitives and applies them to existing nonparametric deep survival models on AD data. No derivation step reduces a claimed prediction or result to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on self-citation chains or imported uniqueness theorems. The central demonstration that models exhibit bias is an experimental outcome measured by the explicitly proposed metrics rather than a tautology. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Nonparametric deep survival models can be trained and evaluated on Alzheimer's progression data without strong distributional assumptions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose two novel fairness metrics, called Time-Dependent Concordance Impurity and Kaplan-Meier Fairness, to quantify bias with respect to sensitive attributes such as sex, race, and education in nonparametric survival models
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CI-td = min{CF_gi − CF_gj | i≠j}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ferial Abuhantash, Roy Welsch, Stan Finkelstein, and Aamna AlShehhi
-
[2]
Alzheimer’s disease risk prediction using machine learning for survival analysis with a comorbidity-based approach.Scientific Reports 15, 1 (Aug. 2025), 28723
work page 2025
-
[3]
Alzheimer’s Association. 2024. 2024 Alzheimer’s disease facts and figures.Alzheimer’s & Dementia20, 5 (may 2024), 3708–3821. doi:10. 1002/alz.13809
work page 2024
-
[4]
Laura Antolini, Patrizia Boracchi, and Elia M Biganzoli. 2005. A time- dependent discrimination index for survival data.Statistics in Medicine 24 (2005).https://api.semanticscholar.org/CorpusID:25663825
work page 2005
- [5]
-
[6]
Kwun C G Chan, Fan Xia, and Walter A Kukull. 2025. NACC data: Who is represented over time and across centers, and implications for generalizability.Alzheimers. Dement.21, 9 (Sept. 2025), e70657
work page 2025
-
[7]
Paidamoyo Chapfuwa, Chenyang Tao, Chunyuan Li, Irfan Khan, Karen J. Chandross, Michael J. Pencina, Lawrence Carin, and Ricardo Henao. 2023. Calibration and Uncertainty in Neural Time-to-Event Modeling.IEEE Transactions on Neural Networks and Learning Systems 34, 4 (2023), 1666–1680. doi:10.1109/TNNLS.2020.3029631
-
[8]
Taane G Clark, Michael J Bradburn, Sharon B Love, and Douglas G Altman. 2003. Survival analysis part I: basic concepts and first analyses. British journal of cancer89, 2 (2003), 232–238
work page 2003
-
[9]
Ranjan Duara and Warren Barker. 2022. Heterogeneity in Alzheimer’s disease diagnosis and progression rates: Implications for therapeutic trials.Neurotherapeutics19, 1 (Jan. 2022), 8–25
work page 2022
-
[10]
Stephane Fotso. 2018. Deep Neural Networks for Survival Analysis Based on a Multi-Task Framework. arXiv:1801.05512 [stat.ML]https: //arxiv.org/abs/1801.05512
work page Pith review arXiv 2018
-
[11]
Stephane Fotso et al . 2019–. PySurvival: Open source package for Survival Analysis modeling.https://www.pysurvival.io/
work page 2019
-
[12]
Sujuan Gao, Frederick W Unverzagt, Kathleen S Hall, Kathleen A Lane, Jill R Murrell, Ann M Hake, Valerie Smith-Gamble, and Hugh C Hendrie. 2014. Mild cognitive impairment, incidence, progression, and reversion: findings from a community-based cohort of elderly African Americans.Am. J. Geriatr. Psychiatry22, 7 (July 2014), 670–681
work page 2014
-
[13]
Polat Goktas and Andrzej Grzybowski. 2025. Shaping the future of healthcare: Ethical clinical challenges and pathways to trustworthy AI.J. Clin. Med.14, 5 (Feb. 2025), 1605
work page 2025
-
[14]
David W. Hosmer and Stanley Lemesbow. 1980. Goodness of fit tests for the multiple logistic regression model.Communi- cations in Statistics - Theory and Methods9, 10 (1980), 1043–1069. arXiv:https://www.tandfonline.com/doi/pdf/10.1080/03610928008827941 doi:10.1080/03610928008827941
-
[15]
Tae Ho Huh, Jong Lull Yoon, Jung Jin Cho, Mee Young Kim, and Young Soo Ju. 2020. Survival analysis of patients with Alzheimer’s disease: A study based on data from the Korean National Health In- surance Services’ Senior Cohort database.Korean J. Fam. Med.41, 4 (July 2020), 214–221
work page 2020
-
[16]
Chaudhari, Curtis Langlotz, and Nigam H
Zepeng Huo, Jason Alan Fries, Alejandro Lozano, Jeya Maria Jose Valanarasu, Ethan Steinberg, Louis Blankemeier, Akshay S. Chaudhari, Curtis Langlotz, and Nigam H. Shah. 2025. Time-to-Event Pretraining for 3D Medical Imaging. arXiv:2411.09361 [cs.CV]https://arxiv.org/ abs/2411.09361
-
[18]
Fahad Kamran and Jenna Wiens. 2021. Estimating Calibrated Indi- vidualized Survival Curves with Deep Learning.Proceedings of the AAAI Conference on Artificial Intelligence35, 1 (May 2021), 240–248. doi:10.1609/aaai.v35i1.16098
-
[19]
E. L. Kaplan and Paul Meier. 1958. Nonparametric Estimation from Incomplete Observations.J. Amer. Statist. Assoc.53, 282 (1958), 457–481. doi:10.1080/01621459.1958.10501452
-
[20]
Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. 2018. DeepSurv: personalized treat- ment recommender system using a Cox proportional hazards deep neural network.BMC Medical Research Methodology18, 1 (Feb. 2018). doi:10.1186/s12874-018-0482-1
-
[21]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Sto- chastic Optimization. arXiv:1412.6980 [cs.LG]https://arxiv.org/abs/ 1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Mateusz Krzyziński, Mikołaj Spytek, Hubert Baniecki, and Przemysław Biecek. 2023. SurvSHAP(t): Time-dependent explanations of machine learning survival models.Knowledge-Based Systems262 (2023), 110234
work page 2023
-
[23]
Mateusz Krzyziński, Mikołaj Spytek, Hubert Baniecki, and Przemysław Biecek. 2023. SurvSHAP(t): Time-dependent explanations of machine learning survival models.Knowledge-Based Systems262 (2023), 110234. doi:10.1016/j.knosys.2022.110234
-
[24]
Walter A Kukull. 2025. The National Alzheimer’s Coordinating Center (NACC) 1999-2025: Personal history and recollections.Alzheimers. Dement.21, 10 (Oct. 2025), e70836
work page 2025
- [25]
-
[26]
Changhee Lee, William Zame, Jinsung Yoon, and Mihaela Van Der Schaar. 2018. Deephit: A deep learning approach to survival analysis with competing risks. InProceedings of the AAAI conference on artificial intelligence, Vol. 32
work page 2018
-
[27]
Dongjoon Lee, Hyeryn Park, and Changhee Lee. 2024. Toward a Well-Calibrated Discrimination via Survival Outcome-Aware Con- trastive Learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems.https://openreview.net/forum?id= UVjuYBSbCN IEEE/ACM CHASE ’26, August 04–06, 2026, Pittsburgh, PA Thrasher et al
work page 2024
-
[28]
Abigail Lewis, Aditi Gupta, Inez Oh, Suzanne E Schindler, Nupur Ghoshal, Zachary Abrams, Randi Foraker, Barbara Joy Snider, John C Morris, Joyce Balls-Berry, Mahendra Gupta, Philip R O Payne, and Albert M Lai. 2023. Association between socioeconomic factors, race, and use of a specialty memory clinic.Neurology101, 14 (Oct. 2023), e1424–e1433
work page 2023
-
[29]
Mingxuan Liu, Yilin Ning, Salinelat Teixayavong, M. Mertens, Jie Xu, D. Ting, L. T. Cheng, J. Ong, Zhen Ling Teo, Ting Fang Tan, Ravi Chandran Narrendar, Fei Wang, L. Celi, M. Ong, and Nan Liu. 2023. A transla- tional perspective towards clinical AI fairness.NPJ Digital Medicine6 (2023).https://api.semanticscholar.org/CorpusId:261883775
work page 2023
-
[30]
Roberto Marquardt, Frédéric Cuvelier, Roar A Olsen, Evert Jan Baerends, Jean Christophe Tremblay, and Peter Saalfrank. 2010. A new analytical potential energy surface for the adsorption system CO/Cu(100).J. Chem. Phys.132, 7 (Feb. 2010), 074108
work page 2010
-
[31]
Elizabeth Rose Mayeda, M Maria Glymour, Charles P Quesenberry, and Rachel A Whitmer. 2016. Inequalities in dementia incidence between six racial and ethnic groups over 14 years.Alzheimers. Dement.12, 3 (March 2016), 216–224
work page 2016
-
[32]
Allan H. Murphy. 1973. A New Vector Partition of the Probability Score.Journal of Applied Meteorology and Climatology12, 4 (1973), 595 – 600. doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2
-
[33]
Saki Nakashima, Kenichiro Sato, Yoshiki Niimi, Ryoko Ihara, Kazushi Suzuki, Atsushi Iwata, Tatsushi Toda, Takeshi Iwatsubo, and for Alzheimer’s Disease Neuroimaging Initiative. 2025. Therapeutic time window of disease-modifying therapy for early Alzheimer’s disease. Alzheimers Dement. (N. Y.)11, 2 (April 2025), e70102
work page 2025
-
[34]
Shi-Ang Qi, Yakun Yu, and Russell Greiner. 2024. Conformalized Survival Distributions: A Generic Post-Process to Increase Calibration. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 41303– 41339.https://proceedings.mlr.press/v235/qi24a.html
work page 2024
-
[35]
Shi-ang Qi, Yakun Yu, and Russell Greiner. 2024. Toward Conditional Distribution Calibration in Survival Prediction. InAdvances in Neural Information Processing Systems, Vol. 37. Curran Associates, Inc., 86180– 86225.https://proceedings.neurips.cc/paper_files/paper/2024/file/ 9c8df8de46c1a1b39b30b9f74be69c02-Paper-Conference.pdf
work page 2024
- [36]
-
[37]
Rahul Sharma, Harsh Anand, Youakim Badr, and Robin G. Qiu. 2021. Time-to-event prediction using survival analysis methods for Alzheimer’s disease progression. Alzheimer’s & Dementia: Translational Research & Clin- ical Interventions7, 1 (2021), e12229. arXiv:https://alz- journals.onlinelibrary.wiley.com/doi/pdf/10.1002/trc2.12229 doi:10.1002/trc2.12229
-
[38]
Rahul Sharma, Harsh Anand, Youakim Badr, and Robin G Qiu
-
[39]
Time-to-event prediction using survival analysis methods for Alzheimer’s disease progression.Alzheimers Dement. (N. Y.)7, 1 (Dec. 2021), e12229
work page 2021
- [40]
-
[41]
Reisa A Sperling, Jason Karlawish, and Keith A Johnson. 2013. Pre- clinical Alzheimer disease—the challenges ahead.Nat. Rev. Neurol.9, 1 (Jan. 2013), 54–58
work page 2013
-
[42]
Zhihao Tang, Xi Zhang, and Chaozhuo Li. 2025. From Representation Space to Prognostic Insights: Whole Slide Image Generation with Hierarchical Diffusion Model for Survival Prediction.Proceedings of the AAAI Conference on Artificial Intelligence39, 7 (Apr. 2025), 7329–
work page 2025
-
[43]
doi:10.1609/aaai.v39i7.32788 Figure 4.Selected features from the NACC dataset
- [44]
-
[45]
Simon Wiegrebe, Philipp Kopper, Raphael Sonabend, Bernd Bischl, and Andreas Bender. 2024. Deep learning for survival analysis: a review. Artificial Intelligence Review57, 3 (2024), 65
work page 2024
- [46]
-
[47]
Yuzhe Yang, Yujia Liu, Xin Liu, Avanti V Gulhane, Domenico Mas- trodicasa, Wei Wu, E. J. Wang, Dushyant W. Sahani, and Shwetak N. Patel. 2024. Demographic Bias of Expert-Level Vision-Language Foundation Models in Medical Imaging.Science Advances11 (2024). https://api.semanticscholar.org/CorpusId:267782475
work page 2024
-
[48]
Wenbin Zhang, Tina Hernandez-Boussard, and Jeremy Weiss. 2023. Censored fairness through awareness. InProceedings of the Thirty- Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelli- gence (AAAI’23/IAAI’23/...
-
[49]
Wenbin Zhang and Jeremy C. Weiss. 2022. Longitudinal Fairness with Censorship. arXiv:2203.16024 [cs.LG]https://arxiv.org/abs/2203.16024 A Selected features Figure 4 provides an overview of the NACC features used in our analysis. We categorize features into six groups: subject visit information, demographics, genetics, functional/behavior predictors, risk ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.