From Hazard Functions to Language Space: Cox-Supervised Distillation of Survival Risk into a Large Language Model

Blanca Gallego; Louisa Jorm; Nicholas I-Hsien Kuo

arxiv: 2606.08945 · v1 · pith:ZVLCF6HQnew · submitted 2026-06-08 · 💻 cs.LG

From Hazard Functions to Language Space: Cox-Supervised Distillation of Survival Risk into a Large Language Model

Nicholas I-Hsien Kuo , Blanca Gallego , Louisa Jorm This is my paper

Pith reviewed 2026-06-27 17:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords survival analysislarge language modelsCox proportional hazardsknowledge distillationtext generationclinical predictionlatent space

0 comments

The pith

Survival risk from Cox models can be distilled into large language models via text prompt fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether time-to-event risk estimated by a Cox proportional hazards model can be transferred into a generative large language model. It proposes a pipeline that converts structured clinical covariates into text prompts and fine-tunes a Qwen-based LLM to generate patient-specific survival risk using Cox model outputs as the training target. The approach yields competitive held-out discrimination and calibration on three clinical datasets even though training occurs only as a text-generation task. A sympathetic reader would care because this suggests LLMs can internalize continuous survival-risk structure and support calibrated predictions without conventional survival losses.

Core claim

By converting clinical covariates to text and fine-tuning an LLM to generate survival risk scores that match those from a fitted Cox model, the model achieves competitive discrimination and calibration on held-out portions of the GBSG2, ACTG320, and WHAS500 datasets. Visualizations of the model's hidden states show smooth risk gradients under t-SNE, indicating that the LLM represents survival risk as a continuous structure in latent space rather than isolated categories. These observations together indicate that large language models can internalize survival-risk structure from Cox targets while still producing calibrated predictions.

What carries the argument

The text-based survival modelling pipeline that uses Cox model predictions as direct training targets for fine-tuning an LLM on text prompts derived from clinical covariates.

If this is right

LLMs can reach competitive discrimination and calibration in survival tasks when trained only as text generation.
Survival risk is represented internally as smooth continuous gradients in the model's latent space.
This approach supplies a direct route to time-to-event reasoning inside generative language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation process could be tested on other parametric or machine-learning survival models beyond Cox.
LLMs trained this way might combine survival risk with unstructured clinical notes or free-text patient descriptions.
Natural-language interfaces could allow clinicians to query individualized survival predictions conversationally.

Load-bearing premise

Converting structured clinical covariates into text prompts preserves enough information for the LLM to accurately recover the survival risk structure encoded in the Cox model targets.

What would settle it

If the fine-tuned LLM produces substantially lower concordance or markedly poorer calibration than the original Cox model on held-out data from GBSG2, ACTG320, or WHAS500, or if t-SNE plots of its hidden states fail to display smooth risk gradients.

Figures

Figures reproduced from arXiv: 2606.08945 by Blanca Gallego, Louisa Jorm, Nicholas I-Hsien Kuo.

**Figure 2.** Figure 2: t-SNE projection of Qwen hidden states on the WHAS500 test set [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

We investigate whether information about time-to-event risk estimated by a Cox proportional hazards model can be transferred into a generative large language model. We propose a text-based survival modelling pipeline in which structured clinical covariates are converted into text prompts and a Qwen-based large language model is fine-tuned to generate patient-specific survival risk using Cox model predictions as a training target. Across GBSG2, ACTG320, and WHAS500, the model achieves competitive held-out discrimination and calibration despite being trained as a text-generation task rather than with a conventional survival-analysis loss. We further analyse the geometry of the model's hidden states, where t-SNE visualisations reveal smooth risk gradients in latent space, suggesting that the model represents survival risk as a continuous structure rather than isolated risk categories. Together, these findings suggest that large language models can internalise survival-risk structure while supporting calibrated prediction, providing a route towards time-to-event reasoning in language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows you can fine-tune an LLM on text prompts from clinical covariates to match Cox survival predictions with held-out discrimination and calibration, plus t-SNE evidence of continuous risk structure in latent space.

read the letter

The main point is that this work converts structured clinical data into text prompts, fine-tunes a Qwen LLM to output survival risk scores that track a fitted Cox model, and reports competitive held-out results on GBSG2, ACTG320, and WHAS500 while the hidden states show smooth risk gradients under t-SNE.

What is actually new is the concrete pipeline that treats Cox outputs as training targets for a standard text-generation loss and then checks the geometry of the resulting representations. Prior distillation work exists, but the specific move to survival risk via text conversion plus the latent-space check on these datasets is not already in the cited literature.

The paper does a couple of things cleanly. It avoids any custom survival loss and still gets discrimination and calibration that hold up on held-out data. The t-SNE plots provide at least qualitative support that the model has learned risk as a continuous quantity rather than isolated categories.

The soft spots are mostly around missing detail. The abstract does not spell out the exact text-conversion template or any fidelity checks, so it remains possible that rounding or summarization loses the numerical precision the Cox linear predictor needs. Without the actual C-index numbers, baselines, or error bars, it is hard to judge how competitive the results really are. The t-SNE analysis is visual only.

This is for readers working on medical AI who want to add time-to-event reasoning to language models without building new loss functions. A reader already familiar with both Cox models and LLM fine-tuning will get the most out of it.

It deserves a serious referee because the claim is testable on public datasets and the approach is straightforward enough to evaluate. I would send it to peer review.

Referee Report

2 major / 0 minor

Summary. The paper proposes distilling Cox proportional-hazards risk scores into a generative LLM (Qwen-based) by converting structured clinical covariates into text prompts and fine-tuning the model to output patient-specific survival risk as a text-generation task. It reports competitive held-out discrimination and calibration on GBSG2, ACTG320, and WHAS500, and presents t-SNE visualizations of hidden states showing smooth risk gradients, arguing that LLMs can internalize continuous survival-risk structure without a conventional survival loss.

Significance. If the quantitative results hold, the work is significant for demonstrating that text-generation objectives can approximate survival-analysis metrics and that LLM latent spaces can encode proportional-hazards structure. The explicit use of held-out Cox targets for evaluation avoids circularity and provides a falsifiable test of distillation. This opens a route to natural-language interfaces for time-to-event prediction.

major comments (2)

[Abstract] Abstract: the central claim of 'competitive held-out discrimination and calibration' is stated without any reported C-index, Brier score, calibration slope, baselines, or error bars. This absence is load-bearing because the claim that text-generation training successfully distills Cox risk cannot be evaluated without these metrics.
[Abstract / Pipeline description] Pipeline description (abstract and methods): the conversion of structured covariates to text prompts is described only as 'converted into text prompts' with no template, rounding/binning rules, or fidelity verification. This is load-bearing for the weakest assumption that numerical precision is preserved sufficiently for the LLM to recover the Cox linear predictor on held-out data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the abstract and methods could be strengthened for clarity and evaluability. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'competitive held-out discrimination and calibration' is stated without any reported C-index, Brier score, calibration slope, baselines, or error bars. This absence is load-bearing because the claim that text-generation training successfully distills Cox risk cannot be evaluated without these metrics.

Authors: We agree that the abstract would be stronger and more self-contained if it included the quantitative results. The manuscript reports C-index, Brier scores, calibration slopes, and baseline comparisons in the results section with held-out evaluations on the three datasets. In revision we will add the key metrics, baselines, and any error bars or intervals directly to the abstract to support the claim. revision: yes
Referee: [Abstract / Pipeline description] Pipeline description (abstract and methods): the conversion of structured covariates to text prompts is described only as 'converted into text prompts' with no template, rounding/binning rules, or fidelity verification. This is load-bearing for the weakest assumption that numerical precision is preserved sufficiently for the LLM to recover the Cox linear predictor on held-out data.

Authors: We accept that the current description of the covariate-to-text conversion is insufficiently detailed. The methods section will be expanded in revision to include the exact prompt template, rules for handling numerical values (rounding, binning, or direct inclusion), and any verification performed to confirm that the text encoding retains the information needed for the model to approximate the Cox linear predictor. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Cox targets are external and evaluation is held-out

full rationale

The paper fits a separate Cox model on training data to generate targets, then fine-tunes the LLM on text prompts to match those targets via text-generation loss. Held-out discrimination and calibration are measured against true event times, not the Cox predictions themselves. No equations reduce the reported performance to a fitted quantity by construction, no self-citation chain supports a uniqueness claim, and the text-conversion step is a preprocessing choice rather than a definitional loop. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard LLM fine-tuning and the domain assumption that Cox outputs serve as valid targets.

free parameters (1)

LLM fine-tuning hyperparameters
Specific learning rate, epochs, and prompt formatting choices are required for the reported performance but not enumerated.

axioms (1)

domain assumption Cox proportional hazards model supplies reliable risk targets for the LLM to learn from
Used as the sole training signal without independent validation of its accuracy on the target populations.

pith-pipeline@v0.9.1-grok · 5699 in / 1238 out tokens · 28249 ms · 2026-06-27T17:06:18.830990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 4 canonical work pages · 3 internal anchors

[1]

John Wiley & Sons, 2002

John D Kalbfleisch and Ross L Prentice.The statistical analysis of failure time data. John Wiley & Sons, 2002

2002
[2]

Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972

David R Cox. Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972

1972
[3]

Deephit: A deep learning approach to survival analysis with competing risks

Changhee Lee, William Zame, Jinsung Yoon, and Mihaela Van Der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[4]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

2024
[7]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[8]

Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients

M Schumacher, G Bastert, H Bojar, K Hübner, M Olschewski, W Sauerbrei, C Schmoor, C Beyerle, RL Neumann, and HF Rauschecker. Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. german breast cancer study group.Journal of Clinical Oncology, 12(10):2086–2093, 1994

2086
[9]

Scott M Hammer, Kathleen E Squires, Michael D Hughes, Janet M Grimes, Lisa M Demeter, Judith S Currier, Joseph J Eron Jr, Judith E Feinberg, Henry H Balfour Jr, Lawrence R Deyton, et al. A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or le...

1997
[10]

Incidence and case fatality rates of acute myocardial infarction (1975–1984): the worcester heart attack study

Robert J Goldberg, Joel M Gore, Joseph S Alpert, and James E Dalen. Incidence and case fatality rates of acute myocardial infarction (1975–1984): the worcester heart attack study. American Heart Journal, 115(4):761–767, 1988

1975
[11]

Evaluating the yield of medical tests.Jama, 247(18):2543–2546, 1982

Frank E Harrell, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A Rosati. Evaluating the yield of medical tests.Jama, 247(18):2543–2546, 1982

1982
[12]

Calibration: the achilles heel of predictive analytics.BMC medicine, 17(1):230, 2019

Ben Van Calster, David J McLernon, Maarten Van Smeden, Laure Wynants, and Ewout W Steyerberg. Calibration: the achilles heel of predictive analytics.BMC medicine, 17(1):230, 2019

2019
[13]

Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC medical research methodology, 18(1):24, 2018

Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC medical research methodology, 18(1):24, 2018

2018
[14]

A transformer-based survival model for prediction of all-cause mortality in patients with heart failure: a multi-cohort study.npj Digital Medicine, 2026

Shishir Rao, Nouman Ahmed, Gholamreza Salimi-Khorshidi, Christopher Yau, Huimin Su, Nathalie Conrad, Folkert W Asselbergs, Mark Woodward, Rod Jackson, John GF Cleland, et al. A transformer-based survival model for prediction of all-cause mortality in patients with heart failure: a multi-cohort study.npj Digital Medicine, 2026. 5

2026
[15]

Tabllm: Few-shot classification of tabular data with large language models

Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm: Few-shot classification of tabular data with large language models. InInternational conference on artificial intelligence and statistics, pages 5549–5581. PMLR, 2023

2023
[16]

scikit-survival: A library for time-to-event analysis built on top of scikit-learn

Sebastian Pölsterl. scikit-survival: A library for time-to-event analysis built on top of scikit-learn. Journal of Machine Learning Research, 21(212):1–6, 2020

2020
[17]

lifelines: Survival Analysis in Python.Journal of Open Source Software, 2019

Cameron Davidson-Pilon. lifelines: Survival Analysis in Python.Journal of Open Source Software, 2019. URL https://github.com/CamDavidsonPilon/lifelines/ tree/master

2019
[18]

Python Tutorial

Guido Van Rossum. Python Tutorial. Technical report, Centrum voor Wiskunde en Informatica (CWI), 1995

1995
[19]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[20]

Ck4gen: A knowledge distillation framework for generating high-utility synthetic survival datasets in healthcare.arXiv preprint arXiv:2410.16872, 2024

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm, et al. Ck4gen: A knowledge distillation framework for generating high-utility synthetic survival datasets in healthcare.arXiv preprint arXiv:2410.16872, 2024

work page arXiv 2024
[21]

Performance of the net reclassification improvement for nonnested models and a novel percentile-based alternative.American journal of epidemiology, 187(6):1327–1335, 2018

Shannon B McKearnan, Julian Wolfson, David M V ock, Gabriela Vazquez-Benitez, and Patrick J O’Connor. Performance of the net reclassification improvement for nonnested models and a novel percentile-based alternative.American journal of epidemiology, 187(6):1327–1335, 2018

2018
[22]

Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

2008
[23]

Tristan Bouckley, Rimma Myton-Katieva, David Peiris, Devaki Nambiar, Samuel Prince, Simon Bishop, Damien Cordery, Flynn Robert Hill, Patricia Correll, Anne-Marie Feyer, et al. An assessment of data quality and sociodemographic variation in health service utilisation of general practice, emergency department and admitted services in a new south wales linke...

2025
[24]

Estimating 5-year absolute risk of cardiovascular disease using routinely collected electronic medical records from australian general practices.Heart, 2025

Nicholas I-Hsien Kuo, Sebastiano Barbieri, Clare Arnott, Blanca Gallego, Ziba Gandomkar, Shahana Ferdousi, Kirsty Douglas, Mark Woodward, and Louisa Jorm. Estimating 5-year absolute risk of cardiovascular disease using routinely collected electronic medical records from australian general practices.Heart, 2025

2025
[25]

Cardiovascular disease risk prediction equations in 400 000 primary care patients in new zealand: a derivation and validation study.The Lancet, 391(10133):1897–1907, 2018

Romana Pylypchuk, Sue Wells, Andrew Kerr, Katrina Poppe, Tania Riddell, Matire Harwood, Dan Exeter, Suneela Mehta, Corina Grey, Billy P Wu, et al. Cardiovascular disease risk prediction equations in 400 000 primary care patients in new zealand: a derivation and validation study.The Lancet, 391(10133):1897–1907, 2018

1907
[26]

Development and validation of qrisk3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.bmj, 357, 2017

Julia Hippisley-Cox, Carol Coupland, and Peter Brindle. Development and validation of qrisk3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.bmj, 357, 2017

2017
[27]

Development and validation of the american heart association’s prevent equations

Sadiya S Khan, Kunihiro Matsushita, Yingying Sang, Shoshana H Ballew, Morgan E Grams, Aditya Surapaneni, Michael J Blaha, April P Carson, Alexander R Chang, Elizabeth Ciemins, et al. Development and validation of the american heart association’s prevent equations. Circulation, 149(6):430–449, 2024

2024
[28]

Informative missingness in electronic health record systems: the curse of knowing.Diagnostic and prognostic research, 4(1):8, 2020

Rolf HH Groenwold. Informative missingness in electronic health record systems: the curse of knowing.Diagnostic and prognostic research, 4(1):8, 2020

2020
[29]

Evaluat- ing deep learning sepsis prediction models in icus under distribution shift: a multi-centre retrospective cohort study.npj Digital Medicine, 2026

Fanny Tranchellini, Youssef Farag, Catherine Jutzeler, and Lakmal Meegahapola. Evaluat- ing deep learning sepsis prediction models in icus under distribution shift: a multi-centre retrospective cohort study.npj Digital Medicine, 2026. 6

2026
[30]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019
[31]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transform- ers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[32]

Unsloth team

Michael Han Daniel Han and Michael Han. Unsloth team. 2023

2023
[33]

TRL: Transformers Rein- forcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl

2020
[34]

Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Ben- jamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022. URL https://github.com/huggingface/peft

2022
[35]

Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

2022
[36]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

2023
[37]

Datasets: A community library for natural language processing

Quentin Lhoest, Albert Villanova Del Moral, Yacine Jernite, Abhishek Thakur, Patrick V on Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. Datasets: A community library for natural language processing. InProceedings of the 2021 conference on empirical methods in natural language processing: system demonstrations, pag...

2021
[38]

Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

2011
[39]

Array programming with numpy.nature, 585(7825):357–362, 2020

Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy.nature, 585(7825):357–362, 2020

2020
[40]

Data structures for statistical computing in python.scipy, 445(1):51–56, 2010

Wes McKinney et al. Data structures for statistical computing in python.scipy, 445(1):51–56, 2010

2010
[41]

Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

John D Hunter. Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

2007
[42]

John Wiley & Sons, 2008

David W Hosmer Jr, Stanley Lemeshow, and Susanne May.Applied survival analysis: regression modeling of time-to-event data, volume 618. John Wiley & Sons, 2008

2008
[43]

Modelling the effects of standard prognostic factors in node-positive breast cancer.British Journal of Cancer, 79(11): 1752–1760, 1999

W Sauerbrei, P Royston, H Bojar, C Schmoor, and M Schumacher. Modelling the effects of standard prognostic factors in node-positive breast cancer.British Journal of Cancer, 79(11): 1752–1760, 1999

1999
[44]

Table 6.7 on page 198

David A Karnofsky, Walter H Abelmann, Lloyd F Craver, and Joseph H Burchenal. The use of the nitrogen mustards in the palliative treatment of carcinoma: with particular reference to bronchogenic carcinoma.Cancer, 1948. 7 Appendix: Additional Details to the Main Text Purpose of this Appendix.This appendix formalises the Cox-to-Qwen survival distillation fr...

1948
[45]

Harrell’s C-index for survival discrimination [11],
[46]

calibration error (D21) based on calibration slope estimation [20, 12],
[47]

Du ra ti on

percentile-based NRI [21]. The evaluation pipeline first merges the generated predictions with the held-out survival outcomes, removes invalid predictions, and clips all predicted risks to the interval [0,1] before downstream analysis. Calibration analysis is then performed using quantile-based risk bins and Kaplan–Meier survival estimation at the 1-year ...
[48]

the mean predicted risk is computed,
[49]

ri sk_ bi n

the observed 1-year event probability is estimated using Kaplan–Meier survival estimation. Let: ˆrb denote the mean predicted risk in bin b, and ob denote the observed Kaplan–Meier risk in the same bin. Calibration is assessed using a regression without intercept: ˆrb =βo b. (note that in this formulation, we are essentially putting the predicted risk on ...
[50]

p r e d i c t e d _ r i s k

values . reshape ( -1 , 1) , 6ca li b_d f [ " p r e d i c t e d _ r i s k " ]. values 7) 8 9slope = float ( reg . coef_ [0]) 10c a l i b _ e r r o r = abs ( slope - 1.0) This calibration framework therefore measures whether the generated survival-risk predictions remain numerically consistent with empirically observed event frequencies. 30 C.5.3 Details 5...
[51]

a t t e n t i o n _ m a s k

to ( " cuda " ) 9 10with torch . no_grad () : 11outputs = model ( 12** inputs , 13o u t p u t _ h i d d e n _ s t a t e s = True , 14r e t u r n _ d i c t = True , 15) 16 32 17hidden = outputs . h i d d e n _ s t a t e s [ -1] 18 19attn = inputs [ " a t t e n t i o n _ m a s k " ] 20la st _i dx = attn . sum ( dim =1) - 1 21 22vec = hidden [0 , l ast _i dx...
[52]

a t t e n t i o n _ m a s k

numpy () 27 28return vec Lines 11–15: The key operation enabling hidden-state extraction is: 1outputs = model ( 2** inputs , 3o u t p u t _ h i d d e n _ s t a t e s = True , 4r e t u r n _ d i c t = True , 5) which instructs the transformer to return the full sequence of hidden-state activations from all transformer layers. The final-layer activations ar...

1973

[1] [1]

John Wiley & Sons, 2002

John D Kalbfleisch and Ross L Prentice.The statistical analysis of failure time data. John Wiley & Sons, 2002

2002

[2] [2]

Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972

David R Cox. Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972

1972

[3] [3]

Deephit: A deep learning approach to survival analysis with competing risks

Changhee Lee, William Zame, Jinsung Yoon, and Mihaela Van Der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[4] [4]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[5] [5]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

2024

[7] [7]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901

[8] [8]

Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients

M Schumacher, G Bastert, H Bojar, K Hübner, M Olschewski, W Sauerbrei, C Schmoor, C Beyerle, RL Neumann, and HF Rauschecker. Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. german breast cancer study group.Journal of Clinical Oncology, 12(10):2086–2093, 1994

2086

[9] [9]

Scott M Hammer, Kathleen E Squires, Michael D Hughes, Janet M Grimes, Lisa M Demeter, Judith S Currier, Joseph J Eron Jr, Judith E Feinberg, Henry H Balfour Jr, Lawrence R Deyton, et al. A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or le...

1997

[10] [10]

Incidence and case fatality rates of acute myocardial infarction (1975–1984): the worcester heart attack study

Robert J Goldberg, Joel M Gore, Joseph S Alpert, and James E Dalen. Incidence and case fatality rates of acute myocardial infarction (1975–1984): the worcester heart attack study. American Heart Journal, 115(4):761–767, 1988

1975

[11] [11]

Evaluating the yield of medical tests.Jama, 247(18):2543–2546, 1982

Frank E Harrell, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A Rosati. Evaluating the yield of medical tests.Jama, 247(18):2543–2546, 1982

1982

[12] [12]

Calibration: the achilles heel of predictive analytics.BMC medicine, 17(1):230, 2019

Ben Van Calster, David J McLernon, Maarten Van Smeden, Laure Wynants, and Ewout W Steyerberg. Calibration: the achilles heel of predictive analytics.BMC medicine, 17(1):230, 2019

2019

[13] [13]

Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC medical research methodology, 18(1):24, 2018

Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC medical research methodology, 18(1):24, 2018

2018

[14] [14]

A transformer-based survival model for prediction of all-cause mortality in patients with heart failure: a multi-cohort study.npj Digital Medicine, 2026

Shishir Rao, Nouman Ahmed, Gholamreza Salimi-Khorshidi, Christopher Yau, Huimin Su, Nathalie Conrad, Folkert W Asselbergs, Mark Woodward, Rod Jackson, John GF Cleland, et al. A transformer-based survival model for prediction of all-cause mortality in patients with heart failure: a multi-cohort study.npj Digital Medicine, 2026. 5

2026

[15] [15]

Tabllm: Few-shot classification of tabular data with large language models

Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm: Few-shot classification of tabular data with large language models. InInternational conference on artificial intelligence and statistics, pages 5549–5581. PMLR, 2023

2023

[16] [16]

scikit-survival: A library for time-to-event analysis built on top of scikit-learn

Sebastian Pölsterl. scikit-survival: A library for time-to-event analysis built on top of scikit-learn. Journal of Machine Learning Research, 21(212):1–6, 2020

2020

[17] [17]

lifelines: Survival Analysis in Python.Journal of Open Source Software, 2019

Cameron Davidson-Pilon. lifelines: Survival Analysis in Python.Journal of Open Source Software, 2019. URL https://github.com/CamDavidsonPilon/lifelines/ tree/master

2019

[18] [18]

Python Tutorial

Guido Van Rossum. Python Tutorial. Technical report, Centrum voor Wiskunde en Informatica (CWI), 1995

1995

[19] [19]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[20] [20]

Ck4gen: A knowledge distillation framework for generating high-utility synthetic survival datasets in healthcare.arXiv preprint arXiv:2410.16872, 2024

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm, et al. Ck4gen: A knowledge distillation framework for generating high-utility synthetic survival datasets in healthcare.arXiv preprint arXiv:2410.16872, 2024

work page arXiv 2024

[21] [21]

Performance of the net reclassification improvement for nonnested models and a novel percentile-based alternative.American journal of epidemiology, 187(6):1327–1335, 2018

Shannon B McKearnan, Julian Wolfson, David M V ock, Gabriela Vazquez-Benitez, and Patrick J O’Connor. Performance of the net reclassification improvement for nonnested models and a novel percentile-based alternative.American journal of epidemiology, 187(6):1327–1335, 2018

2018

[22] [22]

Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

2008

[23] [23]

Tristan Bouckley, Rimma Myton-Katieva, David Peiris, Devaki Nambiar, Samuel Prince, Simon Bishop, Damien Cordery, Flynn Robert Hill, Patricia Correll, Anne-Marie Feyer, et al. An assessment of data quality and sociodemographic variation in health service utilisation of general practice, emergency department and admitted services in a new south wales linke...

2025

[24] [24]

Estimating 5-year absolute risk of cardiovascular disease using routinely collected electronic medical records from australian general practices.Heart, 2025

Nicholas I-Hsien Kuo, Sebastiano Barbieri, Clare Arnott, Blanca Gallego, Ziba Gandomkar, Shahana Ferdousi, Kirsty Douglas, Mark Woodward, and Louisa Jorm. Estimating 5-year absolute risk of cardiovascular disease using routinely collected electronic medical records from australian general practices.Heart, 2025

2025

[25] [25]

Cardiovascular disease risk prediction equations in 400 000 primary care patients in new zealand: a derivation and validation study.The Lancet, 391(10133):1897–1907, 2018

Romana Pylypchuk, Sue Wells, Andrew Kerr, Katrina Poppe, Tania Riddell, Matire Harwood, Dan Exeter, Suneela Mehta, Corina Grey, Billy P Wu, et al. Cardiovascular disease risk prediction equations in 400 000 primary care patients in new zealand: a derivation and validation study.The Lancet, 391(10133):1897–1907, 2018

1907

[26] [26]

Development and validation of qrisk3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.bmj, 357, 2017

Julia Hippisley-Cox, Carol Coupland, and Peter Brindle. Development and validation of qrisk3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.bmj, 357, 2017

2017

[27] [27]

Development and validation of the american heart association’s prevent equations

Sadiya S Khan, Kunihiro Matsushita, Yingying Sang, Shoshana H Ballew, Morgan E Grams, Aditya Surapaneni, Michael J Blaha, April P Carson, Alexander R Chang, Elizabeth Ciemins, et al. Development and validation of the american heart association’s prevent equations. Circulation, 149(6):430–449, 2024

2024

[28] [28]

Informative missingness in electronic health record systems: the curse of knowing.Diagnostic and prognostic research, 4(1):8, 2020

Rolf HH Groenwold. Informative missingness in electronic health record systems: the curse of knowing.Diagnostic and prognostic research, 4(1):8, 2020

2020

[29] [29]

Evaluat- ing deep learning sepsis prediction models in icus under distribution shift: a multi-centre retrospective cohort study.npj Digital Medicine, 2026

Fanny Tranchellini, Youssef Farag, Catherine Jutzeler, and Lakmal Meegahapola. Evaluat- ing deep learning sepsis prediction models in icus under distribution shift: a multi-centre retrospective cohort study.npj Digital Medicine, 2026. 6

2026

[30] [30]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019

[31] [31]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transform- ers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[32] [32]

Unsloth team

Michael Han Daniel Han and Michael Han. Unsloth team. 2023

2023

[33] [33]

TRL: Transformers Rein- forcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl

2020

[34] [34]

Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Ben- jamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022. URL https://github.com/huggingface/peft

2022

[35] [35]

Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

2022

[36] [36]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

2023

[37] [37]

Datasets: A community library for natural language processing

Quentin Lhoest, Albert Villanova Del Moral, Yacine Jernite, Abhishek Thakur, Patrick V on Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. Datasets: A community library for natural language processing. InProceedings of the 2021 conference on empirical methods in natural language processing: system demonstrations, pag...

2021

[38] [38]

Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

2011

[39] [39]

Array programming with numpy.nature, 585(7825):357–362, 2020

Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy.nature, 585(7825):357–362, 2020

2020

[40] [40]

Data structures for statistical computing in python.scipy, 445(1):51–56, 2010

Wes McKinney et al. Data structures for statistical computing in python.scipy, 445(1):51–56, 2010

2010

[41] [41]

Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

John D Hunter. Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

2007

[42] [42]

John Wiley & Sons, 2008

David W Hosmer Jr, Stanley Lemeshow, and Susanne May.Applied survival analysis: regression modeling of time-to-event data, volume 618. John Wiley & Sons, 2008

2008

[43] [43]

Modelling the effects of standard prognostic factors in node-positive breast cancer.British Journal of Cancer, 79(11): 1752–1760, 1999

W Sauerbrei, P Royston, H Bojar, C Schmoor, and M Schumacher. Modelling the effects of standard prognostic factors in node-positive breast cancer.British Journal of Cancer, 79(11): 1752–1760, 1999

1999

[44] [44]

Table 6.7 on page 198

David A Karnofsky, Walter H Abelmann, Lloyd F Craver, and Joseph H Burchenal. The use of the nitrogen mustards in the palliative treatment of carcinoma: with particular reference to bronchogenic carcinoma.Cancer, 1948. 7 Appendix: Additional Details to the Main Text Purpose of this Appendix.This appendix formalises the Cox-to-Qwen survival distillation fr...

1948

[45] [45]

Harrell’s C-index for survival discrimination [11],

[46] [46]

calibration error (D21) based on calibration slope estimation [20, 12],

[47] [47]

Du ra ti on

percentile-based NRI [21]. The evaluation pipeline first merges the generated predictions with the held-out survival outcomes, removes invalid predictions, and clips all predicted risks to the interval [0,1] before downstream analysis. Calibration analysis is then performed using quantile-based risk bins and Kaplan–Meier survival estimation at the 1-year ...

[48] [48]

the mean predicted risk is computed,

[49] [49]

ri sk_ bi n

the observed 1-year event probability is estimated using Kaplan–Meier survival estimation. Let: ˆrb denote the mean predicted risk in bin b, and ob denote the observed Kaplan–Meier risk in the same bin. Calibration is assessed using a regression without intercept: ˆrb =βo b. (note that in this formulation, we are essentially putting the predicted risk on ...

[50] [50]

p r e d i c t e d _ r i s k

values . reshape ( -1 , 1) , 6ca li b_d f [ " p r e d i c t e d _ r i s k " ]. values 7) 8 9slope = float ( reg . coef_ [0]) 10c a l i b _ e r r o r = abs ( slope - 1.0) This calibration framework therefore measures whether the generated survival-risk predictions remain numerically consistent with empirically observed event frequencies. 30 C.5.3 Details 5...

[51] [51]

a t t e n t i o n _ m a s k

to ( " cuda " ) 9 10with torch . no_grad () : 11outputs = model ( 12** inputs , 13o u t p u t _ h i d d e n _ s t a t e s = True , 14r e t u r n _ d i c t = True , 15) 16 32 17hidden = outputs . h i d d e n _ s t a t e s [ -1] 18 19attn = inputs [ " a t t e n t i o n _ m a s k " ] 20la st _i dx = attn . sum ( dim =1) - 1 21 22vec = hidden [0 , l ast _i dx...

[52] [52]

a t t e n t i o n _ m a s k

numpy () 27 28return vec Lines 11–15: The key operation enabling hidden-state extraction is: 1outputs = model ( 2** inputs , 3o u t p u t _ h i d d e n _ s t a t e s = True , 4r e t u r n _ d i c t = True , 5) which instructs the transformer to return the full sequence of hidden-state activations from all transformer layers. The final-layer activations ar...

1973