pith. sign in

arxiv: 2606.08945 · v1 · pith:ZVLCF6HQnew · submitted 2026-06-08 · 💻 cs.LG

From Hazard Functions to Language Space: Cox-Supervised Distillation of Survival Risk into a Large Language Model

Pith reviewed 2026-06-27 17:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords survival analysislarge language modelsCox proportional hazardsknowledge distillationtext generationclinical predictionlatent space
0
0 comments X

The pith

Survival risk from Cox models can be distilled into large language models via text prompt fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether time-to-event risk estimated by a Cox proportional hazards model can be transferred into a generative large language model. It proposes a pipeline that converts structured clinical covariates into text prompts and fine-tunes a Qwen-based LLM to generate patient-specific survival risk using Cox model outputs as the training target. The approach yields competitive held-out discrimination and calibration on three clinical datasets even though training occurs only as a text-generation task. A sympathetic reader would care because this suggests LLMs can internalize continuous survival-risk structure and support calibrated predictions without conventional survival losses.

Core claim

By converting clinical covariates to text and fine-tuning an LLM to generate survival risk scores that match those from a fitted Cox model, the model achieves competitive discrimination and calibration on held-out portions of the GBSG2, ACTG320, and WHAS500 datasets. Visualizations of the model's hidden states show smooth risk gradients under t-SNE, indicating that the LLM represents survival risk as a continuous structure in latent space rather than isolated categories. These observations together indicate that large language models can internalize survival-risk structure from Cox targets while still producing calibrated predictions.

What carries the argument

The text-based survival modelling pipeline that uses Cox model predictions as direct training targets for fine-tuning an LLM on text prompts derived from clinical covariates.

If this is right

  • LLMs can reach competitive discrimination and calibration in survival tasks when trained only as text generation.
  • Survival risk is represented internally as smooth continuous gradients in the model's latent space.
  • This approach supplies a direct route to time-to-event reasoning inside generative language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation process could be tested on other parametric or machine-learning survival models beyond Cox.
  • LLMs trained this way might combine survival risk with unstructured clinical notes or free-text patient descriptions.
  • Natural-language interfaces could allow clinicians to query individualized survival predictions conversationally.

Load-bearing premise

Converting structured clinical covariates into text prompts preserves enough information for the LLM to accurately recover the survival risk structure encoded in the Cox model targets.

What would settle it

If the fine-tuned LLM produces substantially lower concordance or markedly poorer calibration than the original Cox model on held-out data from GBSG2, ACTG320, or WHAS500, or if t-SNE plots of its hidden states fail to display smooth risk gradients.

Figures

Figures reproduced from arXiv: 2606.08945 by Blanca Gallego, Louisa Jorm, Nicholas I-Hsien Kuo.

Figure 1
Figure 1. Figure 1: Text-based survival risk prediction via Cox-supervised language model training. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE projection of Qwen hidden states on the WHAS500 test set [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

We investigate whether information about time-to-event risk estimated by a Cox proportional hazards model can be transferred into a generative large language model. We propose a text-based survival modelling pipeline in which structured clinical covariates are converted into text prompts and a Qwen-based large language model is fine-tuned to generate patient-specific survival risk using Cox model predictions as a training target. Across GBSG2, ACTG320, and WHAS500, the model achieves competitive held-out discrimination and calibration despite being trained as a text-generation task rather than with a conventional survival-analysis loss. We further analyse the geometry of the model's hidden states, where t-SNE visualisations reveal smooth risk gradients in latent space, suggesting that the model represents survival risk as a continuous structure rather than isolated risk categories. Together, these findings suggest that large language models can internalise survival-risk structure while supporting calibrated prediction, providing a route towards time-to-event reasoning in language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes distilling Cox proportional-hazards risk scores into a generative LLM (Qwen-based) by converting structured clinical covariates into text prompts and fine-tuning the model to output patient-specific survival risk as a text-generation task. It reports competitive held-out discrimination and calibration on GBSG2, ACTG320, and WHAS500, and presents t-SNE visualizations of hidden states showing smooth risk gradients, arguing that LLMs can internalize continuous survival-risk structure without a conventional survival loss.

Significance. If the quantitative results hold, the work is significant for demonstrating that text-generation objectives can approximate survival-analysis metrics and that LLM latent spaces can encode proportional-hazards structure. The explicit use of held-out Cox targets for evaluation avoids circularity and provides a falsifiable test of distillation. This opens a route to natural-language interfaces for time-to-event prediction.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'competitive held-out discrimination and calibration' is stated without any reported C-index, Brier score, calibration slope, baselines, or error bars. This absence is load-bearing because the claim that text-generation training successfully distills Cox risk cannot be evaluated without these metrics.
  2. [Abstract / Pipeline description] Pipeline description (abstract and methods): the conversion of structured covariates to text prompts is described only as 'converted into text prompts' with no template, rounding/binning rules, or fidelity verification. This is load-bearing for the weakest assumption that numerical precision is preserved sufficiently for the LLM to recover the Cox linear predictor on held-out data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the abstract and methods could be strengthened for clarity and evaluability. We address each point below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'competitive held-out discrimination and calibration' is stated without any reported C-index, Brier score, calibration slope, baselines, or error bars. This absence is load-bearing because the claim that text-generation training successfully distills Cox risk cannot be evaluated without these metrics.

    Authors: We agree that the abstract would be stronger and more self-contained if it included the quantitative results. The manuscript reports C-index, Brier scores, calibration slopes, and baseline comparisons in the results section with held-out evaluations on the three datasets. In revision we will add the key metrics, baselines, and any error bars or intervals directly to the abstract to support the claim. revision: yes

  2. Referee: [Abstract / Pipeline description] Pipeline description (abstract and methods): the conversion of structured covariates to text prompts is described only as 'converted into text prompts' with no template, rounding/binning rules, or fidelity verification. This is load-bearing for the weakest assumption that numerical precision is preserved sufficiently for the LLM to recover the Cox linear predictor on held-out data.

    Authors: We accept that the current description of the covariate-to-text conversion is insufficiently detailed. The methods section will be expanded in revision to include the exact prompt template, rules for handling numerical values (rounding, binning, or direct inclusion), and any verification performed to confirm that the text encoding retains the information needed for the model to approximate the Cox linear predictor. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Cox targets are external and evaluation is held-out

full rationale

The paper fits a separate Cox model on training data to generate targets, then fine-tunes the LLM on text prompts to match those targets via text-generation loss. Held-out discrimination and calibration are measured against true event times, not the Cox predictions themselves. No equations reduce the reported performance to a fitted quantity by construction, no self-citation chain supports a uniqueness claim, and the text-conversion step is a preprocessing choice rather than a definitional loop. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard LLM fine-tuning and the domain assumption that Cox outputs serve as valid targets.

free parameters (1)
  • LLM fine-tuning hyperparameters
    Specific learning rate, epochs, and prompt formatting choices are required for the reported performance but not enumerated.
axioms (1)
  • domain assumption Cox proportional hazards model supplies reliable risk targets for the LLM to learn from
    Used as the sole training signal without independent validation of its accuracy on the target populations.

pith-pipeline@v0.9.1-grok · 5699 in / 1238 out tokens · 28249 ms · 2026-06-27T17:06:18.830990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    John Wiley & Sons, 2002

    John D Kalbfleisch and Ross L Prentice.The statistical analysis of failure time data. John Wiley & Sons, 2002

  2. [2]

    Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972

    David R Cox. Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972

  3. [3]

    Deephit: A deep learning approach to survival analysis with competing risks

    Changhee Lee, William Zame, Jinsung Yoon, and Mihaela Van Der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  4. [4]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  5. [5]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

  6. [6]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

  7. [7]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  8. [8]

    Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients

    M Schumacher, G Bastert, H Bojar, K Hübner, M Olschewski, W Sauerbrei, C Schmoor, C Beyerle, RL Neumann, and HF Rauschecker. Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. german breast cancer study group.Journal of Clinical Oncology, 12(10):2086–2093, 1994

  9. [9]

    Scott M Hammer, Kathleen E Squires, Michael D Hughes, Janet M Grimes, Lisa M Demeter, Judith S Currier, Joseph J Eron Jr, Judith E Feinberg, Henry H Balfour Jr, Lawrence R Deyton, et al. A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or le...

  10. [10]

    Incidence and case fatality rates of acute myocardial infarction (1975–1984): the worcester heart attack study

    Robert J Goldberg, Joel M Gore, Joseph S Alpert, and James E Dalen. Incidence and case fatality rates of acute myocardial infarction (1975–1984): the worcester heart attack study. American Heart Journal, 115(4):761–767, 1988

  11. [11]

    Evaluating the yield of medical tests.Jama, 247(18):2543–2546, 1982

    Frank E Harrell, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A Rosati. Evaluating the yield of medical tests.Jama, 247(18):2543–2546, 1982

  12. [12]

    Calibration: the achilles heel of predictive analytics.BMC medicine, 17(1):230, 2019

    Ben Van Calster, David J McLernon, Maarten Van Smeden, Laure Wynants, and Ewout W Steyerberg. Calibration: the achilles heel of predictive analytics.BMC medicine, 17(1):230, 2019

  13. [13]

    Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC medical research methodology, 18(1):24, 2018

    Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC medical research methodology, 18(1):24, 2018

  14. [14]

    A transformer-based survival model for prediction of all-cause mortality in patients with heart failure: a multi-cohort study.npj Digital Medicine, 2026

    Shishir Rao, Nouman Ahmed, Gholamreza Salimi-Khorshidi, Christopher Yau, Huimin Su, Nathalie Conrad, Folkert W Asselbergs, Mark Woodward, Rod Jackson, John GF Cleland, et al. A transformer-based survival model for prediction of all-cause mortality in patients with heart failure: a multi-cohort study.npj Digital Medicine, 2026. 5

  15. [15]

    Tabllm: Few-shot classification of tabular data with large language models

    Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm: Few-shot classification of tabular data with large language models. InInternational conference on artificial intelligence and statistics, pages 5549–5581. PMLR, 2023

  16. [16]

    scikit-survival: A library for time-to-event analysis built on top of scikit-learn

    Sebastian Pölsterl. scikit-survival: A library for time-to-event analysis built on top of scikit-learn. Journal of Machine Learning Research, 21(212):1–6, 2020

  17. [17]

    lifelines: Survival Analysis in Python.Journal of Open Source Software, 2019

    Cameron Davidson-Pilon. lifelines: Survival Analysis in Python.Journal of Open Source Software, 2019. URL https://github.com/CamDavidsonPilon/lifelines/ tree/master

  18. [18]

    Python Tutorial

    Guido Van Rossum. Python Tutorial. Technical report, Centrum voor Wiskunde en Informatica (CWI), 1995

  19. [19]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  20. [20]

    Ck4gen: A knowledge distillation framework for generating high-utility synthetic survival datasets in healthcare.arXiv preprint arXiv:2410.16872, 2024

    Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm, et al. Ck4gen: A knowledge distillation framework for generating high-utility synthetic survival datasets in healthcare.arXiv preprint arXiv:2410.16872, 2024

  21. [21]

    Performance of the net reclassification improvement for nonnested models and a novel percentile-based alternative.American journal of epidemiology, 187(6):1327–1335, 2018

    Shannon B McKearnan, Julian Wolfson, David M V ock, Gabriela Vazquez-Benitez, and Patrick J O’Connor. Performance of the net reclassification improvement for nonnested models and a novel percentile-based alternative.American journal of epidemiology, 187(6):1327–1335, 2018

  22. [22]

    Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

  23. [23]

    Tristan Bouckley, Rimma Myton-Katieva, David Peiris, Devaki Nambiar, Samuel Prince, Simon Bishop, Damien Cordery, Flynn Robert Hill, Patricia Correll, Anne-Marie Feyer, et al. An assessment of data quality and sociodemographic variation in health service utilisation of general practice, emergency department and admitted services in a new south wales linke...

  24. [24]

    Estimating 5-year absolute risk of cardiovascular disease using routinely collected electronic medical records from australian general practices.Heart, 2025

    Nicholas I-Hsien Kuo, Sebastiano Barbieri, Clare Arnott, Blanca Gallego, Ziba Gandomkar, Shahana Ferdousi, Kirsty Douglas, Mark Woodward, and Louisa Jorm. Estimating 5-year absolute risk of cardiovascular disease using routinely collected electronic medical records from australian general practices.Heart, 2025

  25. [25]

    Cardiovascular disease risk prediction equations in 400 000 primary care patients in new zealand: a derivation and validation study.The Lancet, 391(10133):1897–1907, 2018

    Romana Pylypchuk, Sue Wells, Andrew Kerr, Katrina Poppe, Tania Riddell, Matire Harwood, Dan Exeter, Suneela Mehta, Corina Grey, Billy P Wu, et al. Cardiovascular disease risk prediction equations in 400 000 primary care patients in new zealand: a derivation and validation study.The Lancet, 391(10133):1897–1907, 2018

  26. [26]

    Development and validation of qrisk3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.bmj, 357, 2017

    Julia Hippisley-Cox, Carol Coupland, and Peter Brindle. Development and validation of qrisk3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.bmj, 357, 2017

  27. [27]

    Development and validation of the american heart association’s prevent equations

    Sadiya S Khan, Kunihiro Matsushita, Yingying Sang, Shoshana H Ballew, Morgan E Grams, Aditya Surapaneni, Michael J Blaha, April P Carson, Alexander R Chang, Elizabeth Ciemins, et al. Development and validation of the american heart association’s prevent equations. Circulation, 149(6):430–449, 2024

  28. [28]

    Informative missingness in electronic health record systems: the curse of knowing.Diagnostic and prognostic research, 4(1):8, 2020

    Rolf HH Groenwold. Informative missingness in electronic health record systems: the curse of knowing.Diagnostic and prognostic research, 4(1):8, 2020

  29. [29]

    Evaluat- ing deep learning sepsis prediction models in icus under distribution shift: a multi-centre retrospective cohort study.npj Digital Medicine, 2026

    Fanny Tranchellini, Youssef Farag, Catherine Jutzeler, and Lakmal Meegahapola. Evaluat- ing deep learning sepsis prediction models in icus under distribution shift: a multi-centre retrospective cohort study.npj Digital Medicine, 2026. 6

  30. [30]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  31. [31]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transform- ers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

  32. [32]

    Unsloth team

    Michael Han Daniel Han and Michael Han. Unsloth team. 2023

  33. [33]

    TRL: Transformers Rein- forcement Learning, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl

  34. [34]

    Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Ben- jamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods, 2022. URL https://github.com/huggingface/peft

  35. [35]

    Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

    Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

  36. [36]

    Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

  37. [37]

    Datasets: A community library for natural language processing

    Quentin Lhoest, Albert Villanova Del Moral, Yacine Jernite, Abhishek Thakur, Patrick V on Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. Datasets: A community library for natural language processing. InProceedings of the 2021 conference on empirical methods in natural language processing: system demonstrations, pag...

  38. [38]

    Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

  39. [39]

    Array programming with numpy.nature, 585(7825):357–362, 2020

    Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy.nature, 585(7825):357–362, 2020

  40. [40]

    Data structures for statistical computing in python.scipy, 445(1):51–56, 2010

    Wes McKinney et al. Data structures for statistical computing in python.scipy, 445(1):51–56, 2010

  41. [41]

    Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

    John D Hunter. Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

  42. [42]

    John Wiley & Sons, 2008

    David W Hosmer Jr, Stanley Lemeshow, and Susanne May.Applied survival analysis: regression modeling of time-to-event data, volume 618. John Wiley & Sons, 2008

  43. [43]

    Modelling the effects of standard prognostic factors in node-positive breast cancer.British Journal of Cancer, 79(11): 1752–1760, 1999

    W Sauerbrei, P Royston, H Bojar, C Schmoor, and M Schumacher. Modelling the effects of standard prognostic factors in node-positive breast cancer.British Journal of Cancer, 79(11): 1752–1760, 1999

  44. [44]

    Table 6.7 on page 198

    David A Karnofsky, Walter H Abelmann, Lloyd F Craver, and Joseph H Burchenal. The use of the nitrogen mustards in the palliative treatment of carcinoma: with particular reference to bronchogenic carcinoma.Cancer, 1948. 7 Appendix: Additional Details to the Main Text Purpose of this Appendix.This appendix formalises the Cox-to-Qwen survival distillation fr...

  45. [45]

    Harrell’s C-index for survival discrimination [11],

  46. [46]

    calibration error (D21) based on calibration slope estimation [20, 12],

  47. [47]

    Du ra ti on

    percentile-based NRI [21]. The evaluation pipeline first merges the generated predictions with the held-out survival outcomes, removes invalid predictions, and clips all predicted risks to the interval [0,1] before downstream analysis. Calibration analysis is then performed using quantile-based risk bins and Kaplan–Meier survival estimation at the 1-year ...

  48. [48]

    the mean predicted risk is computed,

  49. [49]

    ri sk_ bi n

    the observed 1-year event probability is estimated using Kaplan–Meier survival estimation. Let: ˆrb denote the mean predicted risk in bin b, and ob denote the observed Kaplan–Meier risk in the same bin. Calibration is assessed using a regression without intercept: ˆrb =βo b. (note that in this formulation, we are essentially putting the predicted risk on ...

  50. [50]

    p r e d i c t e d _ r i s k

    values . reshape ( -1 , 1) , 6ca li b_d f [ " p r e d i c t e d _ r i s k " ]. values 7) 8 9slope = float ( reg . coef_ [0]) 10c a l i b _ e r r o r = abs ( slope - 1.0) This calibration framework therefore measures whether the generated survival-risk predictions remain numerically consistent with empirically observed event frequencies. 30 C.5.3 Details 5...

  51. [51]

    a t t e n t i o n _ m a s k

    to ( " cuda " ) 9 10with torch . no_grad () : 11outputs = model ( 12** inputs , 13o u t p u t _ h i d d e n _ s t a t e s = True , 14r e t u r n _ d i c t = True , 15) 16 32 17hidden = outputs . h i d d e n _ s t a t e s [ -1] 18 19attn = inputs [ " a t t e n t i o n _ m a s k " ] 20la st _i dx = attn . sum ( dim =1) - 1 21 22vec = hidden [0 , l ast _i dx...

  52. [52]

    a t t e n t i o n _ m a s k

    numpy () 27 28return vec Lines 11–15: The key operation enabling hidden-state extraction is: 1outputs = model ( 2** inputs , 3o u t p u t _ h i d d e n _ s t a t e s = True , 4r e t u r n _ d i c t = True , 5) which instructs the transformer to return the full sequence of hidden-state activations from all transformer layers. The final-layer activations ar...