Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing

Ali Cinar; Andrew D. Boyd; Barbara Di Eugenio; Brian T. Layden; Lu Cheng; Mudassir Rashid; Rochana Chaturvedi; Yue Zhou

arxiv: 2511.22038 · v2 · submitted 2025-11-27 · 💻 cs.CL

Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing

Rochana Chaturvedi , Yue Zhou , Andrew D. Boyd , Brian T. Layden , Mudassir Rashid , Lu Cheng , Ali Cinar , Barbara Di Eugenio This is my paper

Pith reviewed 2026-05-17 05:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords clinical notestemporal graph neural networksrisk predictionType 2 Diabeteselectronic health recordslanguage processingprivacy-preserving modelsmodel distillation

0 comments

The pith

A hierarchical temporal graph neural network predicts Type 2 Diabetes risk from longitudinal clinical notes more accurately than baselines by capturing event timing and medical knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Clinical notes hold detailed temporal and contextual clues about patient health that structured records often miss, making them useful for spotting chronic disease risks early. The paper introduces HiTGNN to build graphs linking events within notes, across visits, and with medical knowledge so patient trajectories can be modeled at fine temporal scales. It pairs this with ReVeAL, a lightweight method that transfers reasoning from large language models into smaller verifiers for added sensitivity and explanations. Tests on realistic hospital cohorts for Type 2 Diabetes show gains in accuracy, especially near-term, plus better fairness across groups and less dependence on external proprietary systems. Ablations highlight that the temporal and knowledge components drive the improvements.

Core claim

The authors establish that a hierarchical temporal graph neural network integrating intra-note temporal event structures, inter-visit dynamics, and medical knowledge can model patient trajectories from longitudinal clinical notes to deliver higher predictive accuracy for Type 2 Diabetes onset, particularly near-term risk, while preserving privacy and limiting use of large proprietary models, with a companion distillation framework enhancing sensitivity to true cases and retaining explanatory reasoning.

What carries the argument

HiTGNN, the hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to represent patient trajectories at fine granularity.

If this is right

HiTGNN achieves the highest predictive accuracy for T2D risk, especially near-term forecasts.
ReVeAL increases sensitivity to true T2D cases while retaining explanatory reasoning.
Ablations confirm that temporal structure and knowledge augmentation add value to the predictions.
HiTGNN delivers more equitable performance across demographic subgroups.
The methods reduce reliance on large proprietary models and support privacy-preserving use of notes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same temporal-graph approach could extend to early risk prediction for other chronic conditions using existing EHR notes.
Lightweight distillation like ReVeAL could lower barriers to deploying reasoning models in settings with limited compute or data access.
Fairness gains across subgroups suggest potential to reduce prediction disparities if scaled to broader clinical use.
Prospective deployment studies on live patient streams would test whether the accuracy holds for timely interventions.

Load-bearing premise

The temporal event structures, inter-visit dynamics, and medical knowledge in clinical notes can be captured effectively by the hierarchical temporal graph neural network without major loss of information or introduction of bias.

What would settle it

A head-to-head test on the same temporally realistic T2D cohorts where HiTGNN shows no accuracy improvement or lower performance than simpler non-temporal or non-knowledge-augmented models would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.22038 by Ali Cinar, Andrew D. Boyd, Barbara Di Eugenio, Brian T. Layden, Lu Cheng, Mudassir Rashid, Rochana Chaturvedi, Yue Zhou.

**Figure 2.** Figure 2: HIT-GNN Architecture: Hierarchical Temporal GNN that models intra- and inter-document temporal dependencies between clinical entities and integrates UMLS knowledge for type 2 diabetes (T2D) risk prediction. extracted graphs are provided in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: AUC as a function of the prediction horizon, evaluated over consecutive 3-month windows. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: HIT-GNN performance ablations. negative towards the White minority in the PH corpus, while all models are slightly negative against this group in MIMIC-IV, where it has a majority. HIT-GNN shows a high positive bias for Hispanics in MIMIC-IV, and LLMs show low-to-moderate bias. Overall, HIT-GNN is relatively fairer. 7.4 Computational Efficiency [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for identifying mention of type 2 dia [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Prompts for inference from a reasoning model [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 8.** Figure 8: Prompts used for fine-tuning the verifier model [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 7.** Figure 7: System prompt used for fine-tuning and infer [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 9.** Figure 9: T2D recall as a function of the prediction horizon, evaluated over consecutive 3-month windows. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: T2D recall performance variation of HITGNN with different node embedding approaches [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: T2D recall performance variation by restrict [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Clinical notes in Electronic Health Records (EHRs) capture rich temporal information on events, clinician reasoning, and lifestyle factors often missing from structured data. Leveraging them for predictive modeling can be impactful for timely identification of chronic diseases. However, they present core natural language processing (NLP) challenges: long text, irregular event distribution, complex temporal dependencies, privacy constraints, and resource limitations. We present two complementary methods for temporally and contextually grounded risk prediction from longitudinal notes. First, we introduce HiTGNN, a hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to model patient trajectories with fine-grained temporal granularity. Second, we propose ReVeAL, a lightweight test-time framework that distills LLMs' reasoning into smaller verifier models. Applied to opportunistic screening for Type 2 Diabetes (T2D) using temporally realistic cohorts curated from private and public hospital corpora, HiTGNN achieves the highest predictive accuracy, especially for near-term risk, while preserving privacy and limiting reliance on large proprietary models. ReVeAL enhances sensitivity to true T2D cases and retains explanatory reasoning. Our ablations confirm the value of temporal structure and knowledge augmentation, and fairness analysis shows HiTGNN performs more equitably across subgroups.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HiTGNN and ReVeAL give a concrete way to pull temporal signals from longitudinal EHR notes for T2D risk without leaning on big proprietary models, but the graph construction for irregular visit patterns is the part that needs the most checking.

read the letter

The paper introduces HiTGNN, a hierarchical temporal graph neural network that builds graphs from intra-note events, inter-visit dynamics, and medical knowledge, plus ReVeAL, a test-time distillation setup that transfers reasoning from larger models to smaller verifiers. They test this on temporally realistic cohorts from private and public hospital data for early Type 2 Diabetes prediction. The ablations show gains from the temporal and knowledge components, and the fairness checks across subgroups are a reasonable addition. The privacy angle and reduced dependence on external LLMs are practical points that matter for real deployment.

Referee Report

2 major / 2 minor

Summary. The paper introduces HiTGNN, a hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to model patient trajectories from longitudinal clinical notes, along with ReVeAL, a lightweight test-time framework that distills LLM reasoning into smaller verifier models. Applied to opportunistic T2D risk prediction on temporally realistic cohorts from private and public hospital corpora, the work claims that HiTGNN delivers the highest predictive accuracy (especially near-term), preserves privacy, limits reliance on large proprietary models, and shows equitable performance; ReVeAL improves sensitivity to true cases while retaining explanatory reasoning. Ablations are said to confirm the value of temporal structure and knowledge augmentation.

Significance. If the empirical claims are substantiated with full metrics and controls, the work could advance privacy-preserving, temporally grounded clinical NLP by demonstrating a practical way to leverage rich event and reasoning information in notes for early chronic disease screening without heavy dependence on large models.

major comments (2)

[Abstract] Abstract: the central claim that HiTGNN achieves the highest predictive accuracy (especially for near-term risk) is unsupported by any quantitative metrics, cohort sizes, baselines, or error bars, making verification of the accuracy results impossible from the provided description.
[Methods (HiTGNN)] HiTGNN graph construction (Methods): the assumption that the specific hierarchical temporal graph and message passing capture irregular event dependencies without material information loss or ordering artifacts is load-bearing for the near-term accuracy claim, yet the manuscript does not compare against alternative graph topologies or ablate node/edge definitions to rule out inflation of performance on temporally realistic cohorts.

minor comments (2)

[Methods] Add explicit definitions and pseudocode for node/edge construction and inter-visit linking to support reproducibility.
[Experiments] Report exact cohort sizes, train/test splits, and baseline implementations in the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that HiTGNN achieves the highest predictive accuracy (especially for near-term risk) is unsupported by any quantitative metrics, cohort sizes, baselines, or error bars, making verification of the accuracy results impossible from the provided description.

Authors: We agree that the abstract would benefit from explicit quantitative support to allow immediate verification of the central claims. In the revised version, we have updated the abstract to include key metrics (e.g., AUC-ROC and sensitivity for near-term horizons), cohort sizes from both private and public corpora, baseline comparisons, and error bars derived from multiple runs. revision: yes
Referee: [Methods (HiTGNN)] HiTGNN graph construction (Methods): the assumption that the specific hierarchical temporal graph and message passing capture irregular event dependencies without material information loss or ordering artifacts is load-bearing for the near-term accuracy claim, yet the manuscript does not compare against alternative graph topologies or ablate node/edge definitions to rule out inflation of performance on temporally realistic cohorts.

Authors: We appreciate this observation on the load-bearing nature of the graph design. Our original ablations already demonstrate the contribution of temporal structure and knowledge augmentation. To directly address the concern, we have added new experiments in the revised manuscript that compare the hierarchical temporal graph against alternative topologies (including non-hierarchical and flattened variants) and perform targeted ablations on node and edge definitions. These results confirm that the chosen structure better preserves irregular temporal dependencies without introducing ordering artifacts on the temporally realistic cohorts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external cohort evaluation

full rationale

The paper introduces HiTGNN and ReVeAL as modeling approaches for T2D risk from longitudinal clinical notes and reports performance on temporally realistic cohorts from private and public hospital data. All central claims (highest near-term accuracy, value of temporal structure, fairness across subgroups) are supported by experimental results and ablations rather than any closed-form derivations, parameter fits renamed as predictions, or self-citation chains that reduce the target result to its own inputs. No equations appear in the provided sections that would allow a self-definitional or fitted-input reduction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; ledger remains empty pending full text.

pith-pipeline@v0.9.0 · 5549 in / 1048 out tokens · 51790 ms · 2026-05-17T05:27:53.597871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 831–834

Sequential representation of sparse hetero- geneous data for diabetes risk prediction. In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 831–834. IEEE. Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam Shah

work page
[2]

Kirstie K Danielson, Brett Rydzon, Milena Nicosia, Anjana Maheswaren, Yuval Eisenberg, Janet Lin, and Brian T Layden

Timer: Temporal instruction modeling and evaluation for longitudinal clinical records. Kirstie K Danielson, Brett Rydzon, Milena Nicosia, Anjana Maheswaren, Yuval Eisenberg, Janet Lin, and Brian T Layden. 2023. Prevalence of undiag- nosed diabetes identified by a novel electronic med- ical record diabetes screening program in an urban emergency department...

work page 2023
[3]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Carer-clinical reasoning-enhanced representa- tion for temporal health risk prediction. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10392–10407. Ramesh S Patil, Peter Szolovits, and William B Schwartz. 1981. Causal understanding of patient illness in medical diagnosis. InComputer-Assisted Medical Decis...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

no diagnosis of diabetes

Rethinking human-ai collaboration in complex medical decision making: A case study in sepsis diagnosis. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–18. Hongjian Zhou, Fenglin Liu, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S Chen, Peilin Zhou, Junling Liu, and 1 others. 2024a. A survey of large languag...

work page 2015

[1] [1]

In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 831–834

Sequential representation of sparse hetero- geneous data for diabetes risk prediction. In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 831–834. IEEE. Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam Shah

work page

[2] [2]

Kirstie K Danielson, Brett Rydzon, Milena Nicosia, Anjana Maheswaren, Yuval Eisenberg, Janet Lin, and Brian T Layden

Timer: Temporal instruction modeling and evaluation for longitudinal clinical records. Kirstie K Danielson, Brett Rydzon, Milena Nicosia, Anjana Maheswaren, Yuval Eisenberg, Janet Lin, and Brian T Layden. 2023. Prevalence of undiag- nosed diabetes identified by a novel electronic med- ical record diabetes screening program in an urban emergency department...

work page 2023

[3] [3]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Carer-clinical reasoning-enhanced representa- tion for temporal health risk prediction. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10392–10407. Ramesh S Patil, Peter Szolovits, and William B Schwartz. 1981. Causal understanding of patient illness in medical diagnosis. InComputer-Assisted Medical Decis...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

no diagnosis of diabetes

Rethinking human-ai collaboration in complex medical decision making: A case study in sepsis diagnosis. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–18. Hongjian Zhou, Fenglin Liu, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S Chen, Peilin Zhou, Junling Liu, and 1 others. 2024a. A survey of large languag...

work page 2015