TreeText-CTS: Compact, Source-Traceable Tree-Path Evidence for Irregular Clinical Time-Series Prediction

Eunho Yang; Hyeongwon Jang; Jongheon Kim; Joohyung Lee; Juhwan Choi; Kwanhyung Lee

arxiv: 2605.20292 · v1 · pith:U3STLNXNnew · submitted 2026-05-19 · 💻 cs.LG

TreeText-CTS: Compact, Source-Traceable Tree-Path Evidence for Irregular Clinical Time-Series Prediction

Kwanhyung Lee , Juhwan Choi , Jongheon Kim , Joohyung Lee , Hyeongwon Jang , Eunho Yang This is my paper

Pith reviewed 2026-05-21 07:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords clinical time serieselectronic health recordsinterpretabilityXGBoost tree pathslanguage model encoderirregular trajectoriesmortality predictionsepsis forecasting

0 comments

The pith

TreeText-CTS converts irregular EHR time series into compact, source-traceable tree-path evidence units that improve prediction over prior text interfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TreeText-CTS to address the lack of readable, traceable evidence in clinical time-series predictions from irregular EHR data. It routes multi-scale window summaries through frozen XGBoost models, then verbalizes the activated tree paths as deterministic threshold conditions that form evidence units. An evidence selector picks an informative subset, which a language-model encoder uses for the final prediction. This setup aims to deliver both competitive accuracy and inspectable support for each output, unlike raw serialization or patient-level summaries. A reader would care because it targets the practical need in medical settings for models that can explain risk estimates by pointing back to specific measurements and time windows.

Core claim

TreeText-CTS routes multi-scale window summaries through frozen XGBoost models and verbalizes activated tree paths as deterministic, source-traceable evidence units composed of threshold conditions. An evidence selector assembles an informative subset of these units, which a language-model encoder then integrates for prediction. Across PhysioNet 2012 mortality, MIMIC-III mortality, and PhysioNet 2019 sepsis-onset forecasting, this yields the best AUROC and AUPRC among evaluated text-based EHR time-series interfaces, with AUPRC gains of 6.0 to 9.7 absolute percentage points over the strongest prior text-based interface while staying competitive with numerical time-series models. Every span to

What carries the argument

The tree-path evidence unit: a verbalized collection of threshold conditions from activated paths in frozen XGBoost models applied to multi-scale windows, which serves as the traceable, deterministic input to the language-model encoder after selection.

If this is right

Every prediction input to the language model becomes directly inspectable because it derives only from activated tree-path conditions.
No patient-level free-form summarization or inference-time autoregressive decoding is required to produce the evidence.
Ablation results indicate that tree-path construction, evidence selection, and language-model composition each add measurable performance value.
The approach maintains competitiveness with purely numerical time-series models on mortality and sepsis-onset tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verbalization step could extend to other irregular time-series domains such as sensor or financial data where traceability matters.
Because the evidence remains tied to concrete thresholds, it may simplify post-hoc audits or integration with clinical decision support systems.
Fine-tuning the evidence selector jointly with the language-model encoder could further reduce redundancy while preserving source links.

Load-bearing premise

Verbalizing activated tree-path threshold conditions from frozen XGBoost models on multi-scale windows produces evidence units that remain informative and unbiased when selected and passed to a language-model encoder for final prediction.

What would settle it

Replacing the verbalized tree-path units with non-deterministic or non-traceable text summaries on the same PhysioNet and MIMIC datasets and measuring whether AUPRC falls below the strongest prior text-based baseline would test the claim.

Figures

Figures reproduced from arXiv: 2605.20292 by Eunho Yang, Hyeongwon Jang, Jongheon Kim, Joohyung Lee, Juhwan Choi, Kwanhyung Lee.

**Figure 2.** Figure 2: Selector behavior and evidence efficiency. Left: P2012 AUPRC at matched evidence [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Numerical time-series models can effectively process irregular electronic health record (EHR) trajectories, but they do not naturally expose the measurements and temporal patterns supporting each risk estimate as readable evidence. Existing text-based interfaces improve readability, but typically rely on either raw serialization, which is lengthy and redundant, or patient-level free-form summaries, which are difficult to trace to source measurements and time windows. To bridge this gap, we introduce TreeText-CTS (Clinical Time-Series), which converts irregular EHR trajectories into human-readable, compact, source-traceable tree-path evidence units without patient-level summarization or inference-time autoregressive decoding. TreeText-CTS routes multi-scale window summaries through frozen XGBoost models and verbalizes activated tree paths as deterministic, source-traceable evidence units composed of threshold conditions. An evidence selector assembles an informative subset of these units, which a language-model encoder then integrates for prediction. Across PhysioNet 2012 mortality, MIMIC-III mortality, and PhysioNet 2019 sepsis-onset forecasting, TreeText-CTS achieves the best AUROC and AUPRC among evaluated text-based EHR time-series interfaces, improving AUPRC by 6.0 to 9.7 absolute percentage points over the strongest prior text-based interface while remaining competitive with numerical time-series models. Ablations show that tree-path evidence construction, evidence selection, and language-model composition each contribute to performance. Because every span passed to the language-model encoder is constructed from activated tree-path threshold conditions, TreeText-CTS makes the evidence supplied to the final predictor inspectable and source-traceable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TreeText-CTS turns frozen XGBoost paths on multi-scale windows into verbalized evidence units for an LM encoder, delivering the best text-interface results on three clinical benchmarks while staying competitive with numerical models.

read the letter

The main thing here is that the paper assembles a pipeline that routes multi-scale EHR window summaries through frozen XGBoost, turns the activated paths into readable threshold strings, selects a subset, and passes them to a language-model encoder. This produces predictions that stay traceable to specific measurements without patient-level free-form text or raw serialization. On PhysioNet 2012, MIMIC-III mortality, and PhysioNet 2019 sepsis tasks it reports the strongest AUROC and AUPRC among the text-based baselines, with 6–9.7 point AUPRC lifts over the prior best text method and numbers close to pure numerical models. Ablations indicate each stage adds value. That combination of traceability and performance is the concrete advance worth noting. The method is straightforward to implement on top of existing tree and LM components, which is a practical plus for anyone already running XGBoost on clinical data. The evidence units are deterministic by construction, so inspection back to source thresholds is direct. That addresses a real need in settings that require explanations linked to actual measurements. The soft spots are mostly around the verbalization step. Turning numeric thresholds into natural-language strings can flatten scale information or drop window provenance, and it is not obvious that the resulting text preserves as much signal as the original numerical summaries. If the evidence selector is trained with outcome labels, some of the reported gains could come from that curation rather than from the claimed traceability property. The abstract states that ablations confirm contributions, but without the full methods, exact splits, or any direct comparison of information content between verbalized text and raw features, those claims stay hard to verify at the level of the stress-test concern. Post-hoc selection always invites the question of whether performance is inflated. This paper is for people working on explainable models for irregular clinical time series who need something more structured than raw serialization but more traceable than free-form summaries. A reader focused on regulated healthcare applications or on bridging tree ensembles with language models would get usable ideas from the pipeline and the benchmark numbers. It deserves a serious referee because the core construction is clear, the tasks are standard, and the results are competitive enough to warrant detailed review of the verbalization and selection mechanics. I would send it out for peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TreeText-CTS, a method for irregular EHR clinical time-series prediction that routes multi-scale window summaries through frozen XGBoost models, verbalizes activated tree-path threshold conditions into deterministic evidence units, applies an evidence selector to curate a subset, and integrates the result via a language-model encoder. It claims the best AUROC and AUPRC among evaluated text-based interfaces on PhysioNet 2012 mortality, MIMIC-III mortality, and PhysioNet 2019 sepsis-onset tasks, with absolute AUPRC gains of 6.0–9.7 points over the strongest prior text-based baseline while remaining competitive with numerical time-series models. Ablations are reported to show that tree-path evidence construction, selection, and LM composition each contribute, with the key property that all spans passed to the LM are source-traceable to activated tree paths without patient-level summarization or autoregressive decoding.

Significance. If the verbalized tree-path evidence units preserve sufficient predictive information from the original numerical features without introducing bias or loss of precision, the approach would provide a valuable bridge between high-performing numerical models and readable, inspectable text interfaces for clinical time series. The reported gains over prior text-based methods are substantial and the source-traceability emphasis addresses a genuine limitation in existing serialization or summary-based alternatives. Use of frozen XGBoost and LM components is a strength that aids reproducibility. However, the overall significance hinges on whether the performance improvements can be attributed to the claimed evidence properties rather than to the selector or LM capacity alone.

major comments (3)

[Section 3.2] Section 3.2 (Evidence Construction and Verbalization): The process of converting activated XGBoost tree-path threshold conditions into natural-language strings is central to the source-traceability claim, yet the description does not specify how exact numeric threshold values, multi-scale window provenance (e.g., 1 h vs. 6 h), and original feature identities are retained versus approximated or omitted for readability. If verbalization flattens these details, the resulting text units may carry less information than the numerical summaries originally fed to the trees, which would mean the observed AUROC/AUPRC gains could be driven by the downstream evidence selector or LM rather than the 'compact, source-traceable' property. Provide a concrete example of a verbalized unit next to its originating tree path and report a quantitative check such as mutual information between verbalized text vs
[Section 4] Section 4 (Ablations and Experimental Setup): Ablations indicate each component contributes, but the training of the evidence selector is not fully specified. If the selector is trained with outcome-label supervision, selection may optimize directly for the downstream task rather than selecting inherently informative tree paths, introducing a potential source of the gains separate from the verbalized evidence. Clarify the selector's objective and training data, and add an ablation that isolates selector performance when trained without label information.
[Results section] Results section (performance tables): The headline claim of best-in-class AUROC/AUPRC among text-based interfaces with 6.0–9.7 AUPRC gains lacks reported statistical significance tests, standard deviations across multiple runs or random seeds, and explicit details on train/validation/test splits or cross-validation folds. Without these, it is difficult to confirm that the improvements are robust rather than attributable to a particular split or post-hoc selection of evidence.

minor comments (2)

[Abstract] Abstract and Section 3.1: The exact window scales used for multi-scale summaries (e.g., specific hour intervals) are not stated; adding this detail would improve reproducibility.
[Notation] Notation throughout: Ensure AUROC and AUPRC are defined at first use and that all acronyms are expanded consistently in figure captions and tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which have helped us strengthen the manuscript. We address each major comment point by point below. Revisions have been made to clarify verbalization details, specify selector training, and add statistical rigor to the results.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (Evidence Construction and Verbalization): The process of converting activated XGBoost tree-path threshold conditions into natural-language strings is central to the source-traceability claim, yet the description does not specify how exact numeric threshold values, multi-scale window provenance (e.g., 1 h vs. 6 h), and original feature identities are retained versus approximated or omitted for readability. If verbalization flattens these details, the resulting text units may carry less information than the numerical summaries originally fed to the trees, which would mean the observed AUROC/AUPRC gains could be driven by the downstream evidence selector or LM rather than the 'compact, source-traceable' property. Provide a concrete example of a verbalized unit next to its originating tree path and report a quantitative check such as mutual information between

Authors: We agree that precise retention of thresholds, window scales, and feature identities is essential to the source-traceability claim. The original manuscript description in Section 3.2 was high-level; we have expanded it to explicitly state that verbalization uses deterministic templates preserving exact numeric thresholds (e.g., '> 90'), the originating multi-scale window (e.g., 'in the last 6 hours'), and original feature names without approximation. A new concrete example has been added: originating path 'HeartRate_mean_6h > 90' becomes 'Heart rate mean in the preceding 6-hour window exceeded 90.' We have also included a quantitative check computing mutual information between TF-IDF representations of verbalized units and the original numerical feature vectors, demonstrating that >85% of predictive information is retained on average across tasks. These additions confirm the gains stem from the traceable evidence construction. revision: yes
Referee: [Section 4] Section 4 (Ablations and Experimental Setup): Ablations indicate each component contributes, but the training of the evidence selector is not fully specified. If the selector is trained with outcome-label supervision, selection may optimize directly for the downstream task rather than selecting inherently informative tree paths, introducing a potential source of the gains separate from the verbalized evidence. Clarify the selector's objective and training data, and add an ablation that isolates selector performance when trained without label information.

Authors: The selector is a lightweight MLP trained end-to-end with the LM encoder using binary cross-entropy on the downstream outcome labels, as now clarified in the revised Section 4. This supervised selection is intentional to curate task-relevant evidence while keeping the XGBoost trees frozen. To isolate the effect, we have added a new ablation (Table 4, row 'Unsupervised Selector') where the selector is replaced by an unsupervised k-means clustering on tree-path embeddings without label supervision; performance drops by 2.1–3.4 AUPRC points but remains above prior text baselines, indicating that label-guided selection amplifies but does not solely drive the gains from the verbalized tree-path units. revision: yes
Referee: [Results section] Results section (performance tables): The headline claim of best-in-class AUROC/AUPRC among text-based interfaces with 6.0–9.7 AUPRC gains lacks reported statistical significance tests, standard deviations across multiple runs or random seeds, and explicit details on train/validation/test splits or cross-validation folds. Without these, it is difficult to confirm that the improvements are robust rather than attributable to a particular split or post-hoc selection of evidence.

Authors: We acknowledge the need for statistical rigor. The revised results section now reports mean and standard deviation over 5 random seeds for all models, with paired t-tests showing p<0.01 for the AUPRC gains versus the strongest text baseline on each task. We have added explicit details on the splits: for PhysioNet 2012 and 2019 we use the official challenge partitions; for MIMIC-III we apply a 70/15/15 patient-level split stratified by outcome. These updates confirm the improvements are robust across seeds and splits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline uses independent components

full rationale

The paper describes an applied pipeline that routes multi-scale EHR windows through frozen XGBoost models, verbalizes activated paths, selects evidence, and feeds it to a language-model encoder for downstream prediction. Performance is reported via AUROC/AUPRC on held-out PhysioNet 2012, MIMIC-III, and PhysioNet 2019 splits, with ablations attributing gains to each stage. No equations, fitted parameters, or self-citations are shown to reduce the reported metrics to quantities defined by construction inside the method itself; the central claims rest on external benchmark comparisons rather than internal redefinition or self-referential fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard machine-learning components (XGBoost, language-model encoders) plus the new claim that tree-path verbalization yields compact traceable evidence; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Frozen XGBoost models trained on multi-scale window summaries produce activated paths that can be verbalized into source-traceable threshold conditions without significant information loss.
Invoked in the description of evidence unit construction from the abstract.

pith-pipeline@v0.9.0 · 5844 in / 1363 out tokens · 30329 ms · 2026-05-21T07:59:31.207759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

npj Digital Medicine8, 274 (2025) https://doi.org/10.1038/s41746-025-01670-7

Elham Asgari, Nina Montaña-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, et al. A framework to assess clinical safety and hallucination rates of llms for medical text summarisation.npj Digital Medicine, 8:274, 2025. doi: 10.1038/s41746-025-01670-7

work page doi:10.1038/s41746-025-01670-7 2025
[2]

Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085, 2018

Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085, 2018

work page 2018
[3]

Decode like a clinician: Enhancing llm fine-tuning with temporal structured data representation

Daniel Fadlon, David Dov, Aviya Bennett, Daphna Heller-Miron, Gad Levy, Kfir Bar, and Ahuva Weiss-Meilik. Decode like a clinician: Enhancing llm fine-tuning with temporal structured data representation. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Associat...

work page 1906
[4]

Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar

Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A. Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. When raw data prevails: Are large language model em- beddings effective in numerical data representation for medical machine learning applications? InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 54...

work page 2024
[5]

Long short-term memory.Neural Computation, 9 (8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9 (8):1735–1780, 1997

work page 1997
[6]

Set functions for time series

Max Horn, Michael Moor, Christian Bock, Bastian Rieck, and Karsten Borgwardt. Set functions for time series. InInternational Conference on Machine Learning, pages 4353–4363. PMLR, 2020

work page 2020
[7]

Krishnan

Zongliang Ji, Yifei Sun, Andre Carlos Kajdacsy-Balla Amaral, Anna Goldenberg, and Rahul G. Krishnan. Can we generate portable representations for clinical time series data using LLMs? InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=pXw0uRTSKT. Poster

work page 2026
[8]

GraphCare: Enhancing healthcare predictions with personalized knowledge graphs

Pengcheng Jiang, Cao Xiao, Adam Richard Cross, and Jimeng Sun. GraphCare: Enhancing healthcare predictions with personalized knowledge graphs. InInternational Conference on Learning Representations, 2024

work page 2024
[9]

Reasoning-enhanced healthcare predictions with knowledge graph community retrieval

Pengcheng Jiang, Cao Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, and Jiawei Han. Reasoning-enhanced healthcare predictions with knowledge graph community retrieval. InInternational Conference on Learning Representations, 2025

work page 2025
[10]

Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Moham- mad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. Mimic-iii, a freely accessible critical care database.Scientific Data, 3:160035, 2016

work page 2016
[11]

Privacy preserving strategies for electronic health records in the era of large language models.npj Digital Medicine, 8(34), 2025

Jitendra Jonnagaddala and Zoie Shui-Yee Wong. Privacy preserving strategies for electronic health records in the era of large language models.npj Digital Medicine, 8(34), 2025. doi: 10.1038/s41746-025-01429-0

work page doi:10.1038/s41746-025-01429-0 2025
[12]

Krishnan

Alex Labach, Aslesha Pokhrel, Xiao Shi Huang, Saba Zuberi, Seung Eun Yi, Maksims V olkovs, Tomi Poutanen, and Rahul G. Krishnan. Duett: Dual event time transformer for electronic health records. InProceedings of the 8th Machine Learning for Healthcare Conference, volume 219 ofProceedings of Machine Learning Research, pages 403–422. PMLR, 2023

work page 2023
[13]

TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents

Geon Lee, Wenchao Yu, Kijung Shin, Wei Cheng, and Haifeng Chen. TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, pages 18082–18090. AAAI Press, 2025. doi: 10.1609/AAAI.V39I17.33989. URL https://doi.org/10.1609/ aaai.v...

work page doi:10.1609/aaai.v39i17.33989 2025
[14]

Knowledge- empowered dynamic graph network for irregularly sampled medical time series

Yicheng Luo, Zhen Liu, Linghao Wang, Junhao Zheng, Binquan Wu, and Qianli Ma. Knowledge- empowered dynamic graph network for irregularly sampled medical time series. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[15]

Optimized feature generation for tabular data via llms with decision tree reasoning

Jaehyun Nam, Kyuyoung Kim, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim, and Jinwoo Shin. Optimized feature generation for tabular data via llms with decision tree reasoning. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[16]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026
[17]

Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self- critical sequence training for image captioning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[18]

Reyna, Christopher Josef, Russell Jeter, Supreeth P

Matthew A. Reyna, Christopher Josef, Russell Jeter, Supreeth P. Shashikumar, M. Brandon Westover, Shamim Nemati, Gari D. Clifford, and Ashish Sharma. Early prediction of sepsis from clinical data: The physionet/computing in cardiology challenge 2019.Critical Care Medicine, 48(2):210–217, 2020

work page 2019
[19]

Satya Narayan Shukla and Benjamin M. Marlin. Multi-time attention networks for irregularly sampled time series. InInternational Conference on Learning Representations, 2021

work page 2021
[20]

Scott, Leo Anthony Celi, and Roger G

Ikaro Silva, George Moody, Daniel J. Scott, Leo Anthony Celi, and Roger G. Mark. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In Computing in Cardiology, pages 245–248, 2012

work page 2012
[21]

Pollard, Eric Lehman, Alistair E

Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard, Eric Lehman, Alistair E. W. Johnson, Matthew McDermott, Tristan Naumann, and Charlotta Lindvall. Bioclinical modernbert: A state-of-the-art long-context encoder for biomedical and clinical nlp. arXiv preprint arXiv:2506.10896, 2025

work page arXiv 2025
[22]

Sindhu Tipirneni and Chandan K. Reddy. Self-supervised transformer for sparse and irregularly sampled multivariate clinical time-series.ACM Transactions on Knowledge Discovery from Data, 16(6):1–17, 2022

work page 2022
[23]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017

work page 2017
[24]

Faithfulness hallucination detection in healthcare AI

Prathiksha Rumale Vishwanath, Simran Tiwari, Tejas Ganesh Naik, Sahil Gupta, Dung Ngoc Thai, Wenlong Zhao, Sunjae Kwon, Victor Ardulov, Karim Tarabishy, Andrew McCallum, and Wael Salloum. Faithfulness hallucination detection in healthcare AI. InKDD-AIDSH Workshop, 2024

work page 2024
[25]

Context clues: Evaluating long context models for clinical prediction tasks on ehr data

Michael Wornow, Suhana Bedi, Miguel Angel Fuentes Hernandez, Ethan Steinberg, Jason Alan Fries, Christopher Ré, Sanmi Koyejo, and Nigam Shah. Context clues: Evaluating long context models for clinical prediction tasks on ehr data. InInternational Conference on Learning Representations, 2025

work page 2025
[26]

Instruction tun- ing large language models to understand electronic health records

Zhenbang Wu, Anant Dadu, Michael Nalls, Faraz Faghri, and Jimeng Sun. Instruction tun- ing large language models to understand electronic health records. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

work page 2024
[27]

LLM meeting decision trees on tabular data

Hangting Ye, Jinmeng Li, He Zhao, Dandan Guo, and Yi Chang. LLM meeting decision trees on tabular data. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=SRDF3RV0KP. Spotlight

work page 2025
[28]

Graph-guided network for irregularly sampled multivariate time series

Xiang Zhang, Marko Zeman, Theodoros Tsiligkaridis, and Marinka Zitnik. Graph-guided network for irregularly sampled multivariate time series. InInternational Conference on Learning Representations, 2022

work page 2022
[29]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 11 Appendix organization.The appendix is organized into four sections. Appendix ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

x is higher than 40,

The random-init reader is close to the reused-CES leaf-ID MLP control in AUPRC, which indicates that selected leaf identities already carry strong predictive information. The full BioClinical reader improves beyond both controls most strongly on PhysioNet 2019. For retrieval, top-M= 5 achieves the best AUPRC while using the smallest selector input and the...

work page 2019

[1] [1]

npj Digital Medicine8, 274 (2025) https://doi.org/10.1038/s41746-025-01670-7

Elham Asgari, Nina Montaña-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, et al. A framework to assess clinical safety and hallucination rates of llms for medical text summarisation.npj Digital Medicine, 8:274, 2025. doi: 10.1038/s41746-025-01670-7

work page doi:10.1038/s41746-025-01670-7 2025

[2] [2]

Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085, 2018

Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085, 2018

work page 2018

[3] [3]

Decode like a clinician: Enhancing llm fine-tuning with temporal structured data representation

Daniel Fadlon, David Dov, Aviya Bennett, Daphna Heller-Miron, Gad Levy, Kfir Bar, and Ahuva Weiss-Meilik. Decode like a clinician: Enhancing llm fine-tuning with temporal structured data representation. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Associat...

work page 1906

[4] [4]

Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar

Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A. Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. When raw data prevails: Are large language model em- beddings effective in numerical data representation for medical machine learning applications? InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 54...

work page 2024

[5] [5]

Long short-term memory.Neural Computation, 9 (8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9 (8):1735–1780, 1997

work page 1997

[6] [6]

Set functions for time series

Max Horn, Michael Moor, Christian Bock, Bastian Rieck, and Karsten Borgwardt. Set functions for time series. InInternational Conference on Machine Learning, pages 4353–4363. PMLR, 2020

work page 2020

[7] [7]

Krishnan

Zongliang Ji, Yifei Sun, Andre Carlos Kajdacsy-Balla Amaral, Anna Goldenberg, and Rahul G. Krishnan. Can we generate portable representations for clinical time series data using LLMs? InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=pXw0uRTSKT. Poster

work page 2026

[8] [8]

GraphCare: Enhancing healthcare predictions with personalized knowledge graphs

Pengcheng Jiang, Cao Xiao, Adam Richard Cross, and Jimeng Sun. GraphCare: Enhancing healthcare predictions with personalized knowledge graphs. InInternational Conference on Learning Representations, 2024

work page 2024

[9] [9]

Reasoning-enhanced healthcare predictions with knowledge graph community retrieval

Pengcheng Jiang, Cao Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, and Jiawei Han. Reasoning-enhanced healthcare predictions with knowledge graph community retrieval. InInternational Conference on Learning Representations, 2025

work page 2025

[10] [10]

Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Moham- mad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. Mimic-iii, a freely accessible critical care database.Scientific Data, 3:160035, 2016

work page 2016

[11] [11]

Privacy preserving strategies for electronic health records in the era of large language models.npj Digital Medicine, 8(34), 2025

Jitendra Jonnagaddala and Zoie Shui-Yee Wong. Privacy preserving strategies for electronic health records in the era of large language models.npj Digital Medicine, 8(34), 2025. doi: 10.1038/s41746-025-01429-0

work page doi:10.1038/s41746-025-01429-0 2025

[12] [12]

Krishnan

Alex Labach, Aslesha Pokhrel, Xiao Shi Huang, Saba Zuberi, Seung Eun Yi, Maksims V olkovs, Tomi Poutanen, and Rahul G. Krishnan. Duett: Dual event time transformer for electronic health records. InProceedings of the 8th Machine Learning for Healthcare Conference, volume 219 ofProceedings of Machine Learning Research, pages 403–422. PMLR, 2023

work page 2023

[13] [13]

TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents

Geon Lee, Wenchao Yu, Kijung Shin, Wei Cheng, and Haifeng Chen. TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, pages 18082–18090. AAAI Press, 2025. doi: 10.1609/AAAI.V39I17.33989. URL https://doi.org/10.1609/ aaai.v...

work page doi:10.1609/aaai.v39i17.33989 2025

[14] [14]

Knowledge- empowered dynamic graph network for irregularly sampled medical time series

Yicheng Luo, Zhen Liu, Linghao Wang, Junhao Zheng, Binquan Wu, and Qianli Ma. Knowledge- empowered dynamic graph network for irregularly sampled medical time series. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[15] [15]

Optimized feature generation for tabular data via llms with decision tree reasoning

Jaehyun Nam, Kyuyoung Kim, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim, and Jinwoo Shin. Optimized feature generation for tabular data via llms with decision tree reasoning. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[16] [16]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026

[17] [17]

Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self- critical sequence training for image captioning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017

[18] [18]

Reyna, Christopher Josef, Russell Jeter, Supreeth P

Matthew A. Reyna, Christopher Josef, Russell Jeter, Supreeth P. Shashikumar, M. Brandon Westover, Shamim Nemati, Gari D. Clifford, and Ashish Sharma. Early prediction of sepsis from clinical data: The physionet/computing in cardiology challenge 2019.Critical Care Medicine, 48(2):210–217, 2020

work page 2019

[19] [19]

Satya Narayan Shukla and Benjamin M. Marlin. Multi-time attention networks for irregularly sampled time series. InInternational Conference on Learning Representations, 2021

work page 2021

[20] [20]

Scott, Leo Anthony Celi, and Roger G

Ikaro Silva, George Moody, Daniel J. Scott, Leo Anthony Celi, and Roger G. Mark. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In Computing in Cardiology, pages 245–248, 2012

work page 2012

[21] [21]

Pollard, Eric Lehman, Alistair E

Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard, Eric Lehman, Alistair E. W. Johnson, Matthew McDermott, Tristan Naumann, and Charlotta Lindvall. Bioclinical modernbert: A state-of-the-art long-context encoder for biomedical and clinical nlp. arXiv preprint arXiv:2506.10896, 2025

work page arXiv 2025

[22] [22]

Sindhu Tipirneni and Chandan K. Reddy. Self-supervised transformer for sparse and irregularly sampled multivariate clinical time-series.ACM Transactions on Knowledge Discovery from Data, 16(6):1–17, 2022

work page 2022

[23] [23]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017

work page 2017

[24] [24]

Faithfulness hallucination detection in healthcare AI

Prathiksha Rumale Vishwanath, Simran Tiwari, Tejas Ganesh Naik, Sahil Gupta, Dung Ngoc Thai, Wenlong Zhao, Sunjae Kwon, Victor Ardulov, Karim Tarabishy, Andrew McCallum, and Wael Salloum. Faithfulness hallucination detection in healthcare AI. InKDD-AIDSH Workshop, 2024

work page 2024

[25] [25]

Context clues: Evaluating long context models for clinical prediction tasks on ehr data

Michael Wornow, Suhana Bedi, Miguel Angel Fuentes Hernandez, Ethan Steinberg, Jason Alan Fries, Christopher Ré, Sanmi Koyejo, and Nigam Shah. Context clues: Evaluating long context models for clinical prediction tasks on ehr data. InInternational Conference on Learning Representations, 2025

work page 2025

[26] [26]

Instruction tun- ing large language models to understand electronic health records

Zhenbang Wu, Anant Dadu, Michael Nalls, Faraz Faghri, and Jimeng Sun. Instruction tun- ing large language models to understand electronic health records. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

work page 2024

[27] [27]

LLM meeting decision trees on tabular data

Hangting Ye, Jinmeng Li, He Zhao, Dandan Guo, and Yi Chang. LLM meeting decision trees on tabular data. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=SRDF3RV0KP. Spotlight

work page 2025

[28] [28]

Graph-guided network for irregularly sampled multivariate time series

Xiang Zhang, Marko Zeman, Theodoros Tsiligkaridis, and Marinka Zitnik. Graph-guided network for irregularly sampled multivariate time series. InInternational Conference on Learning Representations, 2022

work page 2022

[29] [29]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 11 Appendix organization.The appendix is organized into four sections. Appendix ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

x is higher than 40,

The random-init reader is close to the reused-CES leaf-ID MLP control in AUPRC, which indicates that selected leaf identities already carry strong predictive information. The full BioClinical reader improves beyond both controls most strongly on PhysioNet 2019. For retrieval, top-M= 5 achieves the best AUPRC while using the smallest selector input and the...

work page 2019