Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction

Jun Wen; Lucas Jing Liu; Matthew Thompson; Meliha Yetisgen; Ruth Etzioni; Sihang Zeng; Sitong Zhou; Yujuan Fu; Zixuan Yu

arxiv: 2510.10454 · v2 · pith:YI2JELOBnew · submitted 2025-10-12 · 💻 cs.AI

Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction

Sihang Zeng , Yujuan Fu , Sitong Zhou , Zixuan Yu , Lucas Jing Liu , Jun Wen , Matthew Thompson , Ruth Etzioni

show 1 more author

Meliha Yetisgen

This is my paper

Pith reviewed 2026-05-21 21:23 UTC · model grok-4.3

classification 💻 cs.AI

keywords patient trajectory modelingchain-of-agentselectronic health recordslung cancer risk predictionmulti-agent systemstemporal reasoningzero-shot predictionLLM healthcare applications

0 comments

The pith

A chain of worker agents chunks long EHR data and distills events into shared memory to outperform baselines in zero-shot lung cancer risk prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Traj-CoA, a multi-agent system designed to model patient trajectories from lengthy and noisy electronic health records for lung cancer risk prediction. Worker agents handle sequential chunks of the data while distilling critical events into a shared long-term memory module called EHRMem, which reduces noise and maintains timeline continuity. A manager agent then combines the distilled memory with agent summaries to generate the final prediction. This setup is tested in a zero-shot scenario that uses five years of EHR to forecast one-year risk and shows better results than multiple categories of baselines while producing reasoning that aligns with clinical patterns.

Core claim

Traj-CoA employs a chain of worker agents to process EHR data in manageable sequential chunks, distills critical events into the shared EHRMem module to preserve a comprehensive timeline, and relies on a final manager agent to synthesize summaries and the extracted timeline for making lung cancer risk predictions, achieving stronger performance than baselines of four categories in zero-shot one-year prediction from five-year records.

What carries the argument

Chain-of-agents architecture in which worker agents sequentially process EHR chunks and distill events into EHRMem long-term memory, enabling the manager agent to perform synthesis and temporal reasoning for the prediction.

Load-bearing premise

Sequential chunk processing by worker agents plus distillation into EHRMem preserves a comprehensive timeline without critical information loss or introduction of hallucinations that would invalidate downstream risk predictions.

What would settle it

A controlled experiment that inserts known critical clinical events into full EHR records, then checks whether those events are omitted from the distilled EHRMem and whether prediction accuracy falls compared with direct full-context baselines.

Figures

Figures reproduced from arXiv: 2510.10454 by Jun Wen, Lucas Jing Liu, Matthew Thompson, Meliha Yetisgen, Ruth Etzioni, Sihang Zeng, Sitong Zhou, Yujuan Fu, Zixuan Yu.

**Figure 2.** Figure 2: Sensitivity analysis on (A) chunk size and (B) number of chunks. This reveals a fundamental trade-off. Small chunks force a long chain of iterative summarizations, risking catastrophic forgetting [50] where early, critical details are abstracted away. Conversely, large chunks shorten the chain but are susceptible to the "lost-in-the-middle" issue [11], where each worker agent fails to identify fine-grain… view at source ↗

**Figure 3.** Figure 3: Analysis of Traj-CoA’s behavior. (A) t-SNE plot visualizing the distribution of lung cancer [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Large language models (LLMs) offer a generalizable approach for modeling patient trajectories, but suffer from the long and noisy nature of electronic health records (EHR) data in temporal reasoning. To address these challenges, we introduce Traj-CoA, a multi-agent system involving chain-of-agents for patient trajectory modeling. Traj-CoA employs a chain of worker agents to process EHR data in manageable chunks sequentially, distilling critical events into a shared long-term memory module, EHRMem, to reduce noise and preserve a comprehensive timeline. A final manager agent synthesizes the worker agents' summary and the extracted timeline in EHRMem to make predictions. In a zero-shot one-year lung cancer risk prediction task based on five-year EHR data, Traj-CoA outperforms baselines of four categories. Analysis reveals that Traj-CoA exhibits clinically aligned temporal reasoning, establishing it as a promisingly robust and generalizable approach for modeling complex patient trajectories. Implementation of Traj-CoA is available on https://github.com/zengsihang/Traj-CoA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Traj-CoA chunks long EHR into worker agents that distill to a shared memory then hands off to a manager for zero-shot lung cancer risk prediction, but the abstract gives no numbers so the outperformance claim is still unverified.

read the letter

This paper's main contribution is Traj-CoA, a chain-of-agents setup where worker agents process chunks of five-year EHR data sequentially and distill key events into a shared EHRMem module. A manager agent then uses that to predict one-year lung cancer risk in a zero-shot manner. It claims to outperform several baseline categories and show clinically aligned reasoning. The new part is applying this specific multi-agent chain with the dedicated memory distillation step to patient trajectory modeling. It builds on general LLM trajectory work but adds the chunking and memory to deal with length and noise in EHR. It does well by focusing on a practical problem in oncology prediction and making the implementation available on GitHub. That lowers the barrier for others to test or extend it. The soft spots are around the evidence. The abstract mentions outperformance but gives no numbers, no details on the baselines or dataset, and no error bars. This makes it difficult to assess how meaningful the improvements are. The stress-test point about possible information loss or hallucinations during distillation is a real concern here. Sequential chunking could break cross-chunk dependencies, and LLM summarization might miss subtle trends or invent links. Without fidelity metrics, ablations on what ends up in EHRMem, or human review of the timelines, it's not clear if the method truly preserves the necessary history. This work is aimed at researchers in AI for healthcare, especially those dealing with longitudinal EHR for risk prediction. Anyone looking at multi-agent LLMs for time-series medical data could find it useful. It deserves a serious referee. The approach is novel enough and the code is out, so the full paper with results should get proper review to check the claims. I would recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces Traj-CoA, a multi-agent system for patient trajectory modeling from long, noisy EHR data. Worker agents sequentially process five-year EHR records in chunks and distill critical events into a shared long-term memory module (EHRMem); a manager agent then synthesizes the distilled timeline and summaries to perform zero-shot one-year lung cancer risk prediction. The authors claim that Traj-CoA outperforms baselines from four categories and exhibits clinically aligned temporal reasoning, with code released on GitHub.

Significance. If the empirical superiority and fidelity of the distilled timeline are substantiated, the work could meaningfully advance LLM-based temporal reasoning for healthcare by offering a practical multi-agent strategy to manage extended context and noise without fine-tuning. The open-source release supports reproducibility and is a clear strength.

major comments (3)

[§4] §4 (Experiments): the central claim of outperformance over four baseline categories is presented without quantitative metrics, error bars, dataset size, cohort details, or statistical tests in the abstract or summary sections; this directly affects assessment of robustness and generalizability.
[§3.2] §3.2 (Method, EHRMem distillation): the description provides no quantitative fidelity metrics, ablation on memory content, or human validation of the summarized timeline, yet the central claim requires that chunked processing plus distillation preserves a comprehensive timeline without critical loss or hallucinations.
[§4.3] §4.3 (Ablation and analysis): absence of ablations isolating the contribution of EHRMem or testing for information loss across chunk boundaries leaves open the possibility that reported gains arise from artifacts rather than genuine trajectory modeling.

minor comments (2)

[Abstract] Abstract: include at least one key quantitative result and dataset scale to make the outperformance claim concrete for readers.
[§3] Notation: clarify the exact interface between worker-agent outputs and the manager agent's input from EHRMem to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, clarifying the current manuscript content and indicating revisions where they strengthen the work without misrepresenting the results.

read point-by-point responses

Referee: [§4] §4 (Experiments): the central claim of outperformance over four baseline categories is presented without quantitative metrics, error bars, dataset size, cohort details, or statistical tests in the abstract or summary sections; this directly affects assessment of robustness and generalizability.

Authors: We agree that the abstract and high-level summary do not include specific numerical results. The full quantitative metrics, error bars, dataset size (five-year EHR cohort for lung cancer risk), cohort details, and statistical comparisons are reported in Section 4. To address the concern directly, we will revise the abstract to include the primary performance gains, dataset scale, and a note on statistical testing. revision: yes
Referee: [§3.2] §3.2 (Method, EHRMem distillation): the description provides no quantitative fidelity metrics, ablation on memory content, or human validation of the summarized timeline, yet the central claim requires that chunked processing plus distillation preserves a comprehensive timeline without critical loss or hallucinations.

Authors: Section 3.2 describes the distillation process into EHRMem, with supporting evidence from overall task performance and the clinically aligned reasoning shown in Section 4. We acknowledge the absence of direct quantitative fidelity metrics or human validation of the distilled timeline. In revision we will add an ablation on memory content and an analysis of information retention; human validation will be added if resources permit within the revision window, otherwise noted as a limitation. revision: partial
Referee: [§4.3] §4.3 (Ablation and analysis): absence of ablations isolating the contribution of EHRMem or testing for information loss across chunk boundaries leaves open the possibility that reported gains arise from artifacts rather than genuine trajectory modeling.

Authors: Section 4.3 already contains ablations on the multi-agent pipeline and temporal components. We agree that more targeted experiments isolating EHRMem and quantifying information loss at chunk boundaries would further rule out artifacts. We will expand the ablation subsection to include a direct with/without-EHRMem comparison and a chunk-boundary retention analysis using event-overlap metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with no derivational reduction

full rationale

The paper introduces a multi-agent architecture (worker agents processing EHR chunks into EHRMem, followed by manager synthesis) and evaluates it via zero-shot empirical comparison on lung cancer risk prediction against four baseline categories. No equations, fitted parameters, uniqueness theorems, or self-citation chains appear in the abstract or described method. The central claim rests on external benchmark outperformance and qualitative analysis of temporal reasoning rather than any input-to-output reduction by construction, rendering the work self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested premise that agent-based chunking plus memory distillation yields clinically aligned temporal reasoning without systematic loss or fabrication of events; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5745 in / 1069 out tokens · 46150 ms · 2026-05-21T21:23:02.995630+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Traj-CoA employs a chain of worker agents to process EHR data in manageable chunks sequentially, distilling critical events into a shared long-term memory module, EHRMem
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance scales positively with context windows up to 160k tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
cs.AI 2026-05 unverdicted novelty 5.0

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Mining electronic health records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012

Peter B Jensen, Lars J Jensen, and Søren Brunak. Mining electronic health records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012

work page 2012
[2]

Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.arXiv preprint arXiv:2503.04176, 2025

Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam Shah. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.arXiv preprint arXiv:2503.04176, 2025

work page arXiv 2025
[3]

Growth dynamics of lung nodules: implications for classification in lung cancer screening.Cancer Imaging, 24(1):113, 2024

Beatriz Ocaña-Tienda, Alba Eroles-Simó, Julián Pérez-Beteta, Estanislao Arana, and Víctor M Pérez-García. Growth dynamics of lung nodules: implications for classification in lung cancer screening.Cancer Imaging, 24(1):113, 2024

work page 2024
[4]

Modelling patient trajectories using multimodal informa- tion.Journal of biomedical informatics, 134:104195, 2022

João Figueira Silva and Sérgio Matos. Modelling patient trajectories using multimodal informa- tion.Journal of biomedical informatics, 134:104195, 2022

work page 2022
[5]

Multi-modal graph learning over umls knowledge graphs

Manuel Burger, Gunnar Rätsch, and Rita Kuznetsova. Multi-modal graph learning over umls knowledge graphs. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 ofProceedings of Machine...

work page 2023
[6]

Trajsurv: Learning continuous latent trajectories from electronic health records for trustworthy survival prediction, 2025

Sihang Zeng, Lucas Jing Liu, Jun Wen, Meliha Yetisgen, Ruth Etzioni, and Gang Luo. Trajsurv: Learning continuous latent trajectories from electronic health records for trustworthy survival prediction, 2025

work page 2025
[7]

Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, and Lin Yang

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...

work page 2025
[8]

Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

work page 2025
[9]

Ultramedical: Building specialized generalists in biomedicine, 2024

Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, and Bowen Zhou. Ultramedical: Building specialized generalists in biomedicine, 2024

work page 2024
[10]

Zero-shot large language models for long clinical text summarization with temporal reasoning, 2025

Maya Kruse, Shiyue Hu, Nicholas Derby, Yifu Wu, Samantha Stonbraker, Bingsheng Yao, Dakuo Wang, Elizabeth Goldberg, and Yanjun Gao. Zero-shot large language models for long clinical text summarization with temporal reasoning, 2025

work page 2025
[11]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

The evolving use of electronic health records (ehr) for research

Ellen Kim, Samuel M Rubinstein, Kevin T Nead, Andrzej P Wojcieszynski, Peter E Gabriel, and Jeremy L Warner. The evolving use of electronic health records (ehr) for research. In Seminars in radiation oncology, volume 29, pages 354–361. Elsevier, 2019

work page 2019
[13]

Scalable and accurate deep learning with electronic health records.NPJ digital medicine, 1(1):18, 2018

Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records.NPJ digital medicine, 1(1):18, 2018

work page 2018
[14]

Large language models for information retrieval: A survey, 2024

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Large language models for information retrieval: A survey, 2024

work page 2024
[15]

Prompting large language models for zero-shot clinical prediction with structured longitudinal electronic health record data.arXiv preprint arXiv:2402.01713, 2024

Yinghao Zhu, Zixiang Wang, Junyi Gao, Yuning Tong, Jingkun An, Weibin Liao, Ewen M Harrison, Liantao Ma, and Chengwei Pan. Prompting large language models for zero-shot clinical prediction with structured longitudinal electronic health record data.arXiv preprint arXiv:2402.01713, 2024

work page arXiv 2024
[16]

A comprehensive survey on long context language modeling, 2025

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, ...

work page 2025
[17]

The rise and potential of large language model based agents: A survey, 2023

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, a...

work page 2023
[18]

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö. Arik. Chain of agents: Large language models collaborating on long-context tasks, 2024

work page 2024
[19]

Leveraging long context in retrieval augmented language models for medical question answering.npj Digital Medicine, 8(1):239, 2025

Gongbo Zhang, Zihan Xu, Qiao Jin, Fangyi Chen, Yilu Fang, Yi Liu, Justin F Rousseau, Ziyang Xu, Zhiyong Lu, Chunhua Weng, et al. Leveraging long context in retrieval augmented language models for medical question answering.npj Digital Medicine, 8(1):239, 2025

work page 2025
[20]

Agent hospital: A simulacrum of hospital with evolvable medical agents, 2025

Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents, 2025

work page 2025
[21]

Kulas, Andy Schuetz, Walter F

Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism, 2017

work page 2017
[22]

Survlatent ode : A neural ode based time- to-event model with competing risks for longitudinal data improves cancer-associated venous thromboembolism (vte) prediction, 2022

Intae Moon, Stefan Groha, and Alexander Gusev. Survlatent ode : A neural ode based time- to-event model with competing risks for longitudinal data improves cancer-associated venous thromboembolism (vte) prediction, 2022

work page 2022
[23]

Ice-node: Integration of clinical embeddings with neural ordinary differential equations

Asem Alaa, Erik Mayer, and Mauricio Barahona. Ice-node: Integration of clinical embeddings with neural ordinary differential equations. InMachine Learning for Healthcare Conference, pages 537–564. PMLR, 2022

work page 2022
[24]

Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

Yikuan Li, Shishir Rao, José Roberto Ayala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy, Yajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

work page 2020
[25]

Med-bert: pretrained contex- tualized embeddings on large-scale structured electronic health records for disease prediction

Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-bert: pretrained contex- tualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86, 2021. 11

work page 2021
[26]

Krishnan

Alex Labach, Aslesha Pokhrel, Xiao Shi Huang, Saba Zuberi, Seung Eun Yi, Maksims V olkovs, Tomi Poutanen, and Rahul G. Krishnan. Duett: Dual event time transformer for electronic health records, 2023

work page 2023
[27]

A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories.Nature medicine, 29(5):1113–1122, 2023

Davide Placido, Bo Yuan, Jessica X Hjaltelin, Chunlei Zheng, Amalie D Haue, Piotr J Chmura, Chen Yuan, Jihye Kim, Renato Umeton, Gregory Antell, et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories.Nature medicine, 29(5):1113–1122, 2023

work page 2023
[28]

Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B

Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunning- ham, David W. Bates, and Arkadiusz Sitek. Foundation model of electronic medical records for adaptive risk estimation, 2025

work page 2025
[29]

Michael Wornow, Suhana Bedi, Miguel Angel Fuentes Hernandez, Ethan Steinberg, Jason Alan Fries, Christopher Re, Sanmi Koyejo, and Nigam H. Shah. Context clues: Evaluating long context models for clinical prediction tasks on ehrs, 2025

work page 2025
[30]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

A survey on the memory mechanism of large language model based agents, 2024

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents, 2024

work page 2024
[32]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

work page 2024
[33]

Longagent: Scaling language models to 128k context through multi-agent collaboration, 2024

Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, and Xuanjing Huang. Longagent: Scaling language models to 128k context through multi-agent collaboration, 2024

work page 2024
[34]

Longhealth: A question answering benchmark with long clinical documents.Journal of Healthcare Informatics Research, pages 1–17, 2025

Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander Löser, Hugo JWL Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. Longhealth: A question answering benchmark with long clinical documents.Journal of Healthcare Informatics Research, pages 1–17, 2025

work page 2025
[35]

A survey of llm-based agents in medicine: How far are we from baymax?, 2025

Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Jiaming Ji, Wenting Chen, Xiang Li, and Yixuan Yuan. A survey of llm-based agents in medicine: How far are we from baymax?, 2025

work page 2025
[36]

Large language models as biomedical hypothesis generators: A comprehensive evaluation, 2024

Biqing Qi, Kaiyan Zhang, Kai Tian, Haoxiang Li, Zhang-Ren Chen, Sihang Zeng, Ermo Hua, Hu Jinfang, and Bowen Zhou. Large language models as biomedical hypothesis generators: A comprehensive evaluation, 2024

work page 2024
[37]

Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

Xi Chen, Huahui Yi, Mingke You, WeiZhi Liu, Li Wang, Hairui Li, Xue Zhang, Yingman Guo, Lei Fan, Gang Chen, et al. Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

work page 2025
[38]

Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning, 2024

Ling Yue, Sixue Xing, Jintai Chen, and Tianfan Fu. Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning, 2024

work page 2024
[39]

Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024

work page 2024
[40]

Developing next-generation cancer care management with multi-agent orchestration, May 2025

MD MPH, Matthew Lungren. Developing next-generation cancer care management with multi-agent orchestration, May 2025

work page 2025
[41]

Care- ad: a multi-agent large language model framework for alzheimer’s disease prediction using longitudinal clinical notes.npj Digital Medicine, 8(1):541, August 2025

Rumeng Li, Xun Wang, Dan Berlowitz, Jesse Mez, Honghuang Lin, and Hong Yu. Care- ad: a multi-agent large language model framework for alzheimer’s disease prediction using longitudinal clinical notes.npj Digital Medicine, 8(1):541, August 2025. 12

work page 2025
[42]

Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Marzyeh Ghassemi, Michael C. Hughes, and Tristan Naumann. Mimic-extract: a data extraction, preprocessing, and representa- tion pipeline for mimic-iii. InProceedings of the ACM Conference on Health, Inference, and Learning, ACM CHIL ’20, page 222–235. ACM, April 2020

work page 2020
[43]

Castro, Vivian S

Jun Wen, Jue Hou, Clara-Lea Bonzel, Yihan Zhao, Victor M. Castro, Vivian S. Gainer, Dana Weisenfeld, Tianrun Cai, Yuk-Lam Ho, Vidul A. Panickan, Lauren Costa, Chuan Hong, J. Michael Gaziano, Katherine P. Liao, Junwei Lu, Kelly Cho, and Tianxi Cai. Latte: Label- efficient incident phenotyping from longitudinal electronic health records, 2023

work page 2023
[44]

Use xml tags to structure your prompts

Anthropic. Use xml tags to structure your prompts. https://docs.anthropic.com/ en/docs/build-with-claude/prompt-engineering/use-xml-tags , 2025. Accessed: 2025-08-16

work page 2025
[45]

A systematic survey of prompt engineering in large language models: Techniques and applications, 2025

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications, 2025

work page 2025
[46]

Xgboost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016

work page 2016
[47]

Lee, Anthony Wu, and Jeffrey N

Simon A. Lee, Anthony Wu, and Jeffrey N. Chiang. Clinical modernbert: An efficient and long context encoder for biomedical text, 2025

work page 2025
[48]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

work page 2021
[49]

Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

work page 2024
[50]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

work page 2017
[51]

Morris, Brandon Duderstadt, and Andriy Mulyar

Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder, 2025

work page 2025
[52]

Topicgpt: A prompt-based topic modeling framework, 2024

Chau Minh Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, and Mohit Iyyer. Topicgpt: A prompt-based topic modeling framework, 2024

work page 2024
[53]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[54]

Screening for lung cancer: 2023 guideline update from the american cancer society.CA: A Cancer Journal for Clinicians, 74(1):50–81, 2024

Andrew MD Wolf, Kevin C Oeffinger, Tina Ya-Chen Shih, Louise C Walter, Timothy R Church, Elizabeth TH Fontham, Elena B Elkin, Ruth D Etzioni, Carmen E Guerra, Rebecca B Perkins, et al. Screening for lung cancer: 2023 guideline update from the american cancer society.CA: A Cancer Journal for Clinicians, 74(1):50–81, 2024

work page 2023
[55]

Cancer progress and priorities: lung cancer.Cancer epidemiology, biomarkers & prevention, 28(10):1563–1579, 2019

Matthew B Schabath and Michele L Cote. Cancer progress and priorities: lung cancer.Cancer epidemiology, biomarkers & prevention, 28(10):1563–1579, 2019

work page 2019
[56]

Hye Seon Kang, Ah Young Shin, Chang Dong Yeo, Chan Kwon Park, Ju Sang Kim, Jin Woo Kim, Seung Joon Kim, Sang Haak Lee, and Sung Kyoung Kim. Clinical significance of anemia as a prognostic factor in non-small cell lung cancer carcinoma with activating epidermal growth factor receptor mutations.Journal of Thoracic Disease, 12(5):1895, 2020. 13

work page 2020
[57]

Inflammation in the development of lung cancer: epidemiological evidence

Eric A Engels. Inflammation in the development of lung cancer: epidemiological evidence. Expert review of anticancer therapy, 8(4):605–615, 2008

work page 2008
[58]

Maria G Prado, Larry G Kessler, Margaret A Au, Hannah A Burkhardt, Monica Zigman Suchsland, Lesleigh Kowalski, Kari A Stephens, Meliha Yetisgen, Fiona M Walter, Richard D Neal, et al. Symptoms and signs of lung cancer prior to diagnosis: case–control study using electronic health records from ambulatory care within a large us-based tertiary care centre.BM...

work page 2023
[59]

Biomni: A general-purpose biomedical ai agent

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Junze Zhang, Yin Di, et al. Biomni: A general-purpose biomedical ai agent. bioRxiv, pages 2025–05, 2025

work page 2025
[60]

Txagent: An ai agent for therapeutic reasoning across a universe of tools, 2025

Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: An ai agent for therapeutic reasoning across a universe of tools, 2025

work page 2025
[61]

Marti: A framework for multi-agent llm systems reinforced training and inference, 2025

Kaiyan Zhang, Runze Liu, Xuekai Zhu, Kai Tian, Sihang Zeng, Guoli Jia, Yuchen Fan, Xingtai Lv, Yuxin Zuo, Che Jiang, Ziyang Liu, Jianyu Wang, Yuru Wang, Ruotong Zhao, Ermo Hua, Yibo Wang, Shijie Wang, Junqi Gao, Xinwei Long, Youbang Sun, Zhiyuan Ma, Ganqu Cui, Lei Bai, Ning Ding, Biqing Qi, and Bowen Zhou. Marti: A framework for multi-agent llm systems re...

work page 2025
[62]

Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip H. S. Torr, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt. Malt: Improving reasoning with multi-agent llm training, 2025

work page 2025
[63]

A systematic survey of automatic prompt optimization techniques, 2025

Kiran Ramnath, Kang Zhou, Sheng Guan, Soumya Smruti Mishra, Xuan Qi, Zhengyuan Shen, Shuai Wang, Sangmin Woo, Sullam Jeoung, Yawei Wang, Haozhu Wang, Han Ding, Yuzhe Lu, Zhichao Xu, Yun Zhou, Balasubramaniam Srinivasan, Qiaojing Yan, Yueyan Chen, Haibo Ding, Panpan Xu, and Lin Lee Cheong. A systematic survey of automatic prompt optimization techniques, 2025

work page 2025
[64]

Large language models are zero shot hypothesis proposers, 2023

Biqing Qi, Kaiyan Zhang, Haoxiang Li, Kai Tian, Sihang Zeng, Zhang-Ren Chen, and Bowen Zhou. Large language models are zero shot hypothesis proposers, 2023

work page 2023
[65]

McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant

Matthew B. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at auroc and auprc under class imbalance. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, page 44102–44163. Curran Associates, Inc., 2024

work page 2024
[66]

Scaling relationship on learning mathematical reasoning with large language models, 2023

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

work page 2023
[67]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

work page 2023
[68]

Unsloth, 2023

Michael Han Daniel Han and Unsloth team. Unsloth, 2023

work page 2023
[69]

Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

work page 2020
[70]

Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning, 2025

Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, and Hoifung Poon. Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning, 2025

work page 2025
[71]

Lung cancer risk in never- smokers: An overview of environmental and genetic factors.Chinese Journal of Cancer Research, 33(5):548, 2021

Elvin S Cheng, Marianne Weber, Julia Steinberg, and Xue Qin Yu. Lung cancer risk in never- smokers: An overview of environmental and genetic factors.Chinese Journal of Cancer Research, 33(5):548, 2021. 14

work page 2021
[72]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the- ar...

work page 2020
[73]

prominent interstitial pattern,

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 15 Appendix A. Additional Results A.1 Dataset Description We...

work page 2023
[74]

How likely is this patient to develop lung cancer within one year?

To accommodate the model’s size and the long-context requirements of the task, inference was performed on two NVIDIA A100 GPUs, leveraging tensor parallelism. Implementation of Traj-CoA will be released on GitHub upon acceptance. Appendix C. Prompts We present the prompt templates and query for RAG in Table S5, S6, S7, S8, S9, S10, S11, and S12. 1https://...

work page

[1] [1]

Mining electronic health records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012

Peter B Jensen, Lars J Jensen, and Søren Brunak. Mining electronic health records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012

work page 2012

[2] [2]

Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.arXiv preprint arXiv:2503.04176, 2025

Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam Shah. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.arXiv preprint arXiv:2503.04176, 2025

work page arXiv 2025

[3] [3]

Growth dynamics of lung nodules: implications for classification in lung cancer screening.Cancer Imaging, 24(1):113, 2024

Beatriz Ocaña-Tienda, Alba Eroles-Simó, Julián Pérez-Beteta, Estanislao Arana, and Víctor M Pérez-García. Growth dynamics of lung nodules: implications for classification in lung cancer screening.Cancer Imaging, 24(1):113, 2024

work page 2024

[4] [4]

Modelling patient trajectories using multimodal informa- tion.Journal of biomedical informatics, 134:104195, 2022

João Figueira Silva and Sérgio Matos. Modelling patient trajectories using multimodal informa- tion.Journal of biomedical informatics, 134:104195, 2022

work page 2022

[5] [5]

Multi-modal graph learning over umls knowledge graphs

Manuel Burger, Gunnar Rätsch, and Rita Kuznetsova. Multi-modal graph learning over umls knowledge graphs. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 ofProceedings of Machine...

work page 2023

[6] [6]

Trajsurv: Learning continuous latent trajectories from electronic health records for trustworthy survival prediction, 2025

Sihang Zeng, Lucas Jing Liu, Jun Wen, Meliha Yetisgen, Ruth Etzioni, and Gang Luo. Trajsurv: Learning continuous latent trajectories from electronic health records for trustworthy survival prediction, 2025

work page 2025

[7] [7]

Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, and Lin Yang

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...

work page 2025

[8] [8]

Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

work page 2025

[9] [9]

Ultramedical: Building specialized generalists in biomedicine, 2024

Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, and Bowen Zhou. Ultramedical: Building specialized generalists in biomedicine, 2024

work page 2024

[10] [10]

Zero-shot large language models for long clinical text summarization with temporal reasoning, 2025

Maya Kruse, Shiyue Hu, Nicholas Derby, Yifu Wu, Samantha Stonbraker, Bingsheng Yao, Dakuo Wang, Elizabeth Goldberg, and Yanjun Gao. Zero-shot large language models for long clinical text summarization with temporal reasoning, 2025

work page 2025

[11] [11]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

The evolving use of electronic health records (ehr) for research

Ellen Kim, Samuel M Rubinstein, Kevin T Nead, Andrzej P Wojcieszynski, Peter E Gabriel, and Jeremy L Warner. The evolving use of electronic health records (ehr) for research. In Seminars in radiation oncology, volume 29, pages 354–361. Elsevier, 2019

work page 2019

[13] [13]

Scalable and accurate deep learning with electronic health records.NPJ digital medicine, 1(1):18, 2018

Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records.NPJ digital medicine, 1(1):18, 2018

work page 2018

[14] [14]

Large language models for information retrieval: A survey, 2024

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Large language models for information retrieval: A survey, 2024

work page 2024

[15] [15]

Prompting large language models for zero-shot clinical prediction with structured longitudinal electronic health record data.arXiv preprint arXiv:2402.01713, 2024

Yinghao Zhu, Zixiang Wang, Junyi Gao, Yuning Tong, Jingkun An, Weibin Liao, Ewen M Harrison, Liantao Ma, and Chengwei Pan. Prompting large language models for zero-shot clinical prediction with structured longitudinal electronic health record data.arXiv preprint arXiv:2402.01713, 2024

work page arXiv 2024

[16] [16]

A comprehensive survey on long context language modeling, 2025

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, ...

work page 2025

[17] [17]

The rise and potential of large language model based agents: A survey, 2023

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, a...

work page 2023

[18] [18]

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö. Arik. Chain of agents: Large language models collaborating on long-context tasks, 2024

work page 2024

[19] [19]

Leveraging long context in retrieval augmented language models for medical question answering.npj Digital Medicine, 8(1):239, 2025

Gongbo Zhang, Zihan Xu, Qiao Jin, Fangyi Chen, Yilu Fang, Yi Liu, Justin F Rousseau, Ziyang Xu, Zhiyong Lu, Chunhua Weng, et al. Leveraging long context in retrieval augmented language models for medical question answering.npj Digital Medicine, 8(1):239, 2025

work page 2025

[20] [20]

Agent hospital: A simulacrum of hospital with evolvable medical agents, 2025

Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents, 2025

work page 2025

[21] [21]

Kulas, Andy Schuetz, Walter F

Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism, 2017

work page 2017

[22] [22]

Survlatent ode : A neural ode based time- to-event model with competing risks for longitudinal data improves cancer-associated venous thromboembolism (vte) prediction, 2022

Intae Moon, Stefan Groha, and Alexander Gusev. Survlatent ode : A neural ode based time- to-event model with competing risks for longitudinal data improves cancer-associated venous thromboembolism (vte) prediction, 2022

work page 2022

[23] [23]

Ice-node: Integration of clinical embeddings with neural ordinary differential equations

Asem Alaa, Erik Mayer, and Mauricio Barahona. Ice-node: Integration of clinical embeddings with neural ordinary differential equations. InMachine Learning for Healthcare Conference, pages 537–564. PMLR, 2022

work page 2022

[24] [24]

Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

Yikuan Li, Shishir Rao, José Roberto Ayala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy, Yajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

work page 2020

[25] [25]

Med-bert: pretrained contex- tualized embeddings on large-scale structured electronic health records for disease prediction

Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-bert: pretrained contex- tualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86, 2021. 11

work page 2021

[26] [26]

Krishnan

Alex Labach, Aslesha Pokhrel, Xiao Shi Huang, Saba Zuberi, Seung Eun Yi, Maksims V olkovs, Tomi Poutanen, and Rahul G. Krishnan. Duett: Dual event time transformer for electronic health records, 2023

work page 2023

[27] [27]

A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories.Nature medicine, 29(5):1113–1122, 2023

Davide Placido, Bo Yuan, Jessica X Hjaltelin, Chunlei Zheng, Amalie D Haue, Piotr J Chmura, Chen Yuan, Jihye Kim, Renato Umeton, Gregory Antell, et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories.Nature medicine, 29(5):1113–1122, 2023

work page 2023

[28] [28]

Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B

Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunning- ham, David W. Bates, and Arkadiusz Sitek. Foundation model of electronic medical records for adaptive risk estimation, 2025

work page 2025

[29] [29]

Michael Wornow, Suhana Bedi, Miguel Angel Fuentes Hernandez, Ethan Steinberg, Jason Alan Fries, Christopher Re, Sanmi Koyejo, and Nigam H. Shah. Context clues: Evaluating long context models for clinical prediction tasks on ehrs, 2025

work page 2025

[30] [30]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

A survey on the memory mechanism of large language model based agents, 2024

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents, 2024

work page 2024

[32] [32]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

work page 2024

[33] [33]

Longagent: Scaling language models to 128k context through multi-agent collaboration, 2024

Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, and Xuanjing Huang. Longagent: Scaling language models to 128k context through multi-agent collaboration, 2024

work page 2024

[34] [34]

Longhealth: A question answering benchmark with long clinical documents.Journal of Healthcare Informatics Research, pages 1–17, 2025

Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander Löser, Hugo JWL Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. Longhealth: A question answering benchmark with long clinical documents.Journal of Healthcare Informatics Research, pages 1–17, 2025

work page 2025

[35] [35]

A survey of llm-based agents in medicine: How far are we from baymax?, 2025

Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Jiaming Ji, Wenting Chen, Xiang Li, and Yixuan Yuan. A survey of llm-based agents in medicine: How far are we from baymax?, 2025

work page 2025

[36] [36]

Large language models as biomedical hypothesis generators: A comprehensive evaluation, 2024

Biqing Qi, Kaiyan Zhang, Kai Tian, Haoxiang Li, Zhang-Ren Chen, Sihang Zeng, Ermo Hua, Hu Jinfang, and Bowen Zhou. Large language models as biomedical hypothesis generators: A comprehensive evaluation, 2024

work page 2024

[37] [37]

Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

Xi Chen, Huahui Yi, Mingke You, WeiZhi Liu, Li Wang, Hairui Li, Xue Zhang, Yingman Guo, Lei Fan, Gang Chen, et al. Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

work page 2025

[38] [38]

Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning, 2024

Ling Yue, Sixue Xing, Jintai Chen, and Tianfan Fu. Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning, 2024

work page 2024

[39] [39]

Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024

work page 2024

[40] [40]

Developing next-generation cancer care management with multi-agent orchestration, May 2025

MD MPH, Matthew Lungren. Developing next-generation cancer care management with multi-agent orchestration, May 2025

work page 2025

[41] [41]

Care- ad: a multi-agent large language model framework for alzheimer’s disease prediction using longitudinal clinical notes.npj Digital Medicine, 8(1):541, August 2025

Rumeng Li, Xun Wang, Dan Berlowitz, Jesse Mez, Honghuang Lin, and Hong Yu. Care- ad: a multi-agent large language model framework for alzheimer’s disease prediction using longitudinal clinical notes.npj Digital Medicine, 8(1):541, August 2025. 12

work page 2025

[42] [42]

Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Marzyeh Ghassemi, Michael C. Hughes, and Tristan Naumann. Mimic-extract: a data extraction, preprocessing, and representa- tion pipeline for mimic-iii. InProceedings of the ACM Conference on Health, Inference, and Learning, ACM CHIL ’20, page 222–235. ACM, April 2020

work page 2020

[43] [43]

Castro, Vivian S

Jun Wen, Jue Hou, Clara-Lea Bonzel, Yihan Zhao, Victor M. Castro, Vivian S. Gainer, Dana Weisenfeld, Tianrun Cai, Yuk-Lam Ho, Vidul A. Panickan, Lauren Costa, Chuan Hong, J. Michael Gaziano, Katherine P. Liao, Junwei Lu, Kelly Cho, and Tianxi Cai. Latte: Label- efficient incident phenotyping from longitudinal electronic health records, 2023

work page 2023

[44] [44]

Use xml tags to structure your prompts

Anthropic. Use xml tags to structure your prompts. https://docs.anthropic.com/ en/docs/build-with-claude/prompt-engineering/use-xml-tags , 2025. Accessed: 2025-08-16

work page 2025

[45] [45]

A systematic survey of prompt engineering in large language models: Techniques and applications, 2025

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications, 2025

work page 2025

[46] [46]

Xgboost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016

work page 2016

[47] [47]

Lee, Anthony Wu, and Jeffrey N

Simon A. Lee, Anthony Wu, and Jeffrey N. Chiang. Clinical modernbert: An efficient and long context encoder for biomedical text, 2025

work page 2025

[48] [48]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

work page 2021

[49] [49]

Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

work page 2024

[50] [50]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

work page 2017

[51] [51]

Morris, Brandon Duderstadt, and Andriy Mulyar

Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder, 2025

work page 2025

[52] [52]

Topicgpt: A prompt-based topic modeling framework, 2024

Chau Minh Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, and Mohit Iyyer. Topicgpt: A prompt-based topic modeling framework, 2024

work page 2024

[53] [53]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025

[54] [54]

Screening for lung cancer: 2023 guideline update from the american cancer society.CA: A Cancer Journal for Clinicians, 74(1):50–81, 2024

Andrew MD Wolf, Kevin C Oeffinger, Tina Ya-Chen Shih, Louise C Walter, Timothy R Church, Elizabeth TH Fontham, Elena B Elkin, Ruth D Etzioni, Carmen E Guerra, Rebecca B Perkins, et al. Screening for lung cancer: 2023 guideline update from the american cancer society.CA: A Cancer Journal for Clinicians, 74(1):50–81, 2024

work page 2023

[55] [55]

Cancer progress and priorities: lung cancer.Cancer epidemiology, biomarkers & prevention, 28(10):1563–1579, 2019

Matthew B Schabath and Michele L Cote. Cancer progress and priorities: lung cancer.Cancer epidemiology, biomarkers & prevention, 28(10):1563–1579, 2019

work page 2019

[56] [56]

Hye Seon Kang, Ah Young Shin, Chang Dong Yeo, Chan Kwon Park, Ju Sang Kim, Jin Woo Kim, Seung Joon Kim, Sang Haak Lee, and Sung Kyoung Kim. Clinical significance of anemia as a prognostic factor in non-small cell lung cancer carcinoma with activating epidermal growth factor receptor mutations.Journal of Thoracic Disease, 12(5):1895, 2020. 13

work page 2020

[57] [57]

Inflammation in the development of lung cancer: epidemiological evidence

Eric A Engels. Inflammation in the development of lung cancer: epidemiological evidence. Expert review of anticancer therapy, 8(4):605–615, 2008

work page 2008

[58] [58]

Maria G Prado, Larry G Kessler, Margaret A Au, Hannah A Burkhardt, Monica Zigman Suchsland, Lesleigh Kowalski, Kari A Stephens, Meliha Yetisgen, Fiona M Walter, Richard D Neal, et al. Symptoms and signs of lung cancer prior to diagnosis: case–control study using electronic health records from ambulatory care within a large us-based tertiary care centre.BM...

work page 2023

[59] [59]

Biomni: A general-purpose biomedical ai agent

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Junze Zhang, Yin Di, et al. Biomni: A general-purpose biomedical ai agent. bioRxiv, pages 2025–05, 2025

work page 2025

[60] [60]

Txagent: An ai agent for therapeutic reasoning across a universe of tools, 2025

Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: An ai agent for therapeutic reasoning across a universe of tools, 2025

work page 2025

[61] [61]

Marti: A framework for multi-agent llm systems reinforced training and inference, 2025

Kaiyan Zhang, Runze Liu, Xuekai Zhu, Kai Tian, Sihang Zeng, Guoli Jia, Yuchen Fan, Xingtai Lv, Yuxin Zuo, Che Jiang, Ziyang Liu, Jianyu Wang, Yuru Wang, Ruotong Zhao, Ermo Hua, Yibo Wang, Shijie Wang, Junqi Gao, Xinwei Long, Youbang Sun, Zhiyuan Ma, Ganqu Cui, Lei Bai, Ning Ding, Biqing Qi, and Bowen Zhou. Marti: A framework for multi-agent llm systems re...

work page 2025

[62] [62]

Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip H. S. Torr, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt. Malt: Improving reasoning with multi-agent llm training, 2025

work page 2025

[63] [63]

A systematic survey of automatic prompt optimization techniques, 2025

Kiran Ramnath, Kang Zhou, Sheng Guan, Soumya Smruti Mishra, Xuan Qi, Zhengyuan Shen, Shuai Wang, Sangmin Woo, Sullam Jeoung, Yawei Wang, Haozhu Wang, Han Ding, Yuzhe Lu, Zhichao Xu, Yun Zhou, Balasubramaniam Srinivasan, Qiaojing Yan, Yueyan Chen, Haibo Ding, Panpan Xu, and Lin Lee Cheong. A systematic survey of automatic prompt optimization techniques, 2025

work page 2025

[64] [64]

Large language models are zero shot hypothesis proposers, 2023

Biqing Qi, Kaiyan Zhang, Haoxiang Li, Kai Tian, Sihang Zeng, Zhang-Ren Chen, and Bowen Zhou. Large language models are zero shot hypothesis proposers, 2023

work page 2023

[65] [65]

McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant

Matthew B. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at auroc and auprc under class imbalance. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, page 44102–44163. Curran Associates, Inc., 2024

work page 2024

[66] [66]

Scaling relationship on learning mathematical reasoning with large language models, 2023

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

work page 2023

[67] [67]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

work page 2023

[68] [68]

Unsloth, 2023

Michael Han Daniel Han and Unsloth team. Unsloth, 2023

work page 2023

[69] [69]

Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

work page 2020

[70] [70]

Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning, 2025

Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, and Hoifung Poon. Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning, 2025

work page 2025

[71] [71]

Lung cancer risk in never- smokers: An overview of environmental and genetic factors.Chinese Journal of Cancer Research, 33(5):548, 2021

Elvin S Cheng, Marianne Weber, Julia Steinberg, and Xue Qin Yu. Lung cancer risk in never- smokers: An overview of environmental and genetic factors.Chinese Journal of Cancer Research, 33(5):548, 2021. 14

work page 2021

[72] [72]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the- ar...

work page 2020

[73] [73]

prominent interstitial pattern,

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 15 Appendix A. Additional Results A.1 Dataset Description We...

work page 2023

[74] [74]

How likely is this patient to develop lung cancer within one year?

To accommodate the model’s size and the long-context requirements of the task, inference was performed on two NVIDIA A100 GPUs, leveraging tensor parallelism. Implementation of Traj-CoA will be released on GitHub upon acceptance. Appendix C. Prompts We present the prompt templates and query for RAG in Table S5, S6, S7, S8, S9, S10, S11, and S12. 1https://...

work page