pith. sign in

arxiv: 2510.10454 · v2 · pith:YI2JELOBnew · submitted 2025-10-12 · 💻 cs.AI

Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction

Pith reviewed 2026-05-21 21:23 UTC · model grok-4.3

classification 💻 cs.AI
keywords patient trajectory modelingchain-of-agentselectronic health recordslung cancer risk predictionmulti-agent systemstemporal reasoningzero-shot predictionLLM healthcare applications
0
0 comments X

The pith

A chain of worker agents chunks long EHR data and distills events into shared memory to outperform baselines in zero-shot lung cancer risk prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Traj-CoA, a multi-agent system designed to model patient trajectories from lengthy and noisy electronic health records for lung cancer risk prediction. Worker agents handle sequential chunks of the data while distilling critical events into a shared long-term memory module called EHRMem, which reduces noise and maintains timeline continuity. A manager agent then combines the distilled memory with agent summaries to generate the final prediction. This setup is tested in a zero-shot scenario that uses five years of EHR to forecast one-year risk and shows better results than multiple categories of baselines while producing reasoning that aligns with clinical patterns.

Core claim

Traj-CoA employs a chain of worker agents to process EHR data in manageable sequential chunks, distills critical events into the shared EHRMem module to preserve a comprehensive timeline, and relies on a final manager agent to synthesize summaries and the extracted timeline for making lung cancer risk predictions, achieving stronger performance than baselines of four categories in zero-shot one-year prediction from five-year records.

What carries the argument

Chain-of-agents architecture in which worker agents sequentially process EHR chunks and distill events into EHRMem long-term memory, enabling the manager agent to perform synthesis and temporal reasoning for the prediction.

Load-bearing premise

Sequential chunk processing by worker agents plus distillation into EHRMem preserves a comprehensive timeline without critical information loss or introduction of hallucinations that would invalidate downstream risk predictions.

What would settle it

A controlled experiment that inserts known critical clinical events into full EHR records, then checks whether those events are omitted from the distilled EHRMem and whether prediction accuracy falls compared with direct full-context baselines.

Figures

Figures reproduced from arXiv: 2510.10454 by Jun Wen, Lucas Jing Liu, Matthew Thompson, Meliha Yetisgen, Ruth Etzioni, Sihang Zeng, Sitong Zhou, Yujuan Fu, Zixuan Yu.

Figure 1
Figure 1. Figure 1: Traj-CoA architecture consisting of a chain of worker agents, a manager agent, and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sensitivity analysis on (A) chunk size and (B) number of chunks. This reveals a fundamental trade-off. Small chunks force a long chain of iterative summarizations, risking catas￾trophic forgetting [50] where early, critical details are ab￾stracted away. Conversely, large chunks shorten the chain but are susceptible to the "lost-in-the-middle" issue [11], where each worker agent fails to identify fine-grain… view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of Traj-CoA’s behavior. (A) t-SNE plot visualizing the distribution of lung cancer [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Large language models (LLMs) offer a generalizable approach for modeling patient trajectories, but suffer from the long and noisy nature of electronic health records (EHR) data in temporal reasoning. To address these challenges, we introduce Traj-CoA, a multi-agent system involving chain-of-agents for patient trajectory modeling. Traj-CoA employs a chain of worker agents to process EHR data in manageable chunks sequentially, distilling critical events into a shared long-term memory module, EHRMem, to reduce noise and preserve a comprehensive timeline. A final manager agent synthesizes the worker agents' summary and the extracted timeline in EHRMem to make predictions. In a zero-shot one-year lung cancer risk prediction task based on five-year EHR data, Traj-CoA outperforms baselines of four categories. Analysis reveals that Traj-CoA exhibits clinically aligned temporal reasoning, establishing it as a promisingly robust and generalizable approach for modeling complex patient trajectories. Implementation of Traj-CoA is available on https://github.com/zengsihang/Traj-CoA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Traj-CoA, a multi-agent system for patient trajectory modeling from long, noisy EHR data. Worker agents sequentially process five-year EHR records in chunks and distill critical events into a shared long-term memory module (EHRMem); a manager agent then synthesizes the distilled timeline and summaries to perform zero-shot one-year lung cancer risk prediction. The authors claim that Traj-CoA outperforms baselines from four categories and exhibits clinically aligned temporal reasoning, with code released on GitHub.

Significance. If the empirical superiority and fidelity of the distilled timeline are substantiated, the work could meaningfully advance LLM-based temporal reasoning for healthcare by offering a practical multi-agent strategy to manage extended context and noise without fine-tuning. The open-source release supports reproducibility and is a clear strength.

major comments (3)
  1. [§4] §4 (Experiments): the central claim of outperformance over four baseline categories is presented without quantitative metrics, error bars, dataset size, cohort details, or statistical tests in the abstract or summary sections; this directly affects assessment of robustness and generalizability.
  2. [§3.2] §3.2 (Method, EHRMem distillation): the description provides no quantitative fidelity metrics, ablation on memory content, or human validation of the summarized timeline, yet the central claim requires that chunked processing plus distillation preserves a comprehensive timeline without critical loss or hallucinations.
  3. [§4.3] §4.3 (Ablation and analysis): absence of ablations isolating the contribution of EHRMem or testing for information loss across chunk boundaries leaves open the possibility that reported gains arise from artifacts rather than genuine trajectory modeling.
minor comments (2)
  1. [Abstract] Abstract: include at least one key quantitative result and dataset scale to make the outperformance claim concrete for readers.
  2. [§3] Notation: clarify the exact interface between worker-agent outputs and the manager agent's input from EHRMem to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, clarifying the current manuscript content and indicating revisions where they strengthen the work without misrepresenting the results.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): the central claim of outperformance over four baseline categories is presented without quantitative metrics, error bars, dataset size, cohort details, or statistical tests in the abstract or summary sections; this directly affects assessment of robustness and generalizability.

    Authors: We agree that the abstract and high-level summary do not include specific numerical results. The full quantitative metrics, error bars, dataset size (five-year EHR cohort for lung cancer risk), cohort details, and statistical comparisons are reported in Section 4. To address the concern directly, we will revise the abstract to include the primary performance gains, dataset scale, and a note on statistical testing. revision: yes

  2. Referee: [§3.2] §3.2 (Method, EHRMem distillation): the description provides no quantitative fidelity metrics, ablation on memory content, or human validation of the summarized timeline, yet the central claim requires that chunked processing plus distillation preserves a comprehensive timeline without critical loss or hallucinations.

    Authors: Section 3.2 describes the distillation process into EHRMem, with supporting evidence from overall task performance and the clinically aligned reasoning shown in Section 4. We acknowledge the absence of direct quantitative fidelity metrics or human validation of the distilled timeline. In revision we will add an ablation on memory content and an analysis of information retention; human validation will be added if resources permit within the revision window, otherwise noted as a limitation. revision: partial

  3. Referee: [§4.3] §4.3 (Ablation and analysis): absence of ablations isolating the contribution of EHRMem or testing for information loss across chunk boundaries leaves open the possibility that reported gains arise from artifacts rather than genuine trajectory modeling.

    Authors: Section 4.3 already contains ablations on the multi-agent pipeline and temporal components. We agree that more targeted experiments isolating EHRMem and quantifying information loss at chunk boundaries would further rule out artifacts. We will expand the ablation subsection to include a direct with/without-EHRMem comparison and a chunk-boundary retention analysis using event-overlap metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with no derivational reduction

full rationale

The paper introduces a multi-agent architecture (worker agents processing EHR chunks into EHRMem, followed by manager synthesis) and evaluates it via zero-shot empirical comparison on lung cancer risk prediction against four baseline categories. No equations, fitted parameters, uniqueness theorems, or self-citation chains appear in the abstract or described method. The central claim rests on external benchmark outperformance and qualitative analysis of temporal reasoning rather than any input-to-output reduction by construction, rendering the work self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested premise that agent-based chunking plus memory distillation yields clinically aligned temporal reasoning without systematic loss or fabrication of events; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5745 in / 1069 out tokens · 46150 ms · 2026-05-21T21:23:02.995630+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

    cs.AI 2026-05 unverdicted novelty 5.0

    Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Mining electronic health records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012

    Peter B Jensen, Lars J Jensen, and Søren Brunak. Mining electronic health records: towards better research applications and clinical care.Nature Reviews Genetics, 13(6):395–405, 2012

  2. [2]

    Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.arXiv preprint arXiv:2503.04176, 2025

    Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam Shah. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.arXiv preprint arXiv:2503.04176, 2025

  3. [3]

    Growth dynamics of lung nodules: implications for classification in lung cancer screening.Cancer Imaging, 24(1):113, 2024

    Beatriz Ocaña-Tienda, Alba Eroles-Simó, Julián Pérez-Beteta, Estanislao Arana, and Víctor M Pérez-García. Growth dynamics of lung nodules: implications for classification in lung cancer screening.Cancer Imaging, 24(1):113, 2024

  4. [4]

    Modelling patient trajectories using multimodal informa- tion.Journal of biomedical informatics, 134:104195, 2022

    João Figueira Silva and Sérgio Matos. Modelling patient trajectories using multimodal informa- tion.Journal of biomedical informatics, 134:104195, 2022

  5. [5]

    Multi-modal graph learning over umls knowledge graphs

    Manuel Burger, Gunnar Rätsch, and Rita Kuznetsova. Multi-modal graph learning over umls knowledge graphs. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 ofProceedings of Machine...

  6. [6]

    Trajsurv: Learning continuous latent trajectories from electronic health records for trustworthy survival prediction, 2025

    Sihang Zeng, Lucas Jing Liu, Jun Wen, Meliha Yetisgen, Ruth Etzioni, and Gang Luo. Trajsurv: Learning continuous latent trajectories from electronic health records for trustworthy survival prediction, 2025

  7. [7]

    Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, and Lin Yang

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...

  8. [8]

    Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

  9. [9]

    Ultramedical: Building specialized generalists in biomedicine, 2024

    Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, and Bowen Zhou. Ultramedical: Building specialized generalists in biomedicine, 2024

  10. [10]

    Zero-shot large language models for long clinical text summarization with temporal reasoning, 2025

    Maya Kruse, Shiyue Hu, Nicholas Derby, Yifu Wu, Samantha Stonbraker, Bingsheng Yao, Dakuo Wang, Elizabeth Goldberg, and Yanjun Gao. Zero-shot large language models for long clinical text summarization with temporal reasoning, 2025

  11. [11]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172, 2023. 10

  12. [12]

    The evolving use of electronic health records (ehr) for research

    Ellen Kim, Samuel M Rubinstein, Kevin T Nead, Andrzej P Wojcieszynski, Peter E Gabriel, and Jeremy L Warner. The evolving use of electronic health records (ehr) for research. In Seminars in radiation oncology, volume 29, pages 354–361. Elsevier, 2019

  13. [13]

    Scalable and accurate deep learning with electronic health records.NPJ digital medicine, 1(1):18, 2018

    Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records.NPJ digital medicine, 1(1):18, 2018

  14. [14]

    Large language models for information retrieval: A survey, 2024

    Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Large language models for information retrieval: A survey, 2024

  15. [15]

    Prompting large language models for zero-shot clinical prediction with structured longitudinal electronic health record data.arXiv preprint arXiv:2402.01713, 2024

    Yinghao Zhu, Zixiang Wang, Junyi Gao, Yuning Tong, Jingkun An, Weibin Liao, Ewen M Harrison, Liantao Ma, and Chengwei Pan. Prompting large language models for zero-shot clinical prediction with structured longitudinal electronic health record data.arXiv preprint arXiv:2402.01713, 2024

  16. [16]

    A comprehensive survey on long context language modeling, 2025

    Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, ...

  17. [17]

    The rise and potential of large language model based agents: A survey, 2023

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, a...

  18. [18]

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö. Arik. Chain of agents: Large language models collaborating on long-context tasks, 2024

  19. [19]

    Leveraging long context in retrieval augmented language models for medical question answering.npj Digital Medicine, 8(1):239, 2025

    Gongbo Zhang, Zihan Xu, Qiao Jin, Fangyi Chen, Yilu Fang, Yi Liu, Justin F Rousseau, Ziyang Xu, Zhiyong Lu, Chunhua Weng, et al. Leveraging long context in retrieval augmented language models for medical question answering.npj Digital Medicine, 8(1):239, 2025

  20. [20]

    Agent hospital: A simulacrum of hospital with evolvable medical agents, 2025

    Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents, 2025

  21. [21]

    Kulas, Andy Schuetz, Walter F

    Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism, 2017

  22. [22]

    Survlatent ode : A neural ode based time- to-event model with competing risks for longitudinal data improves cancer-associated venous thromboembolism (vte) prediction, 2022

    Intae Moon, Stefan Groha, and Alexander Gusev. Survlatent ode : A neural ode based time- to-event model with competing risks for longitudinal data improves cancer-associated venous thromboembolism (vte) prediction, 2022

  23. [23]

    Ice-node: Integration of clinical embeddings with neural ordinary differential equations

    Asem Alaa, Erik Mayer, and Mauricio Barahona. Ice-node: Integration of clinical embeddings with neural ordinary differential equations. InMachine Learning for Healthcare Conference, pages 537–564. PMLR, 2022

  24. [24]

    Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

    Yikuan Li, Shishir Rao, José Roberto Ayala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy, Yajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

  25. [25]

    Med-bert: pretrained contex- tualized embeddings on large-scale structured electronic health records for disease prediction

    Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-bert: pretrained contex- tualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86, 2021. 11

  26. [26]

    Krishnan

    Alex Labach, Aslesha Pokhrel, Xiao Shi Huang, Saba Zuberi, Seung Eun Yi, Maksims V olkovs, Tomi Poutanen, and Rahul G. Krishnan. Duett: Dual event time transformer for electronic health records, 2023

  27. [27]

    A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories.Nature medicine, 29(5):1113–1122, 2023

    Davide Placido, Bo Yuan, Jessica X Hjaltelin, Chunlei Zheng, Amalie D Haue, Piotr J Chmura, Chen Yuan, Jihye Kim, Renato Umeton, Gregory Antell, et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories.Nature medicine, 29(5):1113–1122, 2023

  28. [28]

    Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B

    Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunning- ham, David W. Bates, and Arkadiusz Sitek. Foundation model of electronic medical records for adaptive risk estimation, 2025

  29. [29]

    Michael Wornow, Suhana Bedi, Miguel Angel Fuentes Hernandez, Ethan Steinberg, Jason Alan Fries, Christopher Re, Sanmi Koyejo, and Nigam H. Shah. Context clues: Evaluating long context models for clinical prediction tasks on ehrs, 2025

  30. [30]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  31. [31]

    A survey on the memory mechanism of large language model based agents, 2024

    Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents, 2024

  32. [32]

    Retrieval-augmented generation for large language models: A survey, 2024

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

  33. [33]

    Longagent: Scaling language models to 128k context through multi-agent collaboration, 2024

    Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, and Xuanjing Huang. Longagent: Scaling language models to 128k context through multi-agent collaboration, 2024

  34. [34]

    Longhealth: A question answering benchmark with long clinical documents.Journal of Healthcare Informatics Research, pages 1–17, 2025

    Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander Löser, Hugo JWL Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. Longhealth: A question answering benchmark with long clinical documents.Journal of Healthcare Informatics Research, pages 1–17, 2025

  35. [35]

    A survey of llm-based agents in medicine: How far are we from baymax?, 2025

    Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Jiaming Ji, Wenting Chen, Xiang Li, and Yixuan Yuan. A survey of llm-based agents in medicine: How far are we from baymax?, 2025

  36. [36]

    Large language models as biomedical hypothesis generators: A comprehensive evaluation, 2024

    Biqing Qi, Kaiyan Zhang, Kai Tian, Haoxiang Li, Zhang-Ren Chen, Sihang Zeng, Ermo Hua, Hu Jinfang, and Bowen Zhou. Large language models as biomedical hypothesis generators: A comprehensive evaluation, 2024

  37. [37]

    Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

    Xi Chen, Huahui Yi, Mingke You, WeiZhi Liu, Li Wang, Hairui Li, Xue Zhang, Yingman Guo, Lei Fan, Gang Chen, et al. Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

  38. [38]

    Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning, 2024

    Ling Yue, Sixue Xing, Jintai Chen, and Tianfan Fu. Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning, 2024

  39. [39]

    Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024

    Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024

  40. [40]

    Developing next-generation cancer care management with multi-agent orchestration, May 2025

    MD MPH, Matthew Lungren. Developing next-generation cancer care management with multi-agent orchestration, May 2025

  41. [41]

    Care- ad: a multi-agent large language model framework for alzheimer’s disease prediction using longitudinal clinical notes.npj Digital Medicine, 8(1):541, August 2025

    Rumeng Li, Xun Wang, Dan Berlowitz, Jesse Mez, Honghuang Lin, and Hong Yu. Care- ad: a multi-agent large language model framework for alzheimer’s disease prediction using longitudinal clinical notes.npj Digital Medicine, 8(1):541, August 2025. 12

  42. [42]

    Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Marzyeh Ghassemi, Michael C. Hughes, and Tristan Naumann. Mimic-extract: a data extraction, preprocessing, and representa- tion pipeline for mimic-iii. InProceedings of the ACM Conference on Health, Inference, and Learning, ACM CHIL ’20, page 222–235. ACM, April 2020

  43. [43]

    Castro, Vivian S

    Jun Wen, Jue Hou, Clara-Lea Bonzel, Yihan Zhao, Victor M. Castro, Vivian S. Gainer, Dana Weisenfeld, Tianrun Cai, Yuk-Lam Ho, Vidul A. Panickan, Lauren Costa, Chuan Hong, J. Michael Gaziano, Katherine P. Liao, Junwei Lu, Kelly Cho, and Tianxi Cai. Latte: Label- efficient incident phenotyping from longitudinal electronic health records, 2023

  44. [44]

    Use xml tags to structure your prompts

    Anthropic. Use xml tags to structure your prompts. https://docs.anthropic.com/ en/docs/build-with-claude/prompt-engineering/use-xml-tags , 2025. Accessed: 2025-08-16

  45. [45]

    A systematic survey of prompt engineering in large language models: Techniques and applications, 2025

    Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications, 2025

  46. [46]

    Xgboost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016

  47. [47]

    Lee, Anthony Wu, and Jeffrey N

    Simon A. Lee, Anthony Wu, and Jeffrey N. Chiang. Clinical modernbert: An efficient and long context encoder for biomedical text, 2025

  48. [48]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

  49. [49]

    Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

  50. [50]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

  51. [51]

    Morris, Brandon Duderstadt, and Andriy Mulyar

    Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder, 2025

  52. [52]

    Topicgpt: A prompt-based topic modeling framework, 2024

    Chau Minh Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, and Mohit Iyyer. Topicgpt: A prompt-based topic modeling framework, 2024

  53. [53]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  54. [54]

    Screening for lung cancer: 2023 guideline update from the american cancer society.CA: A Cancer Journal for Clinicians, 74(1):50–81, 2024

    Andrew MD Wolf, Kevin C Oeffinger, Tina Ya-Chen Shih, Louise C Walter, Timothy R Church, Elizabeth TH Fontham, Elena B Elkin, Ruth D Etzioni, Carmen E Guerra, Rebecca B Perkins, et al. Screening for lung cancer: 2023 guideline update from the american cancer society.CA: A Cancer Journal for Clinicians, 74(1):50–81, 2024

  55. [55]

    Cancer progress and priorities: lung cancer.Cancer epidemiology, biomarkers & prevention, 28(10):1563–1579, 2019

    Matthew B Schabath and Michele L Cote. Cancer progress and priorities: lung cancer.Cancer epidemiology, biomarkers & prevention, 28(10):1563–1579, 2019

  56. [56]

    Hye Seon Kang, Ah Young Shin, Chang Dong Yeo, Chan Kwon Park, Ju Sang Kim, Jin Woo Kim, Seung Joon Kim, Sang Haak Lee, and Sung Kyoung Kim. Clinical significance of anemia as a prognostic factor in non-small cell lung cancer carcinoma with activating epidermal growth factor receptor mutations.Journal of Thoracic Disease, 12(5):1895, 2020. 13

  57. [57]

    Inflammation in the development of lung cancer: epidemiological evidence

    Eric A Engels. Inflammation in the development of lung cancer: epidemiological evidence. Expert review of anticancer therapy, 8(4):605–615, 2008

  58. [58]

    Maria G Prado, Larry G Kessler, Margaret A Au, Hannah A Burkhardt, Monica Zigman Suchsland, Lesleigh Kowalski, Kari A Stephens, Meliha Yetisgen, Fiona M Walter, Richard D Neal, et al. Symptoms and signs of lung cancer prior to diagnosis: case–control study using electronic health records from ambulatory care within a large us-based tertiary care centre.BM...

  59. [59]

    Biomni: A general-purpose biomedical ai agent

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Junze Zhang, Yin Di, et al. Biomni: A general-purpose biomedical ai agent. bioRxiv, pages 2025–05, 2025

  60. [60]

    Txagent: An ai agent for therapeutic reasoning across a universe of tools, 2025

    Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: An ai agent for therapeutic reasoning across a universe of tools, 2025

  61. [61]

    Marti: A framework for multi-agent llm systems reinforced training and inference, 2025

    Kaiyan Zhang, Runze Liu, Xuekai Zhu, Kai Tian, Sihang Zeng, Guoli Jia, Yuchen Fan, Xingtai Lv, Yuxin Zuo, Che Jiang, Ziyang Liu, Jianyu Wang, Yuru Wang, Ruotong Zhao, Ermo Hua, Yibo Wang, Shijie Wang, Junqi Gao, Xinwei Long, Youbang Sun, Zhiyuan Ma, Ganqu Cui, Lei Bai, Ning Ding, Biqing Qi, and Bowen Zhou. Marti: A framework for multi-agent llm systems re...

  62. [62]

    Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip H. S. Torr, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt. Malt: Improving reasoning with multi-agent llm training, 2025

  63. [63]

    A systematic survey of automatic prompt optimization techniques, 2025

    Kiran Ramnath, Kang Zhou, Sheng Guan, Soumya Smruti Mishra, Xuan Qi, Zhengyuan Shen, Shuai Wang, Sangmin Woo, Sullam Jeoung, Yawei Wang, Haozhu Wang, Han Ding, Yuzhe Lu, Zhichao Xu, Yun Zhou, Balasubramaniam Srinivasan, Qiaojing Yan, Yueyan Chen, Haibo Ding, Panpan Xu, and Lin Lee Cheong. A systematic survey of automatic prompt optimization techniques, 2025

  64. [64]

    Large language models are zero shot hypothesis proposers, 2023

    Biqing Qi, Kaiyan Zhang, Haoxiang Li, Kai Tian, Sihang Zeng, Zhang-Ren Chen, and Bowen Zhou. Large language models are zero shot hypothesis proposers, 2023

  65. [65]

    McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant

    Matthew B. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at auroc and auprc under class imbalance. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, page 44102–44163. Curran Associates, Inc., 2024

  66. [66]

    Scaling relationship on learning mathematical reasoning with large language models, 2023

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

  67. [67]

    Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

  68. [68]

    Unsloth, 2023

    Michael Han Daniel Han and Unsloth team. Unsloth, 2023

  69. [69]

    Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

  70. [70]

    Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning, 2025

    Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, and Hoifung Poon. Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning, 2025

  71. [71]

    Lung cancer risk in never- smokers: An overview of environmental and genetic factors.Chinese Journal of Cancer Research, 33(5):548, 2021

    Elvin S Cheng, Marianne Weber, Julia Steinberg, and Xue Qin Yu. Lung cancer risk in never- smokers: An overview of environmental and genetic factors.Chinese Journal of Cancer Research, 33(5):548, 2021. 14

  72. [72]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the- ar...

  73. [73]

    prominent interstitial pattern,

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 15 Appendix A. Additional Results A.1 Dataset Description We...

  74. [74]

    How likely is this patient to develop lung cancer within one year?

    To accommodate the model’s size and the long-context requirements of the task, inference was performed on two NVIDIA A100 GPUs, leveraging tensor parallelism. Implementation of Traj-CoA will be released on GitHub upon acceptance. Appendix C. Prompts We present the prompt templates and query for RAG in Table S5, S6, S7, S8, S9, S10, S11, and S12. 1https://...