pith. sign in

arxiv: 2605.20176 · v1 · pith:QYY7KX2Dnew · submitted 2026-05-19 · 💻 cs.CL

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Pith reviewed 2026-05-20 05:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords agentic systemsclinical reasoningmultimodal evidenceraw EHR navigationmedical imaging toolstrajectory distillationclinical decision support
0
0 comments X

The pith

ClinSeekAgent lets LLMs actively gather and refine multimodal clinical evidence from raw data sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ClinSeekAgent as a framework that allows AI agents to dynamically gather evidence for clinical queries instead of receiving pre-curated inputs. It works by querying medical knowledge bases, exploring raw EHRs, and invoking imaging tools while updating hypotheses as new details arrive. This active approach produces clear performance gains on both text and multimodal clinical tasks. A sympathetic reader would care because real clinical work involves sorting through messy, unorganized sources rather than clean test sets. The distillation step further shows how successful agent behaviors can be transferred to smaller open models.

Core claim

ClinSeekAgent is an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. It serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models.

What carries the argument

ClinSeekAgent, an agentic framework that plans tool calls across heterogeneous sources, invokes queries on raw EHRs and imaging data, and iteratively refines clinical hypotheses.

If this is right

  • Raises Claude Opus 4.6 from 47.5 to 62.6 F1 on multimodal tasks.
  • Improves text-only EHR results for most host models, including Claude from 60.0 to 63.2 overall F1.
  • Distilled ClinSeek-35B-A3B reaches 34.0 average F1 on AgentEHR-Bench, +11.9 over its Qwen3.5-35B-A3B baseline.
  • All tested models improve on CXR-related task groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same active-seeking loop could be tested in other data-rich domains that currently rely on pre-filtered inputs.
  • Distillation success suggests a route for embedding agentic skills into models that can run inside hospital firewalls.
  • Real deployment would likely need extra checks for privacy and error recovery that go beyond the current benchmarks.

Load-bearing premise

The base LLMs can reliably plan, invoke tools on raw heterogeneous clinical data, and refine hypotheses without introducing critical errors or requiring human correction during the seeking process.

What would settle it

A controlled run on raw patient cases where the agent repeatedly retrieves irrelevant records or produces unsafe hypothesis updates would show that automated seeking does not deliver the claimed gains.

Figures

Figures reproduced from arXiv: 2605.20176 by Cihang Xie, Haoqin Tu, Hardy Chen, Juncheng Wu, Letian Zhang, Yuhan Wang, Yuyin Zhou, Zijun Wang.

Figure 1
Figure 1. Figure 1: ClinSeekAgent Overview. ClinSeekAgent is an automated agentic evidence-seeking pipeline. It interacts with heterogeneous data sources to enable multimodal evidence seeking for clinical decision support. Compared with prior user-curated context settings, ClinSeekAgent is more flexible by acquiring richer information and knowledge from diverse tools. only to reason over given evidence, but also to decide whe… view at source ↗
Figure 2
Figure 2. Figure 2: Performance–model size comparison on AgentEHR-Bench. ClinSeek-35B-A3B achieves strong per￾formance among open-source models while maintaining a favorable parameter-efficiency tradeoff. While these inference-time results demonstrate the effectiveness of Clin￾SeekAgent, they also suggest that au￾tomated evidence seeking depends on the agentic model’s ability to plan and execute long-horizon tool use. Therefo… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of fine-grained text-based subtasks. We categorize the tasks in EHR-Bench into fine-grained groups and report the performance gains brought by ClinSeekAgent pipelines. Green indicates an advantage over Curated Input baseline, while red indicates a disadvantage [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between the ClinSeekAgent pipeline and the Curated Input baseline. to provide the correct answer due to the limited patient context and insufficient ability to analyze medical images. 3.5 Failure Analysis on Decision-Making Task As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tool-call distribution before and after SFT training. trajectories teach the student to treat the EHR as a programmable database, rather than relying only on fixed retrieval templates. Together with the stronger AgentEHR-Bench performance in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between the ClinSeekAgent pipeline and the Curated Input baseline. Our pipeline fails to locate critical patient information on a decision-making prediction task. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A case of Medmod Decompensation. Page 1. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: A case of Medmod Decompensation. Page 2. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A case of Medmod Phenotyping. Page 1. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 8
Figure 8. Figure 8: A case of Medmod Phenotyping. Page 2. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A case of Length of Stay. Page 1. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 9
Figure 9. Figure 9: A case of Length of Stay. Page 2. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ClinSeekAgent, an agentic framework enabling LLMs to actively seek, plan, and synthesize multimodal evidence from raw heterogeneous clinical sources (EHRs, knowledge bases, imaging tools) rather than consuming pre-curated inputs. It constructs ClinSeek-Bench to compare curated-input reasoning against automated evidence-seeking, reports F1 gains across models (e.g., Claude Opus 4.6 improves from 47.5 to 62.6 on multimodal tasks), and demonstrates distillation of agent trajectories into ClinSeek-35B-A3B, which improves +11.9 over its Qwen3.5-35B baseline on AgentEHR-Bench.

Significance. If the central results hold under rigorous verification, the work meaningfully advances agentic clinical AI by addressing the gap between curated-evidence assumptions and real-world data-seeking workflows. The distillation pipeline that transfers high-quality trajectories to compact open-source models is a concrete strength that could improve accessibility and reproducibility.

major comments (3)
  1. [Evaluation section (ClinSeek-Bench results)] The reported F1 improvements (e.g., +15.1 for Claude Opus 4.6 on multimodal tasks and +11.9 after distillation) are presented without statistical significance tests, confidence intervals, or variance across runs. This information is required to determine whether the gains are robust or could be explained by sampling variability in the new benchmark.
  2. [Agent trajectory and error analysis] No quantitative analysis of agent failure modes, tool-call accuracy, or error rates during planning and evidence seeking on raw heterogeneous data is provided. Because the central claim depends on the assumption that base LLMs can autonomously invoke tools and refine hypotheses without critical uncorrected errors, the absence of such diagnostics leaves open the possibility that observed gains arise from host-model compensation rather than genuine evidence acquisition.
  3. [Benchmark construction (ClinSeek-Bench)] ClinSeek-Bench is constructed by the authors; additional details are needed on how the automated-seeking split is isolated from benchmark design choices to rule out circularity or leakage that could inflate the reported gains relative to the curated-input baseline.
minor comments (2)
  1. [Abstract] The abstract states positive risk-prediction gains in 7 out of 9 models but does not list the specific models or tasks; adding this enumeration would improve clarity.
  2. [Introduction] The relationship between ClinSeek-Bench and the existing AgentEHR-Bench could be stated more explicitly in the introduction to help readers distinguish the two evaluation settings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate where the manuscript has been revised to incorporate the suggestions.

read point-by-point responses
  1. Referee: [Evaluation section (ClinSeek-Bench results)] The reported F1 improvements (e.g., +15.1 for Claude Opus 4.6 on multimodal tasks and +11.9 after distillation) are presented without statistical significance tests, confidence intervals, or variance across runs. This information is required to determine whether the gains are robust or could be explained by sampling variability in the new benchmark.

    Authors: We agree that statistical significance testing and variance estimates are necessary to establish robustness. In the revised manuscript we have added bootstrap resampling (1,000 iterations) to compute 95% confidence intervals for all F1 scores and performed McNemar’s tests for paired comparisons between curated-input and automated-seeking conditions. The key gains remain statistically significant (p < 0.05). These results and the corresponding methodology are now reported in the Evaluation section and in updated tables. revision: yes

  2. Referee: [Agent trajectory and error analysis] No quantitative analysis of agent failure modes, tool-call accuracy, or error rates during planning and evidence seeking on raw heterogeneous data is provided. Because the central claim depends on the assumption that base LLMs can autonomously invoke tools and refine hypotheses without critical uncorrected errors, the absence of such diagnostics leaves open the possibility that observed gains arise from host-model compensation rather than genuine evidence acquisition.

    Authors: We acknowledge that a quantitative breakdown of agent behavior strengthens the central claim. We have performed an additional analysis of the collected trajectories, reporting tool-call precision and recall, the fraction of planning steps that led to successful evidence retrieval, and the distribution of failure modes (e.g., incorrect tool selection, premature termination, or unaddressed contradictions). The revised manuscript includes a new subsection that correlates these metrics with performance gains, showing that improvements are predominantly associated with successful evidence acquisition rather than host-model compensation alone. revision: yes

  3. Referee: [Benchmark construction (ClinSeek-Bench)] ClinSeek-Bench is constructed by the authors; additional details are needed on how the automated-seeking split is isolated from benchmark design choices to rule out circularity or leakage that could inflate the reported gains relative to the curated-input baseline.

    Authors: We agree that explicit safeguards against circularity and leakage must be documented. The revised benchmark-construction section now details the independent selection criteria for queries and raw data sources, the temporal and patient-level partitioning used to separate the automated-seeking and curated-input splits, and the fact that no agent-generated outputs or performance signals were used during benchmark curation. We also confirm that the same clinical queries underlie both conditions, with the only difference being the presence or absence of pre-curated evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on newly constructed benchmark remains independent of method definition

full rationale

The paper presents an empirical agentic framework and reports F1 improvements on ClinSeek-Bench (curated-input vs. automated-seeking splits) and AgentEHR-Bench. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would make the reported gains equivalent to the inputs by construction. The benchmark construction and trajectory distillation are described as validation steps rather than tautological re-labelings of the same data. The central claims rest on observable performance deltas across host models, which are falsifiable outside the paper's own design choices.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests primarily on the domain assumption that frontier LLMs already possess adequate tool-use and planning ability; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Frontier LLMs can perform iterative planning, tool invocation, and hypothesis refinement on raw clinical data sources.
    The entire agentic loop depends on this capability of the underlying models.
invented entities (1)
  • ClinSeekAgent no independent evidence
    purpose: Automated framework for dynamic multimodal evidence seeking
    The proposed system itself; no independent falsifiable evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5920 in / 1220 out tokens · 61652 ms · 2026-05-20T05:14:22.342463+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

  1. [1]

    Medreason: Eliciting factual medicalreasoningstepsinllmsviaknowledgegraphs

    Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

  2. [2]

    Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  3. [3]

    Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

    Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, and Bo Wang. Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

  4. [4]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  5. [5]

    Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments.arXiv preprint arXiv:2405.07960,

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agent- clinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments.arXiv preprint arXiv:2405.07960, 2024

  6. [6]

    Huatuogpt, towards taming language model to be a doctor, 2023

    Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li. Huatuogpt, towards taming language model to be a doctor, 2023

  7. [7]

    Knowledge or reasoning? a close look at how llms think across domains.arXiv preprint arXiv:2506.02126, 2025

    Juncheng Wu, Sheng Liu, Haoqin Tu, Hang Yu, Xiaoke Huang, James Zou, Cihang Xie, and Yuyin Zhou. Knowledge or reasoning? a close look at how llms think across domains.arXiv preprint arXiv:2506.02126, 2025

  8. [8]

    Medrag: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot

    Xuejiao Zhao, Siyan Liu, Su-Yin Yang, and Chunyan Miao. Medrag: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. InProceedings of the ACM on Web Conference 2025, pages 4442–4457, 2025

  9. [9]

    Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

    Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

  10. [10]

    Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023

    Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023

  11. [11]

    Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

  12. [12]

    Ehr-r1: A reasoning-enhanced foundational language model for electronic health record analysis.arXiv preprint arXiv:2510.25628, 2025

    Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, et al. Ehr-r1: A reasoning-enhanced foundational language model for electronic health record analysis.arXiv preprint arXiv:2510.25628, 2025

  13. [13]

    Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images.Advances in Neural Information Processing Systems, 36:3867–3880, 2023

    Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, et al. Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images.Advances in Neural Information Processing Systems, 36:3867–3880, 2023

  14. [14]

    Medmod: Multimodal benchmark for medical prediction tasks with electronic health records and chest x-ray scans

    Shaza Elsharief, Saeed Shurrab, Baraa Al Jorf, L Julián Lechuga López, and Farah E Shamout. Medmod: Multimodal benchmark for medical prediction tasks with electronic health records and chest x-ray scans. Proceedings of Machine Learning Research, 287:1–23, 2025

  15. [15]

    Mtbbench: A multimodal sequential clinical decision-making benchmark in oncology.arXiv preprint arXiv:2511.20490, 2025

    Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos, Olivier Michielin, Michael Moor, and Charlotte Bunne. Mtbbench: A multimodal sequential clinical decision-making benchmark in oncology.arXiv preprint arXiv:2511.20490, 2025. 10

  16. [16]

    Ehrnoteqa: An llm benchmark for real-world clinical practice using discharge summaries.Advances in Neural Information Processing Systems, 37:124575–124611, 2024

    Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwanghyun Kim, Jeewon Yang, Seunghyun Won, and Edward Choi. Ehrnoteqa: An llm benchmark for real-world clinical practice using discharge summaries.Advances in Neural Information Processing Systems, 37:124575–124611, 2024

  17. [17]

    Agen- tehr: Advancing autonomous clinical decision-making via retrospective summarization.arXiv preprint arXiv:2601.13918, 2026

    Yusheng Liao, Chuan Xuan, Yutong Cai, Lina Yang, Zhe Chen, Yanfeng Wang, and Yu Wang. Agen- tehr: Advancing autonomous clinical decision-making via retrospective summarization.arXiv preprint arXiv:2601.13918, 2026

  18. [18]

    Black, Gloria Geng, Danny Park, James Zou, Andrew Y

    Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. Medagentbench: A virtual ehr environment to benchmark medical llm agents.NEJM AI, 2(9):AIdbp2500144, 2025

  19. [19]

    Medagentbench v2: Improving medical llm agent design

    Eric Chen, Sam Postelnik, Kameron Black, Yixing Jiang, and Jonathan H Chen. Medagentbench v2: Improving medical llm agent design. InBiocomputing 2026: Proceedings of the Pacific Symposium, pages 354–371. World Scientific, 2025

  20. [20]

    Ehrnavigator: A multi-agent system for patient-level clinical question answering over heterogeneous electronic health records.arXiv preprint arXiv:2601.10020, 2026

    Lingfei Qian, Mauro Giuffre, Yan Wang, Huan He, Qianqian Xie, Xuguang Ai, Xeuqing Peng, Fan Ma, Ruey-Ling Weng, Donald Wright, et al. Ehrnavigator: A multi-agent system for patient-level clinical question answering over heterogeneous electronic health records.arXiv preprint arXiv:2601.10020, 2026

  21. [21]

    Fhir-agentbench: Benchmarking llm agents for realistic interoperable ehr question answering.arXiv preprint arXiv:2509.19319, 2025

    Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Alistair Johnson, Edward Choi, Jong Ha Lee, et al. Fhir-agentbench: Benchmarking llm agents for realistic interoperable ehr question answering.arXiv preprint arXiv:2509.19319, 2025

  22. [22]

    Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records

    Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C Ho, Carl Yang, and May Dongmei Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22315–22339, 2024

  23. [23]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  24. [24]

    Claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026

    Anthropic. Claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026

  25. [25]

    Claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

    Anthropic. Claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

  26. [26]

    Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2026

    GLM-4.7 Team. Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2026

  27. [27]

    Welcome gemma 4: Frontier multimodal intelligence on device

    Google DeepMind. Welcome gemma 4: Frontier multimodal intelligence on device. https:// huggingface.co/blog/gemma4, 2026

  28. [28]

    Forge: Scalable agent rl framework and algorithm

    MiniMax. Forge: Scalable agent rl framework and algorithm. https://huggingface.co/blog/ MiniMax-AI/forge-scalable-agent-rl-framework-and-algorithm, 2026

  29. [29]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  30. [30]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  31. [31]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  32. [32]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

  33. [33]

    Towards generalist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024

    Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024

  34. [34]

    Octomed: Data recipes for state-of-the-art multimodal medical reasoning.arXiv preprint arXiv:2511.23269, 2025

    Timothy Ossowski, Sheng Zhang, Qianchu Liu, Guanghui Qin, Reuben Tan, Tristan Naumann, Junjie Hu, and Hoifung Poon. Octomed: Data recipes for state-of-the-art multimodal medical reasoning.arXiv preprint arXiv:2511.23269, 2025

  35. [35]

    Medvlthinker: Simple baselines for multimodal medical reasoning.arXiv preprint arXiv:2508.02669, 2025

    Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. Medvlthinker: Simple baselines for multimodal medical reasoning.arXiv preprint arXiv:2508.02669, 2025. 11

  36. [36]

    m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869, 2025

    Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869, 2025

  37. [37]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

  38. [38]

    MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

    Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang, Huichao Wang, Jiale Chen, Jianfei Pan, Jieqiong Cao, et al. Medxiaohe: A comprehensive recipe for building medical mllms.arXiv preprint arXiv:2602.12705, 2026

  39. [39]

    Deepmed: Building a medical deepresearch agent via multi-hop med-search data and turn-controlled agentic training & inference.arXiv preprint arXiv:2601.18496, 2026

    Zihan Wang, Hao Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yiqun Zhang, Jinghao Lin, Haihua Yang, and Xiaozhong Ji. Deepmed: Building a medical deepresearch agent via multi-hop med-search data and turn-controlled agentic training & inference.arXiv preprint arXiv:2601.18496, 2026

  40. [40]

    Pubmedqa: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019

  41. [41]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  42. [42]

    Medframeqa: A multi-image medical vqa benchmark for clinical reasoning.arXiv preprint arXiv:2505.16964, 2025

    Suhao Yu, Haojin Wang, Juncheng Wu, Luyang Luo, Jingshen Wang, Cihang Xie, Pranav Rajpurkar, Carl Yang, Yang Yang, Kang Wang, et al. Medframeqa: A multi-image medical vqa benchmark for clinical reasoning.arXiv preprint arXiv:2505.16964, 2025

  43. [43]

    MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

    Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

  44. [44]

    Overall” row in the text-only EHR table pools all text-only samples, the “Overall

    Yixiong Chen, Xinyi Bai, Yue Pan, Zongwei Zhou, and Alan Yuille. Meissa: Multi-modal medical agentic intelligence.arXiv preprint arXiv:2603.09018, 2026. 12 Technical Appendix A Limitations and Discussion While ClinSeekAgent demonstrates promising results as both an inference-time and training-time pipeline, several limitations remain. First, the current m...

  45. [45]

    RIJ central line tip in distal SVC

  46. [46]

    Lung volumes remain low

  47. [47]

    Diffuse reticular pattern bilaterally — may represent edema or infection

  48. [48]

    sql_query

    Stable small bilateral pleural effusions. Model's Conclusion - Bilateral pulmonary metastases → patient has known malignancy (priors metastatic). - Diffuse reticular pattern → edema vs. infection — corroborates pneumonia phenotype if labs/vitals support it. - Bilateral effusions + low lung volumes → atelectasis + pleural effusion phenotypes. - Central lin...

  49. [49]

    Penicillin G = GBS-positive mother on intrapartum prophylaxis → confirms a known indication for prolonged postpartum observation of the newborn (and by extension the dyad)

  50. [50]

    sql_query

    Multiple postpartum med stop times = 2138-11-21 17:00. Pharmacy team has already written orders extending >72 h past cutoff. Stop-time horizons are a clinically robust planning signal — providers don't write 4-day orders for a patient they expect to discharge in 24h. This is the cleanest non-leaky evidence for predicting LOS > 3 days. Step 5 — DRG Codes (...