pith. sign in

arxiv: 2606.07549 · v1 · pith:EL24EC3Anew · submitted 2026-05-18 · 💻 cs.AI · cs.MA

PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

Pith reviewed 2026-06-30 18:37 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords pathologymultimodal reasoningagentic workflowevidence adjudicationVQA hallucinationstool reliabilityBeta-Bernoulli model
0
0 comments X

The pith

PathoSage separates evidence retrieval, collection, and adjudication into distinct stages to reduce hallucinations and conflicts in pathology multimodal reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PathoSage as a three-stage agentic framework that keeps knowledge retrieval, evidence collection, and final judgment apart rather than merging them in one context. Its Structured Evidence Deliberation step reviews tool outputs independently, identifies conflicts, and produces the answer in a fresh context to limit bias from earlier information. A training-free Beta-Bernoulli system tracks each tool's long-term reliability and builds similarity-weighted priors for later decisions. These choices target the hallucination problem in end-to-end pathology MLLMs and the evidence-contamination problem in existing agent workflows. The reported experiments indicate the design yields higher accuracy than strong MLLM and agentic baselines on patch-level tasks.

Core claim

PathoSage is a three-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch-level pathology multimodal reasoning. Its core component, Structured Evidence Deliberation, independently evaluates heterogeneous evidence from tools, performs conflict analysis, and generates the final judgment in a fresh context to reduce anchoring bias. The system adds a training-free Beta-Bernoulli experience model with continuous credit assignment that estimates long-term tool reliability and supplies similarity-weighted priors for future tool selection. Experiments demonstrate that this structure mitigates VQA hallucinations and classifier disagre

What carries the argument

Structured Evidence Deliberation, which evaluates tool outputs independently, analyzes conflicts, and issues the judgment in a fresh context, together with the Beta-Bernoulli experience system that models tool reliability via continuous credit assignment.

If this is right

  • Explicit conflict analysis between heterogeneous tool outputs reduces disagreement among classifiers.
  • Judgment in a fresh context limits propagation of early misleading evidence into the final answer.
  • The Beta-Bernoulli model supplies reliability-weighted priors that improve tool selection on later cases without retraining.
  • The overall workflow outperforms both end-to-end pathology MLLMs and merged-context agent baselines on patch-level VQA tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of adjudication from collection could be tested on other multimodal domains that face conflicting tool outputs, such as radiology report generation.
  • Continuous credit assignment might allow the agent to adapt tool preferences across an entire hospital dataset without additional labeled validation sets.
  • If the experience system generalizes, similar reliability tracking could be added to non-pathology agent workflows that already use multiple vision-language tools.

Load-bearing premise

That performing conflict analysis and final judgment in a fresh context sufficiently reduces anchoring bias and that the Beta-Bernoulli experience system supplies accurate long-term tool reliability estimates without task-specific tuning or validation data.

What would settle it

A controlled experiment in which the same tool outputs are fed once to a shared-context agent and once to PathoSage's fresh-context adjudication, with no difference in hallucination rate or final accuracy, would falsify the claimed benefit of the separated adjudication stage.

Figures

Figures reproduced from arXiv: 2606.07549 by Bob Zhang, Bo Li, Chengyang Zhang, Hong Bu, Jiancheng Lv, Mengran Li, Wenchuan Zhang, Yuhao Yi.

Figure 1
Figure 1. Figure 1: Comparison of the (a) "black box" VLM approach and (b) our proposed PathoSage for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PathoSage framework, which performs pathology multimodal reasoning [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The RAG retriever pipeline. Per-candidate queries retrieve textbook pages from a Milvus [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A representative example where PathoSage correctly identifies the answer, while three [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on the host VLM, including Qwen3-VL-32B, GPT-5.4, and Gemini-3-Pro. The [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Recent advances in Multimodal Large Language Models (MLLMs) and agent workflows have shown strong promise for computational pathology, yet reliable patch-level reasoning remains challenging. End-to-end pathology MLLMs often hallucinate morphological features, while recent agentic systems usually merge tool outputs and retrieved knowledge into a shared context, making decisions vulnerable to conflicting evidence and context contamination. We propose PathoSage, a three-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch-level pathology multimodal reasoning. Its core component, Structured Evidence Deliberation, independently evaluates heterogeneous evidence from tools, performs conflict analysis, and generates the final judgment in a fresh context to reduce anchoring bias. We further introduce a training-free Beta-Bernoulli experience system with continuous credit assignment to model long-term tool reliability and construct similarity-weighted priors for future tool use. Experiments show that PathoSage effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines. Our results highlight explicit evidence adjudication and reliability-aware tool modeling as key ingredients for robust pathology agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces PathoSage, a three-stage agentic framework for patch-level pathology multimodal reasoning that separates knowledge retrieval, evidence collection, and Structured Evidence Deliberation (which performs independent evidence evaluation, conflict analysis, and final judgment in a fresh context). It adds a training-free Beta-Bernoulli experience system with continuous credit assignment to model long-term tool reliability and build similarity-weighted priors. The central claim is that this mitigates VQA hallucinations and classifier disagreement, outperforming pathology MLLMs and agentic baselines.

Significance. If the empirical claims hold after proper validation, the explicit adjudication stage and the experience system could offer a practical, training-free route to more robust multimodal agents in domains with conflicting evidence sources, such as computational pathology.

major comments (3)
  1. Abstract: the assertion that PathoSage 'effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines' supplies no metrics, baselines, datasets, statistical tests, or controls, so the central empirical claim cannot be evaluated from the manuscript.
  2. Abstract (Structured Evidence Deliberation description): the premise that moving conflict analysis and final judgment to a fresh context 'reduce[s] anchoring bias' is stated without any quantitative measure, ablation, or bias metric, leaving the load-bearing mechanism for bias reduction unsupported.
  3. Abstract (Beta-Bernoulli experience system): the claim that the training-free Beta-Bernoulli model with continuous credit assignment yields accurate long-term tool reliability estimates is presented without validation against ground-truth tool accuracy, held-out data, or analysis of how success/failure signals are obtained, raising a direct threat to the reliability-aware component credited for outperformance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for the abstract to better reflect the empirical grounding in the full manuscript. We address each point below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: Abstract: the assertion that PathoSage 'effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines' supplies no metrics, baselines, datasets, statistical tests, or controls, so the central empirical claim cannot be evaluated from the manuscript.

    Authors: The abstract provides a high-level summary of the results. The full manuscript details the evaluation in the Experiments section, including specific datasets (e.g., patch-level VQA benchmarks), baselines (pathology MLLMs and agentic systems), metrics (accuracy, hallucination rates), and statistical comparisons. We will revise the abstract to incorporate a concise reference to the key evaluation setup and main quantitative gains. revision: yes

  2. Referee: Abstract (Structured Evidence Deliberation description): the premise that moving conflict analysis and final judgment to a fresh context 'reduce[s] anchoring bias' is stated without any quantitative measure, ablation, or bias metric, leaving the load-bearing mechanism for bias reduction unsupported.

    Authors: The fresh-context design is motivated by reducing context contamination and anchoring; the manuscript supports this via ablations isolating the Structured Evidence Deliberation stage and showing performance differences. While no dedicated bias metric is introduced, the empirical gains from the stage are quantified. We will revise the abstract to frame the bias reduction as a design rationale validated by the ablations reported in the main text. revision: partial

  3. Referee: Abstract (Beta-Bernoulli experience system): the claim that the training-free Beta-Bernoulli model with continuous credit assignment yields accurate long-term tool reliability estimates is presented without validation against ground-truth tool accuracy, held-out data, or analysis of how success/failure signals are obtained, raising a direct threat to the reliability-aware component credited for outperformance.

    Authors: The experience system is assessed in the manuscript by tracking estimated reliabilities against observed tool success rates across queries, with success/failure derived from downstream task outcomes. Longitudinal analysis and comparisons appear in the Experiments and supplementary sections. We will revise the abstract to note that the reliability estimates are validated through these performance-tracking experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The provided manuscript text contains no equations, derivations, or mathematical steps that reduce a claimed result to its inputs by construction. The framework is described at a high level with a training-free Beta-Bernoulli component, but no self-definitional mappings, fitted parameters renamed as predictions, or load-bearing self-citations are exhibited. Claims rest on experimental outperformance rather than any internal reduction that would trigger the enumerated circularity patterns. This is the expected outcome for a system-description paper without visible formal derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no information on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.1-grok · 5743 in / 1157 out tokens · 28415 ms · 2026-06-30T18:37:06.433344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 31 canonical work pages · 9 internal anchors

  1. [1]

    Quilt-llava: Visual instruction tuning by extracting localized narratives from open- source histopathology videos

    Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open- source histopathology videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13183–13192, 2024

  2. [2]

    A multimodal generative ai copilot for human pathology.Nature, 634(8033):466–473, 2024

    Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Melissa Zhao, Aaron K Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, et al. A multimodal generative ai copilot for human pathology.Nature, 634(8033):466–473, 2024

  3. [3]

    Pa-llava: A large language-vision assistant for human pathology image understanding

    Dawei Dai, Yuanhui Zhang, Long Xu, Qianlan Yang, Xiaojing Shen, Shuyin Xia, and Guoyin Wang. Pa-llava: A large language-vision assistant for human pathology image understanding. InProceedings of the International Conference on Bioinformatics and Biomedicine, pages 3138–3143, 2024

  4. [4]

    Pathgen-1.6m: 1.6 million pathol- ogy image-text pairs generation through multi-agent collaboration

    Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Jingxiong Li, Xuan Gong, XINHENG LYU, Tao Lin, and Lin Yang. Pathgen-1.6m: 1.6 million pathol- ogy image-text pairs generation through multi-agent collaboration. InProceedings of the International Conference on Learning Representations, 2025

  5. [5]

    Patho-r1: A multimodal reinforcement learning-based pathol- ogy expert reasoner

    Wenchuan Zhang, Penghao Zhang, Jingru Guo, Tao Cheng, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, and Hong Bu. Patho-r1: A multimodal reinforcement learning-based pathol- ogy expert reasoner. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28418–28426, 2026

  6. [6]

    A versatile pathology co-pilot via reasoning enhanced multimodal large language model.arXiv preprint arXiv:2507.17303, 2025

    Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Fuxiang Huang, Zhengrui Guo, et al. A versatile pathology co-pilot via reasoning enhanced multimodal large language model.arXiv preprint arXiv:2507.17303, 2025

  7. [7]

    TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

    Tianyu Liu, Weihao Xuan, Hao Wu, Peter Humphrey, Marcello DiStasio, Heli Qi, Rui Yang, Simeng Han, Tinglin Huang, Fang Wu, et al. Teampath: Building multimodal pathology experts with reasoning ai copilots.arXiv preprint arXiv:2511.17652, 2025. 10

  8. [8]

    Wsicaption: Multiple instance generation of pathology reports for gigapixel whole-slide images

    Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Zhongyi Shui, and Lin Yang. Wsicaption: Multiple instance generation of pathology reports for gigapixel whole-slide images. InPro- ceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 546–556. Springer, 2024

  9. [9]

    Wsi-vqa: Interpreting whole slide images by generative visual question answering

    Pingyi Chen, Chenglu Zhu, Sunyi Zheng, Honglin Li, and Lin Yang. Wsi-vqa: Interpreting whole slide images by generative visual question answering. InProceedings of the European Conference on Computer Vision, pages 401–417. Springer, 2024

  10. [10]

    Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction

    Zhengrui Guo, Jiabo Ma, Yingxue Xu, Yihui Wang, Liansheng Wang, and Hao Chen. Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction. InProceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 189–199. Springer, 2024

  11. [11]

    Slidechat: A large vision-language assistant for whole-slide pathology image understanding

    Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, Ming Hu, Rongshan Yu, Yu Qiao, and Junjun He. Slidechat: A large vision-language assistant for whole-slide pathology image understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5134–5143, 2025

  12. [12]

    Wsi-llava: A multimodal large language model for whole slide image

    Yuci Liang, Xinheng Lyu, Wenting Chen, Meidan Ding, Jipeng Zhang, Xiangjian He, Song Wu, Xiaohan Xing, Sen Yang, Xiyue Wang, et al. Wsi-llava: A multimodal large language model for whole slide image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22718–22727, 2025

  13. [13]

    Pathalign: A vision-language model for whole slide images in histopathology.arXiv preprint arXiv:2406.19578, 2024

    Faruk Ahmed, Andrew Sellergren, Lin Yang, Shawn Xu, Boris Babenko, Abbi Ward, Niels Olson, Arash Mohtashamian, Yossi Matias, Greg S Corrado, et al. Pathalign: A vision-language model for whole slide images in histopathology.arXiv preprint arXiv:2406.19578, 2024

  14. [14]

    Alpaca: Adapting llama for pathology context analysis to enable slide-level question answering.medRxiv, pages 2025–04, 2025

    Zeyu Gao, Kai He, Weiheng Su, Ines P Machado, William McGough, Mercedes Jimenez-Linan, Brian Rous, Chunbao Wang, Chengzu Li, Xiaobo Pang, et al. Alpaca: Adapting llama for pathology context analysis to enable slide-level question answering.medRxiv, pages 2025–04, 2025

  15. [15]

    A multimodal whole-slide foundation model for pathology.Nature Medicine, pages 1–13, 2025

    Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature Medicine, pages 1–13, 2025

  16. [16]

    Pathreasoner-r1: Instilling structured reasoning into pathology vision-language model via knowledge-guided policy optimization.arXiv preprint arXiv:2601.21617, 2026

    Songhan Jiang, Fengchun Liu, Ziyue Wang, Linghan Cai, and Yongbing Zhang. Pathreasoner-r1: Instilling structured reasoning into pathology vision-language model via knowledge-guided policy optimization.arXiv preprint arXiv:2601.21617, 2026

  17. [17]

    Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology

    Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, and Lin Yang. Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10360–10371, 2025

  18. [18]

    Polypath: Adapting a large multimodal model for multi-slide pathology report generation.Modern Pathology, page 100886, 2025

    Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S Corrado, Dale R Webster, Shravya Shetty, Shruthi Prabhakara, et al. Polypath: Adapting a large multimodal model for multi-slide pathology report generation.Modern Pathology, page 100886, 2025

  19. [19]

    Generating dermatopathology reports from gigapixel whole slide images with histogpt.Nature Communications, 16(1):4886, 2025

    Manuel Tran, Paul Schmidle, Ruifeng Ray Guo, Sophia J Wagner, Valentin Koch, Valerio Lupperger, Brenna Novotny, Dennis H Murphree, Heather D Hardway, Marina D’Amato, et al. Generating dermatopathology reports from gigapixel whole slide images with histogpt.Nature Communications, 16(1):4886, 2025

  20. [20]

    Prism2: Unlocking multi-modal general pathology ai with clinical dialogue.arXiv preprint arXiv:2506.13063, 2025

    Eugene V orontsov, George Shaikovski, Adam Casson, Julian Viret, Eric Zimmermann, Neil Tenenholtz, Yi Kan Wang, Jan H Bernhard, Ran A Godrich, Juan A Retamero, et al. Prism2: Unlocking multi-modal general pathology ai with clinical dialogue.arXiv preprint arXiv:2506.13063, 2025

  21. [21]

    Pathfound: An agentic multimodal model activating evidence-seeking pathological diagnosis.arXiv preprint arXiv:2512.23545, 2025

    Shengyi Hua, Jianfeng Wu, Tianle Shen, Kangzhe Hu, Zhongzhen Huang, Shujuan Ni, Zhihong Zhang, Yuan Li, Zhe Wang, and Xiaofan Zhang. Pathfound: An agentic multimodal model activating evidence-seeking pathological diagnosis.arXiv preprint arXiv:2512.23545, 2025. 11

  22. [22]

    Hepato-llava: An expert mllm with sparse topo-pack attention for hepatocellular pathology analysis on whole slide images.arXiv preprint arXiv:2602.19424, 2026

    Yuxuan Yang, Zhonghao Yan, Yi Zhang, Bo Yun, Muxi Diao, Guowei Zhao, Kongming Liang, Wenbin Li, and Zhanyu Ma. Hepato-llava: An expert mllm with sparse topo-pack attention for hepatocellular pathology analysis on whole slide images.arXiv preprint arXiv:2602.19424, 2026

  23. [23]

    Pathasst: A generative foundation ai assistant towards artificial gen- eral intelligence of pathology

    Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, and Lin Yang. Pathasst: A generative foundation ai assistant towards artificial gen- eral intelligence of pathology. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5034–5042, 2024

  24. [24]

    Patho-agenticrag: towards multimodal agentic retrieval- augmented generation for pathology vlms via reinforcement learning

    Wenchuan Zhang, Jingru Guo, Hengzhe Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, and Hong Bu. Patho-agenticrag: towards multimodal agentic retrieval- augmented generation for pathology vlms via reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29921–29929, 2026

  25. [25]

    Evidence- based diagnostic reasoning with multi-agent copilot for human pathology.arXiv preprint arXiv:2506.20964, 2025

    Chengkuan Chen, Luca L Weishaupt, Drew FK Williamson, Richard J Chen, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y Lu, et al. Evidence- based diagnostic reasoning with multi-agent copilot for human pathology.arXiv preprint arXiv:2506.20964, 2025

  26. [26]

    Pathology-cot: Learning visual chain-of-thought agent from expert whole slide image diagnosis behavior.arXiv preprint arXiv:2510.04587, 2025

    Sheng Wang, Ruiming Wu, Charles Herndon, Yihang Liu, Shunsuke Koga, Jeanne Shen, and Zhi Huang. Pathology-cot: Learning visual chain-of-thought agent from expert whole slide image diagnosis behavior.arXiv preprint arXiv:2510.04587, 2025

  27. [27]

    Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathol- ogy

    Fatemeh Ghezloo, Mehmet Saygin Seyfioglu, Rustin Soraki, Wisdom O Ikezogwo, Beibin Li, Tejoram Vivekanandan, Joann G Elmore, Ranjay Krishna, and Linda Shapiro. Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathol- ogy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 234...

  28. [28]

    Pathagent: Toward interpretable analysis of whole- slide pathology images via large language model-based agentic reasoning.arXiv preprint arXiv:2511.17052, 2025

    Jingyun Chen, Linghan Cai, Zhikang Wang, Yi Huang, Songhan Jiang, Shenjin Huang, Hongpeng Wang, and Yongbing Zhang. Pathagent: Toward interpretable analysis of whole- slide pathology images via large language model-based agentic reasoning.arXiv preprint arXiv:2511.17052, 2025

  29. [29]

    Yuxuan Sun, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Bowen Ding, Tao Lin, and Lin Yang. Cpathagent: An agent-based foundation model for interpretable high-resolution pathology image analysis mimicking pathologists’ diagnostic logic.arXiv preprint arXiv:2505.20510, 2025

  30. [30]

    Survagent: Hierarchical cot-enhanced case banking and dichotomy-based multi-agent system for multimodal survival prediction.arXiv preprint arXiv:2511.16635, 2025

    Guolin Huang, Wenting Chen, Jiaqi Yang, Xinheng Lyu, Xiaoling Luo, Sen Yang, Xiaohan Xing, and Linlin Shen. Survagent: Hierarchical cot-enhanced case banking and dichotomy-based multi-agent system for multimodal survival prediction.arXiv preprint arXiv:2511.16635, 2025

  31. [31]

    Wsi-agents: A collaborative multi-agent system for multi-modal whole slide image analysis.arXiv preprint arXiv:2507.14680, 2025

    Xinheng Lyu, Yuci Liang, Wenting Chen, Meidan Ding, Jiaqi Yang, Guolin Huang, Daokun Zhang, Xiangjian He, and Linlin Shen. Wsi-agents: A collaborative multi-agent system for multi-modal whole slide image analysis.arXiv preprint arXiv:2507.14680, 2025

  32. [32]

    A co-evolving agentic ai system for medical imaging analysis.arXiv preprint arXiv:2509.20279, 2025

    Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu, Yihang Liu, Lilin Wang, Wenhui Lei, Sheng Wang, Yinuo Xu, et al. A co-evolving agentic ai system for medical imaging analysis.arXiv preprint arXiv:2509.20279, 2025

  33. [33]

    Mmnavagent: Multi-magnification wsi navigation agent for clinically consistent whole-slide analysis.arXiv preprint arXiv:2603.02079, 2026

    Zhengyang Xu, Han Li, Jingsong Liu, Linrui Xie, Xun Ma, Xin You, Shihui Zu, Ayako Ito, Xinyu Hao, Hongming Xu, et al. Mmnavagent: Multi-magnification wsi navigation agent for clinically consistent whole-slide analysis.arXiv preprint arXiv:2603.02079, 2026

  34. [34]

    Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology.Nature Cancer, 6(8):1337–1349, 2025

    Dyke Ferber, Omar SM El Nahhas, Georg Wölflein, Isabella C Wiest, Jan Clusmann, Marie- Elisabeth Leßmann, Sebastian Foersch, Jacqueline Lammert, Maximilian Tschochohei, Dirk Jäger, et al. Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology.Nature Cancer, 6(8):1337–1349, 2025. 12

  35. [35]

    SAGE: Agentic Framework for Interpretable and Clinically Translatable Computational Pathology Biomarker Discovery

    Sahar Almahfouz Nasser, Juan Francisco Pesantez Borja, Jincheng Liu, Tanvir Hasan, Zenghan Wang, Suman Ghosh, Sandeep Manandhar, Shikhar Shiromani, Twisha Shah, Naoto Tokuyama, et al. Sage: Agentic framework for interpretable and clinically translatable computational pathology biomarker discovery.arXiv preprint arXiv:2602.00953, 2026

  36. [36]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InProceedings of the International Conference on Learning Representations, 2022

  37. [37]

    A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

    Andrew Zhang, Tong Ding, Sophia J Wagner, Caiwei Tian, Ming Y Lu, Rowland Pettit, Joshua E Lewis, Alexandre Misrahi, Dandan Mo, Long Phi Le, et al. A multimodal and temporal foundation model for virtual patient representations at healthcare system scale.arXiv preprint arXiv:2604.18570, 2026

  38. [38]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

  39. [39]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  40. [40]

    Toolmem: Enhancing multimodal agents with learnable tool capability memory,

    Yunzhong Xiao, Yangmin Li, Hewei Wang, Yunlong Tang, and Zora Zhiruo Wang. Toolmem: Enhancing multimodal agents with learnable tool capability memory.arXiv preprint arXiv:2510.06664, 2025

  41. [41]

    XSkill: Continual Learning from Experience and Skills in Multimodal Agents

    Guanyu Jiang, Zhaochen Su, Xiaoye Qu, et al. Xskill: Continual learning from experience and skills in multimodal agents.arXiv preprint arXiv:2603.12056, 2026

  42. [42]

    MemOS: A Memory OS for AI System

    Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, et al. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724, 2025

  43. [43]

    MemVerse: Multimodal Memory for Lifelong Learning Agents

    Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, et al. Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

  44. [44]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  45. [45]

    Colpali: Efficient document retrieval with vision language models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, CELINE HUDELOT, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. In Proceedings of the International Conference on Learning Representations, 2025

  46. [46]

    Milvus: A purpose-built vector data management system

    Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. Milvus: A purpose-built vector data management system. InProceedings of the International Conference on Management of Data, pages 2614–2627, 2021

  47. [47]

    Yu A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2018

  48. [48]

    Cognitive bias in decision-making with llms

    Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He. Cognitive bias in decision-making with llms. InFindings of the Association for Computational Linguistics, pages 12640–12653, 2024

  49. [49]

    Application of artificial intelligence and digital tools in cancer pathology.The Lancet Digital Health, 7(10), 2025

    Lawrence A Shaktah, Zunamys I Carrero, Katherine Jane Hewitt, Marco Gustav, Matthew Cecchini, Sebastian Foersch, Sabina Berezowska, and Jakob Nikolas Kather. Application of artificial intelligence and digital tools in cancer pathology.The Lancet Digital Health, 7(10), 2025. 13

  50. [50]

    Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In Proceedings of the International Conference on Learning Representations, 2024

  51. [51]

    A visual-language foundation model for computational pathology.Nature Medicine, 30(3):863–874, 2024

    Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guil- laume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology.Nature Medicine, 30(3):863–874, 2024

  52. [52]

    A visual–language foundation model for pathology image analysis using medical twitter.Nature Medicine, 29(9):2307–2316, 2023

    Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter.Nature Medicine, 29(9):2307–2316, 2023

  53. [53]

    Quilt-1m: One million image-text pairs for histopathology.Advances in Neural Information Processing Systems, 36:37995–38017, 2023

    Wisdom Ikezogwo, Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Geva, Fatwir Sheikh Mo- hammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology.Advances in Neural Information Processing Systems, 36:37995–38017, 2023

  54. [54]

    A vision–language foundation model for precision oncology.Nature, 638(8051):769–778, 2025

    Jinxi Xiang, Xiyue Wang, Xiaoming Zhang, Yinghua Xi, Feyisope Eweje, Yijiang Chen, Yuchen Li, Colin Bergstrom, Matthew Gopaulchan, Ted Kim, et al. A vision–language foundation model for precision oncology.Nature, 638(8051):769–778, 2025

  55. [55]

    Interpretable vision-language sur- vival analysis with ordinal inductive bias for computational pathology

    Pei Liu, Luping Ji, Jiaxiang Gou, Bo Fu, and Mao Ye. Interpretable vision-language sur- vival analysis with ordinal inductive bias for computational pathology. InProceedings of the International Conference on Learning Representations, 2025

  56. [56]

    Foundation models in medical image analysis: A systematic review and meta-analysis.ArXiv, abs/2510.16973, 2025

    Praveenbalaji Rajendran, Mojtaba Safari, Wenfeng He, Mingzhe Hu, Shansong Wang, Jun Zhou, and Xiaofeng Yang. Foundation models in medical image analysis: A systematic review and meta-analysis.ArXiv, abs/2510.16973, 2025

  57. [57]

    Aligning clinical needs and ai capabilities: a survey on llms for medical reasoning.Authorea Preprints, 2025

    Qi Peng, Jiatong Li, Sirui Huang, Yiyang Jiang, Kaisong Gong, Ronger Ding, Shijie Ye, Changmeng Zheng, Xiao-Yong Wei, and Qing Li. Aligning clinical needs and ai capabilities: a survey on llms for medical reasoning.Authorea Preprints, 2025

  58. [58]

    A pathology foundation model for cancer diagnosis and prognosis prediction.Nature, 634(8035):970–978, 2024

    Xiyue Wang, Junhan Zhao, Eliana Marostica, Wei Yuan, Jietian Jin, Jiayu Zhang, Ruijiang Li, Hongping Tang, Kanran Wang, Yu Li, et al. A pathology foundation model for cancer diagnosis and prognosis prediction.Nature, 634(8035):970–978, 2024

  59. [59]

    Homie: Histopathology omni-modal embedding for pathology composed retrieval.arXiv preprint arXiv:2502.07221, 2025

    Qifeng Zhou, Wenliang Zhong, Thao M Dang, Hehuan Ma, Saiyang Na, Yuzhi Guo, and Junzhou Huang. Homie: Histopathology omni-modal embedding for pathology composed retrieval.arXiv preprint arXiv:2502.07221, 2025

  60. [60]

    The landscape of computational pathology agents from static analysis to autonomous diagnostic workflows.Authorea Preprints, 2026

    Jingyun Chen, Fengchun Liu, Songhan Jiang, and Linghan Cai. The landscape of computational pathology agents from static analysis to autonomous diagnostic workflows.Authorea Preprints, 2026

  61. [61]

    arXiv preprint arXiv:2511.23269 (2025)

    Timothy Ossowski, Sheng Zhang, Qianchu Liu, Guanghui Qin, Reuben Tan, Tristan Naumann, Junjie Hu, and Hoifung Poon. Octomed: Data recipes for state-of-the-art multimodal medical reasoning.arXiv preprint arXiv:2511.23269, 2025

  62. [62]

    Pulsemind: A multi-modal medical model for real-world clinical diagnosis.arXiv preprint arXiv:2601.07344, 2026

    Jiao Xu, Junwei Liu, Jiangwei Lao, Qi Zhu, Yunpeng Zhao, Congyun Jin, Shinan Liu, Zhihong Lu, Lihe Zhang, Xin Chen, et al. Pulsemind: A multi-modal medical model for real-world clinical diagnosis.arXiv preprint arXiv:2601.07344, 2026

  63. [63]

    Cx-mind: a pioneering multimodal large language model for interleaved reasoning in chest x-ray via curriculum-guided reinforcement learning.Information Fusion, page 104027, 2025

    Wenjie Li, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang, Mengzhe Xu, Victoria Borja Clau- sich, Sade Mellin, Renhao Yang, Chenrun Wang, et al. Cx-mind: a pioneering multimodal large language model for interleaved reasoning in chest x-ray via curriculum-guided reinforcement learning.Information Fusion, page 104027, 2025

  64. [64]

    Bridging the gap in ophthalmic ai: Mm-retinal-reason dataset and ophthareason model toward dynamic multimodal reasoning.arXiv preprint arXiv:2508.16129, 2025

    Ruiqi Wu, Yuang Yao, Tengfei Ma, Chenran Zhang, Na Su, Tao Zhou, Geng Chen, Wen Fan, and Yi Zhou. Bridging the gap in ophthalmic ai: Mm-retinal-reason dataset and ophthareason model toward dynamic multimodal reasoning.arXiv preprint arXiv:2508.16129, 2025. 14

  65. [65]

    Wenchuan Zhang, Shuwan Zhang, Jiadi You, Fengling Li, Xiaoyan Wu, Xunxi Lu, Qingjie Lv, Juan Huang, Yuhao Yi, and Hong Bu. Attention-based multimodal fusion transformer for predicting the efficacy of neoadjuvant therapy in breast cancer: a cross-institutional retrospective study.Breast Cancer Research, 2025

  66. [66]

    Analysis of thompson sampling for the multi-armed bandit problem

    Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. InProceedings of the Conference on Learning Theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012

  67. [67]

    When does rl help medical vlms? disentangling vision, sft, and rl gains.arXiv preprint arXiv:2603.01301, 2026

    Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, and Babak Taati. When does rl help medical vlms? disentangling vision, sft, and rl gains.arXiv preprint arXiv:2603.01301, 2026

  68. [68]

    Nunext: Reframing nucleus detection as next-point detection.arXiv preprint arXiv:2603.07098, 2026

    Zhongyi Shui, Honglin Li, Xiaozhong Ji, Ye Zhang, Zijiang Yang, Chenglu Zhu, Yuxuan Sun, Kai Yao, Conghui He, and Cheng Tan. Nunext: Reframing nucleus detection as next-point detection.arXiv preprint arXiv:2603.07098, 2026

  69. [69]

    Ziyang Song, Zelin Zang, Zuyao Chen, Xusheng Liang, Dong Yi, Jinlin Wu, Hongbin Liu, Jiebo Luo, and Zhen Lei. Anatomy-r1: Enhancing anatomy reasoning in multimodal large language models via anatomical similarity curriculum and group diversity augmentation.arXiv preprint arXiv:2512.19512, 2025

  70. [70]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

  71. [71]

    MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

    Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

  72. [72]

    Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm

    Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22170–22183, 2024

  73. [73]

    Pathmmu: A massive multimodal expert-level benchmark for understanding and reasoning in pathology

    Yuxuan Sun, Hao Wu, Chenglu Zhu, Sunyi Zheng, Qizi Chen, Kai Zhang, Yunlong Zhang, Dan Wan, Xiaoxiao Lan, Mengyue Zheng, et al. Pathmmu: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. InProceedings of the European Conference on Computer Vision, pages 56–73. Springer, 2024

  74. [74]

    Introducing-gpt-5-4

    OpenAI. Introducing-gpt-5-4. 2026

  75. [75]

    Gemini 3 pro - model card

    Google. Gemini 3 pro - model card. 2025

  76. [76]

    Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images.Medical Image Analysis, 58:101563, 2019

    Simon Graham, Quoc Dang Vu, Shan E Ahmed Raza, Ayesha Azam, Yee Wah Tsang, Jin Tae Kwak, and Nasir Rajpoot. Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images.Medical Image Analysis, 58:101563, 2019

  77. [77]

    critical

    Fabian Hörst, Moritz Rempe, Helmut Becker, Lukas Heine, Julius Keyl, and Jens Kleesiek. Cellvit++: Energy-efficient and adaptive cell segmentation and classification using foundation models.Computer Methods and Programs in Biomedicine, page 109206, 2026. 15 A Additional Experiments and Discussion A.1 Experience Accumulation on PathMMU-val Table A.1: Quant...

  78. [78]

    [Tool Selection Guide]

    and Quilt-VQA [53]. These tasks require precise identification of specific pathological features (e.g., the presence of necrosis, specific cellular arrangements, or staining characteristics). We collect closed-ended questions from their respective test splits, resulting in 3,362 questions for Path-VQA and 343 questions for Quilt-VQA. B.2 Implementation De...