Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents

Balakrishnan Prabhakaran; Bo Peng; Hongtu Zhu; Hui Guo; Liangqiao Gui; Peng Huang; Shu Hu; Tsao Connie; Xin Wang; Xi Wu

arxiv: 2604.06347 · v1 · submitted 2026-04-07 · 💻 cs.CV

Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents

Peng Huang , Yiming Wang , Yineng Chen , Liangqiao Gui , Hui Guo , Bo Peng , Shu Hu , Xi Wu

show 4 more authors

Tsao Connie Hongtu Zhu Balakrishnan Prabhakaran Xin Wang

This is my paper

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords echocardiographyvisual language modelsactor-verifier frameworktrustworthy reasoningmedical video analysisstructured intermediate representationcardiovascular diagnosis

0 comments

The pith

EchoTrust uses an actor-verifier framework built around a structured evidence representation to make echocardiographic visual language model reasoning more reliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes EchoTrust to fix problems in visual language models that analyze echocardiographic videos for heart disease screening. Existing approaches map video and question directly to an answer, which makes them prone to template shortcuts and false explanations. EchoTrust instead generates a structured intermediate representation that separate actor and verifier roles then examine. This separation is meant to yield more dependable and understandable outputs for clinical use. A reader would care because echocardiography is central to cardiovascular diagnosis yet remains difficult to automate reliably due to heart motion and varying image views.

Core claim

EchoTrust is an evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography VLM-based agents. It produces a structured intermediate representation that is subsequently analyzed by distinct roles, enabling more reliable and interpretable decision-making for high-stakes clinical applications.

What carries the argument

The structured intermediate representation generated from echocardiographic video and question, which is then processed separately by an actor role and a verifier role instead of a single direct mapping.

If this is right

Decision outputs become more reliable for high-stakes clinical use in cardiovascular screening.
Reasoning steps are more interpretable because the intermediate evidence layer can be inspected.
The system handles complex cardiac dynamics and view heterogeneity with fewer spurious explanations.
Clinical decision support tools built on VLMs gain a built-in mechanism to reduce template shortcuts.
The framework supports separate analysis of evidence before final answer generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same actor-verifier split could be tested on other video-based medical imaging tasks such as ultrasound of other organs.
The intermediate representation might serve as an audit trail that clinicians or regulators could review independently.
Combining the verifier role with human review could create hybrid workflows that flag uncertain cases more clearly.
Performance gains would be measurable by comparing error rates on cases designed to trigger common VLM shortcuts.

Load-bearing premise

Separating actor and verifier roles around an evidence-based structured intermediate representation will overcome template shortcuts and spurious explanations that arise in direct video-to-answer mappings by visual language models.

What would settle it

A controlled test set of echocardiographic videos where ground-truth diagnoses are known and the EchoTrust system still produces the same shortcut-based or incorrect answers as a standard direct-mapping VLM would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.06347 by Balakrishnan Prabhakaran, Bo Peng, Hongtu Zhu, Hui Guo, Liangqiao Gui, Peng Huang, Shu Hu, Tsao Connie, Xin Wang, Xi Wu, Yiming Wang, Yineng Chen.

**Figure 1.** Figure 1: Overview of the proposed EchoTrust. The Actor generates evidence, a preliminary answer, and a confidence score from the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Case studies on binary (Yes/No) questions. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Case studies on four-choice severity classification questions. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Echocardiography plays an important role in the screening and diagnosis of cardiovascular diseases. However, automated intelligent analysis of echocardiographic data remains challenging due to complex cardiac dynamics and strong view heterogeneity. In recent years, visual language models (VLM) have opened a new avenue for building ultrasound understanding systems for clinical decision support. Nevertheless, most existing methods formulate this task as a direct mapping from video and question to answer, making them vulnerable to template shortcuts and spurious explanations. To address these issues, we propose EchoTrust, an evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography VLM-based agents. EchoTrust produces a structured intermediate representation that is subsequently analyzed by distinct roles, enabling more reliable and interpretable decision-making for high-stakes clinical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EchoTrust proposes splitting VLM reasoning into actor and verifier roles around a structured intermediate for echocardiography but gives no experiments or implementation details to show it reduces shortcuts.

read the letter

The main point is that this paper puts forward EchoTrust as an evidence-driven actor-verifier setup for echocardiography VLMs. An actor builds a structured intermediate representation from the video and question, then a verifier role analyzes it to produce the final output. The goal is more reliable and interpretable answers than direct video-to-answer mappings that fall into template shortcuts or spurious explanations. That separation is the central new element here, applied to a domain where view heterogeneity and cardiac motion make reliable automation hard. The clinical motivation is solid: echocardiography is a key screening tool, and current VLMs can produce outputs that look plausible but do not hold up under scrutiny. Framing the work around distinct roles and an intermediate representation is a reasonable way to push toward interpretability in high-stakes medical imaging. The paper stays focused on that gap without overclaiming prior results. The main weakness is the complete absence of any validation. There are no datasets, no baseline comparisons, no metrics on accuracy or reliability, and no ablations showing that the actor-verifier split actually cuts down on the problems it targets. The description of the structured representation and the exact division of labor between roles stays high-level, so it is difficult to judge whether the approach would work in practice or simply add complexity. No code, no formal checks, and no reproducible steps are provided. This kind of paper is mainly for researchers already working on multi-agent or role-based VLM systems in medical imaging who want a concrete starting point for trustworthy reasoning. A reader looking for tested methods or quantitative gains will find little to use directly. It is worth sending for peer review because the underlying reliability issue in clinical VLMs is real and the proposed structure is specific enough to be worth exploring, even though the current version would need substantial experiments and details added before it could stand on its own.

Referee Report

1 major / 0 minor

Summary. The paper proposes EchoTrust, an evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography VLM-based agents. It generates a structured intermediate representation that is analyzed by distinct actor and verifier roles to enable more reliable and interpretable decision-making for high-stakes clinical applications, addressing vulnerabilities to template shortcuts and spurious explanations in direct video-to-answer mappings.

Significance. The framework addresses an important challenge in applying VLMs to medical imaging by introducing role separation and evidence-based reasoning. If the claims hold and are validated through experiments, this could lead to more trustworthy AI tools for echocardiographic analysis, potentially improving clinical decision support in cardiovascular disease screening and diagnosis. The absence of any empirical validation in the current manuscript limits the immediate significance.

major comments (1)

[Abstract] The central claim that the actor-verifier framework overcomes template shortcuts and spurious explanations is not supported by any experimental results, ablation studies, or quantitative metrics. The abstract describes the high-level idea but does not provide the concrete definition of the structured intermediate representation or the precise division of labor between the actor and verifier roles.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] The central claim that the actor-verifier framework overcomes template shortcuts and spurious explanations is not supported by any experimental results, ablation studies, or quantitative metrics. The abstract describes the high-level idea but does not provide the concrete definition of the structured intermediate representation or the precise division of labor between the actor and verifier roles.

Authors: We agree that the abstract is currently high-level and would benefit from added specificity. In the revised version, we will expand the abstract to define the structured intermediate representation as an evidence chain consisting of view-specific visual features, temporal cardiac dynamics, and explicit reasoning steps grounded in the input video. We will also clarify the division of labor: the actor role generates candidate evidence chains and initial inferences from the echocardiographic data, while the verifier role independently validates each step for consistency, rules out template shortcuts, and confirms absence of spurious correlations before producing the final output. Regarding empirical support, the current manuscript presents EchoTrust as a framework proposal motivated by the documented limitations of direct video-to-answer mappings in existing VLMs; the design directly targets these issues through enforced evidence grounding and role separation. We acknowledge the absence of ablation studies or quantitative metrics in this work and will note this limitation explicitly while outlining directions for future empirical validation. revision: partial

standing simulated objections not resolved

Empirical validation of the framework via experiments, ablation studies, or quantitative metrics demonstrating reduction in template shortcuts and spurious explanations, as no such results are present in the current manuscript.

Circularity Check

0 steps flagged

No circularity: framework proposal lacks equations, derivations, or self-referential reductions

full rationale

The paper introduces EchoTrust as an evidence-driven Actor-Verifier framework that generates a structured intermediate representation for analysis by distinct roles, claiming this yields more reliable and interpretable outputs than direct VLM video-to-answer mappings. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. The central claim is an architectural assertion about role separation overcoming shortcuts, not a mathematical result that reduces to its inputs by construction. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The proposal is self-contained as a conceptual design without circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no extractable free parameters, axioms, or invented entities beyond the high-level framework name itself.

pith-pipeline@v0.9.0 · 5455 in / 1126 out tokens · 38967 ms · 2026-05-10T18:33:56.132127+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Ro- bust fairness vision-language learning for medical image analysis

Sparsh Bansal, Mingyang Wu, Xin Wang, and Shu Hu. Ro- bust fairness vision-language learning for medical image analysis. In2025 IEEE 8th International Conference on Mul- timedia Information Processing and Retrieval (MIPR), pages 463–469. IEEE, 2025. 3

work page 2025
[4]

Bridging vision and text: applications and chal- lenges of vision-language models in urological surgery.Eu- ropean urology focus, 11(1):18–21, 2025

Wouter Bogaert, Nicolas Carl, Karl-Friedrich Kowalewski, Maurice Stephan Michel, Alexandre Mottrie, and Pieter De Backer. Bridging vision and text: applications and chal- lenges of vision-language models in urological surgery.Eu- ropean urology focus, 11(1):18–21, 2025. 2

work page 2025
[5]

Echoatlas: A conversational, multi-view vision-language foundation model for echocardiography interpretation and clinical reasoning.medRxiv, pages 2026–03, 2026

Chieh-Ju Chao, Mohammad Asadi, Lavonda Li, Gokul Ramasamy, Nicolo Pecco, Yu-Chiang Wang, Timothy Poterucha, Reza Arsanjani, Garvan C Kane, Jae K Oh, et al. Echoatlas: A conversational, multi-view vision-language foundation model for echocardiography interpretation and clinical reasoning.medRxiv, pages 2026–03, 2026. 3

work page 2026
[6]

Echollm: extracting echocardiogram entities with light-weight, open-source large language models.JAMIA open, 8(4):ooaf092, 2025

Jonathan Chi, Yazan Rouphail, Ethan Hillis, Ningning Ma, An Nguyen, Jane Wang, Mackenzie Hofford, Aditi Gupta, Patrick G Lyons, Adam Wilcox, et al. Echollm: extracting echocardiogram entities with light-weight, open-source large language models.JAMIA open, 8(4):ooaf092, 2025. 1

work page 2025
[7]

Medical knowledge intervention prompt tuning for medical image classification

Ye Du, Nanxi Yu, and Shujun Wang. Medical knowledge intervention prompt tuning for medical image classification. IEEE Transactions on Medical Imaging, 2025. 2

work page 2025
[8]

Umednerf: Uncertainty-aware single view vol- umetric rendering for medical neural radiance fields

Jing Hu, Qinrui Fan, Shu Hu, Siwei Lyu, Xi Wu, and Xin Wang. Umednerf: Uncertainty-aware single view vol- umetric rendering for medical neural radiance fields. In 2024 IEEE International Symposium on Biomedical Imag- ing (ISBI), pages 1–4. IEEE, 2024. 3

work page 2024
[9]

Rlministyler: Light-weight rl style agent for arbitrary sequential neural style generation

Jing Hu, Chengming Feng, Shu Hu, Ming-Ching Chang, Xin Li, Xi Wu, and Xin Wang. Rlministyler: Light-weight rl style agent for arbitrary sequential neural style generation. arXiv preprint arXiv:2505.04424, 2025. 3

work page arXiv 2025
[10]

Improving generalization of medical image registra- tion foundation model.IJCNN, 2025

Jing Hu, Kaiwei Yu, Hongjiang Xian, Shu Hu, and Xin Wang. Improving generalization of medical image registra- tion foundation model.IJCNN, 2025. 1

work page 2025
[11]

Robustly optimized deep feature decoupling network for fatty liver diseases detection

Peng Huang, Shu Hu, Bo Peng, Jiashu Zhang, Xi Wu, and Xin Wang. Robustly optimized deep feature decoupling network for fatty liver diseases detection. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, pages 68–78. Springer, 2024. 2

work page 2024
[12]

Robust ai-generated face detection with imbalanced data

Yamini Sri Krubha, Aryana Hou, Braden Vester, Web Walker, Xin Wang, Li Lin, and Shu Hu. Robust ai-generated face detection with imbalanced data. In2025 IEEE 8th Inter- national Conference on Multimedia Information Processing and Retrieval (MIPR), pages 470–476. IEEE, 2025. 2

work page 2025
[13]

Echovlm: Measurement-grounded mul- timodal learning for echocardiography.arXiv preprint arXiv:2512.12107, 2025

Yuheng Li, Yue Zhang, Abdoul Aziz Amadou, Yuxiang Lai, Jike Zhong, Tiziano Passerini, Dorin Comaniciu, and Puneet Sharma. Echovlm: Measurement-grounded mul- timodal learning for echocardiography.arXiv preprint arXiv:2512.12107, 2025. 1

work page arXiv 2025
[14]

Robust covid-19 detection in ct images with clip

Li Lin, Yamini Sri Krubha, Zhenhuan Yang, Cheng Ren, Thuc Duy Le, Irene Amerini, Xin Wang, and Shu Hu. Robust covid-19 detection in ct images with clip. In2024 IEEE 7th 8 International Conference on Multimedia Information Pro- cessing and Retrieval (MIPR), pages 586–592. IEEE, 2024. 1

work page 2024
[15]

Medchat: A multi-agent framework for mul- timodal diagnosis with large language models

Philip R Liu, Sparsh Bansal, Jimmy Dinh, Aditya Pawar, Ra- mani Satishkumar, Shail Desai, Neeraj Gupta, Xin Wang, and Shu Hu. Medchat: A multi-agent framework for mul- timodal diagnosis with large language models. In2025 IEEE 8th International Conference on Multimedia Infor- mation Processing and Retrieval (MIPR), pages 456–462. IEEE, 2025. 3

work page 2025
[16]

Emerging utility of multimodal large language models in cardiovascular diagnostics.Journal of Medical Systems, 50(1):33, 2026

Anna G Quinlan, Mitchell H Tsai, and Joshua M Zimmer- man. Emerging utility of multimodal large language models in cardiovascular diagnostics.Journal of Medical Systems, 50(1):33, 2026. 3

work page 2026
[17]

Multimodal generative ai for medical image interpretation

Vishwanatha M Rao, Michael Hla, Michael Moor, Subathra Adithan, Stephen Kwak, Eric J Topol, and Pranav Rajpurkar. Multimodal generative ai for medical image interpretation. Nature, 639(8056):888–896, 2025. 2

work page 2025
[18]

Teacher encoder-student decoder denoising guided segmentation network for anomaly detection

Shixuan Song, Hao Chen, Shu Hu, Xin Wang, Jinrong Hu, and Xi Wu. Teacher encoder-student decoder denoising guided segmentation network for anomaly detection. InIn- ternational Conference on Neural Information Processing, pages 238–253. Springer, 2025. 3

work page 2025
[19]

Gen- erative ai and foundation models in radiology: Applications, opportunities, and potential challenges.Radiology, 317(2): e242961, 2025

Neda Tavakoli, Zahra Shakeri, Vrushab Gowda, Konrad Samsel, Arash Bedayat, Ahmadreza Ghasemiesfe, Ulas Bagci, Albert Hsiao, Tim Leiner, James Carr, et al. Gen- erative ai and foundation models in radiology: Applications, opportunities, and potential challenges.Radiology, 317(2): e242961, 2025. 3

work page 2025
[20]

arXiv preprint arXiv:2504.14391 , year=

Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sa- hashi, Christina Binder, Angela Zhang, Ben Athiwaratkun, Shuaiwen Leon Song, David Ouyang, et al. How well can general vision-language models learn medicine by watching public educational videos?arXiv preprint arXiv:2504.14391, 2025. 5

work page arXiv 2025
[21]

MIMIC-IV- ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering.Phys- ioNet, 2025

Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sahashi, Christina Binder-Rodriguez, Angela Zhang, David Ouyang, and James Zou. MIMIC-IV- ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering.Phys- ioNet, 2025. Version 1.0.0. 5

work page 2025
[22]

Uu-mamba: uncertainty-aware u- mamba for cardiac image segmentation

Ting Yu Tsai, Li Lin, Shu Hu, Ming-Ching Chang, Hongtu Zhu, and Xin Wang. Uu-mamba: uncertainty-aware u- mamba for cardiac image segmentation. In2024 IEEE 7th International Conference on Multimedia Information Pro- cessing and Retrieval (MIPR), pages 267–273. IEEE, 2024. 2

work page 2024
[23]

Uu-mamba: Uncertainty-aware u-mamba for cardiovascular segmentation,

Ting Yu Tsai, Li Lin, Shu Hu, Connie W Tsao, Xin Li, Ming- Ching Chang, Hongtu Zhu, and Xin Wang. Uu-mamba: Uncertainty-aware u-mamba for cardiovascular segmenta- tion.arXiv preprint arXiv:2409.14305, 2024. 2

work page arXiv 2024
[24]

Future guidelines for artificial intelligence in echocardiography.Journal of the American Society of Echocardiography, 35(8):878–882, 2022

Andrew S Tseng, Francisco Lopez-Jimenez, and Patricia A Pellikka. Future guidelines for artificial intelligence in echocardiography.Journal of the American Society of Echocardiography, 35(8):878–882, 2022. 1

work page 2022
[25]

Akhil Vaid, Son Q Duong, Joshua Lampert, Patricia Ko- vatch, Robert Freeman, Edgar Argulian, Lori Croft, Sta- matios Lerakis, Martin Goldman, Rohan Khera, et al. Lo- cal large language models for privacy-preserving acceler- ated review of historic echocardiogram reports.Journal of the American Medical Informatics Association, 31(9):2097– 2102, 2024. 2

work page 2097
[26]

Neural radiance fields in medical imaging: A survey,

Xin Wang, Yineng Chen, Shu Hu, Heng Fan, Hongtu Zhu, and Xin Li. Neural radiance fields in medical imaging: A survey.arXiv preprint arXiv:2402.17797, 2024. 3

work page arXiv 2024
[27]

Challenge summary u-medsam: Uncertainty-aware medsam for medical image segmentation,

Xin Wang, Xiaoyu Liu, Peng Huang, Pu Huang, Shu Hu, and Hongtu Zhu. U-medsam: Uncertainty-aware medsam for medical image segmentation.arXiv preprint arXiv:2408.08881, 2024. 3

work page arXiv 2024
[28]

Llm- medqa: Enhancing medical question answering through case studies in large language models

Hang Yang, Hao Chen, Hui Guo, Yineng Chen, Ching-Sheng Lin, Shu Hu, Jinrong Hu, Xi Wu, and Xin Wang. Llm- medqa: Enhancing medical question answering through case studies in large language models. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE,

work page
[29]

Improve vision language model chain-of- thought reasoning

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning. InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1662, 2025. 2

work page 2025
[30]

Contextual reinforcement learning for unsupervised deformable multimodal medical images regis- tration

Yang Zheng, Hongjiang Xian, Zhikun Shuai, Jing Hu, Xin Wang, and Shu Hu. Contextual reinforcement learning for unsupervised deformable multimodal medical images regis- tration. In2024 IEEE International Joint Conference on Bio- metrics (IJCB), pages 1–9. IEEE, 2024. 1

work page 2024
[31]

Cgd-net: A hybrid end-to- end network with gating decoding for liver tumor segmenta- tion from ct images

Xiaogang Zhu, Tao Liu, Ziqiu Liu, Ouyang Shaobo, Xin Wang, Shu Hu, and Feng Ding. Cgd-net: A hybrid end-to- end network with gating decoding for liver tumor segmenta- tion from ct images. In2024 IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–7. IEEE, 2024. 2 9

work page 2024

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Ro- bust fairness vision-language learning for medical image analysis

Sparsh Bansal, Mingyang Wu, Xin Wang, and Shu Hu. Ro- bust fairness vision-language learning for medical image analysis. In2025 IEEE 8th International Conference on Mul- timedia Information Processing and Retrieval (MIPR), pages 463–469. IEEE, 2025. 3

work page 2025

[4] [4]

Bridging vision and text: applications and chal- lenges of vision-language models in urological surgery.Eu- ropean urology focus, 11(1):18–21, 2025

Wouter Bogaert, Nicolas Carl, Karl-Friedrich Kowalewski, Maurice Stephan Michel, Alexandre Mottrie, and Pieter De Backer. Bridging vision and text: applications and chal- lenges of vision-language models in urological surgery.Eu- ropean urology focus, 11(1):18–21, 2025. 2

work page 2025

[5] [5]

Echoatlas: A conversational, multi-view vision-language foundation model for echocardiography interpretation and clinical reasoning.medRxiv, pages 2026–03, 2026

Chieh-Ju Chao, Mohammad Asadi, Lavonda Li, Gokul Ramasamy, Nicolo Pecco, Yu-Chiang Wang, Timothy Poterucha, Reza Arsanjani, Garvan C Kane, Jae K Oh, et al. Echoatlas: A conversational, multi-view vision-language foundation model for echocardiography interpretation and clinical reasoning.medRxiv, pages 2026–03, 2026. 3

work page 2026

[6] [6]

Echollm: extracting echocardiogram entities with light-weight, open-source large language models.JAMIA open, 8(4):ooaf092, 2025

Jonathan Chi, Yazan Rouphail, Ethan Hillis, Ningning Ma, An Nguyen, Jane Wang, Mackenzie Hofford, Aditi Gupta, Patrick G Lyons, Adam Wilcox, et al. Echollm: extracting echocardiogram entities with light-weight, open-source large language models.JAMIA open, 8(4):ooaf092, 2025. 1

work page 2025

[7] [7]

Medical knowledge intervention prompt tuning for medical image classification

Ye Du, Nanxi Yu, and Shujun Wang. Medical knowledge intervention prompt tuning for medical image classification. IEEE Transactions on Medical Imaging, 2025. 2

work page 2025

[8] [8]

Umednerf: Uncertainty-aware single view vol- umetric rendering for medical neural radiance fields

Jing Hu, Qinrui Fan, Shu Hu, Siwei Lyu, Xi Wu, and Xin Wang. Umednerf: Uncertainty-aware single view vol- umetric rendering for medical neural radiance fields. In 2024 IEEE International Symposium on Biomedical Imag- ing (ISBI), pages 1–4. IEEE, 2024. 3

work page 2024

[9] [9]

Rlministyler: Light-weight rl style agent for arbitrary sequential neural style generation

Jing Hu, Chengming Feng, Shu Hu, Ming-Ching Chang, Xin Li, Xi Wu, and Xin Wang. Rlministyler: Light-weight rl style agent for arbitrary sequential neural style generation. arXiv preprint arXiv:2505.04424, 2025. 3

work page arXiv 2025

[10] [10]

Improving generalization of medical image registra- tion foundation model.IJCNN, 2025

Jing Hu, Kaiwei Yu, Hongjiang Xian, Shu Hu, and Xin Wang. Improving generalization of medical image registra- tion foundation model.IJCNN, 2025. 1

work page 2025

[11] [11]

Robustly optimized deep feature decoupling network for fatty liver diseases detection

Peng Huang, Shu Hu, Bo Peng, Jiashu Zhang, Xi Wu, and Xin Wang. Robustly optimized deep feature decoupling network for fatty liver diseases detection. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, pages 68–78. Springer, 2024. 2

work page 2024

[12] [12]

Robust ai-generated face detection with imbalanced data

Yamini Sri Krubha, Aryana Hou, Braden Vester, Web Walker, Xin Wang, Li Lin, and Shu Hu. Robust ai-generated face detection with imbalanced data. In2025 IEEE 8th Inter- national Conference on Multimedia Information Processing and Retrieval (MIPR), pages 470–476. IEEE, 2025. 2

work page 2025

[13] [13]

Echovlm: Measurement-grounded mul- timodal learning for echocardiography.arXiv preprint arXiv:2512.12107, 2025

Yuheng Li, Yue Zhang, Abdoul Aziz Amadou, Yuxiang Lai, Jike Zhong, Tiziano Passerini, Dorin Comaniciu, and Puneet Sharma. Echovlm: Measurement-grounded mul- timodal learning for echocardiography.arXiv preprint arXiv:2512.12107, 2025. 1

work page arXiv 2025

[14] [14]

Robust covid-19 detection in ct images with clip

Li Lin, Yamini Sri Krubha, Zhenhuan Yang, Cheng Ren, Thuc Duy Le, Irene Amerini, Xin Wang, and Shu Hu. Robust covid-19 detection in ct images with clip. In2024 IEEE 7th 8 International Conference on Multimedia Information Pro- cessing and Retrieval (MIPR), pages 586–592. IEEE, 2024. 1

work page 2024

[15] [15]

Medchat: A multi-agent framework for mul- timodal diagnosis with large language models

Philip R Liu, Sparsh Bansal, Jimmy Dinh, Aditya Pawar, Ra- mani Satishkumar, Shail Desai, Neeraj Gupta, Xin Wang, and Shu Hu. Medchat: A multi-agent framework for mul- timodal diagnosis with large language models. In2025 IEEE 8th International Conference on Multimedia Infor- mation Processing and Retrieval (MIPR), pages 456–462. IEEE, 2025. 3

work page 2025

[16] [16]

Emerging utility of multimodal large language models in cardiovascular diagnostics.Journal of Medical Systems, 50(1):33, 2026

Anna G Quinlan, Mitchell H Tsai, and Joshua M Zimmer- man. Emerging utility of multimodal large language models in cardiovascular diagnostics.Journal of Medical Systems, 50(1):33, 2026. 3

work page 2026

[17] [17]

Multimodal generative ai for medical image interpretation

Vishwanatha M Rao, Michael Hla, Michael Moor, Subathra Adithan, Stephen Kwak, Eric J Topol, and Pranav Rajpurkar. Multimodal generative ai for medical image interpretation. Nature, 639(8056):888–896, 2025. 2

work page 2025

[18] [18]

Teacher encoder-student decoder denoising guided segmentation network for anomaly detection

Shixuan Song, Hao Chen, Shu Hu, Xin Wang, Jinrong Hu, and Xi Wu. Teacher encoder-student decoder denoising guided segmentation network for anomaly detection. InIn- ternational Conference on Neural Information Processing, pages 238–253. Springer, 2025. 3

work page 2025

[19] [19]

Gen- erative ai and foundation models in radiology: Applications, opportunities, and potential challenges.Radiology, 317(2): e242961, 2025

Neda Tavakoli, Zahra Shakeri, Vrushab Gowda, Konrad Samsel, Arash Bedayat, Ahmadreza Ghasemiesfe, Ulas Bagci, Albert Hsiao, Tim Leiner, James Carr, et al. Gen- erative ai and foundation models in radiology: Applications, opportunities, and potential challenges.Radiology, 317(2): e242961, 2025. 3

work page 2025

[20] [20]

arXiv preprint arXiv:2504.14391 , year=

Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sa- hashi, Christina Binder, Angela Zhang, Ben Athiwaratkun, Shuaiwen Leon Song, David Ouyang, et al. How well can general vision-language models learn medicine by watching public educational videos?arXiv preprint arXiv:2504.14391, 2025. 5

work page arXiv 2025

[21] [21]

MIMIC-IV- ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering.Phys- ioNet, 2025

Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sahashi, Christina Binder-Rodriguez, Angela Zhang, David Ouyang, and James Zou. MIMIC-IV- ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering.Phys- ioNet, 2025. Version 1.0.0. 5

work page 2025

[22] [22]

Uu-mamba: uncertainty-aware u- mamba for cardiac image segmentation

Ting Yu Tsai, Li Lin, Shu Hu, Ming-Ching Chang, Hongtu Zhu, and Xin Wang. Uu-mamba: uncertainty-aware u- mamba for cardiac image segmentation. In2024 IEEE 7th International Conference on Multimedia Information Pro- cessing and Retrieval (MIPR), pages 267–273. IEEE, 2024. 2

work page 2024

[23] [23]

Uu-mamba: Uncertainty-aware u-mamba for cardiovascular segmentation,

Ting Yu Tsai, Li Lin, Shu Hu, Connie W Tsao, Xin Li, Ming- Ching Chang, Hongtu Zhu, and Xin Wang. Uu-mamba: Uncertainty-aware u-mamba for cardiovascular segmenta- tion.arXiv preprint arXiv:2409.14305, 2024. 2

work page arXiv 2024

[24] [24]

Future guidelines for artificial intelligence in echocardiography.Journal of the American Society of Echocardiography, 35(8):878–882, 2022

Andrew S Tseng, Francisco Lopez-Jimenez, and Patricia A Pellikka. Future guidelines for artificial intelligence in echocardiography.Journal of the American Society of Echocardiography, 35(8):878–882, 2022. 1

work page 2022

[25] [25]

Akhil Vaid, Son Q Duong, Joshua Lampert, Patricia Ko- vatch, Robert Freeman, Edgar Argulian, Lori Croft, Sta- matios Lerakis, Martin Goldman, Rohan Khera, et al. Lo- cal large language models for privacy-preserving acceler- ated review of historic echocardiogram reports.Journal of the American Medical Informatics Association, 31(9):2097– 2102, 2024. 2

work page 2097

[26] [26]

Neural radiance fields in medical imaging: A survey,

Xin Wang, Yineng Chen, Shu Hu, Heng Fan, Hongtu Zhu, and Xin Li. Neural radiance fields in medical imaging: A survey.arXiv preprint arXiv:2402.17797, 2024. 3

work page arXiv 2024

[27] [27]

Challenge summary u-medsam: Uncertainty-aware medsam for medical image segmentation,

Xin Wang, Xiaoyu Liu, Peng Huang, Pu Huang, Shu Hu, and Hongtu Zhu. U-medsam: Uncertainty-aware medsam for medical image segmentation.arXiv preprint arXiv:2408.08881, 2024. 3

work page arXiv 2024

[28] [28]

Llm- medqa: Enhancing medical question answering through case studies in large language models

Hang Yang, Hao Chen, Hui Guo, Yineng Chen, Ching-Sheng Lin, Shu Hu, Jinrong Hu, Xi Wu, and Xin Wang. Llm- medqa: Enhancing medical question answering through case studies in large language models. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE,

work page

[29] [29]

Improve vision language model chain-of- thought reasoning

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning. InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1662, 2025. 2

work page 2025

[30] [30]

Contextual reinforcement learning for unsupervised deformable multimodal medical images regis- tration

Yang Zheng, Hongjiang Xian, Zhikun Shuai, Jing Hu, Xin Wang, and Shu Hu. Contextual reinforcement learning for unsupervised deformable multimodal medical images regis- tration. In2024 IEEE International Joint Conference on Bio- metrics (IJCB), pages 1–9. IEEE, 2024. 1

work page 2024

[31] [31]

Cgd-net: A hybrid end-to- end network with gating decoding for liver tumor segmenta- tion from ct images

Xiaogang Zhu, Tao Liu, Ziqiu Liu, Ouyang Shaobo, Xin Wang, Shu Hu, and Feng Ding. Cgd-net: A hybrid end-to- end network with gating decoding for liver tumor segmenta- tion from ct images. In2024 IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–7. IEEE, 2024. 2 9

work page 2024