EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

Alison Pouch; Emily Mackay; Filippos Bellos; Jason J. Corso; Jessie N Dong; Yannis Avrithis; Yayuan Li; Yutong Li; Zaiyang Guo

arxiv: 2605.24159 · v1 · pith:TBWW4LVAnew · submitted 2026-05-22 · 💻 cs.CV

EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

Filippos Bellos , Yutong Li , Jessie N Dong , Zaiyang Guo , Emily Mackay , Yayuan Li , Yannis Avrithis , Alison Pouch

show 1 more author

Jason J. Corso

This is my paper

Pith reviewed 2026-06-30 15:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords echocardiographyvisual question answeringpoint-of-care ultrasoundmultimodal promptsdatasetcardiac imagingparameter-efficient learningacquisition guidance

0 comments

The pith

EchoVQA supplies the first large-scale visual question answering dataset for echocardiography, including guidance questions for point-of-care image acquisition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates EchoVQA by combining existing public echocardiography collections with new images captured on two handheld probes, yielding 14,299 images and 74,819 question-answer pairs that cover multiple views and both high-quality and suboptimal scans. It adds a distinctive set of acquisition guidance questions that instruct users how to reposition the transducer to reach a usable apical four-chamber view for left-ventricular ejection fraction assessment. The authors also introduce a parameter-efficient training approach that relies on multimodal learnable prompts and report that it matches or exceeds prior methods on EchoVQA and related benchmarks while updating far fewer parameters.

Core claim

EchoVQA is the first large-scale VQA dataset for echocardiography that merges public high-quality sources with point-of-care acquisitions from handheld probes and uniquely includes acquisition guidance questions to help novice operators optimize transducer positioning toward a diagnostic apical four-chamber view. A multimodal learnable prompts method achieves state-of-the-art performance on EchoVQA and most other benchmarks while requiring significantly fewer trainable parameters than existing approaches.

What carries the argument

The EchoVQA dataset of 14,299 images and 74,819 question-answer pairs that incorporates acquisition guidance questions, combined with a multimodal learnable prompts method for parameter-efficient model adaptation.

If this is right

Models can supply real-time transducer positioning instructions that help non-experts obtain diagnostic-quality apical four-chamber views during point-of-care scans.
The same models can answer clinical interpretation questions on both high-quality and suboptimal images typical of handheld use.
The low parameter count allows the assistance system to run on the limited hardware available in many point-of-care settings.
Training on mixed high-quality and suboptimal images reduces the performance gap that usually appears when models move from controlled lab data to bedside conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset construction approach could be replicated for other ultrasound applications such as lung or abdominal point-of-care imaging.
Live video rather than static frames would be required to turn the current question-answering system into true conversational guidance during scanning.
Deployment studies measuring whether novice users actually produce better diagnostic images when following the model's advice would test clinical utility beyond benchmark accuracy.

Load-bearing premise

The combination of public high-quality datasets with new point-of-care acquisitions is representative enough for models to generalize to real clinical point-of-care use.

What would settle it

Testing the trained model on an independent collection of point-of-care echocardiography images and acquisition guidance questions gathered with the same handheld probes but outside the training distribution and measuring whether accuracy on guidance and diagnostic questions falls substantially below reported levels.

Figures

Figures reproduced from arXiv: 2605.24159 by Alison Pouch, Emily Mackay, Filippos Bellos, Jason J. Corso, Jessie N Dong, Yannis Avrithis, Yayuan Li, Yutong Li, Zaiyang Guo.

**Figure 1.** Figure 1: EchoVQA dataset overview. Left: Example QA pairs from each data source demonstrating the different question types. Right: Distribution of images and QA pairs across data sources (inner ring) and quality categories (outer ring). The dataset comprises 14,299 images and 74,819 QA pairs spanning public (EchoNet-Dynamic, CAMUS) and our point-of-care (Lumify, Clarius) acquisitions. 2 EchoVQA Dataset 2.1 Data Col… view at source ↗

**Figure 2.** Figure 2: EchoVQA construction pipeline. (1) Echocardiography experts categorize images from our point-of-care acquisitions and public datasets into view and quality-based categories, then define coarse QA templates for each category. (2) GPT-4o generates diverse, image-specific QA pairs conditioned on category labels and expert templates, covering cardiac landmark visibility, LVEF estimation feasibility, ultrasoun… view at source ↗

**Figure 3.** Figure 3: Proposed method. (a) Training pipeline with visual adaption prompts injected into top LLM layers. (b) Dataset-level text prompts added to keys and values in the layer’s self-attention. (c) At inference, instance-level prompts are retrieved from a medical vocabulary, optimized, and injected as a summary prompt. D is the embedding dimension and Nv is the number of visual tokens. These features are projected … view at source ↗

read the original abstract

Point-of-care transthoracic echocardiography (TTE) enables cardiac assessment in virtually any clinical setting, yet its diagnostic utility remains constrained by the expertise required for image acquisition and interpretation. Visual question answering (VQA) offers a promising paradigm for bridging this expertise gap through interactive clinical assistance, but existing echocardiography VQA datasets are limited in scale, restricted to high-quality images, and only cover a few views. We introduce EchoVQA, the first large-scale VQA dataset for echocardiography, comprising 14,299 images and 74,819 question-answer pairs. The dataset integrates public sources (EchoNet-Dynamic, CAMUS) with our own point-of-care acquisitions from two handheld probes (Lumify, Clarius), spanning diverse views and including both high-quality and suboptimal images. Uniquely, EchoVQA includes acquisition guidance questions to help users optimize transducer positioning toward a diagnostic apical 4-chamber view for left ventricular ejection fraction estimation -- a challenging task for novice operators in point-of-care settings. We further develop a parameter-efficient method based on multimodal learnable prompts achieving state-of-the-art performance on most benchmarks, including EchoVQA, with significantly less trainable parameters than existing state-of-the-art approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EchoVQA is a new dataset for echocardiography VQA that includes acquisition guidance and mixed-quality handheld images, paired with an efficient prompt method, but the abstract provides no performance numbers to back the SOTA claim.

read the letter

The punchline is that this paper brings a sizable new VQA dataset for cardiac ultrasound focused on point-of-care use. It combines existing sources with fresh acquisitions from two handheld probes, includes both high and low quality images, and adds questions specifically about transducer positioning to reach a good apical 4-chamber view.

What is new is the scale and the inclusion of acquisition guidance questions along with realistic suboptimal images from portable devices. That combination is not in prior work. The parameter-efficient multimodal prompt approach is a secondary contribution aimed at reducing trainable parameters while maintaining performance.

The paper does well at laying out the clinical motivation and describing how the data was gathered to address the expertise barrier in point-of-care TTE. The dataset size of 14k images and 74k pairs is substantial for this domain.

The soft spots are clear. No quantitative results, baselines, or evaluation details appear in the abstract, so the SOTA claim cannot be checked. The assumption that the data distribution will support generalization to real clinical novices is stated but not demonstrated. It would help to see how the QA pairs were created and if there are any controls for question quality or diversity.

This paper is for the medical computer vision community, particularly those interested in VQA or ultrasound analysis. Readers who need a benchmark for conversational assistance in echo would get value from the dataset construction. It deserves a serious referee because the dataset could be a useful resource for the field if the results hold up under review.

I recommend sending it to peer review with a request for the full experimental results and any plans for data availability.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces EchoVQA, the first large-scale VQA dataset for echocardiography comprising 14,299 images and 74,819 question-answer pairs. The dataset combines public sources (EchoNet-Dynamic, CAMUS) with new point-of-care acquisitions from two handheld probes (Lumify, Clarius), spanning diverse views and including both high-quality and suboptimal images. It uniquely incorporates acquisition guidance questions aimed at optimizing transducer positioning for apical 4-chamber views and left ventricular ejection fraction estimation. The authors also propose a parameter-efficient method based on multimodal learnable prompts that is stated to achieve state-of-the-art performance on EchoVQA and most other benchmarks while using significantly fewer trainable parameters than prior approaches.

Significance. If the dataset scale, diversity, and inclusion of acquisition guidance questions are as described and the prompt-based method demonstrably outperforms existing approaches with reduced parameters, the work could provide a valuable benchmark and practical tool for conversational assistance in point-of-care cardiac ultrasound. The emphasis on suboptimal images and real-world acquisition challenges addresses a genuine clinical gap, and the parameter efficiency would be a strength for deployment. However, the significance is currently limited by the absence of supporting quantitative evidence for the performance claims.

major comments (1)

[Abstract] Abstract: The central claim that the proposed method achieves 'state-of-the-art performance on most benchmarks, including EchoVQA, with significantly less trainable parameters than existing state-of-the-art approaches' is asserted without any quantitative results, baselines, error bars, ablation studies, or description of the evaluation protocol. This absence makes it impossible to assess whether the data support the SOTA claim, which is load-bearing for the paper's contribution.

minor comments (1)

[Abstract] Abstract: The phrase 'most benchmarks' is used without specifying which benchmarks are included or the criteria for 'most,' reducing clarity on the scope of the performance claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the single major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the proposed method achieves 'state-of-the-art performance on most benchmarks, including EchoVQA, with significantly less trainable parameters than existing state-of-the-art approaches' is asserted without any quantitative results, baselines, error bars, ablation studies, or description of the evaluation protocol. This absence makes it impossible to assess whether the data support the SOTA claim, which is load-bearing for the paper's contribution.

Authors: We agree that the abstract as currently written asserts the performance claim without supporting numbers, which limits immediate verifiability. The full manuscript contains the requested elements: Section 4 reports quantitative results on EchoVQA and external benchmarks (including accuracy, parameter counts, and comparisons against prior VQA and multimodal models), with ablation studies on prompt design and evaluation protocols detailed in Section 3.2 and the supplementary material; error bars are included for repeated runs. To address the concern directly, we will revise the abstract to incorporate key quantitative highlights (e.g., specific accuracy gains and parameter reductions relative to baselines) while preserving its length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims consist of introducing a new dataset (EchoVQA with explicit size, sources, and question types) and describing a parameter-efficient multimodal prompt method that achieves reported benchmark results. No equations, fitted parameters, predictions, or derivation steps appear in the abstract or described content. The work is self-contained as a data-collection and empirical-method contribution; no self-citation chain, ansatz smuggling, or input-renaming reductions are present. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the work relies on standard assumptions of multimodal VQA and prompt tuning without introducing new physical or mathematical entities.

pith-pipeline@v0.9.1-grok · 5778 in / 1230 out tokens · 18417 ms · 2026-06-30T15:24:23.484589+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 20 canonical work pages · 3 internal anchors

[1]

Simulation in Healthcare: Journal of the Society for Simulation in Healthcare8(5), 329–334 (Oct 2013)

Bick, J.S., Demaria, S., Kennedy, J.D., Schwartz, A.D., Weiner, M.M., Levine, A.I., Shi, Y., Schildcrout, J.S., Wagner, C.E.: Comparison of expert and novice perfor- mance of a simulated transesophageal echocardiography examination. Simulation in Healthcare: Journal of the Society for Simulation in Healthcare8(5), 329–334 (Oct 2013). https://doi.org/10.10...

work page doi:10.1097/sih.0b013e31829068df 2013
[2]

Circulation95(6), 1686– 1744 (Mar 1997)

Cheitlin, M.D., et al.: ACC/AHA Guidelines for the Clini- cal Application of Echocardiography. Circulation95(6), 1686– 1744 (Mar 1997). https://doi.org/10.1161/01.CIR.95.6.1686, https://www.ahajournals.org/doi/10.1161/01.cir.95.6.1686, publisher: Ameri- can Heart Association

work page doi:10.1161/01.cir.95.6.1686 1997
[3]

arXiv preprint arXiv:2401.02797 (2024)

He, J., Li, P., Liu, G., He, G., Chen, Z., Zhong, S.: Pefomed: Parameter effi- cient fine-tuning of multimodal large language models for medical imaging. arXiv preprint arXiv:2401.02797 (2024). https://doi.org/10.48550/arXiv.2401.02797

work page doi:10.48550/arxiv.2401.02797 2024
[4]

Rico Sennrich, Barry Haddow, and Alexandra Birch

He, X., Cai, Z., Wei, W., Zhang, Y., Mou, L., Xing, E., Xie, P.: Towards visual question answering on pathology images. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Compu- tational Linguistics and the 11th International Joint Conference on Natural Lan- guage Processing (Volume 2: Short Pap...

work page doi:10.18653/v1/2021.acl- 2021
[5]

In: Interna- tional Conference on Learning Representations (ICLR) (2022)

Hu, E.J., et al.: Lora: Low-rank adaptation of large language models. In: Interna- tional Conference on Learning Representations (ICLR) (2022)

2022
[6]

In: Experimental IR Meets Multilinguality, Multimodality, and Interaction

Ionescu, B., et al.: Overview of the imageclef 2022: Multimedia retrieval in medical, social media and nature applications. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Lecture Notes in Computer Science, vol. 13390, pp. 541–564. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13643-6_31

work page doi:10.1007/978-3-031-13643-6_31 2022
[7]

Scientific Data6, 317 (2019)

Johnson, A.E.W., et al.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data6, 317 (2019). https://doi.org/10.1038/s41597-019-0322-0 10 F. Bellos et al

work page doi:10.1038/s41597-019-0322-0 2019
[8]

BMC Primary Care24, 183 (Sep 2023)

Kornelsen, J., Ho, H., Robinson, V., Frenkel, O.: Rural family physician use of point-of-care ultrasonography: experiences of pri- mary care providers in British Columbia, Canada. BMC Primary Care24, 183 (Sep 2023). https://doi.org/10.1186/s12875-023-02128-z, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10486031/

work page doi:10.1186/s12875-023-02128-z 2023
[9]

Scientific Data5, 180251 (2018)

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data5, 180251 (2018). https://doi.org/10.1038/sdata.2018.251

work page doi:10.1038/sdata.2018.251 2018
[10]

IEEE Transactions on Medical Imaging38(9), 2198–2210 (2019)

Leclerc, S., et al.: Deep learning for segmentation using an open large-scale dataset in 2d echocardiography. IEEE Transactions on Medical Imaging38(9), 2198–2210 (2019). https://doi.org/10.1109/TMI.2019.2900516

work page doi:10.1109/tmi.2019.2900516 2019
[11]

In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023), spotlight

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023), spotlight

2023
[12]

Artificial Intelligence in Medicine143, 102611 (2023)

Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., He, M., Ge, Z.: Medical visual question answering: A survey. Artificial Intelligence in Medicine143, 102611 (2023). https://doi.org/10.1016/j.artmed.2023.102611

work page doi:10.1016/j.artmed.2023.102611 2023
[13]

Liu, L.-M

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeledknowledge-enhanceddatasetformedicalvisualquestionanswering.In:2021 IEEE18thInternationalSymposiumonBiomedicalImaging(ISBI).pp.1650–1654. IEEE (2021). https://doi.org/10.1109/ISBI48211.2021.9434010

work page doi:10.1109/isbi48211.2021.9434010 2021
[14]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

2023
[15]

In: Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers)

Liu, X., Ji, K., Fu, Y., Tam, W.L., Du, Z., Yang, Z., Tang, J.: P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In: Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers). pp. 61–68. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/20...

work page doi:10.18653/v1/2022.acl-short.8 2022
[16]

Journal of the American Society of Echocardiogra- phy: Official Publication of the American Society of Echocardiography32(1), 1–64 (Jan 2019)

Mitchell, C., et al.: Guidelines for Performing a Comprehensive Transthoracic Echocardiographic Examination in Adults: Recommendations from the American Society of Echocardiography. Journal of the American Society of Echocardiogra- phy: Official Publication of the American Society of Echocardiography32(1), 1–64 (Jan 2019). https://doi.org/10.1016/j.echo.2...

work page doi:10.1016/j.echo.2018.06.004 2019
[17]

GPT-4 Technical Report

OpenAI, et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[18]

Nature580(7802), 252–256 (Apr 2020)

Ouyang, D., et al.: Video-based AI for beat-to-beat assessment of cardiac function. Nature580(7802), 252–256 (Apr 2020). https://doi.org/10.1038/s41586-020-2145- 8, https://www.nature.com/articles/s41586-020-2145-8, publisher: Nature Pub- lishing Group

work page doi:10.1038/s41586-020-2145- 2020
[19]

Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology objects in context (roco): A multimodal image dataset. In: Multimodal Learning for Clinical Decision Support and Clinical Image-Based Procedures (LABELS / CVII-STENT @MICCAI).LectureNotesinComputerScience,vol.11043,pp.180–189.Springer, Cham (2018). https://doi.org/10.1007/978-3-030-...

work page doi:10.1007/978-3-030-01364-6_20 2018
[20]

In: Findings of the Association for Computational Linguistics: EMNLP

Subramanian,S.,etal.:Medicat:Adatasetofmedicalimages,captions,andtextual references. In: Findings of the Association for Computational Linguistics: EMNLP
[21]

2112–2120

pp. 2112–2120. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.191 Title Suppressed Due to Excessive Length 11

work page doi:10.18653/v1/2020.findings-emnlp.191 2020
[22]

PhysioNet (Oct 2025)

Thapa, R., Li, A., Wu, Q., He, B., Sahashi, Y., Binder-Rodriguez, C., Zhang, A., Ouyang, D., Zou, J.: MIMIC-IV-ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering. PhysioNet (Oct 2025). https://doi.org/10.13026/rndk-4s36, https://doi.org/10.13026/rndk-4s36, version 1.0.0

work page doi:10.13026/rndk-4s36 2025
[23]

ArXiv (Jul 2023)

Touvron, H., et al.: Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv (Jul 2023)

2023
[24]

In: International Conference on Learning Rep- resentations (ICLR) (2025)

Xie, Y., Zhou, C., Gao, L., Wu, J., Li, X., Zhou, H.Y., Liu, S., Xing, L., Zou, J., Xie, C., Zhou, Y.: Medtrinity-25m: A large-scale multimodal dataset with multi- granular annotations for medicine. In: International Conference on Learning Rep- resentations (ICLR) (2025)

2025
[25]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Zhang, R., et al.: LLaMA-Adapter: Efficient Fine-tuning of Language Mod- els with Zero-init Attention (2023). https://doi.org/10.48550/ARXIV.2303.16199, https://arxiv.org/abs/2303.16199, publisher: arXiv Version Number: 3

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.16199 2023
[26]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Zhang,S.,etal.:Biomedclip:amultimodalbiomedicalfoundationmodelpretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023). https://doi.org/10.48550/arXiv.2303.00915

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.00915 2023

[1] [1]

Simulation in Healthcare: Journal of the Society for Simulation in Healthcare8(5), 329–334 (Oct 2013)

Bick, J.S., Demaria, S., Kennedy, J.D., Schwartz, A.D., Weiner, M.M., Levine, A.I., Shi, Y., Schildcrout, J.S., Wagner, C.E.: Comparison of expert and novice perfor- mance of a simulated transesophageal echocardiography examination. Simulation in Healthcare: Journal of the Society for Simulation in Healthcare8(5), 329–334 (Oct 2013). https://doi.org/10.10...

work page doi:10.1097/sih.0b013e31829068df 2013

[2] [2]

Circulation95(6), 1686– 1744 (Mar 1997)

Cheitlin, M.D., et al.: ACC/AHA Guidelines for the Clini- cal Application of Echocardiography. Circulation95(6), 1686– 1744 (Mar 1997). https://doi.org/10.1161/01.CIR.95.6.1686, https://www.ahajournals.org/doi/10.1161/01.cir.95.6.1686, publisher: Ameri- can Heart Association

work page doi:10.1161/01.cir.95.6.1686 1997

[3] [3]

arXiv preprint arXiv:2401.02797 (2024)

He, J., Li, P., Liu, G., He, G., Chen, Z., Zhong, S.: Pefomed: Parameter effi- cient fine-tuning of multimodal large language models for medical imaging. arXiv preprint arXiv:2401.02797 (2024). https://doi.org/10.48550/arXiv.2401.02797

work page doi:10.48550/arxiv.2401.02797 2024

[4] [4]

Rico Sennrich, Barry Haddow, and Alexandra Birch

He, X., Cai, Z., Wei, W., Zhang, Y., Mou, L., Xing, E., Xie, P.: Towards visual question answering on pathology images. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Compu- tational Linguistics and the 11th International Joint Conference on Natural Lan- guage Processing (Volume 2: Short Pap...

work page doi:10.18653/v1/2021.acl- 2021

[5] [5]

In: Interna- tional Conference on Learning Representations (ICLR) (2022)

Hu, E.J., et al.: Lora: Low-rank adaptation of large language models. In: Interna- tional Conference on Learning Representations (ICLR) (2022)

2022

[6] [6]

In: Experimental IR Meets Multilinguality, Multimodality, and Interaction

Ionescu, B., et al.: Overview of the imageclef 2022: Multimedia retrieval in medical, social media and nature applications. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Lecture Notes in Computer Science, vol. 13390, pp. 541–564. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13643-6_31

work page doi:10.1007/978-3-031-13643-6_31 2022

[7] [7]

Scientific Data6, 317 (2019)

Johnson, A.E.W., et al.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data6, 317 (2019). https://doi.org/10.1038/s41597-019-0322-0 10 F. Bellos et al

work page doi:10.1038/s41597-019-0322-0 2019

[8] [8]

BMC Primary Care24, 183 (Sep 2023)

Kornelsen, J., Ho, H., Robinson, V., Frenkel, O.: Rural family physician use of point-of-care ultrasonography: experiences of pri- mary care providers in British Columbia, Canada. BMC Primary Care24, 183 (Sep 2023). https://doi.org/10.1186/s12875-023-02128-z, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10486031/

work page doi:10.1186/s12875-023-02128-z 2023

[9] [9]

Scientific Data5, 180251 (2018)

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data5, 180251 (2018). https://doi.org/10.1038/sdata.2018.251

work page doi:10.1038/sdata.2018.251 2018

[10] [10]

IEEE Transactions on Medical Imaging38(9), 2198–2210 (2019)

Leclerc, S., et al.: Deep learning for segmentation using an open large-scale dataset in 2d echocardiography. IEEE Transactions on Medical Imaging38(9), 2198–2210 (2019). https://doi.org/10.1109/TMI.2019.2900516

work page doi:10.1109/tmi.2019.2900516 2019

[11] [11]

In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023), spotlight

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023), spotlight

2023

[12] [12]

Artificial Intelligence in Medicine143, 102611 (2023)

Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., He, M., Ge, Z.: Medical visual question answering: A survey. Artificial Intelligence in Medicine143, 102611 (2023). https://doi.org/10.1016/j.artmed.2023.102611

work page doi:10.1016/j.artmed.2023.102611 2023

[13] [13]

Liu, L.-M

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeledknowledge-enhanceddatasetformedicalvisualquestionanswering.In:2021 IEEE18thInternationalSymposiumonBiomedicalImaging(ISBI).pp.1650–1654. IEEE (2021). https://doi.org/10.1109/ISBI48211.2021.9434010

work page doi:10.1109/isbi48211.2021.9434010 2021

[14] [14]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

2023

[15] [15]

In: Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers)

Liu, X., Ji, K., Fu, Y., Tam, W.L., Du, Z., Yang, Z., Tang, J.: P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In: Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers). pp. 61–68. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/20...

work page doi:10.18653/v1/2022.acl-short.8 2022

[16] [16]

Journal of the American Society of Echocardiogra- phy: Official Publication of the American Society of Echocardiography32(1), 1–64 (Jan 2019)

Mitchell, C., et al.: Guidelines for Performing a Comprehensive Transthoracic Echocardiographic Examination in Adults: Recommendations from the American Society of Echocardiography. Journal of the American Society of Echocardiogra- phy: Official Publication of the American Society of Echocardiography32(1), 1–64 (Jan 2019). https://doi.org/10.1016/j.echo.2...

work page doi:10.1016/j.echo.2018.06.004 2019

[17] [17]

GPT-4 Technical Report

OpenAI, et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[18] [18]

Nature580(7802), 252–256 (Apr 2020)

Ouyang, D., et al.: Video-based AI for beat-to-beat assessment of cardiac function. Nature580(7802), 252–256 (Apr 2020). https://doi.org/10.1038/s41586-020-2145- 8, https://www.nature.com/articles/s41586-020-2145-8, publisher: Nature Pub- lishing Group

work page doi:10.1038/s41586-020-2145- 2020

[19] [19]

Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology objects in context (roco): A multimodal image dataset. In: Multimodal Learning for Clinical Decision Support and Clinical Image-Based Procedures (LABELS / CVII-STENT @MICCAI).LectureNotesinComputerScience,vol.11043,pp.180–189.Springer, Cham (2018). https://doi.org/10.1007/978-3-030-...

work page doi:10.1007/978-3-030-01364-6_20 2018

[20] [20]

In: Findings of the Association for Computational Linguistics: EMNLP

Subramanian,S.,etal.:Medicat:Adatasetofmedicalimages,captions,andtextual references. In: Findings of the Association for Computational Linguistics: EMNLP

[21] [21]

2112–2120

pp. 2112–2120. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.191 Title Suppressed Due to Excessive Length 11

work page doi:10.18653/v1/2020.findings-emnlp.191 2020

[22] [22]

PhysioNet (Oct 2025)

Thapa, R., Li, A., Wu, Q., He, B., Sahashi, Y., Binder-Rodriguez, C., Zhang, A., Ouyang, D., Zou, J.: MIMIC-IV-ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering. PhysioNet (Oct 2025). https://doi.org/10.13026/rndk-4s36, https://doi.org/10.13026/rndk-4s36, version 1.0.0

work page doi:10.13026/rndk-4s36 2025

[23] [23]

ArXiv (Jul 2023)

Touvron, H., et al.: Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv (Jul 2023)

2023

[24] [24]

In: International Conference on Learning Rep- resentations (ICLR) (2025)

Xie, Y., Zhou, C., Gao, L., Wu, J., Li, X., Zhou, H.Y., Liu, S., Xing, L., Zou, J., Xie, C., Zhou, Y.: Medtrinity-25m: A large-scale multimodal dataset with multi- granular annotations for medicine. In: International Conference on Learning Rep- resentations (ICLR) (2025)

2025

[25] [25]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Zhang, R., et al.: LLaMA-Adapter: Efficient Fine-tuning of Language Mod- els with Zero-init Attention (2023). https://doi.org/10.48550/ARXIV.2303.16199, https://arxiv.org/abs/2303.16199, publisher: arXiv Version Number: 3

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.16199 2023

[26] [26]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Zhang,S.,etal.:Biomedclip:amultimodalbiomedicalfoundationmodelpretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023). https://doi.org/10.48550/arXiv.2303.00915

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.00915 2023