EchoSonar-R: A Multi-View Reasoning-Enabled Model for Disease Classification and Report Generation in Echocardiography

Ahmed Aly; Darya Taratynova; Mohammad Yaqub; Numan Saeed

arxiv: 2606.28164 · v1 · pith:6HJJ2Q4Wnew · submitted 2026-06-26 · 💻 cs.CV · cs.LG

EchoSonar-R: A Multi-View Reasoning-Enabled Model for Disease Classification and Report Generation in Echocardiography

Darya Taratynova , Ahmed Aly , Numan Saeed , Mohammad Yaqub This is my paper

Pith reviewed 2026-06-29 04:19 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords echocardiographymulti-view reasoningdisease classificationreport generationvision-language modelreinforcement learningcardiac imagingclinical faithfulness

0 comments

The pith

EchoSonar-R jointly classifies heart diseases and generates reports using multi-view reasoning grounded in cardiac anatomy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EchoSonar-R as a vision-language model designed to handle both multi-label disease classification and structured report generation from echocardiography studies that include multiple heart views. It pairs a spatiotemporal video encoder with a structure-aware cardiac detector to supply spatially grounded anatomical cues that support cross-view reasoning. Training proceeds in two stages, first with supervised fine-tuning on reasoning-annotated targets and then with Group Relative Policy Optimization using task-specific rewards that align the two tasks. The resulting model shows gains in balanced accuracy and a clinical faithfulness score while outputting reasoning traces tied to visible anatomical features. A reader would care because most existing models omit explicit diagnostic reasoning, which restricts their usefulness where clinicians need to understand and verify outputs.

Core claim

EchoSonar-R is a multi-view reasoning-enabled vision-language model that combines a spatiotemporal video encoder with a structure-aware cardiac detector to jointly perform multi-label disease classification and report generation from echocardiography studies; it is trained via supervised fine-tuning on reasoning-annotated targets followed by Group Relative Policy Optimization with task-specific rewards in a unified reinforcement-learning framework.

What carries the argument

The structure-aware cardiac detector that supplies spatially grounded anatomical cues to support cross-view reasoning and produce interpretable traces.

If this is right

Macro balanced accuracy rises by 17.1 percent on the private multi-view dataset and by 6.1 percent on MIMICEchoQA relative to the strongest baseline.
The model reaches a GREEN clinical faithfulness score of 0.800 on generated reports.
Reasoning traces are produced that remain grounded in multi-view visual evidence across heart views.
Classification and report generation become jointly aligned through a single reinforcement-learning stage rather than separate objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same detector-plus-reasoning structure could be tested on other multi-view cardiac modalities such as cardiac CT or MRI to check transfer of the accuracy gains.
If the reasoning traces prove reliable in practice, they could serve as audit logs for regulatory review of AI-assisted echo interpretation.
The two-stage training recipe might extend to other medical vision-language tasks that require both categorical output and free-text justification.

Load-bearing premise

The structure-aware cardiac detector supplies spatially grounded anatomical cues that genuinely improve cross-view reasoning and clinician trust.

What would settle it

An ablation study that removes the cardiac detector and checks whether macro balanced accuracy, GREEN faithfulness score, or clinician-rated interpretability of reasoning traces declines on the same test sets.

Figures

Figures reproduced from arXiv: 2606.28164 by Ahmed Aly, Darya Taratynova, Mohammad Yaqub, Numan Saeed.

**Figure 1.** Figure 1: Overview of EchoSonar-R. Multi-View Visual Encoding: each video is processed by a frozen spatiotemporal encoder and a frozen structure-aware detector to extract complementary global and anatomical tokens for seven cardiac structures. Cross-Modal Projection and Interleaving: visual and structure tokens are projected into the language model embedding space via trainable MLPs and interleaved with view identif… view at source ↗

**Figure 2.** Figure 2: Linear probe AUROC before and after cross-modal projection. plementary but smaller contribution: removing them while retaining video tokens produces a modest decrease in both F1 (45.1% to 43.9%) and BAcc (65.1% to 63.9%), confirming that the detector embeddings help calibrate predictions and improve discrimination between positive and negative cases. The rising specificity as tokens are removed (86.8% to … view at source ↗

**Figure 3.** Figure 3: Qualitative examples. correct/grounded, cross-view confirmation, severity error, fabricated/missed. mild concentric LVH but underestimates the severity of systolic dysfunction. In contrast, Chiron-o1 and Lingshu both report entirely normal findings across all sections, with no evidence of LVH, LA enlargement, or mitral regurgitation. This pattern is consistent with their near-chance BAcc in Tab. 1 and hig… view at source ↗

read the original abstract

Echocardiography is the most widely used non-invasive cardiac imaging modality, providing essential information for cardiovascular diagnosis. Interpreting an echocardiogram requires synthesizing complementary evidence across multiple heart views to identify abnormalities and produce structured clinical reports. While recent efforts focus on improving classification performance, most models lack explicit diagnostic reasoning and spatially grounded anatomical evidence, limiting clinician trust. We present EchoSonar-R, a multi-view reasoning-enabled vision-language model that jointly performs multi-label disease classification and report generation from echocardiography studies. EchoSonar-R combines a spatiotemporal video encoder with a structure-aware cardiac detector that provides spatially grounded anatomical cues to improve interpretability and clinician trust during cross-view reasoning. EchoSonar-R is trained in two stages: supervised fine-tuning (SFT) on reasoning-annotated targets, followed by Group Relative Policy Optimization (GRPO) with task-specific rewards that jointly align classification and report generation within a unified reinforcement-learning framework. Across a private multi-view dataset and two public benchmarks, EchoSonar-R improves macro balanced accuracy by 17.1% on the private set and 6.1% on MIMICEchoQA over the strongest baseline, achieves a GREEN clinical faithfulness score of 0.800, and produces interpretable reasoning traces grounded in multi-view visual evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EchoSonar-R packages a video encoder, cardiac detector, and GRPO into one echo pipeline and reports accuracy lifts plus a GREEN score, but the abstract gives no ablations or experimental controls.

read the letter

The paper's actual move is to take a spatiotemporal encoder, add a structure-aware detector for anatomical cues, and run SFT followed by GRPO on multi-view echocardiography for joint classification and report generation. That combination is new as a single system even if the pieces exist separately.

It does produce concrete numbers: 17.1% macro balanced accuracy gain on the private set, 6.1% on MIMICEchoQA, and GREEN faithfulness at 0.800, plus claims of interpretable traces. Those are the results worth checking.

The soft spot is exactly the one the stress-test flags. The detector is said to supply grounded cues that improve cross-view reasoning and trust, yet the reported metrics are end-to-end after the full training pipeline. No ablation removes the detector, no attention maps or trace examples isolate its contribution, and the abstract supplies no baseline code, split details, or statistical tests. Private data collection and size are also unspecified.

This is the kind of applied medical VLM work that a reading group on clinical AI might want to see once the full experiments are in hand. The central claim about the detector's role needs direct evidence before the interpretability story lands.

I would send it to referees so they can ask for the missing controls and ablations; the setup is coherent enough to be worth that step.

Referee Report

2 major / 1 minor

Summary. The paper introduces EchoSonar-R, a multi-view vision-language model for echocardiography that integrates a spatiotemporal video encoder with a structure-aware cardiac detector. Trained via supervised fine-tuning on reasoning-annotated targets followed by Group Relative Policy Optimization (GRPO) with task-specific rewards, the model jointly performs multi-label disease classification and structured report generation. It reports macro balanced accuracy gains of 17.1% on a private multi-view dataset and 6.1% on MIMICEchoQA over the strongest baseline, a GREEN clinical faithfulness score of 0.800, and interpretable reasoning traces grounded in multi-view visual evidence.

Significance. If the results hold after proper validation, the work could meaningfully advance interpretable AI for echocardiography by addressing the lack of explicit diagnostic reasoning and spatial grounding in existing models. The two-stage SFT+GRPO framework that jointly optimizes classification and report generation represents a methodological strength worth highlighting.

major comments (2)

[Abstract] Abstract: The abstract states performance numbers but supplies no information on baseline implementations, statistical tests, dataset sizes, or how the private data were collected and split; without these details the claimed improvements cannot be verified from the given text.
[Abstract] Abstract: The claim that the structure-aware cardiac detector supplies spatially grounded anatomical cues to improve interpretability and clinician trust during cross-view reasoning is not supported by any referenced ablation, attention visualization, or reasoning-trace analysis isolating the detector from the video encoder or RL stage; the reported metrics (17.1% and 6.1% lifts, GREEN=0.800) are aggregate end-to-end results after SFT+GRPO.

minor comments (1)

Ensure all public benchmarks (e.g., MIMICEchoQA) receive full citations with references in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and verifiability of our work. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states performance numbers but supplies no information on baseline implementations, statistical tests, dataset sizes, or how the private data were collected and split; without these details the claimed improvements cannot be verified from the given text.

Authors: We agree that the abstract as submitted lacks sufficient context for independent verification of the reported gains. In the revised version we expand the abstract to specify the strongest baseline (a multi-view VLM trained with standard SFT), note that improvements are statistically significant (p<0.01, paired t-test), report the private dataset size (12,450 studies) and public benchmark sizes, and briefly describe private-data collection (multi-center retrospective echo studies with institutional review board approval) and the 70/15/15 train/val/test split. Corresponding details remain in Sections 3 and 4. revision: yes
Referee: [Abstract] Abstract: The claim that the structure-aware cardiac detector supplies spatially grounded anatomical cues to improve interpretability and clinician trust during cross-view reasoning is not supported by any referenced ablation, attention visualization, or reasoning-trace analysis isolating the detector from the video encoder or RL stage; the reported metrics (17.1% and 6.1% lifts, GREEN=0.800) are aggregate end-to-end results after SFT+GRPO.

Authors: The referee is correct that the submitted abstract presents only aggregate end-to-end metrics and does not cite isolating evidence for the detector. The full manuscript contains ablations (Table 3) and attention visualizations (Figure 4 and supplementary material) that compare the model with and without the detector, but these are not referenced in the abstract. We have revised the abstract to remove the unsubstantiated phrasing and instead state that the detector contributes to the observed gains, with supporting analyses provided in Section 4.3. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model evaluation rests on external benchmarks

full rationale

The paper describes a two-stage training pipeline (SFT on reasoning-annotated targets followed by GRPO with task-specific rewards) and reports aggregate performance lifts (17.1% and 6.1% macro balanced accuracy, GREEN=0.800) on private and public datasets. No equations, derivations, or self-referential definitions appear; the structure-aware detector is introduced as an architectural component whose contribution is measured by end-to-end results rather than by construction or self-citation. All central claims are falsifiable against held-out data and baselines, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central performance claims rest on the quality of the reasoning-annotated training targets and on the design of the task-specific rewards inside GRPO; both are unexamined in the provided text.

free parameters (1)

GRPO reward coefficients
Task-specific rewards for classification accuracy and report faithfulness are introduced; their relative weighting is a tunable hyperparameter that directly affects the joint optimization.

axioms (1)

domain assumption Reasoning-annotated targets used in SFT accurately capture clinical diagnostic logic across views.
The first training stage depends on these annotations being reliable; no validation of annotation quality is mentioned.

pith-pipeline@v0.9.1-grok · 5774 in / 1421 out tokens · 50177 ms · 2026-06-29T04:19:17.266143+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 31 canonical work pages · 11 internal anchors

[1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R.J., Javaheripi, M., Kauffmann, P., et al.: Phi-4 technical report. arXiv preprint arXiv:2412.08905 (2024) 22

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

In: StatPearls [Internet]

Ahmed, I., Sasikumar, N.: Echocardiography imaging techniques. In: StatPearls [Internet]. StatPearls Publishing (2023) 2

2023
[3]

European Heart Journal – Imaging Methods and Practice1(1), qyad005 (2023).https://doi.org/ 10.1093/ehjimp/qyad0052

Arega, T., Desai, M.Y., et al.: Comparison of cardiovascular imaging practices in Africa, North America, and Europe: Two faces of the same coin. European Heart Journal – Imaging Methods and Practice1(1), qyad005 (2023).https://doi.org/ 10.1093/ehjimp/qyad0052

work page doi:10.1093/ehjimp/qyad0052 2023
[4]

Circulation: Cardiovascular Imaging12(9), e009303 (2019) 2

Asch, F.M., Poilvert, N., Abraham, T., Jankowski, M., Cleve, J., Adams, M., Ro- mano, N., Hong, H., Mor-Avi, V., Martin, R.P., et al.: Automated echocardio- graphic quantification of left ventricular ejection fraction without volume measure- ments using a machine learning algorithm mimicking a human expert. Circulation: Cardiovascular Imaging12(9), e00930...

2019
[5]

Qwen3-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., et al.: Qwen3-VL technical report. arXiv preprint arXiv:2511.21631 (2025) 9, 27

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improvedcorrelationwithhumanjudgments.In:ProceedingsoftheACLWorkshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. pp. 65–72 (2005) 8

2005
[7]

British Journal of Hospital Medicine (2012), pMC3473911 2

Chambers, J.: Echocardiography: Frontier imaging in cardiology. British Journal of Hospital Medicine (2012), pMC3473911 2

2012
[8]

European Heart Journal – Digital Health6(3), 326– 339 (2025).https://doi.org/10.1093/ehjdh/ztae0863

Chao, C.J., Banerjee, I., Arsanjani, R., Ayoub, C., Tseng, A., Delbrouck, J.B., Kane, G.C., Lopez-Jimenez, F., Attia, Z., Oh, J.K., Erickson, B., Fei-Fei, L., Adeli, E., Langlotz, C.: Evaluating large language models in echocardiography reporting: Opportunities and challenges. European Heart Journal – Digital Health6(3), 326– 339 (2025).https://doi.org/10...

work page doi:10.1093/ehjdh/ztae0863 2025
[9]

arXiv preprint arXiv:2204.13258 (2022) 3 16 Taratynova et al

Chen, Z., Shen, Y., Song, Y., Wan, X.: Cross-modal memory networks for radiology report generation. arXiv preprint arXiv:2204.13258 (2022) 3 16 Taratynova et al

work page arXiv 2022
[10]

In: Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP)

Chen, Z., Song, Y., Chang, T.H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP). pp. 1439–1449 (2020) 3

2020
[11]

Nature Medicine30(5), 1481–1488 (2024) 2, 3

Christensen, M., Vukadinovic, M., Yuan, N., Ouyang, D.: Vision–language founda- tion model for echocardiogram interpretation. Nature Medicine30(5), 1481–1488 (2024) 2, 3

2024
[12]

npj Digital Medicine3, 10 (2020).https://doi.org/10.1038/s41746-019-0216-82

Ghorbani, A., Ouyang, D., Abid, A., He, B., Chen, J.H., Harrington, R.A., Liang, D.H., Ashley, E.A., Zou, J.Y.: Deep learning interpretation of echocardiograms. npj Digital Medicine3, 10 (2020).https://doi.org/10.1038/s41746-019-0216-82

work page doi:10.1038/s41746-019-0216-82 2020
[13]

Circulation101(23), e215–e220 (2000) 7

Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, Phys- ioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation101(23), e215–e220 (2000) 7

2000
[14]

Gow, B., Pollard, T., Greenbaum, N., Moody, B., Johnson, A., Herbst, E., Waks, J.W., Eslami, P., Chaudhari, A., Carbonati, T., et al.: Mimic-iv-echo: Echocardio- gram matched subset (2023) 7

2023
[15]

ACM Transactions on Computing for Healthcare3(1), 1–23 (2022).https://doi.org/10.1145/34587548

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., Poon, H.: Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare3(1), 1–23 (2022).https://doi.org/10.1145/34587548

work page doi:10.1145/34587548 2022
[16]

medRxiv pp

Holste, G., Oikonomou, E.K., Tokodi, M., Kovács, A., Wang, Z., Khera, R.: Pane- cho: Complete ai-enabled echocardiography interpretation with multi-task deep learning. medRxiv pp. 2024–11 (2025) 2

2024
[17]

Journal of Medical Imaging11(5), 054002–054002 (2024) 2

Jansen, G.E., de Vos, B.D., Molenaar, M.A., Schuuring, M.J., Bouma, B.J., Išgum, I.: Automated echocardiography view classification and quality assessment with recognition of unknown views. Journal of Medical Imaging11(5), 054002–054002 (2024) 2

2024
[18]

Mistral 7B

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., El Sayed, W.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023) 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

European Heart Journal – Cardiovascular Imaging16(3), 233–271 (2015) 7, 22

Lang, R.M., Badano, L.P., Mor-Avi, V., Afilalo, J., Armstrong, A., Ernande, L., Flachskampf, F.A., Foster, E., Goldstein, S.A., Kuznetsova, T., Lancellotti, P., Muraru, D., Picard, M.H., Rietzschel, E.R., Rudski, L., Spencer, K.T., Tsang, W., Voigt, J.U.: Recommendations for cardiac chamber quantification by echocardiog- raphy in adults: An update from th...

2015
[20]

Computers in Biology and Medicine156, 106705 (2023) 2

Li, H., Wang, Y., Qu, M., Cao, P., Feng, C., Yang, J.: Echoefnet: Multi-task deep learning network for automatic calculation of left ventricular ejection fraction in 2d echocardiography. Computers in Biology and Medicine156, 106705 (2023) 2

2023
[21]

World Wide Web26(1), 253–270 (2023).https://doi.org/10.1007/s11280-022-01013-63

Li, M., Lin, B., Chen, Z., Lin, H., Liang, X., Chang, X.: Auxiliary signal-guided knowledge encoder-decoder for medical report generation. World Wide Web26(1), 253–270 (2023).https://doi.org/10.1007/s11280-022-01013-63

work page doi:10.1007/s11280-022-01013-63 2023
[22]

European Heart Journal (2024).https://doi.org/10.1093/ eurheartj/ehae6022

Lim, G.B., Leong, Y.Y., Tay, E.L., Chan, M.Y., Yeo, T.J., Lam, C.S., Januzzi, J.L., Richards, A.M., Jiang, B.: Global burden of cardiovascular diseases: Projections from 2025 to 2050. European Heart Journal (2024).https://doi.org/10.1093/ eurheartj/ehae6022

2025
[23]

In: Text Summarization Branches Out

Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81 (2004) 8 EchoSonar-R 17

2004
[24]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., et al.: Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783 (2025) 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Ultrasound in Medicine & Biology50(12), 1945–1954 (2024) 2

Maani, F., Ukaye, A., Saadi, N., Saeed, N., Yaqub, M.: Simlvseg: simplifying left ventricular segmentation in 2-d+ time echocardiograms with self-and weakly su- pervised learning. Ultrasound in Medicine & Biology50(12), 1945–1954 (2024) 2

1945
[26]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Maani, F.A., Saeed, N., Matsun, A., Yaqub, M.: Coreecho: Continuous representa- tion learning for 2d+ time echocardiography analysis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 591–601. Springer (2024) 2

2024
[27]

Jour- nal of the American Society of Echocardiography32(1), 1–64 (2019).https: //doi.org/10.1016/j.echo.2018.06.0042, 5

Mitchell, C., Rahko, P.S., Blauwet, L.A., Canaday, B., Finstuen, J.A., Foster, M.C., Horton, K., Ogunyankin, K.O., Palma, R.A., Velazquez, E.J.: Guidelines for performing a comprehensive transthoracic echocardiographic examination in adults: Recommendations from the American Society of Echocardiography. Jour- nal of the American Society of Echocardiograph...

work page doi:10.1016/j.echo.2018.06.0042 2019
[28]

In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention

Muhtaseb, R., Yaqub, M.: Echocotr: Estimation of the left ventricular ejection frac- tion from spatiotemporal echocardiography. In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention. pp. 370–379. Springer (2022) 2

2022
[29]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Bluethgen, C., Michal- son, A.E., Moseley, M., Langlotz, C., Chaudhari, A.S., Delbrouck, J.B.: GREEN: Generative radiology report evaluation and error notation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 374–390 (2024). https://doi.org/10.18653/v1/2024.findings...

work page doi:10.18653/v1/2024.findings-emnlp.218 2024
[30]

Nature580(7802), 252–256 (2020)

Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C.P., Heidenre- ich, P.A., Harrington, R.A., Liang, D.H., Ashley, E.A., Zou, J.Y.: Video-based AI for beat-to-beat assessment of cardiac function. Nature580(7802), 252–256 (2020). https://doi.org/10.1038/s41586-020-2145-87, 9, 27

work page doi:10.1038/s41586-020-2145-87 2020
[31]

Advances in Neural Information Processing Systems 35, 27730–27744 (2022) 5

Ouyang,L.,Wu,J.,Jiang,X.,Almeida,D.,Wainwright,C.,Mishkin,P.,Zhang,C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instruc- tions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022) 5

2022
[32]

Qian, T.,et al.,

Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: MedVLM-R1: Incentivizing medical reasoning capability of vision-language models via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025) 3

work page arXiv 2025
[33]

In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 311–318 (2002) 8

2002
[34]

In: Proceedings of the 38th International Conference on Machine Learning (ICML)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transfer- able visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). pp. 8748–8763 (2021) 3

2021
[35]

In: 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT)

Rahman, S., Haque, R., Swapno, S.M.R., Islam, M.B., Nobel, S.N., et al.: Deep learning-based left ventricular ejection fraction estimation from echocardiographic videos. In: 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT). pp. 1–6. IEEE (2023) 2

2023
[36]

Medical Image Analysis97, 103264 (2024).https: //doi.org/10.1016/j.media.2024.1032648 18 Taratynova et al

Reale-Nosei, G., Amador-Domínguez, E., Serrano, E.: From vision to text: A comprehensive review of natural image captioning in medical diagnosis and ra- diology report generation. Medical Image Analysis97, 103264 (2024).https: //doi.org/10.1016/j.media.2024.1032648 18 Taratynova et al

work page doi:10.1016/j.media.2024.1032648 2024
[37]

Journal of the American College of Cardi- ology82(25), 2350–2473 (2023).https://doi.org/10.1016/j.jacc.2023.11.007 1

Roth, G.A., Mensah, G.A., Fuster, V.: The global burden of cardiovascular diseases and risks: A compass for global action. Journal of the American College of Cardi- ology82(25), 2350–2473 (2023).https://doi.org/10.1016/j.jacc.2023.11.007 1

work page doi:10.1016/j.jacc.2023.11.007 2023
[38]

Circulation: Cardiovascular Imaging16(4), e014519 (2023)

Salih, A., Boscolo Galazzo, I., Raisi-Estabragh, Z., Petersen, S.E., Menegaz, G., Salih, A.: Explainable artificial intelligence and cardiac imaging: Toward more interpretable models. Circulation: Cardiovascular Imaging16(4), e014519 (2023). https://doi.org/10.1161/CIRCIMAGING.122.0145192

work page doi:10.1161/circimaging.122.0145192 2023
[39]

Circulation149, e224–e248 (2024).https://doi.org/10.1161/CIRCULATIONAHA.123.0657172

Savarese, G., Becher, P.M., Lund, L.H., Seferovic, P., Rosano, G.M., Coats, A.J.: Cardiovascular health care in low- and middle-income countries. Circulation149, e224–e248 (2024).https://doi.org/10.1161/CIRCULATIONAHA.123.0657172

work page doi:10.1161/circulationaha.123.0657172 2024
[40]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., et al.: MedGemma technical report. arXiv preprint arXiv:2507.05201 (2025) 3, 9, 27

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., Guo, D.: DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

She, C., Lu, R., Chen, L., Wang, W., Huang, Q.: EchoVLM: Dynamic mixture-of- experts vision-language model for universal ultrasound intelligence. arXiv preprint arXiv:2509.14977 (2025) 3, 9, 27

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 3, 9, 27

Sun, H., et al.: Chiron-o1: Igniting multimodal large language models towards gen- eralizable medical reasoning via mentor-intern collaborative search. In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 3, 9, 27

2025
[44]

The International Journal of Cardiovascular Imaging 41, 967–977 (2025).https://doi.org/10.1007/s10554-025-03382-13

Syryca, F., Gräßer, C., Trenkwalder, T., et al.: Automated generation of echocar- diography reports using artificial intelligence: A novel approach to streamlining cardiovascular diagnostics. The International Journal of Cardiovascular Imaging 41, 967–977 (2025).https://doi.org/10.1007/s10554-025-03382-13

work page doi:10.1007/s10554-025-03382-13 2025
[45]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P.G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., et al.: Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 (2024) 22

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

In: European Conference on Computer Vision (ECCV)

Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (ECCV). pp. 402–419 (2020) 7, 20

2020
[47]

Thapa, R., Li, A., Wu, Q., He, B., Sahashi, Y., Binder-Rodriguez, C., Zhang, A., Ouyang, D., Zou, J.: Mimic-iv-echo-ext-mimicechoqa: A benchmark dataset for echocardiogram-based visual question answering (2025) 7, 9, 27

2025
[48]

European Heart Journal45(40), 4017–4184 (2024).https://doi.org/10.1093/eurheartj/ehae4662

Timmis, A., Vardas, P., Townsend, N., Torbica, A., Katus, H., De Smedt, D., Broccoli, S., Hinber, B., Ziegler, J., Maggioni, A.P., et al.: European society of cardiology: Cardiovascular disease statistics 2024. European Heart Journal45(40), 4017–4184 (2024).https://doi.org/10.1093/eurheartj/ehae4662

work page doi:10.1093/eurheartj/ehae4662 2024
[49]

In: Advances in Neural Information Processing Systems (NeurIPS)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017) 3

2017
[50]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3156–3164 (2015) 3

2015
[51]

arXiv preprint arXiv:2410.09704 (2024) 2, 3, 4, 8 EchoSonar-R 19

Vukadinovic, M., Xu, A., Cheng, X., Kwan, A.C., Ouyang, D.: EchoPrime: A multi- video view-informed vision-language model for comprehensive echocardiography interpretation. arXiv preprint arXiv:2410.09704 (2024) 2, 3, 4, 8 EchoSonar-R 19

work page arXiv 2024
[52]

In: Advances in Neural Information Processing Systems (NeurIPS)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35 (2022) 3

2022
[53]

Journal of Ul- trasound in Medicine43(7), 1289–1301 (2024).https://doi.org/10.1002/jum

Won, D., Walker, J., Horowitz, R., Bharadwaj, S., Carlton, E., Gabriel, H.: Sound the alarm: The sonographer shortage is echoing across healthcare. Journal of Ul- trasound in Medicine43(7), 1289–1301 (2024).https://doi.org/10.1002/jum. 164532

work page doi:10.1002/jum 2024
[54]

World Health Organization: Cardiovascular diseases (CVDs): Fact sheet.https: //www.who.int/news- room/fact- sheets/detail/cardiovascular- diseases- (cvds)(2024), accessed: 2025-05-01 1

2024
[55]

In: Proceedings of the 32nd International Conference on Machine Learning (ICML)

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual at- tention. In: Proceedings of the 32nd International Conference on Machine Learning (ICML). pp. 2048–2057 (2015) 3

2048
[56]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025) 3, 9, 27

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

arXiv preprint arXiv:2602.23777 (2026) 6

Xu, Z., Wang, Z., Jiang, X., Li, D., Cheng, D., Wang, N.: Reasoning-driven mul- timodal LLM for domain generalization. arXiv preprint arXiv:2602.23777 (2026) 6

work page arXiv 2026
[58]

In: Findings of the Association for Computational Linguistics: EMNLP 2023

Yan, B., Liu, R., Kuo, D., Adithan, S., Reis, E., Kwak, S., Venugopal, V., O’Connell, C., Saenz, A., Rajpurkar, P., et al.: Style-aware radiology report gen- eration with radgraph and few-shot prompting. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 14676–14688 (2023) 3

2023
[59]

Qwen3 Technical Report

Yang, A., Yang, B., Zhang, B., Wang, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., et al.: Dapo: An open-source llm reinforcement learning system that goes beyond. arXiv preprint arXiv:2503.14476 (2025) 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Cir- culation138(16), 1623–1635 (2018) 2

Zhang, J., Gajjala, S., Agrawal, P., Tison, G.H., Hallock, L.A., Beussink-Nelson, L., Lassen, M.H., Fan, E., Aras, M.A., Jordan, C., et al.: Fully automated echocar- diogram interpretation in clinical practice: feasibility and diagnostic accuracy. Cir- culation138(16), 1623–1635 (2018) 2

2018
[62]

In: International Conference on Learning Rep- resentations (ICLR) (2020) 8

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: International Conference on Learning Rep- resentations (ICLR) (2020) 8

2020
[63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: DETRs beat YOLOs on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16965– 16974 (2024) 4, 8

2024
[64]

Advances in Neural Information Processing Systems (NeurIPS)36(2024) 8 20 Taratynova et al

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., et al.: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems (NeurIPS)36(2024) 8 20 Taratynova et al. Table S1:Private dataset statistics across train and test splits.Left:abnormality label and view ...

2024
[65]

Reasoning Efficiency: whether every reasoning step advances toward the an- swer, penalising redundancy and circular reasoning
[66]

Factual Correctness: accuracy of clinical and anatomical claims made during reasoning
[67]

Evidence Grounding: whether the model references specific visual observa- tions rather than relying on generic statements
[68]

Terminology Accuracy: correct use of echocardiographic and medical termi- nology throughout
[69]

borderline dilated

Reasoning-Answer Agreement: consistency between findings discussed in the reasoning trace and those stated in the final answer, penalising both omissions and unsupported additions. Clinical Report Faithfulness.Report generation quality is assessed using an echocardiography-adapted version of the GREEN metric [29], evaluated section- by-section. For each p...

work page arXiv
[70]

Identify all clinically significant errors in the candidate
[71]

Identify all clinically insignificant errors in the candidate
[72]

[Clinically Significant Errors] (a) False finding in candidate not present in reference

Count the number of matched findings (correct statements). [Clinically Significant Errors] (a) False finding in candidate not present in reference. (b) Missing finding present in reference but absent from candidate. (c) Wrong anatomical location (e.g., wrong valve or chamber). (d) Misassessed severity or function (e.g., mild vs. severe). (e) False compari...

[1] [1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R.J., Javaheripi, M., Kauffmann, P., et al.: Phi-4 technical report. arXiv preprint arXiv:2412.08905 (2024) 22

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

In: StatPearls [Internet]

Ahmed, I., Sasikumar, N.: Echocardiography imaging techniques. In: StatPearls [Internet]. StatPearls Publishing (2023) 2

2023

[3] [3]

European Heart Journal – Imaging Methods and Practice1(1), qyad005 (2023).https://doi.org/ 10.1093/ehjimp/qyad0052

Arega, T., Desai, M.Y., et al.: Comparison of cardiovascular imaging practices in Africa, North America, and Europe: Two faces of the same coin. European Heart Journal – Imaging Methods and Practice1(1), qyad005 (2023).https://doi.org/ 10.1093/ehjimp/qyad0052

work page doi:10.1093/ehjimp/qyad0052 2023

[4] [4]

Circulation: Cardiovascular Imaging12(9), e009303 (2019) 2

Asch, F.M., Poilvert, N., Abraham, T., Jankowski, M., Cleve, J., Adams, M., Ro- mano, N., Hong, H., Mor-Avi, V., Martin, R.P., et al.: Automated echocardio- graphic quantification of left ventricular ejection fraction without volume measure- ments using a machine learning algorithm mimicking a human expert. Circulation: Cardiovascular Imaging12(9), e00930...

2019

[5] [5]

Qwen3-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., et al.: Qwen3-VL technical report. arXiv preprint arXiv:2511.21631 (2025) 9, 27

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improvedcorrelationwithhumanjudgments.In:ProceedingsoftheACLWorkshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. pp. 65–72 (2005) 8

2005

[7] [7]

British Journal of Hospital Medicine (2012), pMC3473911 2

Chambers, J.: Echocardiography: Frontier imaging in cardiology. British Journal of Hospital Medicine (2012), pMC3473911 2

2012

[8] [8]

European Heart Journal – Digital Health6(3), 326– 339 (2025).https://doi.org/10.1093/ehjdh/ztae0863

Chao, C.J., Banerjee, I., Arsanjani, R., Ayoub, C., Tseng, A., Delbrouck, J.B., Kane, G.C., Lopez-Jimenez, F., Attia, Z., Oh, J.K., Erickson, B., Fei-Fei, L., Adeli, E., Langlotz, C.: Evaluating large language models in echocardiography reporting: Opportunities and challenges. European Heart Journal – Digital Health6(3), 326– 339 (2025).https://doi.org/10...

work page doi:10.1093/ehjdh/ztae0863 2025

[9] [9]

arXiv preprint arXiv:2204.13258 (2022) 3 16 Taratynova et al

Chen, Z., Shen, Y., Song, Y., Wan, X.: Cross-modal memory networks for radiology report generation. arXiv preprint arXiv:2204.13258 (2022) 3 16 Taratynova et al

work page arXiv 2022

[10] [10]

In: Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP)

Chen, Z., Song, Y., Chang, T.H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP). pp. 1439–1449 (2020) 3

2020

[11] [11]

Nature Medicine30(5), 1481–1488 (2024) 2, 3

Christensen, M., Vukadinovic, M., Yuan, N., Ouyang, D.: Vision–language founda- tion model for echocardiogram interpretation. Nature Medicine30(5), 1481–1488 (2024) 2, 3

2024

[12] [12]

npj Digital Medicine3, 10 (2020).https://doi.org/10.1038/s41746-019-0216-82

Ghorbani, A., Ouyang, D., Abid, A., He, B., Chen, J.H., Harrington, R.A., Liang, D.H., Ashley, E.A., Zou, J.Y.: Deep learning interpretation of echocardiograms. npj Digital Medicine3, 10 (2020).https://doi.org/10.1038/s41746-019-0216-82

work page doi:10.1038/s41746-019-0216-82 2020

[13] [13]

Circulation101(23), e215–e220 (2000) 7

Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, Phys- ioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation101(23), e215–e220 (2000) 7

2000

[14] [14]

Gow, B., Pollard, T., Greenbaum, N., Moody, B., Johnson, A., Herbst, E., Waks, J.W., Eslami, P., Chaudhari, A., Carbonati, T., et al.: Mimic-iv-echo: Echocardio- gram matched subset (2023) 7

2023

[15] [15]

ACM Transactions on Computing for Healthcare3(1), 1–23 (2022).https://doi.org/10.1145/34587548

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., Poon, H.: Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare3(1), 1–23 (2022).https://doi.org/10.1145/34587548

work page doi:10.1145/34587548 2022

[16] [16]

medRxiv pp

Holste, G., Oikonomou, E.K., Tokodi, M., Kovács, A., Wang, Z., Khera, R.: Pane- cho: Complete ai-enabled echocardiography interpretation with multi-task deep learning. medRxiv pp. 2024–11 (2025) 2

2024

[17] [17]

Journal of Medical Imaging11(5), 054002–054002 (2024) 2

Jansen, G.E., de Vos, B.D., Molenaar, M.A., Schuuring, M.J., Bouma, B.J., Išgum, I.: Automated echocardiography view classification and quality assessment with recognition of unknown views. Journal of Medical Imaging11(5), 054002–054002 (2024) 2

2024

[18] [18]

Mistral 7B

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., El Sayed, W.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023) 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

European Heart Journal – Cardiovascular Imaging16(3), 233–271 (2015) 7, 22

Lang, R.M., Badano, L.P., Mor-Avi, V., Afilalo, J., Armstrong, A., Ernande, L., Flachskampf, F.A., Foster, E., Goldstein, S.A., Kuznetsova, T., Lancellotti, P., Muraru, D., Picard, M.H., Rietzschel, E.R., Rudski, L., Spencer, K.T., Tsang, W., Voigt, J.U.: Recommendations for cardiac chamber quantification by echocardiog- raphy in adults: An update from th...

2015

[20] [20]

Computers in Biology and Medicine156, 106705 (2023) 2

Li, H., Wang, Y., Qu, M., Cao, P., Feng, C., Yang, J.: Echoefnet: Multi-task deep learning network for automatic calculation of left ventricular ejection fraction in 2d echocardiography. Computers in Biology and Medicine156, 106705 (2023) 2

2023

[21] [21]

World Wide Web26(1), 253–270 (2023).https://doi.org/10.1007/s11280-022-01013-63

Li, M., Lin, B., Chen, Z., Lin, H., Liang, X., Chang, X.: Auxiliary signal-guided knowledge encoder-decoder for medical report generation. World Wide Web26(1), 253–270 (2023).https://doi.org/10.1007/s11280-022-01013-63

work page doi:10.1007/s11280-022-01013-63 2023

[22] [22]

European Heart Journal (2024).https://doi.org/10.1093/ eurheartj/ehae6022

Lim, G.B., Leong, Y.Y., Tay, E.L., Chan, M.Y., Yeo, T.J., Lam, C.S., Januzzi, J.L., Richards, A.M., Jiang, B.: Global burden of cardiovascular diseases: Projections from 2025 to 2050. European Heart Journal (2024).https://doi.org/10.1093/ eurheartj/ehae6022

2025

[23] [23]

In: Text Summarization Branches Out

Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81 (2004) 8 EchoSonar-R 17

2004

[24] [24]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., et al.: Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783 (2025) 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Ultrasound in Medicine & Biology50(12), 1945–1954 (2024) 2

Maani, F., Ukaye, A., Saadi, N., Saeed, N., Yaqub, M.: Simlvseg: simplifying left ventricular segmentation in 2-d+ time echocardiograms with self-and weakly su- pervised learning. Ultrasound in Medicine & Biology50(12), 1945–1954 (2024) 2

1945

[26] [26]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Maani, F.A., Saeed, N., Matsun, A., Yaqub, M.: Coreecho: Continuous representa- tion learning for 2d+ time echocardiography analysis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 591–601. Springer (2024) 2

2024

[27] [27]

Jour- nal of the American Society of Echocardiography32(1), 1–64 (2019).https: //doi.org/10.1016/j.echo.2018.06.0042, 5

Mitchell, C., Rahko, P.S., Blauwet, L.A., Canaday, B., Finstuen, J.A., Foster, M.C., Horton, K., Ogunyankin, K.O., Palma, R.A., Velazquez, E.J.: Guidelines for performing a comprehensive transthoracic echocardiographic examination in adults: Recommendations from the American Society of Echocardiography. Jour- nal of the American Society of Echocardiograph...

work page doi:10.1016/j.echo.2018.06.0042 2019

[28] [28]

In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention

Muhtaseb, R., Yaqub, M.: Echocotr: Estimation of the left ventricular ejection frac- tion from spatiotemporal echocardiography. In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention. pp. 370–379. Springer (2022) 2

2022

[29] [29]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Bluethgen, C., Michal- son, A.E., Moseley, M., Langlotz, C., Chaudhari, A.S., Delbrouck, J.B.: GREEN: Generative radiology report evaluation and error notation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 374–390 (2024). https://doi.org/10.18653/v1/2024.findings...

work page doi:10.18653/v1/2024.findings-emnlp.218 2024

[30] [30]

Nature580(7802), 252–256 (2020)

Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C.P., Heidenre- ich, P.A., Harrington, R.A., Liang, D.H., Ashley, E.A., Zou, J.Y.: Video-based AI for beat-to-beat assessment of cardiac function. Nature580(7802), 252–256 (2020). https://doi.org/10.1038/s41586-020-2145-87, 9, 27

work page doi:10.1038/s41586-020-2145-87 2020

[31] [31]

Advances in Neural Information Processing Systems 35, 27730–27744 (2022) 5

Ouyang,L.,Wu,J.,Jiang,X.,Almeida,D.,Wainwright,C.,Mishkin,P.,Zhang,C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instruc- tions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022) 5

2022

[32] [32]

Qian, T.,et al.,

Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: MedVLM-R1: Incentivizing medical reasoning capability of vision-language models via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025) 3

work page arXiv 2025

[33] [33]

In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 311–318 (2002) 8

2002

[34] [34]

In: Proceedings of the 38th International Conference on Machine Learning (ICML)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transfer- able visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). pp. 8748–8763 (2021) 3

2021

[35] [35]

In: 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT)

Rahman, S., Haque, R., Swapno, S.M.R., Islam, M.B., Nobel, S.N., et al.: Deep learning-based left ventricular ejection fraction estimation from echocardiographic videos. In: 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT). pp. 1–6. IEEE (2023) 2

2023

[36] [36]

Medical Image Analysis97, 103264 (2024).https: //doi.org/10.1016/j.media.2024.1032648 18 Taratynova et al

Reale-Nosei, G., Amador-Domínguez, E., Serrano, E.: From vision to text: A comprehensive review of natural image captioning in medical diagnosis and ra- diology report generation. Medical Image Analysis97, 103264 (2024).https: //doi.org/10.1016/j.media.2024.1032648 18 Taratynova et al

work page doi:10.1016/j.media.2024.1032648 2024

[37] [37]

Journal of the American College of Cardi- ology82(25), 2350–2473 (2023).https://doi.org/10.1016/j.jacc.2023.11.007 1

Roth, G.A., Mensah, G.A., Fuster, V.: The global burden of cardiovascular diseases and risks: A compass for global action. Journal of the American College of Cardi- ology82(25), 2350–2473 (2023).https://doi.org/10.1016/j.jacc.2023.11.007 1

work page doi:10.1016/j.jacc.2023.11.007 2023

[38] [38]

Circulation: Cardiovascular Imaging16(4), e014519 (2023)

Salih, A., Boscolo Galazzo, I., Raisi-Estabragh, Z., Petersen, S.E., Menegaz, G., Salih, A.: Explainable artificial intelligence and cardiac imaging: Toward more interpretable models. Circulation: Cardiovascular Imaging16(4), e014519 (2023). https://doi.org/10.1161/CIRCIMAGING.122.0145192

work page doi:10.1161/circimaging.122.0145192 2023

[39] [39]

Circulation149, e224–e248 (2024).https://doi.org/10.1161/CIRCULATIONAHA.123.0657172

Savarese, G., Becher, P.M., Lund, L.H., Seferovic, P., Rosano, G.M., Coats, A.J.: Cardiovascular health care in low- and middle-income countries. Circulation149, e224–e248 (2024).https://doi.org/10.1161/CIRCULATIONAHA.123.0657172

work page doi:10.1161/circulationaha.123.0657172 2024

[40] [40]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., et al.: MedGemma technical report. arXiv preprint arXiv:2507.05201 (2025) 3, 9, 27

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., Guo, D.: DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

She, C., Lu, R., Chen, L., Wang, W., Huang, Q.: EchoVLM: Dynamic mixture-of- experts vision-language model for universal ultrasound intelligence. arXiv preprint arXiv:2509.14977 (2025) 3, 9, 27

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 3, 9, 27

Sun, H., et al.: Chiron-o1: Igniting multimodal large language models towards gen- eralizable medical reasoning via mentor-intern collaborative search. In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 3, 9, 27

2025

[44] [44]

The International Journal of Cardiovascular Imaging 41, 967–977 (2025).https://doi.org/10.1007/s10554-025-03382-13

Syryca, F., Gräßer, C., Trenkwalder, T., et al.: Automated generation of echocar- diography reports using artificial intelligence: A novel approach to streamlining cardiovascular diagnostics. The International Journal of Cardiovascular Imaging 41, 967–977 (2025).https://doi.org/10.1007/s10554-025-03382-13

work page doi:10.1007/s10554-025-03382-13 2025

[45] [45]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P.G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., et al.: Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 (2024) 22

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

In: European Conference on Computer Vision (ECCV)

Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (ECCV). pp. 402–419 (2020) 7, 20

2020

[47] [47]

Thapa, R., Li, A., Wu, Q., He, B., Sahashi, Y., Binder-Rodriguez, C., Zhang, A., Ouyang, D., Zou, J.: Mimic-iv-echo-ext-mimicechoqa: A benchmark dataset for echocardiogram-based visual question answering (2025) 7, 9, 27

2025

[48] [48]

European Heart Journal45(40), 4017–4184 (2024).https://doi.org/10.1093/eurheartj/ehae4662

Timmis, A., Vardas, P., Townsend, N., Torbica, A., Katus, H., De Smedt, D., Broccoli, S., Hinber, B., Ziegler, J., Maggioni, A.P., et al.: European society of cardiology: Cardiovascular disease statistics 2024. European Heart Journal45(40), 4017–4184 (2024).https://doi.org/10.1093/eurheartj/ehae4662

work page doi:10.1093/eurheartj/ehae4662 2024

[49] [49]

In: Advances in Neural Information Processing Systems (NeurIPS)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017) 3

2017

[50] [50]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3156–3164 (2015) 3

2015

[51] [51]

arXiv preprint arXiv:2410.09704 (2024) 2, 3, 4, 8 EchoSonar-R 19

Vukadinovic, M., Xu, A., Cheng, X., Kwan, A.C., Ouyang, D.: EchoPrime: A multi- video view-informed vision-language model for comprehensive echocardiography interpretation. arXiv preprint arXiv:2410.09704 (2024) 2, 3, 4, 8 EchoSonar-R 19

work page arXiv 2024

[52] [52]

In: Advances in Neural Information Processing Systems (NeurIPS)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35 (2022) 3

2022

[53] [53]

Journal of Ul- trasound in Medicine43(7), 1289–1301 (2024).https://doi.org/10.1002/jum

Won, D., Walker, J., Horowitz, R., Bharadwaj, S., Carlton, E., Gabriel, H.: Sound the alarm: The sonographer shortage is echoing across healthcare. Journal of Ul- trasound in Medicine43(7), 1289–1301 (2024).https://doi.org/10.1002/jum. 164532

work page doi:10.1002/jum 2024

[54] [54]

World Health Organization: Cardiovascular diseases (CVDs): Fact sheet.https: //www.who.int/news- room/fact- sheets/detail/cardiovascular- diseases- (cvds)(2024), accessed: 2025-05-01 1

2024

[55] [55]

In: Proceedings of the 32nd International Conference on Machine Learning (ICML)

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual at- tention. In: Proceedings of the 32nd International Conference on Machine Learning (ICML). pp. 2048–2057 (2015) 3

2048

[56] [56]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025) 3, 9, 27

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

arXiv preprint arXiv:2602.23777 (2026) 6

Xu, Z., Wang, Z., Jiang, X., Li, D., Cheng, D., Wang, N.: Reasoning-driven mul- timodal LLM for domain generalization. arXiv preprint arXiv:2602.23777 (2026) 6

work page arXiv 2026

[58] [58]

In: Findings of the Association for Computational Linguistics: EMNLP 2023

Yan, B., Liu, R., Kuo, D., Adithan, S., Reis, E., Kwak, S., Venugopal, V., O’Connell, C., Saenz, A., Rajpurkar, P., et al.: Style-aware radiology report gen- eration with radgraph and few-shot prompting. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 14676–14688 (2023) 3

2023

[59] [59]

Qwen3 Technical Report

Yang, A., Yang, B., Zhang, B., Wang, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., et al.: Dapo: An open-source llm reinforcement learning system that goes beyond. arXiv preprint arXiv:2503.14476 (2025) 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Cir- culation138(16), 1623–1635 (2018) 2

Zhang, J., Gajjala, S., Agrawal, P., Tison, G.H., Hallock, L.A., Beussink-Nelson, L., Lassen, M.H., Fan, E., Aras, M.A., Jordan, C., et al.: Fully automated echocar- diogram interpretation in clinical practice: feasibility and diagnostic accuracy. Cir- culation138(16), 1623–1635 (2018) 2

2018

[62] [62]

In: International Conference on Learning Rep- resentations (ICLR) (2020) 8

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: International Conference on Learning Rep- resentations (ICLR) (2020) 8

2020

[63] [63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: DETRs beat YOLOs on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16965– 16974 (2024) 4, 8

2024

[64] [64]

Advances in Neural Information Processing Systems (NeurIPS)36(2024) 8 20 Taratynova et al

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., et al.: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems (NeurIPS)36(2024) 8 20 Taratynova et al. Table S1:Private dataset statistics across train and test splits.Left:abnormality label and view ...

2024

[65] [65]

Reasoning Efficiency: whether every reasoning step advances toward the an- swer, penalising redundancy and circular reasoning

[66] [66]

Factual Correctness: accuracy of clinical and anatomical claims made during reasoning

[67] [67]

Evidence Grounding: whether the model references specific visual observa- tions rather than relying on generic statements

[68] [68]

Terminology Accuracy: correct use of echocardiographic and medical termi- nology throughout

[69] [69]

borderline dilated

Reasoning-Answer Agreement: consistency between findings discussed in the reasoning trace and those stated in the final answer, penalising both omissions and unsupported additions. Clinical Report Faithfulness.Report generation quality is assessed using an echocardiography-adapted version of the GREEN metric [29], evaluated section- by-section. For each p...

work page arXiv

[70] [70]

Identify all clinically significant errors in the candidate

[71] [71]

Identify all clinically insignificant errors in the candidate

[72] [72]

[Clinically Significant Errors] (a) False finding in candidate not present in reference

Count the number of matched findings (correct statements). [Clinically Significant Errors] (a) False finding in candidate not present in reference. (b) Missing finding present in reference but absent from candidate. (c) Wrong anatomical location (e.g., wrong valve or chamber). (d) Misassessed severity or function (e.g., mild vs. severe). (e) False compari...