VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance
Pith reviewed 2026-05-21 21:10 UTC · model grok-4.3
The pith
A lightweight Vision-Action Adapter injects patient-specific 3D heart structure understanding into pre-trained ultrasound foundation models for echocardiography probe guidance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Embedding the VA-Adapter inside the image encoder of an ultrasound foundation model enables the model to infer cardiac anatomy from historical vision-action sequences, thereby supplying the missing patient-specific 3D structure understanding needed for accurate probe guidance without explicit 3D supervision or full-model retraining.
What carries the argument
Vision-Action Adapter (VA-Adapter), a lightweight module inserted into the foundation model's image encoder that processes sequences of 2D images and corresponding probe actions to build patient-specific 3D navigation capability.
If this is right
- Probe guidance can be achieved by freezing most of a large foundation model and training only a small adapter.
- Adaptation to new patients becomes feasible with far lower compute and data requirements than full retraining.
- The same adapter pattern supplies a route for adding 3D context to other 2D foundation models used in medical imaging.
- Real-time probe suggestions become practical because the adapter runs on top of an already-trained encoder.
Where Pith is reading between the lines
- The same lightweight-adapter pattern could be tried on foundation models for other ultrasound procedures that also need 3D spatial reasoning.
- If the adapter can be made still smaller, the whole guidance system might run on portable or bedside ultrasound hardware.
- The use of historical sequences suggests the method could support continual adaptation within a single patient exam.
Load-bearing premise
The pre-trained ultrasound foundation model already holds sufficiently robust 2D image representations that a small adapter can add patient-specific 3D understanding without any direct 3D training data or full retraining.
What would settle it
A controlled test on a new set of echocardiography scans from patients with atypical heart geometries in which the VA-Adapter version shows no gain in probe-placement accuracy or navigation success rate over the unmodified foundation model.
Figures
read the original abstract
Echocardiography is a critical tool for detecting heart diseases, yet its steep operational difficulty causes a shortage of skilled personnel. Probe guidance systems, which assist in acquiring high-quality images, offer a promising solution to lower this operational barrier. However, robust probe guidance remains challenging due to significant individual variability. This variability manifests as differences in low-level features within two-dimensional (2D) images, which complicates image feature understanding, and differences in individual three-dimensional (3D) structures, which poses challenges for precise navigation. To address these challenges, we first propose leveraging the robust image representations learned by ultrasound foundation models from vast datasets. Yet, applying these models to probe navigation is non-trivial due to their lack of understanding of individual 3D structures. To this end, we meticulously design a Vision-Action Adapter (VA-Adapter) to online inject the capability of understanding individual 3D structures. Specifically, by embedding the VA-Adapter into the foundation model's image encoder, the model can infer cardiac anatomy from historical vision-action sequences, mimicking the cognitive process of a sonographer. Extensive experiments on a dataset with over 1.31M samples demonstrate that the VA-Adapter outperforms strong probe guidance models while requiring approximately 33 times fewer trained parameters. Code is available at https://github.com/LeapLabTHU/VA-Adapter.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes VA-Adapter, a lightweight vision-action adapter embedded into a frozen ultrasound foundation model's image encoder. The adapter enables online inference of patient-specific 3D cardiac anatomy from historical 2D vision-action sequences, addressing individual variability in echocardiography probe guidance. Extensive experiments on a dataset of over 1.31 million samples claim that VA-Adapter outperforms strong probe guidance baselines while training approximately 33 times fewer parameters.
Significance. If the performance gains and parameter efficiency hold under rigorous controls, the work would offer a practical route to deploy foundation models for probe guidance without full retraining, potentially lowering barriers to high-quality echocardiography. The design choice to mimic sonographer cognition via sequential vision-action modeling is conceptually appealing and could generalize to other ultrasound navigation tasks. The reported scale of the dataset is a clear strength.
major comments (2)
- [Method and Experiments] The central claim that VA-Adapter specifically solves the 3D structural variability problem rests on the assumption that historical vision-action sequences supply sufficient geometric signal for 3D inference. However, the manuscript provides no 3D reconstruction loss, explicit 3D labels, or direct probe-position error metric that would distinguish true 3D anatomy understanding from improved 2D feature calibration alone.
- [Experiments] Table reporting main results (presumably Table 1 or equivalent in §4): while outperformance and the 33× parameter reduction are stated, the text supplies no details on baseline implementations, statistical significance testing, or controls for inter-patient variability, leaving the empirical superiority only partially verifiable.
minor comments (2)
- [Abstract] The abstract states 'approximately 33 times fewer trained parameters' without giving the absolute parameter counts for VA-Adapter versus the strongest baseline; adding these numbers would improve precision.
- [Method] Notation for the vision-action sequence input and how actions are encoded alongside image features could be clarified with a diagram or explicit equations in the method section.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Method and Experiments] The central claim that VA-Adapter specifically solves the 3D structural variability problem rests on the assumption that historical vision-action sequences supply sufficient geometric signal for 3D inference. However, the manuscript provides no 3D reconstruction loss, explicit 3D labels, or direct probe-position error metric that would distinguish true 3D anatomy understanding from improved 2D feature calibration alone.
Authors: We thank the referee for this precise observation. The training data consists of 2D ultrasound frames paired with probe actions and does not contain explicit 3D labels or reconstruction supervision; therefore no 3D reconstruction loss is used. The VA-Adapter instead learns an implicit representation of patient-specific 3D cardiac anatomy by conditioning the frozen foundation-model encoder on historical vision-action sequences, enabling the model to predict probe movements that account for individual geometry. This design choice mirrors how sonographers acquire 3D understanding through sequential observation and action rather than explicit volumetric reconstruction. Probe-guidance performance (success rate, image-quality scores) serves as the downstream metric that validates the utility of this implicit 3D modeling. We will add a dedicated paragraph in the revised Methods and Discussion sections clarifying the implicit versus explicit distinction and will report any available probe-position statistics if they exist in the dataset. revision: partial
-
Referee: [Experiments] Table reporting main results (presumably Table 1 or equivalent in §4): while outperformance and the 33× parameter reduction are stated, the text supplies no details on baseline implementations, statistical significance testing, or controls for inter-patient variability, leaving the empirical superiority only partially verifiable.
Authors: We agree that these details are essential for rigorous verification. In the revised manuscript we will: (i) expand the experimental setup subsection with complete baseline implementation details and hyper-parameter choices, (ii) add statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) comparing VA-Adapter against each baseline, and (iii) explicitly describe inter-patient controls, including patient-wise data splits that ensure no patient overlap between training and test sets together with patient-stratified performance metrics. These additions will be incorporated into the main results table and the accompanying text. revision: yes
Circularity Check
No circularity: empirical performance claims rest on dataset comparisons
full rationale
The paper proposes a VA-Adapter module to adapt a frozen ultrasound foundation model for probe guidance by injecting 3D structure understanding from vision-action sequences. All load-bearing claims (outperformance on 1.31M samples, 33x fewer parameters) are justified solely by empirical experiments against baselines. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the provided text. The derivation chain is absent; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ultrasound foundation models learn robust image representations from vast datasets that transfer to probe guidance tasks.
invented entities (1)
-
VA-Adapter
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2405.01409 (2024)
Amadou, A.A., Singh, V ., Ghesu, F.C., Kim, Y .H., Stanciulescu, L., Sai, H.P., Sharma, P., Young, A., Rajani, R., Rhode, K.: Goal-conditioned re- inforcement learning for ultrasound navigation guidance. arXiv preprint arXiv:2405.01409 (2024)
-
[2]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y ., Ballas, N.: Self-supervised learning from images with a joint- embedding predictive architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15619– 15629 (2023)
work page 2023
-
[3]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Bao, M., Wang, Y ., Wei, X., Jia, B., Fan, X., Lu, D., Gu, Y ., Cheng, J., Zhang, Y ., Wang, C., et al.: Real-world visual navigation for cardiac ultrasound view planning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 317–326. Springer (2024)
work page 2024
-
[4]
Advances in neural information processing systems34, 15084–15097 (2021)
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., Mordatch, I.: Decision transformer: Reinforce- ment learning via sequence modeling. Advances in neural information processing systems34, 15084–15097 (2021)
work page 2021
-
[5]
In: Proceedings of the IEEE/CVF international conference on computer vision
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9640–9649 (2021)
work page 2021
-
[6]
Christensen, M., Vukadinovic, M., Yuan, N., Ouyang, D.: Vision– language foundation model for echocardiogram interpretation. Nature Medicine pp. 1–8 (2024)
work page 2024
-
[7]
Droste, R., Drukker, L., Papageorghiou, A.T., Noble, J.A.: Automatic probe movement guidance for freehand obstetric ultrasound. In: Med- ical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. pp. 583–592. Springer (2020)
work page 2020
-
[8]
NPJ digital medicine3(1), 10 (2020)
Ghorbani, A., Ouyang, D., Abid, A., He, B., Chen, J.H., Harrington, R.A., Liang, D.H., Ashley, E.A., Zou, J.Y .: Deep learning interpretation of echocardiograms. NPJ digital medicine3(1), 10 (2020)
work page 2020
-
[9]
IEEE Transactions on Medical Robotics and Bionics7(2), 782–792 (2025)
Hao, M., Zhang, P., Hou, X., Gu, X., Zhou, X.H., Hou, Z.G., Chen, C., Wang, S.: Towards autonomous cardiac ultrasound scanning: Combining physician expertise and machine intelligence. IEEE Transactions on Medical Robotics and Bionics7(2), 782–792 (2025)
work page 2025
-
[10]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., Girshick, R.: Masked autoen- coders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000– 16009 (2022)
work page 2022
-
[11]
In: Proceedings of the 36th International Conference on Machine Learning
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning. pp. 2790–2799 (2019)
work page 2019
-
[12]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. CoRR abs/2106.09685(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
In: International Workshop on Advances in Simplifying Medical Ultrasound
Jiang, H., Li, M., Sun, Z., Jia, N., Sun, Y ., Luo, S., Song, S., Huang, G.: Structure-aware world model for probe guidance via large-scale self-supervised pre-train. In: International Workshop on Advances in Simplifying Medical Ultrasound. pp. 58–67. Springer (2024)
work page 2024
-
[14]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Jiang, H., Sun, Z., Jia, N., Li, M., Sun, Y ., Luo, S., Song, S., Huang, G.: Cardiac copilot: Automatic probe guidance for echocardiography with world model. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 190–199. Springer (2024)
work page 2024
-
[15]
arXiv preprint arXiv:2408.15026 (2024)
Jiang, H., Sun, Z., Sun, Y ., Jia, N., Li, M., Luo, S., Song, S., Huang, G.: Sequence-aware pre-training for echocardiography probe guidance. arXiv preprint arXiv:2408.15026 (2024)
-
[16]
Nature Communications16(1), 7893 (2025)
Jiang, H., Zhao, A., Yang, Q., Yan, X., Wang, T., Wang, Y ., Jia, N., Wang, J., Wu, G., Yue, Y ., et al.: Towards expert-level autonomous carotid ultrasonography with large-scale learning-based robotic system. Nature Communications16(1), 7893 (2025)
work page 2025
-
[17]
Medical Image Analysis96, 103202 (2024)
Jiao, J., Zhou, J., Li, X., Xia, M., Huang, Y ., Huang, L., Wang, N., Zhang, X., Zhou, S., Wang, Y ., et al.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis96, 103202 (2024)
work page 2024
-
[18]
IEEE Transactions on Automation Science and Engineering21(2), 1526–1538 (2023)
Li, K., Li, A., Xu, Y ., Xiong, H., Meng, M.Q.H.: Rl-tee: Au- tonomous probe guidance for transesophageal echocardiography based on attention-augmented deep reinforcement learning. IEEE Transactions on Automation Science and Engineering21(2), 1526–1538 (2023)
work page 2023
-
[19]
Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers) (2021)
work page 2021
-
[20]
Advances in Neural Information Processing Systems36(2024)
MH Nguyen, D., Nguyen, H., Diep, N., Pham, T.N., Cao, T., Nguyen, B., Swoboda, P., Ho, N., Albarqouni, S., Xie, P., et al.: Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching. Advances in Neural Information Processing Systems36(2024)
work page 2024
-
[21]
Journal of the American Society of Echocardiography32(1), 1–64 (2019)
Mitchell, C., Rahko, P.S., Blauwet, L.A., Canaday, B., Finstuen, J.A., Foster, M.C., Horton, K., Ogunyankin, K.O., Palma, R.A., Velazquez, E.J.: Guidelines for performing a comprehensive transthoracic echocar- diographic examination in adults: recommendations from the ameri- can society of echocardiography. Journal of the American Society of Echocardiogra...
work page 2019
-
[22]
JAMA cardiology6(6), 624–632 (2021)
Narang, A., Bae, R., Hong, H., Thomas, Y ., Surette, S., Cadieu, C., Chaudhry, A., Martin, R.P., McCarthy, P.M., Rubenson, D.S., et al.: Utility of a deep-learning algorithm to guide novices to acquire echocar- diograms for limited diagnostic use. JAMA cardiology6(6), 624–632 (2021)
work page 2021
-
[23]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Nature 580(7802), 252–256 (2020)
Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C.P., Heidenreich, P.A., Harrington, R.A., Liang, D.H., Ashley, E.A., et al.: Video-based ai for beat-to-beat assessment of cardiac function. Nature 580(7802), 252–256 (2020)
work page 2020
-
[25]
Roth, G.A., Johnson, C., Abajobir, A., Abd-Allah, F., Abera, S.F., Abyu, G., Ahmed, M., Aksut, B., Alam, T., Alam, K., et al.: Global, regional, and national burden of cardiovascular diseases for 10 causes, 1990 to
work page 1990
-
[26]
Journal of the American college of cardiology70(1), 1–25 (2017)
work page 2017
-
[27]
In: International conference on machine learning
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J ´egou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347– 10357. PMLR (2021)
work page 2021
-
[28]
In: European conference on computer vision
Wang, L., Xiong, Y ., Wang, Z., Qiao, Y ., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. pp. 20–36. Springer (2016)
work page 2016
-
[29]
arXiv preprint arXiv:2509.13832 (2025)
Wang, T., Jiang, H., Wang, Y ., Sun, Z., Yan, X., Li, X., Huang, G.: Ul- trahit: A hierarchical transformer architecture for generalizable internal carotid artery robotic ultrasonography. arXiv preprint arXiv:2509.13832 (2025)
-
[30]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yang, L., Zhang, R.Y ., Wang, Y ., Xie, X.: Mma: Multi-modal adapter for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23826–23837 (2024)
work page 2024
-
[31]
In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yue, Y ., Wang, Y ., Jiang, H., Liu, P., Song, S., Huang, G.: Echoworld: Learning motion-aware world models for echocardiography probe guid- ance. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 25993–26003 (2025)
work page 2025
-
[32]
Zhang, S., Xu, Y ., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.