ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding
Pith reviewed 2026-05-10 14:56 UTC · model grok-4.3
The pith
ReXSonoVQA benchmark shows vision-language models extract some ultrasound procedural details but struggle with causal troubleshooting questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish ReXSonoVQA as a video QA benchmark with 514 clips and paired questions that target Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning in ultrasound. Zero-shot evaluation of models such as Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro finds that the models can derive some procedural information yet remain challenged by troubleshooting questions, with only minimal improvement when given video rather than text alone, thereby exposing limits in causal reasoning.
What carries the argument
The ReXSonoVQA benchmark, a set of 514 ultrasound video clips paired with 514 questions (249 multiple-choice and 265 free-response) that directly probe the three competencies of Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning.
If this is right
- Vision-language models can extract limited procedural information from dynamic ultrasound videos.
- Troubleshooting questions remain difficult for current models with little added benefit from video over text.
- The benchmark identifies clear gaps in causal reasoning for procedure-centric tasks.
- ReXSonoVQA can support development of perception systems for ultrasound training, guidance, and robotic automation.
Where Pith is reading between the lines
- The emphasis on causal gaps suggests future model designs may benefit from explicit temporal cause-effect modeling rather than general video captioning.
- This type of procedural benchmark could be adapted to other real-time medical imaging domains to test similar action-planning skills.
- High performance on the benchmark might eventually reduce reliance on human operators for basic ultrasound scans.
- Fine-tuning experiments on the dataset would provide a direct test of whether the identified limitations are addressable through targeted training.
Load-bearing premise
The 514 video clips and questions were selected and written without bias and validly measure the three targeted competencies of procedural ultrasound understanding.
What would settle it
A result in which future models achieve markedly higher accuracy on the troubleshooting questions from video input than from text-only input, while showing comparable gains on the other question types, would indicate the benchmark successfully isolates the need for visual causal reasoning.
Figures
read the original abstract
Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReXSonoVQA, a video QA benchmark with 514 ultrasound video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluations of VLMs including Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro indicate that models can extract some procedural information but struggle with troubleshooting questions, showing minimal gains over text-only baselines and exposing limitations in causal reasoning. The benchmark aims to support development of perception systems for ultrasound training, guidance, and automation.
Significance. If the benchmark validity holds after addressing validation gaps, this work would be significant for computer vision and medical AI by providing the first dynamic, procedure-centric ultrasound QA dataset, addressing the limitation of existing static-image benchmarks. It offers a concrete resource for evaluating and improving VLMs in real-time medical imaging scenarios, with potential impact on autonomous systems and robotics.
major comments (3)
- [Dataset Construction] The manuscript provides no details on the sourcing, selection criteria, or bias mitigation for the 514 video clips and questions (abstract and dataset section). This is load-bearing for the central claim, as the reported performance gaps and causal-reasoning limitations could arise from dataset artifacts rather than model deficiencies if clips favor common procedures or questions contain linguistic cues.
- [Evaluation Methodology] No expert review process, inter-annotator agreement metrics, or validation steps for question design are described (evaluation and results sections). Without this, it is unclear whether the questions validly isolate the three targeted competencies, undermining the zero-shot results and comparison to text-only baselines.
- [Results and Analysis] The results lack statistical significance tests for the claimed minimal gains over text-only baselines or error analysis breaking down failures on troubleshooting questions (results section). This weakens support for the conclusion about VLM limitations in causal reasoning.
minor comments (2)
- [Abstract] Clarify the exact model versions (e.g., 'Gemini 3 Pro') and ensure consistent naming between abstract and main text.
- [Dataset] Add a table summarizing question distribution across the three competencies and video clip characteristics (duration, procedure types).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to improve clarity and rigor.
read point-by-point responses
-
Referee: [Dataset Construction] The manuscript provides no details on the sourcing, selection criteria, or bias mitigation for the 514 video clips and questions (abstract and dataset section). This is load-bearing for the central claim, as the reported performance gaps and causal-reasoning limitations could arise from dataset artifacts rather than model deficiencies if clips favor common procedures or questions contain linguistic cues.
Authors: We agree that the manuscript would benefit from expanded details on dataset construction to allow readers to evaluate potential artifacts. In the revised version, we will add a dedicated subsection describing: the sourcing of the 514 clips from a combination of clinical archives (with IRB approval) and publicly available ultrasound video repositories; explicit selection criteria that prioritize diversity across procedure types, ultrasound systems, and patient demographics; and bias mitigation steps including stratified sampling to avoid over-representation of common procedures and manual review of questions for linguistic cues or answer leakage. These additions will directly address concerns about whether performance gaps reflect model limitations or dataset characteristics. revision: yes
-
Referee: [Evaluation Methodology] No expert review process, inter-annotator agreement metrics, or validation steps for question design are described (evaluation and results sections). Without this, it is unclear whether the questions validly isolate the three targeted competencies, undermining the zero-shot results and comparison to text-only baselines.
Authors: We acknowledge that the current text does not sufficiently document the question validation process. We will revise the Evaluation section to include: a description of the multi-stage design workflow involving ultrasound experts (sonographers and radiologists) who reviewed and refined questions for each of the three competencies; inter-annotator agreement metrics (e.g., Fleiss' kappa) computed on a held-out subset of 100 questions; and explicit validation steps confirming that questions test procedural reasoning rather than superficial visual or textual patterns. This documentation will clarify how the benchmark isolates the intended competencies. revision: yes
-
Referee: [Results and Analysis] The results lack statistical significance tests for the claimed minimal gains over text-only baselines or error analysis breaking down failures on troubleshooting questions (results section). This weakens support for the conclusion about VLM limitations in causal reasoning.
Authors: We agree that additional statistical rigor and error analysis would strengthen the results. In the revised manuscript, we will add: statistical significance tests (paired t-tests with reported p-values and effect sizes) for all VLM vs. text-only baseline comparisons to confirm the minimal gains; and a detailed error analysis of troubleshooting questions, breaking down failures by category (e.g., artifact misidentification, incorrect causal inference, or planning errors) with quantitative counts and qualitative examples per model. These changes will provide firmer support for our conclusions on causal reasoning limitations. revision: yes
Circularity Check
New benchmark creation and zero-shot VLM evaluations contain no circular derivation steps
full rationale
The paper introduces a new dataset (ReXSonoVQA with 514 video clips and questions) and reports direct zero-shot model performance on it. No equations, fitted parameters, or derivation chains are present that reduce predictions to inputs by construction. The three competencies are defined by the authors' question design rather than derived from prior results, and evaluations use external models without self-referential fitting. This is a standard empirical benchmark paper whose central claims rest on new data collection and testing, not on any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
doi: https://doi.org/10.1016/j.media.2023. 102878. URLhttps://www.sciencedirect.com/ science/article/pii/S136184152300138X. Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically gen- erated visual questions and answers about radiol- ogy images.Scientific data, 5(1):1–10, 2018. Anjie Le, Henan Liu, Yue Wang, Zhenyu Li...
-
[2]
URLhttps://proceedings.mlr.press/ v281/zhang25b.html. Appendix A. Prompts See Fig A1, A2, A3, A4, A5 Appendix B. Case Studies See Fig A7, A8, A9, A10, A11 12 Appendix C. Cross-Setting Outcome Tables Tables A1–A8 report the cross-tabulation of video- informed vs. text-only (blind) outcomes for all four evaluated models, separately for MCQ and free- respons...
-
[3]
- Ask: what maneuver is being performed AND what imaging goal / target view it serves
Type1_ActionGoalReasoning - Tests: action reasoning + goal inference. - Ask: what maneuver is being performed AND what imaging goal / target view it serves. - Phrase questions naturally and adaptively based on the content. - Do NOT describe the specific anatomical structures or visual details in the question
-
[4]
Type2_ArtifactResolutionOptimization - Tests: overcoming artifacts or ambiguity + optimization/disambiguation logic. - Ask: what (probe maneuver, patient management, or knobology) has changed AND why it resolves an artifact or ambiguity / improves image quality. - IMPORTANT: Do not explicitly describe the artifact or ambiguity in the question. - Phrase qu...
-
[5]
Type3_ProcedureContextPlanning - Tests: overall context understanding + next-step planning. - Ask: what phase/step the operator is in AND what the broader workflow objective or next logical step is. - Usually use TWO or more ADJACENT EVENTS to create sufficient context. - Vary your phrasing - ask about exam phases, workflow transitions, procedural objecti...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.