ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Ankit Pal; Pranav Rajpurkar; Sung Eun Kim; Xiaoman Zhang; Xucheng Wang

arxiv: 2604.10916 · v3 · submitted 2026-04-13 · 💻 cs.CV · cs.AI

ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Xucheng Wang , Xiaoman Zhang , Sung Eun Kim , Ankit Pal , Pranav Rajpurkar This is my paper

Pith reviewed 2026-05-10 14:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords ultrasoundvideo QAvision-language modelsprocedural understandingbenchmarkcausal reasoningmedical imagingartifact resolution

0 comments

The pith

ReXSonoVQA benchmark shows vision-language models extract some ultrasound procedural details but struggle with causal troubleshooting questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a new video question-answering benchmark called ReXSonoVQA containing 514 ultrasound clips and 514 questions. These questions measure three competencies required for skilled ultrasound acquisition: linking actions to goals, resolving image artifacts through adjustments, and planning within full procedures. Zero-shot tests on several vision-language models indicate they pick up some procedural content from the videos yet show little advantage over text-only versions when facing troubleshooting items. This pattern points to shortcomings in causal reasoning needed for real-time medical imaging tasks. The benchmark is positioned as a way to guide progress toward automated ultrasound systems for training and robotic use.

Core claim

The authors establish ReXSonoVQA as a video QA benchmark with 514 clips and paired questions that target Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning in ultrasound. Zero-shot evaluation of models such as Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro finds that the models can derive some procedural information yet remain challenged by troubleshooting questions, with only minimal improvement when given video rather than text alone, thereby exposing limits in causal reasoning.

What carries the argument

The ReXSonoVQA benchmark, a set of 514 ultrasound video clips paired with 514 questions (249 multiple-choice and 265 free-response) that directly probe the three competencies of Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning.

If this is right

Vision-language models can extract limited procedural information from dynamic ultrasound videos.
Troubleshooting questions remain difficult for current models with little added benefit from video over text.
The benchmark identifies clear gaps in causal reasoning for procedure-centric tasks.
ReXSonoVQA can support development of perception systems for ultrasound training, guidance, and robotic automation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The emphasis on causal gaps suggests future model designs may benefit from explicit temporal cause-effect modeling rather than general video captioning.
This type of procedural benchmark could be adapted to other real-time medical imaging domains to test similar action-planning skills.
High performance on the benchmark might eventually reduce reliance on human operators for basic ultrasound scans.
Fine-tuning experiments on the dataset would provide a direct test of whether the identified limitations are addressable through targeted training.

Load-bearing premise

The 514 video clips and questions were selected and written without bias and validly measure the three targeted competencies of procedural ultrasound understanding.

What would settle it

A result in which future models achieve markedly higher accuracy on the troubleshooting questions from video input than from text-only input, while showing comparable gains on the other question types, would indicate the benchmark successfully isolates the need for visual causal reasoning.

Figures

Figures reproduced from arXiv: 2604.10916 by Ankit Pal, Pranav Rajpurkar, Sung Eun Kim, Xiaoman Zhang, Xucheng Wang.

**Figure 1.** Figure 1: A ReXSonoVQA example: Type 3 (Procedure Context & Planning, Free-Response) question requiring identification of the screening objective and anatomical transition during a transverse sweep. Gemini 3 Pro correctly identifies the anatomical transition (tendon to muscle) but fails to specify the correct screening objective, receiving a partial score (1/2). More MCQs and freeresponses examples see Appendix B.… view at source ↗

**Figure 2.** Figure 2: End-to-end pipeline for constructing ReXSonoVQA: (1) Task Definition, (2) Data Curation & Ground Truth Construction, (3) Prompt Refinement and Quality Control Loop, and (4) Benchmark Construction & Evaluation. than dynamic reasoning about the acquisition process itself. This limitation is particularly problematic for ultrasound automation, where perception systems must understand not just anatomical cont… view at source ↗

**Figure 3.** Figure 3: Example of video preprocessing. We crop the original video frame to retain only the ultrasound image stream, excluding surrounding content. produce word-level timestamps. We apply light normalization using an LLM (GPT 5.2) to remove filler words and standardize terminology while preserving all scanning-relevant content: maneuver descriptions, view targets, and troubleshooting instructions. Ground Truth E… view at source ↗

**Figure 4.** Figure 4: Examples from ReXSonoVQA. Left: Time-aligned procedural events derived from instructional narration. Right: A clip-grounded question-answer item targeting action-goal reasoning. aligned to a clip window via time start and time end. Questions may derive from a single event or from adjacent event spans to form longer coherent procedural units (e.g., setup → maneuver → confirmation) (see Appendix A, Fig. A1 … view at source ↗

**Figure 5.** Figure 5: ReXSonoVQA dataset composition (514 items). Distribution of questions across clinical categories by task type. of ultrasound categories and scanning purposes, including abdominal, genitourinary, obstetric, musculoskeletal, thoracic, and vascular protocols ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-model zero-shot MCQ accuracy (%, left) and free-response mean score (0–2 rubric, right) by task type under paired evaluation settings. Dashed red line marks 25% random chance for MCQ. pendix B) that highlight typical failure modes and serve as concrete examples of benchmark items. Task-wise Trends and the Role of Duration. We present detailed analysis for Gemini 3 Pro, the best-performing model, foll… view at source ↗

**Figure 7.** Figure 7: Multi-model zero-shot MCQ accuracy (%, left) and free-response mean score (0–2 rubric, right) by video duration under paired evaluation settings. Dashed red line marks 25% random chance for MCQ. tial answering without visual evidence. In contrast, free-response shows increasing gains from video with longer durations (from +0.59 for 0–5 s to +0.82 for >20 s), while the text-only baseline fluctuates rather t… view at source ↗

read the original abstract

Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReXSonoVQA creates a new procedural ultrasound video benchmark that fills a real gap, but its claims about VLM limitations rest on thin validation details.

read the letter

The paper's main contribution is ReXSonoVQA, a set of 514 ultrasound video clips paired with 514 questions that target three concrete competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. This moves past the static-image focus of earlier medical VLM benchmarks and gives a direct testbed for dynamic probe handling and real-time adjustments. The zero-shot results on Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro are straightforward to read: models pick up some procedural signals but show almost no lift over text-only baselines on troubleshooting items. That pattern is worth noting for anyone working on ultrasound guidance or automation systems. The work is honest about the current limits in causal reasoning for these models. The soft spot is the missing documentation on how the clips were chosen and how the questions were written and checked. The abstract lists sizes and question types but gives no information on expert review, inter-rater checks, or tests for linguistic cues that could let models guess without real visual understanding. If those steps were skipped or weak, the performance gap could trace back to dataset construction rather than model capability. No error analysis or statistical tests appear in the summary either, so the strength of the “minimal gains” observation is hard to judge yet. This paper is aimed at medical AI groups building or evaluating VLMs for procedural tasks. A reader who needs a video benchmark in ultrasound would get immediate use from the data if it is released with clear sourcing notes. It deserves a serious referee because new, targeted benchmarks in an area with few options are worth the review time even when the initial version needs more on validation and analysis. I would send it to peer review with a request for those details.

Referee Report

3 major / 2 minor

Summary. The paper introduces ReXSonoVQA, a video QA benchmark with 514 ultrasound video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluations of VLMs including Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro indicate that models can extract some procedural information but struggle with troubleshooting questions, showing minimal gains over text-only baselines and exposing limitations in causal reasoning. The benchmark aims to support development of perception systems for ultrasound training, guidance, and automation.

Significance. If the benchmark validity holds after addressing validation gaps, this work would be significant for computer vision and medical AI by providing the first dynamic, procedure-centric ultrasound QA dataset, addressing the limitation of existing static-image benchmarks. It offers a concrete resource for evaluating and improving VLMs in real-time medical imaging scenarios, with potential impact on autonomous systems and robotics.

major comments (3)

[Dataset Construction] The manuscript provides no details on the sourcing, selection criteria, or bias mitigation for the 514 video clips and questions (abstract and dataset section). This is load-bearing for the central claim, as the reported performance gaps and causal-reasoning limitations could arise from dataset artifacts rather than model deficiencies if clips favor common procedures or questions contain linguistic cues.
[Evaluation Methodology] No expert review process, inter-annotator agreement metrics, or validation steps for question design are described (evaluation and results sections). Without this, it is unclear whether the questions validly isolate the three targeted competencies, undermining the zero-shot results and comparison to text-only baselines.
[Results and Analysis] The results lack statistical significance tests for the claimed minimal gains over text-only baselines or error analysis breaking down failures on troubleshooting questions (results section). This weakens support for the conclusion about VLM limitations in causal reasoning.

minor comments (2)

[Abstract] Clarify the exact model versions (e.g., 'Gemini 3 Pro') and ensure consistent naming between abstract and main text.
[Dataset] Add a table summarizing question distribution across the three competencies and video clip characteristics (duration, procedure types).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses

Referee: [Dataset Construction] The manuscript provides no details on the sourcing, selection criteria, or bias mitigation for the 514 video clips and questions (abstract and dataset section). This is load-bearing for the central claim, as the reported performance gaps and causal-reasoning limitations could arise from dataset artifacts rather than model deficiencies if clips favor common procedures or questions contain linguistic cues.

Authors: We agree that the manuscript would benefit from expanded details on dataset construction to allow readers to evaluate potential artifacts. In the revised version, we will add a dedicated subsection describing: the sourcing of the 514 clips from a combination of clinical archives (with IRB approval) and publicly available ultrasound video repositories; explicit selection criteria that prioritize diversity across procedure types, ultrasound systems, and patient demographics; and bias mitigation steps including stratified sampling to avoid over-representation of common procedures and manual review of questions for linguistic cues or answer leakage. These additions will directly address concerns about whether performance gaps reflect model limitations or dataset characteristics. revision: yes
Referee: [Evaluation Methodology] No expert review process, inter-annotator agreement metrics, or validation steps for question design are described (evaluation and results sections). Without this, it is unclear whether the questions validly isolate the three targeted competencies, undermining the zero-shot results and comparison to text-only baselines.

Authors: We acknowledge that the current text does not sufficiently document the question validation process. We will revise the Evaluation section to include: a description of the multi-stage design workflow involving ultrasound experts (sonographers and radiologists) who reviewed and refined questions for each of the three competencies; inter-annotator agreement metrics (e.g., Fleiss' kappa) computed on a held-out subset of 100 questions; and explicit validation steps confirming that questions test procedural reasoning rather than superficial visual or textual patterns. This documentation will clarify how the benchmark isolates the intended competencies. revision: yes
Referee: [Results and Analysis] The results lack statistical significance tests for the claimed minimal gains over text-only baselines or error analysis breaking down failures on troubleshooting questions (results section). This weakens support for the conclusion about VLM limitations in causal reasoning.

Authors: We agree that additional statistical rigor and error analysis would strengthen the results. In the revised manuscript, we will add: statistical significance tests (paired t-tests with reported p-values and effect sizes) for all VLM vs. text-only baseline comparisons to confirm the minimal gains; and a detailed error analysis of troubleshooting questions, breaking down failures by category (e.g., artifact misidentification, incorrect causal inference, or planning errors) with quantitative counts and qualitative examples per model. These changes will provide firmer support for our conclusions on causal reasoning limitations. revision: yes

Circularity Check

0 steps flagged

New benchmark creation and zero-shot VLM evaluations contain no circular derivation steps

full rationale

The paper introduces a new dataset (ReXSonoVQA with 514 video clips and questions) and reports direct zero-shot model performance on it. No equations, fitted parameters, or derivation chains are present that reduce predictions to inputs by construction. The three competencies are defined by the authors' question design rather than derived from prior results, and evaluations use external models without self-referential fitting. This is a standard empirical benchmark paper whose central claims rest on new data collection and testing, not on any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark creation and evaluation paper that relies on standard practices in dataset construction and VLM testing; no free parameters, domain axioms, or new invented entities are introduced.

pith-pipeline@v0.9.0 · 5458 in / 1130 out tokens · 59221 ms · 2026-05-10T14:56:29.306009+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

doi: https://doi.org/10.1016/j.media.2023. 102878. URLhttps://www.sciencedirect.com/ science/article/pii/S136184152300138X. Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically gen- erated visual questions and answers about radiol- ogy images.Scientific data, 5(1):1–10, 2018. Anjie Le, Henan Liu, Yue Wang, Zhenyu Li...

work page doi:10.1016/j.media.2023 2023
[2]

When the operator sweeps inferiorly and identifies the bifurcation, what anatomical landmark is being visualized?

URLhttps://proceedings.mlr.press/ v281/zhang25b.html. Appendix A. Prompts See Fig A1, A2, A3, A4, A5 Appendix B. Case Studies See Fig A7, A8, A9, A10, A11 12 Appendix C. Cross-Setting Outcome Tables Tables A1–A8 report the cross-tabulation of video- informed vs. text-only (blind) outcomes for all four evaluated models, separately for MCQ and free- respons...

work page
[3]

- Ask: what maneuver is being performed AND what imaging goal / target view it serves

Type1_ActionGoalReasoning - Tests: action reasoning + goal inference. - Ask: what maneuver is being performed AND what imaging goal / target view it serves. - Phrase questions naturally and adaptively based on the content. - Do NOT describe the specific anatomical structures or visual details in the question

work page
[4]

loss of view

Type2_ArtifactResolutionOptimization - Tests: overcoming artifacts or ambiguity + optimization/disambiguation logic. - Ask: what (probe maneuver, patient management, or knobology) has changed AND why it resolves an artifact or ambiguity / improves image quality. - IMPORTANT: Do not explicitly describe the artifact or ambiguity in the question. - Phrase qu...

work page
[5]

question

Type3_ProcedureContextPlanning - Tests: overall context understanding + next-step planning. - Ask: what phase/step the operator is in AND what the broader workflow objective or next logical step is. - Usually use TWO or more ADJACENT EVENTS to create sufficient context. - Vary your phrasing - ask about exam phases, workflow transitions, procedural objecti...

work page

[1] [1]

doi: https://doi.org/10.1016/j.media.2023. 102878. URLhttps://www.sciencedirect.com/ science/article/pii/S136184152300138X. Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically gen- erated visual questions and answers about radiol- ogy images.Scientific data, 5(1):1–10, 2018. Anjie Le, Henan Liu, Yue Wang, Zhenyu Li...

work page doi:10.1016/j.media.2023 2023

[2] [2]

When the operator sweeps inferiorly and identifies the bifurcation, what anatomical landmark is being visualized?

URLhttps://proceedings.mlr.press/ v281/zhang25b.html. Appendix A. Prompts See Fig A1, A2, A3, A4, A5 Appendix B. Case Studies See Fig A7, A8, A9, A10, A11 12 Appendix C. Cross-Setting Outcome Tables Tables A1–A8 report the cross-tabulation of video- informed vs. text-only (blind) outcomes for all four evaluated models, separately for MCQ and free- respons...

work page

[3] [3]

- Ask: what maneuver is being performed AND what imaging goal / target view it serves

Type1_ActionGoalReasoning - Tests: action reasoning + goal inference. - Ask: what maneuver is being performed AND what imaging goal / target view it serves. - Phrase questions naturally and adaptively based on the content. - Do NOT describe the specific anatomical structures or visual details in the question

work page

[4] [4]

loss of view

Type2_ArtifactResolutionOptimization - Tests: overcoming artifacts or ambiguity + optimization/disambiguation logic. - Ask: what (probe maneuver, patient management, or knobology) has changed AND why it resolves an artifact or ambiguity / improves image quality. - IMPORTANT: Do not explicitly describe the artifact or ambiguity in the question. - Phrase qu...

work page

[5] [5]

question

Type3_ProcedureContextPlanning - Tests: overall context understanding + next-step planning. - Ask: what phase/step the operator is in AND what the broader workflow objective or next logical step is. - Usually use TWO or more ADJACENT EVENTS to create sufficient context. - Vary your phrasing - ask about exam phases, workflow transitions, procedural objecti...

work page