pith. sign in

arxiv: 2605.21652 · v1 · pith:PI2XRGYYnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Pith reviewed 2026-05-22 09:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords ultrasound visual question answeringactive zoomingconfidence-aware VQAvision-language modelslesion localizationGroup Relative Policy Optimizationuncertainty-aware reward
0
0 comments X

The pith

A Zoom-then-Diagnose framework lets ultrasound vision-language models actively focus on lesions before answering questions by using consistency across rollouts as a confidence signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to make vision-language models for ultrasound visual question answering follow the way sonographers work: they first search for and zoom on a lesion, then form a diagnosis. It does this by adding a structured Zoom-then-Diagnose reasoning step and an uncertainty-aware reward inside Group Relative Policy Optimization. The reward comes from running several stochastic rollouts in a group and measuring how consistent the answers are, treating high consistency as a sign the model should trust its output. Experiments on liver, breast, and thyroid ultrasound datasets show a 39.3 percent gain in lesion localization, indicating the model has learned to look closer when needed and stay cautious when cases are ambiguous.

Core claim

The central claim is that a Zoom-then-Diagnose paradigm together with an uncertainty-aware reward computed from stochastic group-wise rollouts inside the GRPO framework teaches the model to actively zoom into lesion regions before diagnosis and to reinforce correct answers on clear cases while remaining cautious on ambiguous ones.

What carries the argument

Zoom-then-Diagnose paradigm combined with an uncertainty-aware reward derived from stochastic group-wise rollouts inside Group Relative Policy Optimization

If this is right

  • The model learns to improve lesion localization by 39.3 percent across liver, breast, and thyroid ultrasound datasets.
  • High-consistency predictions on clear cases are reinforced while predictions on ambiguous cases trigger caution.
  • The approach aligns VLM reasoning more closely with the clinical practice of focusing on lesions before reporting.
  • No external labeled validation set is required to estimate when the model should be confident.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency-based reward could be tested on other medical imaging modalities that support interactive cropping or magnification.
  • Modeling ambiguity internally might let the system flag uncertain answers for human review rather than forcing a diagnosis.
  • If the reward structure works outside ultrasound, it could be tried on general visual question answering tasks that contain visual ambiguity.

Load-bearing premise

The uncertainty-aware reward from stochastic group-wise rollouts serves as a reliable proxy for model confidence that can separate clear cases from ambiguous ones without any external validation data.

What would settle it

Measure lesion localization accuracy on a fresh set of ultrasound images from the same organs and check whether the reported 39.3 percent gain appears and whether the rollout consistency score actually tracks human judgments of case ambiguity.

Figures

Figures reproduced from arXiv: 2605.21652 by Erxuan Wu, Hongjoo Lee, Huixiong Xu, Yikang Sun, Yuan Bi, Yue Zhou, Zhongliang Jiang.

Figure 1
Figure 1. Figure 1: Our approach mimics sonographer cognitively. It employs a zoom-in mechanism to simulate sonographers’ localized visualization. Moreover, it reflects their consensus: high consistency in easy cases (where sonographers agree) and appropriate ambiguity in hard cases (where sonographers disagree). These characteristics suggest two key requirements for ultrasound VLMs: lesion-centric structured reasoning and am… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the framework. Top: The Zoom-then-Diagnose paradigm, re￾alized by constructing a supporting dataset (Sec. 2.1) and performing supervised fine￾tuning to instill structured zoom-then-diagnosis reasoning (Sec. 2.2). Bottom: We apply GRPO with a consistency-based uncertainty alignment reward (Sec. 2.3). 2.2 Supervised Finetuning with Zoom-then-Diagnose Reasoning As illustrated in the upper part of … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative result. Unlike MedVLM-R1’s uniform predictions, our model reflects clinical ambiguity via explicit reasoning and inconsistent sampling results. remains consistently confident on this ambiguous case, whereas our model shows calibrated variability across rollouts, better aligning with the subjective nature of ultrasound interpretation. Third, many baselines exhibit near-zero or negative Entropy G… view at source ↗
read the original abstract

Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a Zoom-then-Diagnose paradigm for ultrasound VQA that replicates sonographers' interactive lesion-focused workflow, combined with an uncertainty-aware reward inside the Group Relative Policy Optimization (GRPO) framework. The reward is derived from consistency across stochastic group-wise rollouts and is intended to reinforce predictions on clear cases while promoting caution under ambiguity. Experiments on liver, breast, and thyroid datasets are reported to yield a 39.3% improvement in lesion localization.

Significance. If the localization gains prove reproducible and the internal consistency proxy is shown to track actual diagnostic ambiguity, the framework could advance medical VLMs by explicitly incorporating active zooming and confidence calibration, addressing subjectivity in ultrasound interpretation.

major comments (2)
  1. Abstract: the headline claim of a 39.3% lesion-localization improvement supplies no experimental details on baselines, dataset sizes, statistical tests, or measurement definitions, rendering the central empirical result unverifiable from the provided information.
  2. GRPO uncertainty reward (described in the abstract and method): the reward is computed exclusively from the model's own stochastic group-wise rollouts on the same inputs, creating a self-referential loop in which rollout consistency serves as both the optimization signal and the target; no external anchor such as multi-rater sonographer labels or inter-observer variability scores is described to confirm that the proxy distinguishes clear from ambiguous cases rather than rollout noise.
minor comments (1)
  1. The manuscript would benefit from explicit definitions of rollout parameters (temperature, group size, number of samples) used to compute the uncertainty reward, as these choices directly affect the proxy's behavior.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [—] Abstract: the headline claim of a 39.3% lesion-localization improvement supplies no experimental details on baselines, dataset sizes, statistical tests, or measurement definitions, rendering the central empirical result unverifiable from the provided information.

    Authors: We agree that the abstract, constrained by length, omits key experimental details. In the revised manuscript we will expand the abstract to briefly note the baselines (standard VLMs without the Zoom-then-Diagnose paradigm), the dataset sizes for the liver, breast, and thyroid ultrasound VQA collections, the lesion-localization metric (improvement in IoU-based localization accuracy), and a reference to the statistical significance tests already reported in the experimental section. revision: yes

  2. Referee: [—] GRPO uncertainty reward (described in the abstract and method): the reward is computed exclusively from the model's own stochastic group-wise rollouts on the same inputs, creating a self-referential loop in which rollout consistency serves as both the optimization signal and the target; no external anchor such as multi-rater sonographer labels or inter-observer variability scores is described to confirm that the proxy distinguishes clear from ambiguous cases rather than rollout noise.

    Authors: The uncertainty-aware reward is deliberately constructed from internal consistency across stochastic group-wise rollouts to serve as a training-time proxy for model confidence, allowing the framework to reinforce accurate predictions on clear cases while promoting caution under ambiguity without requiring additional external annotations. This design choice intentionally creates the self-referential aspect noted by the referee. We acknowledge that the absence of multi-rater sonographer labels or inter-observer variability scores leaves open the possibility that the proxy partly captures rollout noise rather than true diagnostic ambiguity. In the revision we will add a dedicated paragraph in the discussion section that explicitly states this limitation, provides further analysis of rollout variance on selected ambiguous examples, and outlines future work that could incorporate external anchors. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on external dataset evaluation

full rationale

The paper's central claim—an empirical 39.3% improvement in lesion localization—is obtained from experiments on liver, breast, and thyroid datasets rather than by algebraic reduction to the reward definition. The uncertainty-aware reward is introduced as a design element inside GRPO using stochastic group-wise rollouts; while this creates an internal consistency signal, the paper does not equate the final performance metric to that signal by construction. No equations are shown that force the localization gain from the reward alone, no self-citation chain carries the uniqueness of the paradigm, and no fitted parameter is relabeled as a prediction. The derivation therefore remains self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that sonographer zooming behavior can be replicated by a VLM and that group rollout consistency is a valid confidence proxy; no free parameters or new physical entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Ultrasound annotations contain inherent subjectivity and ambiguity that standard VLMs fail to model
    Explicitly stated in the abstract as a limitation of existing models.
invented entities (1)
  • Uncertainty-aware reward from stochastic group-wise rollouts no independent evidence
    purpose: Proxy for model confidence to reinforce accurate predictions on clear cases and caution on ambiguous ones
    Introduced inside the GRPO framework as the key mechanism for confidence awareness

pith-pipeline@v0.9.0 · 5754 in / 1292 out tokens · 30195 ms · 2026-05-22T09:10:37.931867+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 13 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  2. [2]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

  3. [3]

    GRIT: Teaching MLLMs to Think with Images

    Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Narayanaraju, S.J., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879 (2025)

  4. [4]

    Com- puters in Biology and Medicine155, 106389 (2023).https://doi.org/https: //doi.org/10.1016/j.compbiomed.2022.106389,https://www.sciencedirect

    Gong, H., Chen, J., Chen, G., Li, H., Li, G., Chen, F.: Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Com- puters in Biology and Medicine155, 106389 (2023).https://doi.org/https: //doi.org/10.1016/j.compbiomed.2022.106389,https://www.sciencedirect. com/science/article/pii/S0010482522010976

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  6. [6]

    Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3

    Guo, X., Alsharid, M., Zhao, H., Wang, Y., Lander, J., Papageorghiou, A.T., Noble, J.A.: A visually grounded language model for fetal ultrasound under- standing. Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3

  7. [7]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  8. [8]

    Look before you leap: An exploratory study of uncertainty measurement for large language models.CoRR, abs/2307.10236, 2023

    Huang, Y., Song, J., Wang, Z., Zhao, S., Chen, H., Juefei-Xu, F., Ma, L.: Look be- fore you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236 (2023)

  9. [9]

    Medical image analysis89, 102878 (2023)

    Jiang, Z., Salcudean, S.E., Navab, N.: Robotic ultrasound imaging: State-of-the-art and future perspectives. Medical image analysis89, 102878 (2023)

  10. [10]

    Language Models (Mostly) Know What They Know

    Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al.: Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 (2022)

  11. [11]

    arXiv preprint arXiv:2505.17779 (2025)

    Le,A.,Liu,H.,Wang,Y.,Liu,Z.,Zhu,R.,Weng,T.,Yu,J.,Wang,B.,Wu,Y.,Yan, K., et al.: U2-bench: Benchmarking large vision-language models on ultrasound understanding. arXiv preprint arXiv:2505.17779 (2025)

  12. [12]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  13. [13]

    Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Yue Zhou et al

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Yue Zhou et al

  14. [14]

    Medical image analysis p

    Li, X., Navab, N., Jiang, Z.: Speckle2self: Self-supervised ultrasound speckle re- duction without clean data. Medical image analysis p. 103755 (2025)

  15. [15]

    arXiv preprint arXiv:2509.15279 (2025)

    Liu, C., Li, D., Shu, Y., Chen, R., Duan, D., Fang, T., Dai, B.: Fleming-r1: To- ward expert-level medical reasoning via reinforcement learning. arXiv preprint arXiv:2509.15279 (2025)

  16. [16]

    Advances in neural in- formation processing systems, 35:27730–27744

    Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025)

  17. [17]

    arXiv preprint arXiv:2505.19213 (2025)

    Rui, S., Chen, K., Ma, W., Wang, X.: Improving medical reasoning with curriculum-aware reinforcement learning. arXiv preprint arXiv:2505.19213 (2025)

  18. [18]

    EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

    She, C., Lu, R., Chen, L., Wang, W., Huang, Q.: Echovlm: Dynamic mixture-of- experts vision-language model for universal ultrasound intelligence. arXiv preprint arXiv:2509.14977 (2025)

  19. [19]

    arXiv preprint arXiv:2503.02623 (2025)

    Stangel, P., Bani-Harouni, D., Pellegrini, C., Özsoy, E., Zaripova, K., Keicher, M., Navab, N.: Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models. arXiv preprint arXiv:2503.02623 (2025)

  20. [20]

    arXiv preprint arXiv:2308.01222 (2023)

    Wang, C.: Calibration in deep learning: A survey of the state-of-the-art. arXiv preprint arXiv:2308.01222 (2023)

  21. [21]

    Nature Medicine, pages 1–8

    Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tun- ing llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023)

  22. [22]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Wang, H., Su, A., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning. arXiv preprint arXiv:2505.15966 (2025)

  23. [23]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

  24. [24]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  25. [25]

    0 technical report

    Weng, T., Hu, K., Liu, H., Liu, S., Liu, X., Liu, Z., Ren, J., Wang, B., Wang, B., Wang, Y., et al.: Dolphin v1. 0 technical report. arXiv preprint arXiv:2509.25748 (2025)

  26. [26]

    Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., Hooi, B.: Can llms express their uncertainty?anempiricalevaluationofconfidenceelicitationinllms.arXivpreprint arXiv:2306.13063 (2023)

  27. [27]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

  28. [28]

    Yin, Z., Sun, Q., Guo, Q., Wu, J., Qiu, X., Huang, X.: Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153 (2023)

  29. [29]

    Scientific Data (2026)

    Yu, H., Li, Y., Niu, Z., Zhang, N., Gong, X., Li, H., Zou, Z., Qi, H., Cao, Z., Lan, Z., et al.: A chain-of-thought reasoning breast ultrasound dataset covering all histopathology categories. Scientific Data (2026)

  30. [30]

    Advances in Neural Information Processing Systems35, 15476–15488 (2022)

    Zelikman, E., Wu, Y., Mu, J., Goodman, N.: Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems35, 15476–15488 (2022)

  31. [31]

    Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    Zhang, X., Gao, Z., Zhang, B., Li, P., Zhang, X., Liu, Y., Yuan, T., Wu, Y., Jia, Y., Zhu, S.C., et al.: Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl. arXiv preprint arXiv:2505.15436 (2025) Look-Closer-Then-Diagnose 11

  32. [32]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025)

  33. [33]

    arXiv preprint arXiv:2503.02863 (2025)

    Zhou, Z., Jin, T., Shi, J., Li, Q.: Steerconf: Steering llms for confidence elicitation. arXiv preprint arXiv:2503.02863 (2025)