Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
Pith reviewed 2026-05-22 09:10 UTC · model grok-4.3
The pith
A Zoom-then-Diagnose framework lets ultrasound vision-language models actively focus on lesions before answering questions by using consistency across rollouts as a confidence signal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a Zoom-then-Diagnose paradigm together with an uncertainty-aware reward computed from stochastic group-wise rollouts inside the GRPO framework teaches the model to actively zoom into lesion regions before diagnosis and to reinforce correct answers on clear cases while remaining cautious on ambiguous ones.
What carries the argument
Zoom-then-Diagnose paradigm combined with an uncertainty-aware reward derived from stochastic group-wise rollouts inside Group Relative Policy Optimization
If this is right
- The model learns to improve lesion localization by 39.3 percent across liver, breast, and thyroid ultrasound datasets.
- High-consistency predictions on clear cases are reinforced while predictions on ambiguous cases trigger caution.
- The approach aligns VLM reasoning more closely with the clinical practice of focusing on lesions before reporting.
- No external labeled validation set is required to estimate when the model should be confident.
Where Pith is reading between the lines
- The same consistency-based reward could be tested on other medical imaging modalities that support interactive cropping or magnification.
- Modeling ambiguity internally might let the system flag uncertain answers for human review rather than forcing a diagnosis.
- If the reward structure works outside ultrasound, it could be tried on general visual question answering tasks that contain visual ambiguity.
Load-bearing premise
The uncertainty-aware reward from stochastic group-wise rollouts serves as a reliable proxy for model confidence that can separate clear cases from ambiguous ones without any external validation data.
What would settle it
Measure lesion localization accuracy on a fresh set of ultrasound images from the same organs and check whether the reported 39.3 percent gain appears and whether the rollout consistency score actually tracks human judgments of case ambiguity.
Figures
read the original abstract
Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a Zoom-then-Diagnose paradigm for ultrasound VQA that replicates sonographers' interactive lesion-focused workflow, combined with an uncertainty-aware reward inside the Group Relative Policy Optimization (GRPO) framework. The reward is derived from consistency across stochastic group-wise rollouts and is intended to reinforce predictions on clear cases while promoting caution under ambiguity. Experiments on liver, breast, and thyroid datasets are reported to yield a 39.3% improvement in lesion localization.
Significance. If the localization gains prove reproducible and the internal consistency proxy is shown to track actual diagnostic ambiguity, the framework could advance medical VLMs by explicitly incorporating active zooming and confidence calibration, addressing subjectivity in ultrasound interpretation.
major comments (2)
- Abstract: the headline claim of a 39.3% lesion-localization improvement supplies no experimental details on baselines, dataset sizes, statistical tests, or measurement definitions, rendering the central empirical result unverifiable from the provided information.
- GRPO uncertainty reward (described in the abstract and method): the reward is computed exclusively from the model's own stochastic group-wise rollouts on the same inputs, creating a self-referential loop in which rollout consistency serves as both the optimization signal and the target; no external anchor such as multi-rater sonographer labels or inter-observer variability scores is described to confirm that the proxy distinguishes clear from ambiguous cases rather than rollout noise.
minor comments (1)
- The manuscript would benefit from explicit definitions of rollout parameters (temperature, group size, number of samples) used to compute the uncertainty reward, as these choices directly affect the proxy's behavior.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [—] Abstract: the headline claim of a 39.3% lesion-localization improvement supplies no experimental details on baselines, dataset sizes, statistical tests, or measurement definitions, rendering the central empirical result unverifiable from the provided information.
Authors: We agree that the abstract, constrained by length, omits key experimental details. In the revised manuscript we will expand the abstract to briefly note the baselines (standard VLMs without the Zoom-then-Diagnose paradigm), the dataset sizes for the liver, breast, and thyroid ultrasound VQA collections, the lesion-localization metric (improvement in IoU-based localization accuracy), and a reference to the statistical significance tests already reported in the experimental section. revision: yes
-
Referee: [—] GRPO uncertainty reward (described in the abstract and method): the reward is computed exclusively from the model's own stochastic group-wise rollouts on the same inputs, creating a self-referential loop in which rollout consistency serves as both the optimization signal and the target; no external anchor such as multi-rater sonographer labels or inter-observer variability scores is described to confirm that the proxy distinguishes clear from ambiguous cases rather than rollout noise.
Authors: The uncertainty-aware reward is deliberately constructed from internal consistency across stochastic group-wise rollouts to serve as a training-time proxy for model confidence, allowing the framework to reinforce accurate predictions on clear cases while promoting caution under ambiguity without requiring additional external annotations. This design choice intentionally creates the self-referential aspect noted by the referee. We acknowledge that the absence of multi-rater sonographer labels or inter-observer variability scores leaves open the possibility that the proxy partly captures rollout noise rather than true diagnostic ambiguity. In the revision we will add a dedicated paragraph in the discussion section that explicitly states this limitation, provides further analysis of rollout variance on selected ambiguous examples, and outlines future work that could incorporate external anchors. revision: partial
Circularity Check
No significant circularity; empirical gains rest on external dataset evaluation
full rationale
The paper's central claim—an empirical 39.3% improvement in lesion localization—is obtained from experiments on liver, breast, and thyroid datasets rather than by algebraic reduction to the reward definition. The uncertainty-aware reward is introduced as a design element inside GRPO using stochastic group-wise rollouts; while this creates an internal consistency signal, the paper does not equate the final performance metric to that signal by construction. No equations are shown that force the localization gain from the reward alone, no self-citation chain carries the uniqueness of the paradigm, and no fitted parameter is relabeled as a prediction. The derivation therefore remains self-contained against the reported benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ultrasound annotations contain inherent subjectivity and ambiguity that standard VLMs fail to model
invented entities (1)
-
Uncertainty-aware reward from stochastic group-wise rollouts
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
GRIT: Teaching MLLMs to Think with Images
Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Narayanaraju, S.J., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Gong, H., Chen, J., Chen, G., Li, H., Li, G., Chen, F.: Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Com- puters in Biology and Medicine155, 106389 (2023).https://doi.org/https: //doi.org/10.1016/j.compbiomed.2022.106389,https://www.sciencedirect. com/science/article/pii/S0010482522010976
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3
Guo, X., Alsharid, M., Zhao, H., Wang, Y., Lander, J., Papageorghiou, A.T., Noble, J.A.: A visually grounded language model for fetal ultrasound under- standing. Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3
work page 2026
-
[7]
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)
work page 2022
-
[8]
Huang, Y., Song, J., Wang, Z., Zhao, S., Chen, H., Juefei-Xu, F., Ma, L.: Look be- fore you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236 (2023)
-
[9]
Medical image analysis89, 102878 (2023)
Jiang, Z., Salcudean, S.E., Navab, N.: Robotic ultrasound imaging: State-of-the-art and future perspectives. Medical image analysis89, 102878 (2023)
work page 2023
-
[10]
Language Models (Mostly) Know What They Know
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al.: Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
arXiv preprint arXiv:2505.17779 (2025)
Le,A.,Liu,H.,Wang,Y.,Liu,Z.,Zhu,R.,Weng,T.,Yu,J.,Wang,B.,Wu,Y.,Yan, K., et al.: U2-bench: Benchmarking large vision-language models on ultrasound understanding. arXiv preprint arXiv:2505.17779 (2025)
-
[12]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Yue Zhou et al
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Yue Zhou et al
work page 2023
-
[14]
Li, X., Navab, N., Jiang, Z.: Speckle2self: Self-supervised ultrasound speckle re- duction without clean data. Medical image analysis p. 103755 (2025)
work page 2025
-
[15]
arXiv preprint arXiv:2509.15279 (2025)
Liu, C., Li, D., Shu, Y., Chen, R., Duan, D., Fang, T., Dai, B.: Fleming-r1: To- ward expert-level medical reasoning via reinforcement learning. arXiv preprint arXiv:2509.15279 (2025)
-
[16]
Advances in neural in- formation processing systems, 35:27730–27744
Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025)
-
[17]
arXiv preprint arXiv:2505.19213 (2025)
Rui, S., Chen, K., Ma, W., Wang, X.: Improving medical reasoning with curriculum-aware reinforcement learning. arXiv preprint arXiv:2505.19213 (2025)
-
[18]
EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
She, C., Lu, R., Chen, L., Wang, W., Huang, Q.: Echovlm: Dynamic mixture-of- experts vision-language model for universal ultrasound intelligence. arXiv preprint arXiv:2509.14977 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
arXiv preprint arXiv:2503.02623 (2025)
Stangel, P., Bani-Harouni, D., Pellegrini, C., Özsoy, E., Zaripova, K., Keicher, M., Navab, N.: Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models. arXiv preprint arXiv:2503.02623 (2025)
-
[20]
arXiv preprint arXiv:2308.01222 (2023)
Wang, C.: Calibration in deep learning: A survey of the state-of-the-art. arXiv preprint arXiv:2308.01222 (2023)
-
[21]
Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tun- ing llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023)
-
[22]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Wang, H., Su, A., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning. arXiv preprint arXiv:2505.15966 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Advances in neural information processing systems35, 24824–24837 (2022)
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)
work page 2022
-
[25]
Weng, T., Hu, K., Liu, H., Liu, S., Liu, X., Liu, Z., Ren, J., Wang, B., Wang, B., Wang, Y., et al.: Dolphin v1. 0 technical report. arXiv preprint arXiv:2509.25748 (2025)
-
[26]
Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., Hooi, B.: Can llms express their uncertainty?anempiricalevaluationofconfidenceelicitationinllms.arXivpreprint arXiv:2306.13063 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [28]
-
[29]
Yu, H., Li, Y., Niu, Z., Zhang, N., Gong, X., Li, H., Zou, Z., Qi, H., Cao, Z., Lan, Z., et al.: A chain-of-thought reasoning breast ultrasound dataset covering all histopathology categories. Scientific Data (2026)
work page 2026
-
[30]
Advances in Neural Information Processing Systems35, 15476–15488 (2022)
Zelikman, E., Wu, Y., Mu, J., Goodman, N.: Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems35, 15476–15488 (2022)
work page 2022
-
[31]
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Zhang, X., Gao, Z., Zhang, B., Li, P., Zhang, X., Liu, Y., Yuan, T., Wu, Y., Jia, Y., Zhu, S.C., et al.: Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl. arXiv preprint arXiv:2505.15436 (2025) Look-Closer-Then-Diagnose 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
arXiv preprint arXiv:2503.02863 (2025)
Zhou, Z., Jin, T., Shi, J., Li, Q.: Steerconf: Steering llms for confidence elicitation. arXiv preprint arXiv:2503.02863 (2025)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.