Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Erxuan Wu; Hongjoo Lee; Huixiong Xu; Yikang Sun; Yuan Bi; Yue Zhou; Zhongliang Jiang

arxiv: 2605.21652 · v1 · pith:PI2XRGYYnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Yue Zhou , Erxuan Wu , Yikang Sun , Hongjoo Lee , Yuan Bi , Huixiong Xu , Zhongliang Jiang This is my paper

Pith reviewed 2026-05-22 09:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords ultrasound visual question answeringactive zoomingconfidence-aware VQAvision-language modelslesion localizationGroup Relative Policy Optimizationuncertainty-aware reward

0 comments

The pith

A Zoom-then-Diagnose framework lets ultrasound vision-language models actively focus on lesions before answering questions by using consistency across rollouts as a confidence signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to make vision-language models for ultrasound visual question answering follow the way sonographers work: they first search for and zoom on a lesion, then form a diagnosis. It does this by adding a structured Zoom-then-Diagnose reasoning step and an uncertainty-aware reward inside Group Relative Policy Optimization. The reward comes from running several stochastic rollouts in a group and measuring how consistent the answers are, treating high consistency as a sign the model should trust its output. Experiments on liver, breast, and thyroid ultrasound datasets show a 39.3 percent gain in lesion localization, indicating the model has learned to look closer when needed and stay cautious when cases are ambiguous.

Core claim

The central claim is that a Zoom-then-Diagnose paradigm together with an uncertainty-aware reward computed from stochastic group-wise rollouts inside the GRPO framework teaches the model to actively zoom into lesion regions before diagnosis and to reinforce correct answers on clear cases while remaining cautious on ambiguous ones.

What carries the argument

Zoom-then-Diagnose paradigm combined with an uncertainty-aware reward derived from stochastic group-wise rollouts inside Group Relative Policy Optimization

If this is right

The model learns to improve lesion localization by 39.3 percent across liver, breast, and thyroid ultrasound datasets.
High-consistency predictions on clear cases are reinforced while predictions on ambiguous cases trigger caution.
The approach aligns VLM reasoning more closely with the clinical practice of focusing on lesions before reporting.
No external labeled validation set is required to estimate when the model should be confident.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency-based reward could be tested on other medical imaging modalities that support interactive cropping or magnification.
Modeling ambiguity internally might let the system flag uncertain answers for human review rather than forcing a diagnosis.
If the reward structure works outside ultrasound, it could be tried on general visual question answering tasks that contain visual ambiguity.

Load-bearing premise

The uncertainty-aware reward from stochastic group-wise rollouts serves as a reliable proxy for model confidence that can separate clear cases from ambiguous ones without any external validation data.

What would settle it

Measure lesion localization accuracy on a fresh set of ultrasound images from the same organs and check whether the reported 39.3 percent gain appears and whether the rollout consistency score actually tracks human judgments of case ambiguity.

Figures

Figures reproduced from arXiv: 2605.21652 by Erxuan Wu, Hongjoo Lee, Huixiong Xu, Yikang Sun, Yuan Bi, Yue Zhou, Zhongliang Jiang.

**Figure 1.** Figure 1: Our approach mimics sonographer cognitively. It employs a zoom-in mechanism to simulate sonographers’ localized visualization. Moreover, it reflects their consensus: high consistency in easy cases (where sonographers agree) and appropriate ambiguity in hard cases (where sonographers disagree). These characteristics suggest two key requirements for ultrasound VLMs: lesion-centric structured reasoning and am… view at source ↗

**Figure 2.** Figure 2: Overview of the framework. Top: The Zoom-then-Diagnose paradigm, realized by constructing a supporting dataset (Sec. 2.1) and performing supervised finetuning to instill structured zoom-then-diagnosis reasoning (Sec. 2.2). Bottom: We apply GRPO with a consistency-based uncertainty alignment reward (Sec. 2.3). 2.2 Supervised Finetuning with Zoom-then-Diagnose Reasoning As illustrated in the upper part of … view at source ↗

**Figure 3.** Figure 3: Qualitative result. Unlike MedVLM-R1’s uniform predictions, our model reflects clinical ambiguity via explicit reasoning and inconsistent sampling results. remains consistently confident on this ambiguous case, whereas our model shows calibrated variability across rollouts, better aligning with the subjective nature of ultrasound interpretation. Third, many baselines exhibit near-zero or negative Entropy G… view at source ↗

read the original abstract

Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs a Zoom-then-Diagnose workflow with GRPO uncertainty rewards from internal rollouts to handle subjectivity in ultrasound VQA, but the 39.3% localization claim rests on minimal reported evidence.

read the letter

The main point is a practical framing for ultrasound VQA that tries to copy how sonographers zoom on lesions first, then adds a consistency signal from the model's own group rollouts inside GRPO to dial down answers on ambiguous cases. The abstract positions this as fixing two gaps in current VLMs: lack of explicit zooming and treating all annotations as firm ground truth. That combination is the clearest new element here, and it lines up with real clinical workflow in a modality where images are noisy and interpretations vary. The paper earns credit for naming the subjectivity problem directly instead of ignoring it. The Zoom-then-Diagnose structure gives the model a staged reasoning path that feels more grounded than flat VQA setups. On the soft spots, the headline result gets almost no backing in the provided text. No baselines, no dataset sizes, no measurement details, and no statistical tests appear for the 39.3% localization lift across liver, breast, and thyroid data. The uncertainty reward is defined only from the model's stochastic outputs on the same inputs, so it risks rewarding internal consistency rather than actual diagnostic reliability. Without an external anchor such as multi-rater sonographer scores or an ambiguity benchmark, it is hard to tell whether the proxy separates clear from unclear cases or just tracks rollout noise. This is a load-bearing assumption for the causal story. The work is aimed at researchers building medical VLMs who want ideas for clinical mimicry and uncertainty handling. A reader already working on ultrasound or noisy imaging might pick up the paradigm even if the numbers need more checking. It deserves a serious referee because the ideas engage a genuine gap and the method is specific enough to evaluate, though the experiments will need substantial expansion and external validation to hold up.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a Zoom-then-Diagnose paradigm for ultrasound VQA that replicates sonographers' interactive lesion-focused workflow, combined with an uncertainty-aware reward inside the Group Relative Policy Optimization (GRPO) framework. The reward is derived from consistency across stochastic group-wise rollouts and is intended to reinforce predictions on clear cases while promoting caution under ambiguity. Experiments on liver, breast, and thyroid datasets are reported to yield a 39.3% improvement in lesion localization.

Significance. If the localization gains prove reproducible and the internal consistency proxy is shown to track actual diagnostic ambiguity, the framework could advance medical VLMs by explicitly incorporating active zooming and confidence calibration, addressing subjectivity in ultrasound interpretation.

major comments (2)

Abstract: the headline claim of a 39.3% lesion-localization improvement supplies no experimental details on baselines, dataset sizes, statistical tests, or measurement definitions, rendering the central empirical result unverifiable from the provided information.
GRPO uncertainty reward (described in the abstract and method): the reward is computed exclusively from the model's own stochastic group-wise rollouts on the same inputs, creating a self-referential loop in which rollout consistency serves as both the optimization signal and the target; no external anchor such as multi-rater sonographer labels or inter-observer variability scores is described to confirm that the proxy distinguishes clear from ambiguous cases rather than rollout noise.

minor comments (1)

The manuscript would benefit from explicit definitions of rollout parameters (temperature, group size, number of samples) used to compute the uncertainty reward, as these choices directly affect the proxy's behavior.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [—] Abstract: the headline claim of a 39.3% lesion-localization improvement supplies no experimental details on baselines, dataset sizes, statistical tests, or measurement definitions, rendering the central empirical result unverifiable from the provided information.

Authors: We agree that the abstract, constrained by length, omits key experimental details. In the revised manuscript we will expand the abstract to briefly note the baselines (standard VLMs without the Zoom-then-Diagnose paradigm), the dataset sizes for the liver, breast, and thyroid ultrasound VQA collections, the lesion-localization metric (improvement in IoU-based localization accuracy), and a reference to the statistical significance tests already reported in the experimental section. revision: yes
Referee: [—] GRPO uncertainty reward (described in the abstract and method): the reward is computed exclusively from the model's own stochastic group-wise rollouts on the same inputs, creating a self-referential loop in which rollout consistency serves as both the optimization signal and the target; no external anchor such as multi-rater sonographer labels or inter-observer variability scores is described to confirm that the proxy distinguishes clear from ambiguous cases rather than rollout noise.

Authors: The uncertainty-aware reward is deliberately constructed from internal consistency across stochastic group-wise rollouts to serve as a training-time proxy for model confidence, allowing the framework to reinforce accurate predictions on clear cases while promoting caution under ambiguity without requiring additional external annotations. This design choice intentionally creates the self-referential aspect noted by the referee. We acknowledge that the absence of multi-rater sonographer labels or inter-observer variability scores leaves open the possibility that the proxy partly captures rollout noise rather than true diagnostic ambiguity. In the revision we will add a dedicated paragraph in the discussion section that explicitly states this limitation, provides further analysis of rollout variance on selected ambiguous examples, and outlines future work that could incorporate external anchors. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on external dataset evaluation

full rationale

The paper's central claim—an empirical 39.3% improvement in lesion localization—is obtained from experiments on liver, breast, and thyroid datasets rather than by algebraic reduction to the reward definition. The uncertainty-aware reward is introduced as a design element inside GRPO using stochastic group-wise rollouts; while this creates an internal consistency signal, the paper does not equate the final performance metric to that signal by construction. No equations are shown that force the localization gain from the reward alone, no self-citation chain carries the uniqueness of the paradigm, and no fitted parameter is relabeled as a prediction. The derivation therefore remains self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that sonographer zooming behavior can be replicated by a VLM and that group rollout consistency is a valid confidence proxy; no free parameters or new physical entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Ultrasound annotations contain inherent subjectivity and ambiguity that standard VLMs fail to model
Explicitly stated in the abstract as a limitation of existing models.

invented entities (1)

Uncertainty-aware reward from stochastic group-wise rollouts no independent evidence
purpose: Proxy for model confidence to reinforce accurate predictions on clear cases and caution on ambiguous ones
Introduced inside the GRPO framework as the key mechanism for confidence awareness

pith-pipeline@v0.9.0 · 5754 in / 1292 out tokens · 30195 ms · 2026-05-22T09:10:37.931867+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 13 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

GRIT: Teaching MLLMs to Think with Images

Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Narayanaraju, S.J., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Com- puters in Biology and Medicine155, 106389 (2023).https://doi.org/https: //doi.org/10.1016/j.compbiomed.2022.106389,https://www.sciencedirect

Gong, H., Chen, J., Chen, G., Li, H., Li, G., Chen, F.: Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Com- puters in Biology and Medicine155, 106389 (2023).https://doi.org/https: //doi.org/10.1016/j.compbiomed.2022.106389,https://www.sciencedirect. com/science/article/pii/S0010482522010976

work page doi:10.1016/j.compbiomed.2022.106389 2023
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3

Guo, X., Alsharid, M., Zhao, H., Wang, Y., Lander, J., Papageorghiou, A.T., Noble, J.A.: A visually grounded language model for fetal ultrasound under- standing. Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3

work page 2026
[7]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022
[8]

Look before you leap: An exploratory study of uncertainty measurement for large language models.CoRR, abs/2307.10236, 2023

Huang, Y., Song, J., Wang, Z., Zhao, S., Chen, H., Juefei-Xu, F., Ma, L.: Look be- fore you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236 (2023)

work page arXiv 2023
[9]

Medical image analysis89, 102878 (2023)

Jiang, Z., Salcudean, S.E., Navab, N.: Robotic ultrasound imaging: State-of-the-art and future perspectives. Medical image analysis89, 102878 (2023)

work page 2023
[10]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al.: Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

arXiv preprint arXiv:2505.17779 (2025)

Le,A.,Liu,H.,Wang,Y.,Liu,Z.,Zhu,R.,Weng,T.,Yu,J.,Wang,B.,Wu,Y.,Yan, K., et al.: U2-bench: Benchmarking large vision-language models on ultrasound understanding. arXiv preprint arXiv:2505.17779 (2025)

work page arXiv 2025
[12]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Yue Zhou et al

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Yue Zhou et al

work page 2023
[14]

Medical image analysis p

Li, X., Navab, N., Jiang, Z.: Speckle2self: Self-supervised ultrasound speckle re- duction without clean data. Medical image analysis p. 103755 (2025)

work page 2025
[15]

arXiv preprint arXiv:2509.15279 (2025)

Liu, C., Li, D., Shu, Y., Chen, R., Duan, D., Fang, T., Dai, B.: Fleming-r1: To- ward expert-level medical reasoning via reinforcement learning. arXiv preprint arXiv:2509.15279 (2025)

work page arXiv 2025
[16]

Advances in neural in- formation processing systems, 35:27730–27744

Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025)

work page arXiv 2025
[17]

arXiv preprint arXiv:2505.19213 (2025)

Rui, S., Chen, K., Ma, W., Wang, X.: Improving medical reasoning with curriculum-aware reinforcement learning. arXiv preprint arXiv:2505.19213 (2025)

work page arXiv 2025
[18]

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

She, C., Lu, R., Chen, L., Wang, W., Huang, Q.: Echovlm: Dynamic mixture-of- experts vision-language model for universal ultrasound intelligence. arXiv preprint arXiv:2509.14977 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

arXiv preprint arXiv:2503.02623 (2025)

Stangel, P., Bani-Harouni, D., Pellegrini, C., Özsoy, E., Zaripova, K., Keicher, M., Navab, N.: Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models. arXiv preprint arXiv:2503.02623 (2025)

work page arXiv 2025
[20]

arXiv preprint arXiv:2308.01222 (2023)

Wang, C.: Calibration in deep learning: A survey of the state-of-the-art. arXiv preprint arXiv:2308.01222 (2023)

work page arXiv 2023
[21]

Nature Medicine, pages 1–8

Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tun- ing llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023)

work page arXiv 2023
[22]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Wang, H., Su, A., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning. arXiv preprint arXiv:2505.15966 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

work page 2022
[25]

0 technical report

Weng, T., Hu, K., Liu, H., Liu, S., Liu, X., Liu, Z., Ren, J., Wang, B., Wang, B., Wang, Y., et al.: Dolphin v1. 0 technical report. arXiv preprint arXiv:2509.25748 (2025)

work page arXiv 2025
[26]

Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., Hooi, B.: Can llms express their uncertainty?anempiricalevaluationofconfidenceelicitationinllms.arXivpreprint arXiv:2306.13063 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Yin, Z., Sun, Q., Guo, Q., Wu, J., Qiu, X., Huang, X.: Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153 (2023)

work page arXiv 2023
[29]

Scientific Data (2026)

Yu, H., Li, Y., Niu, Z., Zhang, N., Gong, X., Li, H., Zou, Z., Qi, H., Cao, Z., Lan, Z., et al.: A chain-of-thought reasoning breast ultrasound dataset covering all histopathology categories. Scientific Data (2026)

work page 2026
[30]

Advances in Neural Information Processing Systems35, 15476–15488 (2022)

Zelikman, E., Wu, Y., Mu, J., Goodman, N.: Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems35, 15476–15488 (2022)

work page 2022
[31]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Zhang, X., Gao, Z., Zhang, B., Li, P., Zhang, X., Liu, Y., Yuan, T., Wu, Y., Jia, Y., Zhu, S.C., et al.: Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl. arXiv preprint arXiv:2505.15436 (2025) Look-Closer-Then-Diagnose 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

arXiv preprint arXiv:2503.02863 (2025)

Zhou, Z., Jin, T., Shi, J., Li, Q.: Steerconf: Steering llms for confidence elicitation. arXiv preprint arXiv:2503.02863 (2025)

work page arXiv 2025

[1] [1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

GRIT: Teaching MLLMs to Think with Images

Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Narayanaraju, S.J., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Com- puters in Biology and Medicine155, 106389 (2023).https://doi.org/https: //doi.org/10.1016/j.compbiomed.2022.106389,https://www.sciencedirect

Gong, H., Chen, J., Chen, G., Li, H., Li, G., Chen, F.: Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Com- puters in Biology and Medicine155, 106389 (2023).https://doi.org/https: //doi.org/10.1016/j.compbiomed.2022.106389,https://www.sciencedirect. com/science/article/pii/S0010482522010976

work page doi:10.1016/j.compbiomed.2022.106389 2023

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3

Guo, X., Alsharid, M., Zhao, H., Wang, Y., Lander, J., Papageorghiou, A.T., Noble, J.A.: A visually grounded language model for fetal ultrasound under- standing. Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3

work page 2026

[7] [7]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022

[8] [8]

Look before you leap: An exploratory study of uncertainty measurement for large language models.CoRR, abs/2307.10236, 2023

Huang, Y., Song, J., Wang, Z., Zhao, S., Chen, H., Juefei-Xu, F., Ma, L.: Look be- fore you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236 (2023)

work page arXiv 2023

[9] [9]

Medical image analysis89, 102878 (2023)

Jiang, Z., Salcudean, S.E., Navab, N.: Robotic ultrasound imaging: State-of-the-art and future perspectives. Medical image analysis89, 102878 (2023)

work page 2023

[10] [10]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al.: Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

arXiv preprint arXiv:2505.17779 (2025)

Le,A.,Liu,H.,Wang,Y.,Liu,Z.,Zhu,R.,Weng,T.,Yu,J.,Wang,B.,Wu,Y.,Yan, K., et al.: U2-bench: Benchmarking large vision-language models on ultrasound understanding. arXiv preprint arXiv:2505.17779 (2025)

work page arXiv 2025

[12] [12]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Yue Zhou et al

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023) 10 Yue Zhou et al

work page 2023

[14] [14]

Medical image analysis p

Li, X., Navab, N., Jiang, Z.: Speckle2self: Self-supervised ultrasound speckle re- duction without clean data. Medical image analysis p. 103755 (2025)

work page 2025

[15] [15]

arXiv preprint arXiv:2509.15279 (2025)

Liu, C., Li, D., Shu, Y., Chen, R., Duan, D., Fang, T., Dai, B.: Fleming-r1: To- ward expert-level medical reasoning via reinforcement learning. arXiv preprint arXiv:2509.15279 (2025)

work page arXiv 2025

[16] [16]

Advances in neural in- formation processing systems, 35:27730–27744

Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634 (2025)

work page arXiv 2025

[17] [17]

arXiv preprint arXiv:2505.19213 (2025)

Rui, S., Chen, K., Ma, W., Wang, X.: Improving medical reasoning with curriculum-aware reinforcement learning. arXiv preprint arXiv:2505.19213 (2025)

work page arXiv 2025

[18] [18]

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

She, C., Lu, R., Chen, L., Wang, W., Huang, Q.: Echovlm: Dynamic mixture-of- experts vision-language model for universal ultrasound intelligence. arXiv preprint arXiv:2509.14977 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

arXiv preprint arXiv:2503.02623 (2025)

Stangel, P., Bani-Harouni, D., Pellegrini, C., Özsoy, E., Zaripova, K., Keicher, M., Navab, N.: Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models. arXiv preprint arXiv:2503.02623 (2025)

work page arXiv 2025

[20] [20]

arXiv preprint arXiv:2308.01222 (2023)

Wang, C.: Calibration in deep learning: A survey of the state-of-the-art. arXiv preprint arXiv:2308.01222 (2023)

work page arXiv 2023

[21] [21]

Nature Medicine, pages 1–8

Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tun- ing llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023)

work page arXiv 2023

[22] [22]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Wang, H., Su, A., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning. arXiv preprint arXiv:2505.15966 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

work page 2022

[25] [25]

0 technical report

Weng, T., Hu, K., Liu, H., Liu, S., Liu, X., Liu, Z., Ren, J., Wang, B., Wang, B., Wang, Y., et al.: Dolphin v1. 0 technical report. arXiv preprint arXiv:2509.25748 (2025)

work page arXiv 2025

[26] [26]

Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., Hooi, B.: Can llms express their uncertainty?anempiricalevaluationofconfidenceelicitationinllms.arXivpreprint arXiv:2306.13063 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Yin, Z., Sun, Q., Guo, Q., Wu, J., Qiu, X., Huang, X.: Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153 (2023)

work page arXiv 2023

[29] [29]

Scientific Data (2026)

Yu, H., Li, Y., Niu, Z., Zhang, N., Gong, X., Li, H., Zou, Z., Qi, H., Cao, Z., Lan, Z., et al.: A chain-of-thought reasoning breast ultrasound dataset covering all histopathology categories. Scientific Data (2026)

work page 2026

[30] [30]

Advances in Neural Information Processing Systems35, 15476–15488 (2022)

Zelikman, E., Wu, Y., Mu, J., Goodman, N.: Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems35, 15476–15488 (2022)

work page 2022

[31] [31]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Zhang, X., Gao, Z., Zhang, B., Li, P., Zhang, X., Liu, Y., Yuan, T., Wu, Y., Jia, Y., Zhu, S.C., et al.: Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl. arXiv preprint arXiv:2505.15436 (2025) Look-Closer-Then-Diagnose 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

arXiv preprint arXiv:2503.02863 (2025)

Zhou, Z., Jin, T., Shi, J., Li, Q.: Steerconf: Steering llms for confidence elicitation. arXiv preprint arXiv:2503.02863 (2025)

work page arXiv 2025