pith. machine review for the scientific record. sign in

arxiv: 2604.09757 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords medical visual question answeringlatent visual reasoningvision-language modelsautoregressive decodingROI supervisionpolicy optimizationvisual evidence preservationmedical imaging AI
0
0 comments X

The pith

MedLVR adds short latent visual reasoning segments to medical VQA models to keep subtle diagnostic image details active during answer generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MedLVR, a framework that interleaves brief segments of latent visual reasoning into the autoregressive decoder of vision-language models used for medical visual question answering. Standard models encode the image once as static context and then shift to text-dominated reasoning, which tends to lose localized visual cues essential for accurate clinical answers. MedLVR reuses decoder hidden states to create continuous latent steps that iteratively preserve and refine query-relevant visual evidence before the final answer is produced. Training proceeds in two stages: region-of-interest supervised fine-tuning followed by Visual-Latent Policy Optimization that rewards better outcomes. A sympathetic reader would care because improved preservation of faint visual signals could make AI assistants more trustworthy in settings where missing a small lesion or pattern leads to wrong diagnoses.

Core claim

MedLVR introduces an explicit visual evidence state into autoregressive decoding by interleaving short latent reasoning segments formed from reused decoder hidden states. These segments enable iterative preservation and refinement of query-relevant visual evidence. The method is trained first with ROI-supervised fine-tuning to align latent states to clinically relevant image regions and then with Visual-Latent Policy Optimization under outcome-level rewards. Experiments show the approach raises average performance from 48.3% to 53.4% over the Qwen2.5-VL-7B backbone on OmniMedVQA and five other medical VQA benchmarks.

What carries the argument

The latent visual reasoning segment: short continuous latent steps created by reusing decoder hidden states that carry and iteratively refine visual evidence across decoding steps.

If this is right

  • Medical VQA models can maintain diagnostically relevant visual information throughout text generation instead of discarding it after the initial image encoding.
  • ROI-supervised fine-tuning aligns the reused hidden states with clinically meaningful image regions.
  • Visual-Latent Policy Optimization jointly improves the quality of the latent reasoning and the final generated answers under outcome rewards.
  • The same gains appear consistently across OmniMedVQA and five additional external medical VQA datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reuse of hidden states for latent visual steps could be tested on non-medical vision-language tasks that require tracking fine visual details over long outputs.
  • Varying the number or duration of latent segments might reveal an optimal balance between visual preservation and computational cost.
  • Combining this mechanism with other forms of visual supervision could address additional sources of error in diagnostic image interpretation.

Load-bearing premise

Reusing decoder hidden states as short continuous latent reasoning segments will reliably preserve and refine query-relevant visual evidence rather than adding noise or redundant computation.

What would settle it

An ablation that removes the latent reasoning segments entirely and measures no drop (or even a gain) in accuracy on the same medical VQA benchmarks would show the added visual state is not responsible for the reported gains.

read the original abstract

Medical vision--language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3\% to 53.4\%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MedLVR, a latent visual reasoning framework for medical VQA that interleaves short continuous latent reasoning segments (reused decoder hidden states) into autoregressive decoding to iteratively preserve and refine query-relevant visual evidence, rather than relying on static image embeddings and text-centric reasoning. It employs a two-stage training process consisting of ROI-supervised fine-tuning to align latent states with clinically relevant regions followed by Visual-Latent Policy Optimization (VLPO) using outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks report consistent gains, raising average performance from 48.3% to 53.4% over the Qwen2.5-VL-7B backbone.

Significance. If the central mechanism is verified to actively preserve diagnostically relevant visual information rather than arising from supervision or added capacity alone, the approach could meaningfully improve reliability in clinical VQA where subtle localized evidence is critical. The reported gains are modest but consistent across benchmarks; however, the significance hinges on demonstrating that the latent states function as visual evidence carriers, which is not yet established by the provided details.

major comments (3)
  1. [Methods] Methods (latent reasoning segment definition): Reusing decoder hidden states as short continuous latent steps is presented as enabling iterative visual evidence preservation, yet the description provides no isolation from the ROI supervision signal or extra forward-pass capacity; without ablations that replace the segments with non-visual tokens or disable ROI alignment while keeping parameter count fixed, the gains cannot be attributed to the claimed visual mechanism.
  2. [Experiments] Experiments (results and ablations): The 48.3% to 53.4% improvement is reported without error bars, statistical significance tests, or ablations on latent segment length, ROI supervision weight, or VLPO reward coefficients; this leaves open whether the gains stem from the two-stage training procedure itself rather than the latent visual reasoning component.
  3. [Experiments] No probing analysis: The manuscript contains no attention-map visualizations, feature reconstruction experiments, or comparisons of latent states against purely textual hidden states to verify that the reused decoder states carry and refine query-relevant visual information rather than generic computation.
minor comments (2)
  1. [Abstract / Methods] The abstract and methods would benefit from an explicit equation or diagram defining how the latent reasoning segment is inserted into the decoder hidden-state sequence and how it interacts with the visual encoder output.
  2. [Experiments] Table or figure captions for benchmark results should include the exact number of test samples per dataset and the precise backbone configuration used for the 48.3% baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions to strengthen the evidence for the latent visual reasoning mechanism.

read point-by-point responses
  1. Referee: [Methods] Methods (latent reasoning segment definition): Reusing decoder hidden states as short continuous latent steps is presented as enabling iterative visual evidence preservation, yet the description provides no isolation from the ROI supervision signal or extra forward-pass capacity; without ablations that replace the segments with non-visual tokens or disable ROI alignment while keeping parameter count fixed, the gains cannot be attributed to the claimed visual mechanism.

    Authors: We agree that explicit isolation ablations are needed to attribute gains specifically to the visual evidence preservation in latent states. In the revised manuscript, we will add controlled experiments that (1) replace latent reasoning segments with non-visual tokens (e.g., zero or random embeddings) while preserving architecture and training, and (2) disable ROI alignment during the first training stage while matching parameter counts and forward-pass capacity. These will directly test whether the iterative refinement arises from the claimed mechanism rather than supervision or added capacity. revision: yes

  2. Referee: [Experiments] Experiments (results and ablations): The 48.3% to 53.4% improvement is reported without error bars, statistical significance tests, or ablations on latent segment length, ROI supervision weight, or VLPO reward coefficients; this leaves open whether the gains stem from the two-stage training procedure itself rather than the latent visual reasoning component.

    Authors: The referee is correct that additional statistical rigor and hyperparameter ablations would strengthen the claims. We will revise the experiments section to include error bars (standard deviation across multiple runs), paired statistical significance tests on the reported improvements, and targeted ablations varying latent segment length, ROI supervision weight, and VLPO reward coefficients. These additions will help isolate the contribution of the latent visual reasoning component from the two-stage training procedure as a whole. revision: yes

  3. Referee: [Experiments] No probing analysis: The manuscript contains no attention-map visualizations, feature reconstruction experiments, or comparisons of latent states against purely textual hidden states to verify that the reused decoder states carry and refine query-relevant visual information rather than generic computation.

    Authors: We acknowledge that direct probing would provide stronger verification that the reused decoder states function as visual evidence carriers. In the revised manuscript, we will add (1) attention-map visualizations contrasting MedLVR with the baseline, (2) feature reconstruction experiments measuring how well latent states recover query-relevant image regions, and (3) quantitative comparisons of latent states versus purely textual hidden states (e.g., via cosine similarity to visual features and query relevance metrics). These analyses will directly address whether the states preserve and refine visual information. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains on held-out benchmarks after independent supervision

full rationale

The paper's derivation chain consists of a proposed architecture (reusing decoder hidden states as latent segments) trained in two stages with external ROI labels and outcome rewards, followed by evaluation on held-out benchmarks (OmniMedVQA and five others). No equations, self-citations, or ansatzes are shown that define the claimed preservation of visual evidence in terms of the same fitted quantities or reduce the reported 48.3% to 53.4% improvement to a definitional equivalence. The central claim is supported by external performance metrics rather than by construction from the training signals alone, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The framework rests on standard autoregressive decoding assumptions plus two new training stages whose effectiveness is demonstrated only empirically; the latent reasoning segment is an invented mechanism whose benefit is measured solely by downstream accuracy.

free parameters (2)
  • latent reasoning segment length
    Short fixed or learned length of the interleaved latent steps; value not stated in abstract but required for the method to function.
  • ROI supervision weight and VLPO reward coefficients
    Hyperparameters that balance the two training stages; chosen to produce the reported gains.
axioms (2)
  • domain assumption Hidden states from the vision-language decoder can serve as continuous, refinable representations of query-relevant visual evidence
    Invoked when the paper states that reusing hidden states enables iterative preservation of visual evidence.
  • domain assumption Outcome-level rewards in VLPO will improve both latent reasoning quality and final answer accuracy
    Central to the second training stage.
invented entities (1)
  • latent reasoning segment / visual evidence state no independent evidence
    purpose: To maintain and refine diagnostically relevant image information inside the decoder before text answer generation
    New construct introduced to address the static-embedding limitation; no independent falsifiable prediction (e.g., a measurable property of the state) is provided beyond accuracy improvement.

pith-pipeline@v0.9.0 · 5583 in / 1613 out tokens · 46598 ms · 2026-05-10T17:59:26.766421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 25 canonical work pages · 11 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  2. [2]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

  3. [3]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023

  4. [4]

    Blink: Multimodal large language models can see but not perceive,

    X. Fu, Y . Hu, B. Li, Y . Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna, “Blink: Multimodal large language models can see but not perceive,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 148–166

  5. [5]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms,

    S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal llms,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9568–9578

  6. [6]

    Perception tokens enhance visual reasoning in multimodal language models,

    M. Bigverdi, Z. Luo, C.-Y . Hsieh, E. Shen, D. Chen, L. G. Shapiro, and R. Krishna, “Perception tokens enhance visual reasoning in multimodal language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3836–3845

  7. [7]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024

  8. [8]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024. 10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2026

  9. [9]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  10. [10]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

  11. [11]

    Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,

    Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 418–13 427

  12. [12]

    Why are visually-grounded language models bad at image classification?

    Y . Zhang, A. Unell, X. Wang, D. Ghosh, Y . Su, L. Schmidt, and S. Yeung-Levy, “Why are visually-grounded language models bad at image classification?”Advances in Neural Information Processing Systems, vol. 37, pp. 51 727–51 753, 2024

  13. [13]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein, “Scaling up test-time compute with latent reasoning: A recurrent depth approach,” arXiv preprint arXiv:2502.05171, 2025

  14. [14]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

  15. [15]

    Visual cot: Unleashing chain-of-thought reasoning in multi- modal language models,

    H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Unleashing chain-of-thought reasoning in multi- modal language models,”CoRR, 2024

  16. [16]

    Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

    Z. Su, L. Li, M. Song, Y . Hao, Z. Yang, J. Zhang, G. Chen, J. Gu, J. Li, X. Quet al., “Openthinkimg: Learning to think with images via visual tool reinforcement learning,”arXiv preprint arXiv:2505.08617, 2025

  17. [17]

    Llava- cot: Let vision language models reason step-by-step,

    G. Xu, P. Jin, Z. Wu, H. Li, Y . Song, L. Sun, and L. Yuan, “Llava- cot: Let vision language models reason step-by-step,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2087–2098

  18. [18]

    Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

    Q. Wang, Y . Shi, Y . Wang, Y . Zhang, P. Wan, K. Gai, X. Ying, and Y . Wang, “Monet: Reasoning in latent visual space beyond images and language,”arXiv preprint arXiv:2511.21395, 2025

  19. [19]

    Medvistagym: A scalable training environment for thinking with medical images via tool-integrated reinforcement learning,

    M. Lu, Y . Lu, Y . Zhuang, M. Mullins, Y . Xie, G. Xiao, C. Fleming, W. Shi, and X. Wang, “Medvistagym: A scalable training environment for thinking with medical images via tool-integrated reinforcement learning,”arXiv preprint arXiv:2601.07107, 2026

  20. [20]

    arXiv preprint arXiv:2510.10052 (2025)

    K. Chen, S. Rui, Y . Jiang, J. Wu, Q. Zheng, C. Song, X. Wang, M. Zhou, and M. Liu, “Think twice to see more: Iterative visual reasoning in medical vlms,”arXiv preprint arXiv:2510.10052, 2025

  21. [21]

    Adaptive chain-of-focus reasoning via dynamic visual search and zooming for efficient vlms.arXiv preprint arXiv:2505.15436, 2025

    X. Zhang, Z. Gao, B. Zhang, P. Li, X. Zhang, Y . Liu, T. Yuan, Y . Wu, Y . Jia, S.-C. Zhuet al., “Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl,”arXiv preprint arXiv:2505.15436, 2025

  22. [22]

    Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

    X. Fu, M. Liu, Z. Yang, J. Corring, Y . Lu, J. Yang, D. Roth, D. Flo- rencio, and C. Zhang, “Refocus: Visual editing as a chain of thought for structured image understanding,”arXiv preprint arXiv:2501.05452, 2025

  23. [23]

    Vipergpt: Visual inference via python execution for reasoning,

    D. Sur ´ıs, S. Menon, and C. V ondrick, “Vipergpt: Visual inference via python execution for reasoning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 888–11 898

  24. [24]

    Med-flamingo: a multimodal medical few-shot learner,

    M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y . Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar, “Med-flamingo: a multimodal medical few-shot learner,” inMachine Learning for Health (ML4H). PMLR, 2023, pp. 353–367

  25. [25]

    Toward expert- level medical question answering with large language models,

    K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert- level medical question answering with large language models,”Nature Medicine, vol. 31, no. 3, pp. 943–950, 2025

  26. [26]

    Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,

    C. Wu, X. Zhang, Y . Zhang, H. Hui, Y . Wang, and W. Xie, “Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,”Nature Communications, vol. 16, no. 1, p. 7866, 2025

  27. [27]

    Chexagent: Towards a foundation model for chest x-ray interpretation,

    Z. Chen, M. Varma, J.-B. Delbrouck, M. Paschali, L. Blankemeier, D. Van Veen, J. M. J. Valanarasu, A. Youssef, J. P. Cohen, E. P. Reiset al., “Chexagent: Towards a foundation model for chest x-ray interpretation,” inAAAI 2024 Spring Symposium on Clinical Foundation Models, 2024

  28. [28]

    Training Large Language Models to Reason in a Continuous Latent Space

    S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y . Tian, “Training large language models to reason in a continuous latent space,” arXiv preprint arXiv:2412.06769, 2024

  29. [29]

    Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

    J. Cheng and B. Van Durme, “Compressed chain of thought: Efficient reasoning through dense representations,”arXiv preprint arXiv:2412.13171, 2024

  30. [30]

    Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv: 2502.21074,

    Z. Shen, H. Yan, L. Zhang, Z. Hu, Y . Du, and Y . He, “Codi: Compress- ing chain-of-thought into continuous space via self-distillation,”arXiv preprint arXiv:2502.21074, 2025

  31. [31]

    Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a

    B. Li, X. Sun, J. Liu, Z. Wang, J. Wu, X. Yu, H. Chen, E. Bar- soum, M. Chen, and Z. Liu, “Latent visual reasoning,”arXiv preprint arXiv:2509.24251, 2025

  32. [32]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

    C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023

  33. [33]

    MedGemma Technical Report

    A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lauet al., “Medgemma technical report,”arXiv preprint arXiv:2507.05201, 2025

  34. [34]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint arXiv:2506.07044.2025

    W. Xu, H. P. Chan, L. Li, M. Aljunied, R. Yuan, J. Wang, C. Xiao, G. Chen, C. Liu, Z. Liet al., “Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning,”arXiv preprint arXiv:2506.07044, 2025

  35. [35]

    R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

    Y . Yang, X. He, H. Pan, X. Jiang, Y . Deng, X. Yang, H. Lu, D. Yin, F. Rao, M. Zhuet al., “R1-onevision: Advancing generalized mul- timodal reasoning through cross-modal formalization,”arXiv preprint arXiv:2503.10615, 2025

  36. [36]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  37. [37]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models,

    Y . Hu, W. Shi, X. Fu, D. Roth, M. Ostendorf, L. Zettlemoyer, N. A. Smith, and R. Krishna, “Visual sketchpad: Sketching as a visual chain of thought for multimodal language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 139 348–139 379, 2024

  38. [38]

    Imagine while reasoning in space: Multimodal visualization-of-thought, 2025b.https://arxiv.org/abs/2501.07542

    C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli ´c, and F. Wei, “Imagine while reasoning in space: Multimodal visualization- of-thought,”arXiv preprint arXiv:2501.07542, 2025

  39. [39]

    Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,

    B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y . Yang, and X.-M. Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 2021, pp. 1650–1654

  40. [40]

    A dataset of clinically generated visual questions and answers about radiology images,

    J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, “A dataset of clinically generated visual questions and answers about radiology images,”Scientific data, vol. 5, no. 1, p. 180251, 2018

  41. [41]

    Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

    X. Zhang, C. Wu, Z. Zhao, W. Lin, Y . Zhang, Y . Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answer- ing,”arXiv preprint arXiv:2305.10415, 2023

  42. [42]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9556–9567

  43. [43]

    Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

    Y . Zuo, S. Qu, Y . Li, Z. Chen, X. Zhu, E. Hua, K. Zhang, N. Ding, and B. Zhou, “Medxpertqa: Benchmarking expert-level medical reasoning and understanding,”arXiv preprint arXiv:2501.18362, 2025

  44. [44]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306

  45. [45]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  46. [46]

    Mmedagent: Learning to use medical tools with multi- modal agent,

    B. Li, T. Yan, Y . Pan, J. Luo, R. Ji, J. Ding, Z. Xu, S. Liu, H. Dong, Z. Linet al., “Mmedagent: Learning to use medical tools with multi- modal agent,” inFindings of the Association for Computational Linguis- tics: EMNLP 2024, 2024, pp. 8745–8760

  47. [47]

    Vila-m3: Enhancing vision-language models with medical expert knowledge,

    V . Nath, W. Li, D. Yang, A. Myronenko, M. Zheng, Y . Lu, Z. Liu, H. Yin, Y . M. Law, Y . Tanget al., “Vila-m3: Enhancing vision-language models with medical expert knowledge,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 14 788–14 798

  48. [48]

    Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,

    J. Pan, C. Liu, J. Wu, F. Liu, J. Zhu, H. B. Li, C. Chen, C. Ouyang, and D. Rueckert, “Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,” inInterna- tional Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2025, pp. 337–347

  49. [49]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    H. Wang, A. Su, W. Ren, F. Lin, and W. Chen, “Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning,”arXiv preprint arXiv:2505.15966, 2025