Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

Chunfeng Song; Jiamin Wu; Kaitao Chen; Mianxin Liu; Mu Zhou; Qihao Zheng; Shangquan Sun; Weiqian Zhao; Xiaosong Wang

arxiv: 2606.31599 · v1 · pith:2SZTSMHKnew · submitted 2026-06-30 · 💻 cs.CV · cs.AI

Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

Kaitao Chen , Weiqian Zhao , Jiamin Wu , Qihao Zheng , Shangquan Sun , Chunfeng Song , Xiaosong Wang , Mu Zhou

show 1 more author

Mianxin Liu

This is my paper

Pith reviewed 2026-07-01 05:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical multimodal reasoningvisual token pruningdual-stream reinforcement learningvision-language modelscross-feedback optimizationtoken-sparse reasoningmedical benchmarks

0 comments

The pith

Dual-stream reinforcement learning prunes visual tokens in medical VLMs to 77% of original length while exceeding baseline performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ViToS, a dual-stream RL framework that jointly trains a shared policy model for visual grounding and token-sparse medical reasoning. One stream handles grounding to identify relevant regions, while the other performs question answering on pruned tokens. Cross-feedback sequential optimization enables this without gradient conflicts. On seven medical benchmarks, it reduces tokens to 77% and achieves 108.27% relative performance on Lingshu-7B and 104.16% on HuatuoGPT-Vision-7B. This matters because medical images contain sparse visual evidence, so efficient pruning can speed up inference without sacrificing clinical accuracy.

Core claim

ViToS trains one policy model with two task branches, where one focuses on grounding while the other conducts token-sparse reasoning after visual token pruning. The coupled policy learning problem is solved by introducing cross-feedback sequential optimization, avoiding gradient conflict and facilitating convergence. Evaluated on seven medical benchmarks, the method reduces visual tokens to 77% of the original sequence length while achieving a 108.27% relative performance on Lingshu-7B and 104.16% relative performance on HuatuoGPT-Vision-7B.

What carries the argument

Dual-stream RL framework with cross-feedback sequential optimization on a shared policy model for visual token pruning (VTP) and question answering.

If this is right

Reduces visual tokens to 77% of original sequence length.
Achieves superior performance and inference speedup on medical multimodal reasoning.
Delivers 108.27% relative performance on Lingshu-7B across benchmarks.
Delivers 104.16% relative performance on HuatuoGPT-Vision-7B.
Establishes an efficient paradigm for medical multimodal reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could extend to non-medical VLMs where visual evidence is sparse, such as in document understanding or scientific image analysis.
The token reduction might enable deployment of medical reasoning models on resource-constrained devices.
Further work could explore combining this with other pruning techniques for even greater efficiency.

Load-bearing premise

A single shared policy model can handle both grounding and token-sparse reasoning through cross-feedback sequential optimization without causing gradient conflicts or losing critical clinical information.

What would settle it

Testing the shared policy model on the medical benchmarks and observing either performance below 100% relative or failure to converge due to gradient conflicts would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.31599 by Chunfeng Song, Jiamin Wu, Kaitao Chen, Mianxin Liu, Mu Zhou, Qihao Zheng, Shangquan Sun, Weiqian Zhao, Xiaosong Wang.

**Figure 1.** Figure 1: Illustration of grounding-aware VTP. By focusing its reasoning on grounded tokens, the model correctly diagnoses scleroderma, consistent with the human expert assessment. 2024b; Xu et al., 2025b; Chen et al., 2024a; Jiang et al., 2025). These VLMs typically adopt a uniform visual token encoder that maps images into dense visual tokens for large language model (LLM) decoding, therefore introducing substa… view at source ↗

**Figure 2.** Figure 2: Impact of grounding-aware VTP and image cropping on performance across seven medical VQA benchmarks. mance, yielding an average gain of 4.1% on Lingshu-7B (Xu et al., 2025b). Note that background information can be retained in reserved tokens via self-attention in the visual encoder (Dosovitskiy et al., 2020), as well as through the token fusion strategy we employ. In contrast, traditional image cropping,… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed dual-stream RL framework with a unified policy model. The localization branch (top) identifies where to focus, and token-sparse reasoning branch (bottom) focuses on how to reason over compressed tokens. Two branches are sequentially optimized through reciprocal cross-feedback rewards, where each branch provides reinforcement signals to guide the other. prior RL-based approaches mai… view at source ↗

**Figure 4.** Figure 4: Comparison of accuracy and IoU reward trends under settings without the IoU signal or without token fusion. imaging achieve comparable baseline performance, fundus images exhibit substantially larger performance gains after token pruning, which aligns well with their higher degree of visual token redundancy. A similar trend can be observed when comparing dermoscopy with ultrasound, as well as MR with micro… view at source ↗

**Figure 6.** Figure 6: Ablation study on the training order. Branch-L and Branch-S correspond to training the localization branch or the tokensparse reasoning branch first, respectively [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 8.** Figure 8: Ablation study about token fusion in the grounding-aware visual token pruning. reasoning instability caused by hard pruning, demonstrating that it is a necessary component of the GTP framework. A.4. Role of Token Fusion and IoU Reward in DS-RL Training In addition to training dynamics we discussed in Section 4.7 and 4.8, we report test-time performance comparisons across different ablation settings in [PI… view at source ↗

**Figure 9.** Figure 9: Test-time performance comparison under different ablation settings. We compare the our full model with variants without IoU reward or without token fusion across multiple medical benchmarks. data of LLaVA-Med (Li et al., 2023) and HuatuoGPT-Vision (Chen et al., 2024a), do not provide explicit bounding box supervision. As a result, they cannot support learning a dedicated localization branch that outputs sp… view at source ↗

**Figure 10.** Figure 10: Prompt for localization branch. Prompt for Token-Sparse Reasoning Branch Determine the answer to the question. First provide an internal step-by-step reasoning within <think> </think> then provide a option letter in <answer> FINAL ANSWER </answer> [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for token-sparse reasoning branch. H. Case Study In [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison between full token and sparse grounded token reasoning for mass diagnosis. interpretability of visual reasoning, pointing toward a paradigm where AI system can support clinical decision-making with greater precision and trust. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison between full token and sparse grounded token reasoning for lungs diagnosis. <think>The fundus photograph shows a significant amount of retinal hemorrhage and exudative changes, which are indicative of severe trauma or a pathological condition. Retinoblastoma (Option A) typically presents as a white mass or calcification in the retina, which is not evident here. Retinopathy of prematurity (Optio… view at source ↗

**Figure 14.** Figure 14: Our method enables accurate identification of shaken-baby syndrome by focusing on key retinal hemorrhages. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Our method directs attention to soft tissue calcification and joint deformitys, enabling correct scleroderma diagnosis. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

read the original abstract

Vision-language models (VLMs) combining reinforcement learning (RL) ignite remarkable progress in multimodal reasoning, yet still struggle with medical images, which typically exhibit extremely sparse visual evidence to inform clinical decision-making. We recognize that pruning visual tokens outside the grounding region greatly enhances medical reasoning. However, a united RL framework for active visual token pruning (VTP) and medical multimodal reasoning remains unestablished. Here, we propose a dual-stream RL framework, ViToS, to fulfill token pruning and question answering. ViToS trains one policy model with two task branches, where one focuses on grounding while the other conducts token-sparse reasoning after VTP. Furthermore, we solve the coupled policy learning problem by introducing the cross-feedback sequential optimization, avoiding gradient conflict and facilitating convergence of the shared policy model. Evaluated on seven medical benchmarks, our method reduces visual tokens to 77% of the original sequence length while achieving a 108.27% relative performance on Lingshu-7B and 104.16% relative performance on HuatuoGPT-Vision-7B. Overall, ViToS delivers superior performance and inference speedup, establishing an efficient paradigm for medical multimodal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViToS proposes a dual-stream RL setup with cross-feedback to prune visual tokens in medical VLMs, but the abstract gives only relative gains without enough detail to judge the results.

read the letter

The paper's main contribution is a dual-stream reinforcement learning framework called ViToS that trains one policy model on two branches: grounding and token-sparse reasoning after pruning. They use cross-feedback sequential optimization to manage the shared policy and avoid gradient conflicts. This targets the sparsity of useful visual evidence in medical images, which is a practical issue.

The approach is coherent on its own terms. Pruning tokens outside the grounding region to reach 77% of original length while claiming relative performance above 100% on Lingshu-7B and HuatuoGPT-Vision-7B across seven benchmarks is a direct attempt to improve efficiency without separate models for each task.

The soft spots are in the reported evidence. The abstract supplies only relative numbers and token reduction percentages, with no absolute scores, baseline details, error bars, ablations, or statistical tests. It is unclear whether the pruning step preserves clinically critical information or if the gains hold under stronger controls. The assumption that cross-feedback reliably prevents loss of signal in the shared policy needs verification from the methods and results sections.

This is for groups working on efficient medical vision-language models or RL-based token selection. Readers focused on practical deployment in clinical AI would get the most from the architecture if the experiments are solid.

Send it for peer review. The core construction is clear and the problem is real, so referees can check the full data and implementation.

Referee Report

2 major / 1 minor

Summary. The paper proposes ViToS, a dual-stream RL framework for medical multimodal reasoning in VLMs. It trains a single shared policy model with separate branches for visual grounding and token-sparse reasoning after active visual token pruning (VTP), using cross-feedback sequential optimization to manage the coupled tasks. On seven medical benchmarks the method is reported to reduce visual tokens to 77% of the original length while delivering relative performance of 108.27% on Lingshu-7B and 104.16% on HuatuoGPT-Vision-7B, together with inference speedup.

Significance. If the empirical claims are supported by complete, reproducible experimental evidence, the work would demonstrate a practical route to efficient medical VLM reasoning by discarding tokens outside clinically relevant regions, potentially improving both accuracy and speed in evidence-sparse domains.

major comments (2)

[Abstract] Abstract: the central performance claims rest on relative percentages (108.27% and 104.16%) and a 77% token-reduction figure, yet no absolute baseline scores, standard deviations, error bars, or statistical tests are supplied. Without these quantities the data cannot be assessed for support of the superiority claim.
[Method] Method section (cross-feedback sequential optimization): the shared-policy construction is presented as solving gradient conflict between grounding and reasoning branches, but no analysis, loss curves, or ablation isolating the cross-feedback mechanism is referenced to confirm absence of conflict or preservation of clinically critical information.

minor comments (1)

Clarify the precise definition and computation of 'relative performance' (e.g., relative to which base model and on which metric) so that the reported percentages can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims rest on relative percentages (108.27% and 104.16%) and a 77% token-reduction figure, yet no absolute baseline scores, standard deviations, error bars, or statistical tests are supplied. Without these quantities the data cannot be assessed for support of the superiority claim.

Authors: We agree that absolute scores and statistical details would aid assessment. In the revised manuscript we will expand the abstract to report the absolute baseline and ViToS scores on all seven benchmarks together with standard deviations from repeated runs. revision: yes
Referee: [Method] Method section (cross-feedback sequential optimization): the shared-policy construction is presented as solving gradient conflict between grounding and reasoning branches, but no analysis, loss curves, or ablation isolating the cross-feedback mechanism is referenced to confirm absence of conflict or preservation of clinically critical information.

Authors: We acknowledge that explicit validation of the cross-feedback mechanism is currently absent. We will add an ablation comparing training with and without cross-feedback, together with loss curves and grounding-accuracy metrics, to demonstrate convergence behavior and retention of clinically relevant tokens. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method (dual-stream RL with cross-feedback optimization for visual token pruning and reasoning) whose central claims are performance numbers obtained from evaluation on seven external medical benchmarks. No derivation, equation, or uniqueness theorem is shown that reduces by construction to fitted inputs, self-definitions, or a self-citation chain; the token-reduction and relative-performance figures are reported outcomes rather than predictions forced by the method's own parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level claim that pruning outside grounding regions enhances reasoning.

axioms (1)

domain assumption Pruning visual tokens outside the grounding region greatly enhances medical reasoning
Presented as a recognized premise that motivates the token-sparse approach.

pith-pipeline@v0.9.1-grok · 5764 in / 1153 out tokens · 48576 ms · 2026-07-01T05:34:28.613176+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 34 canonical work pages · 17 internal anchors

[1]

Data-Centric Foundation Models in Computational Healthcare: A Survey

Data-centric foundation models in computational healthcare: A survey , author=. arXiv preprint arXiv:2401.02458 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Advances in Neural Information Processing Systems , volume=

Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=
[3]

Machine Learning for Health , pages=

Med-flamingo: a multimodal medical few-shot learner , author=. Machine Learning for Health , pages=
[4]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=
[5]

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, et al

Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale , author=. arXiv preprint arXiv:2406.19280 , year=

work page arXiv
[6]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. arXiv preprint arXiv:2506.07044 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

International Joint Conference on Neural Networks , pages=

R-llava: Improving med-vqa understanding through visual region of interest , author=. International Joint Conference on Neural Networks , pages=. 2025 , organization=

2025
[8]

arXiv preprint arXiv:2408.02900 , year=

Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine , author=. arXiv preprint arXiv:2408.02900 , year=

work page arXiv
[9]

InInternational Conference on Medical Image Com- puting and Computer-Assisted Intervention, pages 268–277

Hulu-med: A transparent generalist model towards holistic medical vision-language understanding , author=. arXiv preprint arXiv:2510.08668 , year=

work page arXiv
[10]

arXiv preprint arXiv:2505.18503 , year=

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning , author=. arXiv preprint arXiv:2505.18503 , year=

work page arXiv
[11]

Neural computation , volume=

Adaptive mixtures of local experts , author=. Neural computation , volume=
[12]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=
[13]

arXiv preprint arXiv:2503.01773 , year=

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas , author=. arXiv preprint arXiv:2503.01773 , year=

work page arXiv
[14]

arXiv preprint arXiv:2503.13939 , year=

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models , author=. arXiv preprint arXiv:2503.13939 , year=

work page arXiv
[15]

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=

2025
[16]

arXiv preprint arXiv:2510.10052 , year=

Think Twice to See More: Iterative Visual Reasoning in Medical VLMs , author=. arXiv preprint arXiv:2510.10052 , year=

work page arXiv
[17]

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

Medground-r1: Advancing medical image grounding via spatial-semantic rewarded group relative policy optimization , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=

2025
[18]

Llavanext: Improved reasoning, ocr, and world knowledge , author=
[19]

Advances in Neural Information Processing Systems , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in Neural Information Processing Systems , volume=
[20]

Token Merging: Your ViT But Faster

Token merging: Your vit but faster , author=. arXiv preprint arXiv:2210.09461 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

European Conference on Computer Vision , pages=

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=
[22]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[23]

arXiv preprint arXiv:2112.07658 , year=

AdaViT: Adaptive tokens for efficient vision transformer , author=. arXiv preprint arXiv:2112.07658 , year=

work page arXiv
[24]

arXiv preprint arXiv:2505.19536 , year=

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models , author=. arXiv preprint arXiv:2505.19536 , year=

work page arXiv
[25]

arXiv preprint arXiv:2602.03060 , year=

IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning , author=. arXiv preprint arXiv:2602.03060 , year=

work page arXiv
[26]

arXiv preprint arXiv:2506.21873 , year=

Grounding-Aware Token Pruning: Recovering from Drastic Performance Drops in Visual Grounding Caused by Pruning , author=. arXiv preprint arXiv:2506.21873 , year=

work page arXiv
[27]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Sparsevlm: Visual token sparsification for efficient vision-language model inference , author=. arXiv preprint arXiv:2410.04417 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Llava-prumerge: Adaptive token reduction for efficient large multimodal models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[29]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Visionzip: Longer is better but not necessary in vision language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[30]

arXiv preprint arXiv:2505.22654 , year=

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models , author=. arXiv preprint arXiv:2505.22654 , year=

work page arXiv
[31]

arXiv preprint arXiv:2507.13348 (2025)

Visionthink: Smart and efficient vision language model via reinforcement learning , author=. arXiv preprint arXiv:2507.13348 , year=

work page arXiv
[32]

IEEE International Symposium on Biomedical Imaging , pages=

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. IEEE International Symposium on Biomedical Imaging , pages=. 2021 , organization=

2021
[33]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Pathvqa: 30000+ questions for medical visual question answering , author=. arXiv preprint arXiv:2003.10286 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003
[34]

Scientific Data , volume=

A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific Data , volume=
[35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[36]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Pmc-vqa: Visual instruction tuning for medical visual question answering , author=. arXiv preprint arXiv:2305.10415 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[38]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Medxpertqa: Benchmarking expert-level medical reasoning and understanding , author=. arXiv preprint arXiv:2501.18362 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction , author=. arXiv preprint arXiv:2410.17247 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv: 2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2506.07851 , year=

Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning , author=. arXiv preprint arXiv:2506.07851 , year=

work page arXiv
[45]

arXiv preprint arXiv:2505.19213 , year=

Improving Medical Reasoning with Curriculum-Aware Reinforcement Learning , author=. arXiv preprint arXiv:2505.19213 , year=

work page arXiv
[46]

Proceedings of Symposium on Operating Systems Principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of Symposium on Operating Systems Principles , pages=
[47]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024
[48]

Proceedings of the Annual Meeting of the Association for Computational Linguistics , pages=

Llamafactory: Unified efficient fine-tuning of 100+ language models , author=. Proceedings of the Annual Meeting of the Association for Computational Linguistics , pages=
[49]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[50]

arXiv preprint arXiv:2505.11404 , year=

Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner , author=. arXiv preprint arXiv:2505.11404 , year=

work page arXiv
[51]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Self-rewarding vision-language model via reasoning decomposition , author=. arXiv preprint arXiv:2508.19652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

arXiv preprint arXiv:2508.02669 , year=

Medvlthinker: Simple baselines for multimodal medical reasoning , author=. arXiv preprint arXiv:2508.02669 , year=

work page arXiv
[54]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning , author=. arXiv preprint arXiv:2505.14362 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[57]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[58]

M. J. Kearns , title =
[59]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[60]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[61]

Suppressed for Anonymity , author=
[62]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[63]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[1] [1]

Data-Centric Foundation Models in Computational Healthcare: A Survey

Data-centric foundation models in computational healthcare: A survey , author=. arXiv preprint arXiv:2401.02458 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Advances in Neural Information Processing Systems , volume=

Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=

[3] [3]

Machine Learning for Health , pages=

Med-flamingo: a multimodal medical few-shot learner , author=. Machine Learning for Health , pages=

[4] [4]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

[5] [5]

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, et al

Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale , author=. arXiv preprint arXiv:2406.19280 , year=

work page arXiv

[6] [6]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. arXiv preprint arXiv:2506.07044 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

International Joint Conference on Neural Networks , pages=

R-llava: Improving med-vqa understanding through visual region of interest , author=. International Joint Conference on Neural Networks , pages=. 2025 , organization=

2025

[8] [8]

arXiv preprint arXiv:2408.02900 , year=

Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine , author=. arXiv preprint arXiv:2408.02900 , year=

work page arXiv

[9] [9]

InInternational Conference on Medical Image Com- puting and Computer-Assisted Intervention, pages 268–277

Hulu-med: A transparent generalist model towards holistic medical vision-language understanding , author=. arXiv preprint arXiv:2510.08668 , year=

work page arXiv

[10] [10]

arXiv preprint arXiv:2505.18503 , year=

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning , author=. arXiv preprint arXiv:2505.18503 , year=

work page arXiv

[11] [11]

Neural computation , volume=

Adaptive mixtures of local experts , author=. Neural computation , volume=

[12] [12]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=

[13] [13]

arXiv preprint arXiv:2503.01773 , year=

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas , author=. arXiv preprint arXiv:2503.01773 , year=

work page arXiv

[14] [14]

arXiv preprint arXiv:2503.13939 , year=

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models , author=. arXiv preprint arXiv:2503.13939 , year=

work page arXiv

[15] [15]

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=

2025

[16] [16]

arXiv preprint arXiv:2510.10052 , year=

Think Twice to See More: Iterative Visual Reasoning in Medical VLMs , author=. arXiv preprint arXiv:2510.10052 , year=

work page arXiv

[17] [17]

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

Medground-r1: Advancing medical image grounding via spatial-semantic rewarded group relative policy optimization , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=

2025

[18] [18]

Llavanext: Improved reasoning, ocr, and world knowledge , author=

[19] [19]

Advances in Neural Information Processing Systems , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in Neural Information Processing Systems , volume=

[20] [20]

Token Merging: Your ViT But Faster

Token merging: Your vit but faster , author=. arXiv preprint arXiv:2210.09461 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

European Conference on Computer Vision , pages=

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=

[22] [22]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[23] [23]

arXiv preprint arXiv:2112.07658 , year=

AdaViT: Adaptive tokens for efficient vision transformer , author=. arXiv preprint arXiv:2112.07658 , year=

work page arXiv

[24] [24]

arXiv preprint arXiv:2505.19536 , year=

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models , author=. arXiv preprint arXiv:2505.19536 , year=

work page arXiv

[25] [25]

arXiv preprint arXiv:2602.03060 , year=

IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning , author=. arXiv preprint arXiv:2602.03060 , year=

work page arXiv

[26] [26]

arXiv preprint arXiv:2506.21873 , year=

Grounding-Aware Token Pruning: Recovering from Drastic Performance Drops in Visual Grounding Caused by Pruning , author=. arXiv preprint arXiv:2506.21873 , year=

work page arXiv

[27] [27]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Sparsevlm: Visual token sparsification for efficient vision-language model inference , author=. arXiv preprint arXiv:2410.04417 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Llava-prumerge: Adaptive token reduction for efficient large multimodal models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[29] [29]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Visionzip: Longer is better but not necessary in vision language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[30] [30]

arXiv preprint arXiv:2505.22654 , year=

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models , author=. arXiv preprint arXiv:2505.22654 , year=

work page arXiv

[31] [31]

arXiv preprint arXiv:2507.13348 (2025)

Visionthink: Smart and efficient vision language model via reinforcement learning , author=. arXiv preprint arXiv:2507.13348 , year=

work page arXiv

[32] [32]

IEEE International Symposium on Biomedical Imaging , pages=

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. IEEE International Symposium on Biomedical Imaging , pages=. 2021 , organization=

2021

[33] [33]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Pathvqa: 30000+ questions for medical visual question answering , author=. arXiv preprint arXiv:2003.10286 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003

[34] [34]

Scientific Data , volume=

A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific Data , volume=

[35] [35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[36] [36]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Pmc-vqa: Visual instruction tuning for medical visual question answering , author=. arXiv preprint arXiv:2305.10415 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[38] [38]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Medxpertqa: Benchmarking expert-level medical reasoning and understanding , author=. arXiv preprint arXiv:2501.18362 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction , author=. arXiv preprint arXiv:2410.17247 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv: 2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

arXiv preprint arXiv:2506.07851 , year=

Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning , author=. arXiv preprint arXiv:2506.07851 , year=

work page arXiv

[45] [45]

arXiv preprint arXiv:2505.19213 , year=

Improving Medical Reasoning with Curriculum-Aware Reinforcement Learning , author=. arXiv preprint arXiv:2505.19213 , year=

work page arXiv

[46] [46]

Proceedings of Symposium on Operating Systems Principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of Symposium on Operating Systems Principles , pages=

[47] [47]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024

[48] [48]

Proceedings of the Annual Meeting of the Association for Computational Linguistics , pages=

Llamafactory: Unified efficient fine-tuning of 100+ language models , author=. Proceedings of the Annual Meeting of the Association for Computational Linguistics , pages=

[49] [49]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010

[50] [50]

arXiv preprint arXiv:2505.11404 , year=

Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner , author=. arXiv preprint arXiv:2505.11404 , year=

work page arXiv

[51] [51]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Self-rewarding vision-language model via reasoning decomposition , author=. arXiv preprint arXiv:2508.19652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

arXiv preprint arXiv:2508.02669 , year=

Medvlthinker: Simple baselines for multimodal medical reasoning , author=. arXiv preprint arXiv:2508.02669 , year=

work page arXiv

[54] [54]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning , author=. arXiv preprint arXiv:2505.14362 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[57] [57]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[58] [58]

M. J. Kearns , title =

[59] [59]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[60] [60]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[61] [61]

Suppressed for Anonymity , author=

[62] [62]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[63] [63]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959