Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Jing Lei; Lu Hou; Ruiying Peng; Xiaohui Li; Xueyu Wu; Yuanzheng Ma

arxiv: 2603.14184 · v2 · pith:6LHSGKTTnew · submitted 2026-03-15 · 💻 cs.CV · cs.AI

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Ruiying Peng , Xueyu Wu , Jing Lei , Lu Hou , Yuanzheng Ma , Xiaohui Li This is my paper

Pith reviewed 2026-05-21 11:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal large language modelsvisual attentionperceptual impairmentreasoningattention reweightingvisual question answeringtraining-free methodattention dispersion

0 comments

The pith

Reasoning in multimodal models scatters visual attention away from key image regions, and reweighting selected attention heads restores focus without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal large language models suffer perceptual impairments during extended reasoning in visual question answering tasks because their visual attention disperses and drifts from the image regions needed to answer the question. Analysis of attention maps reveals that reasoning prompts cut attention to critical areas and that the total attention paid to image tokens correlates with how spatially spread out that attention becomes. To counter this, the authors introduce a training-free Visual Region-Guided Attention framework that chooses visual heads by an entropy-focus criterion and reweights their attention to steer focus back to relevant regions. If correct, the approach would let these models keep better visual grounding while reasoning step by step. Readers would care because it explains a practical failure mode in current multimodal systems and supplies a lightweight fix.

Core claim

The central claim is that attention dispersion during multi-step reasoning is the root cause of perceptual impairment in multimodal large language models. Reasoning prompts measurably reduce attention on question-critical regions, and overall attention volume on image tokens tracks the spatial dispersiveness of that attention. The Visual Region-Guided Attention framework selects visual heads according to an entropy-focus criterion and reweights their attention maps, thereby guiding the model to remain focused on relevant image areas, which in turn improves visual grounding and reasoning accuracy on standard vision-language benchmarks.

What carries the argument

The Visual Region-Guided Attention (VRGA) framework, which selects visual attention heads by an entropy-focus criterion and reweights their attention to direct focus onto question-relevant image regions during reasoning.

If this is right

Alleviates perceptual degradation during extended reasoning in VQA tasks.
Improves visual grounding by keeping attention on question-relevant image regions.
Raises reasoning accuracy across vision-language benchmarks.
Yields interpretable insights into how multimodal models process visual information.
Operates without any additional training or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The entropy criterion could be extended to select heads dynamically for tasks beyond static images, such as video or diagram reasoning.
Similar lightweight attention adjustments might reduce drift in language-only chain-of-thought settings where models lose track of earlier context.
The observed correlation between total image attention and spatial spread suggests attention volume itself could serve as a simple diagnostic for when perceptual impairment is likely.
Combining the method with existing prompting techniques might produce additive gains on complex multimodal problems.

Load-bearing premise

Reweighting attention in heads chosen by the entropy-focus criterion will consistently pull focus back to relevant regions and raise accuracy without creating fresh errors or hurting results on non-reasoning tasks.

What would settle it

Running the VRGA method on a reasoning VQA benchmark and finding that accuracy or grounding scores fall below the unmodified baseline model would falsify the claim that the reweighting reliably mitigates perceptual degradation.

Figures

Figures reproduced from arXiv: 2603.14184 by Jing Lei, Lu Hou, Ruiying Peng, Xiaohui Li, Xueyu Wu, Yuanzheng Ma.

**Figure 1.** Figure 1: Comparison of Prompting Strategies in VQA Tasks. This figure compares three prompting strategies—Direct, RegionGuided, and Reason—in visual question answering (VQA) tasks. The attention maps show that the Direct mode focuses correctly on question-relevant regions, while the Region-Guided approach further reduces attention to irrelevant areas, enhancing visual grounding. In contrast, the Reason mode disp… view at source ↗

**Figure 2.** Figure 2: Layer-wise relevant region attention ratio across models. The vertical axis denotes the Relevant Region Attention Ratio, which measures the degree of attention allocated to question-relevant regions during VQA (as defined in Sec. 3.1). The horizontal axis represents the Transformer layer index, where the attention maps are averaged across heads. Green and red curves correspond to correct and wrong predict… view at source ↗

**Figure 3.** Figure 3: Impact of prompting strategies on visual grounding and performance. The bar chart (left) compares TextVQA accuracy under three prompting strategies: Reason, Direct, and Region-Guided. The line plots show the layer-wise RRAR (as defined in Sec. 3.1) from the question-end token to relevant visual regions, which is used to measure the focus on question-related areas. Shaded regions indicate the interquartile … view at source ↗

**Figure 4.** Figure 4: Attention mechanism analysis in vision models. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the Visual Region-guided Attention (VRGA) framework. Our method enhances the visual grounding capability of [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper spots attention drift in MLLMs during reasoning and offers a training-free head-reweighting fix, but the causal link to perceptual gains stays under-tested.

read the letter

The core observation is that reasoning prompts in MLLMs scatter visual attention away from the parts of the image that actually matter for the question. They back this with attention map analysis and a correlation between overall image-token attention and spatial spread. That part feels like a useful diagnostic step. From there they build VRGA, which picks visual heads by an entropy-focus rule and reweights them at inference time. The method is training-free and the abstract claims gains on grounding and accuracy across benchmarks. If the full experiments hold up with clear ablations, this could be a lightweight patch worth trying in VQA pipelines. What is actually new is the specific entropy criterion for head selection tied to this dispersion story; prior attention work in MLLMs has looked at heads but not quite in this reasoning-triggered way. The analysis of how reasoning changes attention maps is also a solid contribution on its own. The soft spot is the causal interpretation. The stress-test note is right: reweighting heads could improve scores through global entropy reduction or noise suppression rather than by genuinely restoring focus on question-relevant regions. Without controls that isolate spatial relevance—such as comparing against random head selection or non-spatial reweighting—the claim that VRGA fixes perceptual impairment by guiding attention stays observational. The paper would benefit from failure-case analysis and checks on non-reasoning tasks to show no hidden costs. This is aimed at practitioners who need more reliable visual reasoning without retraining. Readers working on MLLM internals or lightweight inference tweaks will get the most out of the attention diagnostics and the method description. It deserves a serious referee because the problem is real and the proposed fix is simple enough to test quickly, even if the current evidence for the mechanism is still thin.

Referee Report

2 major / 2 minor

Summary. The paper identifies attention dispersion as the cause of perceptual impairments in MLLMs during multi-step reasoning on VQA tasks, showing that reasoning prompts reduce attention to question-critical image regions and that overall image-token attention correlates with spatial dispersiveness. It introduces a training-free VRGA method that selects visual attention heads via an entropy-focus criterion and reweights their attention maps to restore focus on relevant regions, with experiments claiming improved grounding and accuracy on vision-language benchmarks.

Significance. If the causal link between VRGA's head selection/reweighting and restored spatial grounding holds, the work offers both a mechanistic explanation for reasoning-induced perceptual degradation and a practical inference-time intervention applicable to existing MLLMs without retraining. The entropy-focus criterion and correlation analysis provide interpretable insights that could inform future attention-based diagnostics in multimodal models.

major comments (2)

§4 (Method) and §5 (Experiments): The central claim that VRGA mitigates perceptual impairment specifically by guiding attention to question-relevant regions rests on observational correlations between reasoning prompts, attention drift, and dispersiveness. However, the intervention results report only downstream accuracy and grounding metrics; no direct pre/post-VRGA attention-map comparisons on annotated relevant regions are provided to rule out alternative mechanisms such as global entropy reduction or noisy-head suppression. This leaves the causal interpretation under-supported.
§5.2 (Ablation studies): The entropy-focus head selection is presented as key, yet the paper does not report an ablation that applies random or uniform head selection with the same reweighting strength. Without this control, it is unclear whether gains derive from the entropy criterion's grounding effect or from any form of attention regularization.

minor comments (2)

The abstract and introduction would benefit from explicit naming of the vision-language benchmarks and the exact VQA datasets used for quantitative evaluation.
Notation for the entropy-focus criterion and reweighting formula should be introduced with a clear equation in §4 to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the causal evidence for VRGA and the specificity of the head-selection criterion. We address each major comment below and outline planned revisions.

read point-by-point responses

Referee: §4 (Method) and §5 (Experiments): The central claim that VRGA mitigates perceptual impairment specifically by guiding attention to question-relevant regions rests on observational correlations between reasoning prompts, attention drift, and dispersiveness. However, the intervention results report only downstream accuracy and grounding metrics; no direct pre/post-VRGA attention-map comparisons on annotated relevant regions are provided to rule out alternative mechanisms such as global entropy reduction or noisy-head suppression. This leaves the causal interpretation under-supported.

Authors: We agree that direct pre/post attention-map comparisons on annotated regions would provide stronger causal support. The manuscript currently relies on (i) the §4 correlation between reasoning-induced dispersion and reduced attention to critical regions and (ii) downstream gains in grounding metrics that quantify improved localization. To address the concern, we will add visualizations of attention maps before and after VRGA, restricted to question-relevant annotated regions, in the revised §5. These will help distinguish the proposed mechanism from global entropy reduction or generic suppression. revision: yes
Referee: §5.2 (Ablation studies): The entropy-focus head selection is presented as key, yet the paper does not report an ablation that applies random or uniform head selection with the same reweighting strength. Without this control, it is unclear whether gains derive from the entropy criterion's grounding effect or from any form of attention regularization.

Authors: We concur that this control ablation is necessary to isolate the contribution of the entropy-focus criterion. In the revised manuscript we will include an additional ablation in §5.2 that applies random and uniform head selection followed by identical reweighting. Results will be reported alongside the original entropy-based selection to demonstrate that performance improvements are specifically tied to choosing low-entropy (high-focus) heads rather than non-specific attention regularization. revision: yes

Circularity Check

0 steps flagged

No significant circularity: analysis-driven heuristic remains independent of target metrics

full rationale

The paper's chain begins with observational analysis of attention maps under reasoning prompts, identifies a correlation between overall image-token attention and spatial dispersiveness, and then defines a training-free VRGA intervention that selects heads via an entropy-focus criterion and reweights attention. This criterion and reweighting rule are constructed from the observed patterns rather than fitted to final VQA accuracy or defined in terms of the very grounding improvement they claim to produce. No equations reduce the claimed alleviation of perceptual impairment to a tautology, self-referential definition, or load-bearing self-citation. The method is evaluated on external vision-language benchmarks, leaving the central claim independently falsifiable and non-circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility into parameters or assumptions; the work relies on the domain assumption that attention maps are extractable and meaningful for guiding reweighting.

axioms (1)

domain assumption Attention maps from MLLMs can be extracted and used to identify question-relevant regions via entropy-focus criterion.
Invoked when analyzing attention maps and proposing reweighting based on them.

pith-pipeline@v0.9.0 · 5742 in / 1152 out tokens · 44222 ms · 2026-05-21T11:52:18.831347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 7 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Ddot: A derivative-directed dual-decoder ordinary differential equation transformer for dynamic system modeling

Yang Chang, Kuang-Da Wang, Ping-Chun Hsieh, Cheng- Kuan Lin, and Wen-Chih Peng. Ddot: A derivative-directed dual-decoder ordinary differential equation transformer for dynamic system modeling. InPacific-Asia Conference on Knowledge Discovery and Data Mining, pages 434–445. Springer, 2025. 3

work page 2025
[4]

Are we on the right way for evaluating large vision-language models?Advances in Neural Informa- tion Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Informa- tion Processing Systems, 37:27056–27087, 2024. 2, 8

work page 2024
[5]

Interleaved-modal chain-of-thought

Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19520–19529, 2025. 2, 3, 8

work page 2025
[6]

Gemini: Our most capable multi- modal ai model yet.https://deepmind.google/ technologies/gemini/, 2023

Google DeepMind. Gemini: Our most capable multi- modal ai model yet.https://deepmind.google/ technologies/gemini/, 2023. 1

work page 2023
[7]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,...

work page 2024
[8]

See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025. 3

work page arXiv 2025
[9]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

work page 2022
[10]

Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025

Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025. 1

work page arXiv 2025
[11]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Repre- sentations, 2023. 1

work page 2023
[12]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- som. Program induction by rationale generation: Learning to solve and explain algebraic word problems.arXiv preprint arXiv:1705.04146, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Advances in Neural Information Processing Systems , year =

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 1, 2, 3

work page arXiv 2025
[14]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

work page 2023
[15]

Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, and Yuan Qi. Scp-116k: A high- quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science do- main.arXiv preprint arXiv:2501.15587, 2025. 1

work page arXiv 2025
[17]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

work page
[18]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Ocean-r1: An open and generaliz- able large vision-language model enhanced by reinforcement learning, 2025

Lingfeng Ming, Yadong Li, Song Chen, Jianhua Xu, Zenan Zhou, and Weipeng Chen. Ocean-r1: An open and generaliz- able large vision-language model enhanced by reinforcement learning, 2025. 2, 4

work page 2025
[20]

Compositional chain-of-thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 3, 8

work page 2024
[21]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 2, 4

work page 2019
[22]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 1

work page 2025
[24]

Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024. 1

work page 2024
[25]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Li- juan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 2, 4

work page arXiv 2025
[27]

Haloquest: A visual hallucination dataset for advancing multimodal reasoning

Zhecan Wang, Garrett Bingham, Adams Wei Yu, Quoc V Le, Thang Luong, and Golnaz Ghiasi. Haloquest: A visual hallucination dataset for advancing multimodal reasoning. In European Conference on Computer Vision, pages 288–304. Springer, 2024. 1, 2, 8

work page 2024
[28]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2

work page 2025
[29]

Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts.arXiv preprint arXiv:2509.25160, 2025

Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts. arXiv preprint arXiv:2509.25160, 2025. 1

work page arXiv 2025
[30]

Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025. 2, 3

work page arXiv 2025
[31]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Ddot: A derivative-directed dual-decoder ordinary differential equation transformer for dynamic system modeling

Yang Chang, Kuang-Da Wang, Ping-Chun Hsieh, Cheng- Kuan Lin, and Wen-Chih Peng. Ddot: A derivative-directed dual-decoder ordinary differential equation transformer for dynamic system modeling. InPacific-Asia Conference on Knowledge Discovery and Data Mining, pages 434–445. Springer, 2025. 3

work page 2025

[4] [4]

Are we on the right way for evaluating large vision-language models?Advances in Neural Informa- tion Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Informa- tion Processing Systems, 37:27056–27087, 2024. 2, 8

work page 2024

[5] [5]

Interleaved-modal chain-of-thought

Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19520–19529, 2025. 2, 3, 8

work page 2025

[6] [6]

Gemini: Our most capable multi- modal ai model yet.https://deepmind.google/ technologies/gemini/, 2023

Google DeepMind. Gemini: Our most capable multi- modal ai model yet.https://deepmind.google/ technologies/gemini/, 2023. 1

work page 2023

[7] [7]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,...

work page 2024

[8] [8]

See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025. 3

work page arXiv 2025

[9] [9]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

work page 2022

[10] [10]

Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025

Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025. 1

work page arXiv 2025

[11] [11]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Repre- sentations, 2023. 1

work page 2023

[12] [12]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- som. Program induction by rationale generation: Learning to solve and explain algebraic word problems.arXiv preprint arXiv:1705.04146, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Advances in Neural Information Processing Systems , year =

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 1, 2, 3

work page arXiv 2025

[14] [14]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

work page 2023

[15] [15]

Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, and Yuan Qi. Scp-116k: A high- quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science do- main.arXiv preprint arXiv:2501.15587, 2025. 1

work page arXiv 2025

[16] [17]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

work page

[17] [18]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [19]

Ocean-r1: An open and generaliz- able large vision-language model enhanced by reinforcement learning, 2025

Lingfeng Ming, Yadong Li, Song Chen, Jianhua Xu, Zenan Zhou, and Weipeng Chen. Ocean-r1: An open and generaliz- able large vision-language model enhanced by reinforcement learning, 2025. 2, 4

work page 2025

[19] [20]

Compositional chain-of-thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 3, 8

work page 2024

[20] [21]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 2, 4

work page 2019

[21] [22]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [23]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 1

work page 2025

[23] [24]

Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024. 1

work page 2024

[24] [25]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [26]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Li- juan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 2, 4

work page arXiv 2025

[26] [27]

Haloquest: A visual hallucination dataset for advancing multimodal reasoning

Zhecan Wang, Garrett Bingham, Adams Wei Yu, Quoc V Le, Thang Luong, and Golnaz Ghiasi. Haloquest: A visual hallucination dataset for advancing multimodal reasoning. In European Conference on Computer Vision, pages 288–304. Springer, 2024. 1, 2, 8

work page 2024

[27] [28]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2

work page 2025

[28] [29]

Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts.arXiv preprint arXiv:2509.25160, 2025

Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts. arXiv preprint arXiv:2509.25160, 2025. 1

work page arXiv 2025

[29] [30]

Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025. 2, 3

work page arXiv 2025

[30] [31]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025