LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression

Bowen Yuan; Shijie Wang; Yadan Luo; Zi Huang; Zijian Wang

arxiv: 2607.01707 · v1 · pith:R23N5BQ5new · submitted 2026-07-02 · 💻 cs.CV

LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression

Bowen Yuan , Zijian Wang , Yadan Luo , Shijie Wang , Zi Huang This is my paper

Pith reviewed 2026-07-03 16:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual forgettinglarge vision-language modelsattention preservationsink suppressionvisual grounding rewardpost-trainingLVLMsattention trajectory

0 comments

The pith

LASER uses two rewards to keep large vision-language models focused on visual evidence instead of drifting away during long reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models lose access to visual evidence as decoding proceeds because attention moves away from images early on and piles up on a few unhelpful visual tokens. The paper identifies these two patterns through attention analysis and introduces LASER, a post-training method that adds a Visual Grounding Reward to sustain attention on salient image tokens and a Sink Suppression Reward to reduce focus on irrelevant ones. The approach regulates both the overall attention path and the distribution among visual tokens. Experiments show consistent gains over baselines on eight datasets. A sympathetic reader would see this as a direct way to embed visual grounding into the training signal rather than patching it at inference time.

Core claim

The central claim is that visual forgetting in LVLMs stems mainly from early-stage attention decay that blocks evidence acquisition and from attention concentrating on task-irrelevant visual sink tokens, and that applying a Visual Grounding Reward to maintain attention on salient visual tokens together with a Sink Suppression Reward to penalize excessive concentration on sinks during post-training corrects both issues, preserves the attention trajectory, and raises performance on downstream tasks.

What carries the argument

The pair of Visual Grounding Reward and Sink Suppression Reward that together enforce sustained attention on salient visual tokens and limit collapse onto sink tokens.

If this is right

The model keeps attention on semantically salient visual tokens across the full decoding sequence.
Attention no longer collapses onto a small set of task-irrelevant visual sink tokens.
Performance rises above strong baselines on all eight evaluated benchmark datasets.
Attention-aware post-training becomes a practical remedy for visual forgetting without altering model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward structure could be tested on models that process longer image sequences or video to check whether early decay is equally dominant.
Embedding these rewards might reduce reliance on inference-time attention interventions in other multimodal settings.
The method opens a route to compare attention regularization against data curation as ways to improve grounding.
If the sink tokens prove consistent across models, a shared suppression term could be developed once rather than per model.

Load-bearing premise

That early attention decay and concentration on sink tokens are the primary drivers of performance loss when visual forgetting occurs.

What would settle it

Training an LVLM with the two rewards yet finding that early-stage attention still decays away from visual tokens or that performance gains disappear on the same eight benchmarks.

Figures

Figures reproduced from arXiv: 2607.01707 by Bowen Yuan, Shijie Wang, Yadan Luo, Zi Huang, Zijian Wang.

**Figure 1.** Figure 1: Illustration of visual forgetting during long-horizon reasoning in LVLMs. Left: an input image and the corresponding question. Middle: the Visual Attention Proportion (VAP) over generation steps shows that attention to visual tokens peaks early and progressively decays during reasoning. Right: difference in attention maps reveals that later decoding stages could attend less to task relevant visual regions.… view at source ↗

**Figure 2.** Figure 2: Accuracy on MMStar under visual KVcache scaling factor at various reasoning stages. decays [26, 49, 53]. However, existing characterizations primarily document the overall decay of visual attention, without examining its temporal sensitivity or internal allocation patterns. To uncover these finergrained dynamics, we re-examine attention trajectories throughout the reasoning process. Through case stu… view at source ↗

**Figure 3.** Figure 3: Steps Visualizations of attention maps over reasoning steps. We identify image regions that receive high attention score despite lacking semantic relevance to the task. grounding errors happen at this stage, errors in evidence acquisition propagate autoregressively throughout the remaining reasoning process. At the same time, we observe that the visual attention that persists is not always directed toward … view at source ↗

**Figure 4.** Figure 4: Analysis of visual sink token attention. Left. Redistributing attention away from sink tokens yields consistent accuracy gains over uniform scaling. Right. Visual sink attention ratio across generation steps for GRPO vanilla vs. GRPO + Rvis. 4.1 Preliminaries We build on Group Relative Policy Optimization (GRPO) [46], which eliminates the value function and estimates advantages by comparing rewards within … view at source ↗

**Figure 5.** Figure 5: Overview of LASER. Visual token attention is extracted during generation to compute two complementary rewards: Rvis sustains overall VAP across reasoning steps, while Rsupp identifies and suppresses sink tokens to rectify visual attention from visual sink tokens. process consists of two stages. First, we isolate a subset of massive activation dimensions D∗ ⊆ [D], where [D] denotes the full set of hidden st… view at source ↗

**Figure 6.** Figure 6: Analysis of VAP during generation on various datasets. As reported in Tab. 2, both Rvis and Rsupp yield consistent performance gains across all benchmarks. Specifically, integrating Rvis improves the average score from 51.1 to 52.2 by reducing visual forgetting and maintaining attention on taskrelevant regions. The subsequent addition of Rsupp further improves performance, particularly in general reasonin… view at source ↗

**Figure 7.** Figure 7: Left. Visual sink token attention across generation steps for model trained by LASER. Right. Response length during training process. 5.4 Further Analysis LASER preserves visual attention throughout generation [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: A qualitative comparison between Qwen2.5-VL-Instruct and LASER on a visual perception task. Premature visual disengagement in the base model leads to an erroneous inference, while LASER maintains visual attention and answers correctly. example, the base Qwen2.5-VL-Instruct model makes an erroneous summarization in its early-stage reasoning chain, resulting in an incorrect answer after longhorizon reasonin… view at source ↗

**Figure 9.** Figure 9: Qualitative results of LASER [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of LASER [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results of LASER [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative results of LASER [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

read the original abstract

Large vision-language models (LVLMs) exhibit strong reasoning ability but suffer from visual forgetting during long-horizon decoding, where attention progressively drifts away from visual evidence. Existing methods largely treat this issue as a late-stage attention decay problem or attempt to mitigate it through heuristic reminders or post-hoc attention lifting. Through systematic empirical analysis, we find that performance degradation under visual forgetting is largely driven by two overlooked factors: early-stage attention decay disrupts evidence acquisition, and attention concentration on a subset of task-irrelevant visual sink tokens. Motivated by these insights, we propose LASER, a post-training framework that regulates both the visual attention trajectory and intra-visual token attention distribution during reasoning. Technically, LASER introduces two complementary rewards: a Visual Grounding Reward, which encourages the model to maintain attention on semantically salient visual tokens throughout decoding, and a Sink Suppression Reward, which penalizes excessive attention concentration on visual sink tokens. Together, these rewards preserve early-stage grounding while preventing attention collapse onto uninformative regions. Extensive experiments on eight benchmark datasets demonstrate that LASER consistently outperforms strong baselines, validating attention-aware training as an effective remedy for visual forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LASER adds two simple rewards to steer attention in LVLMs during long decoding, but the abstract gives no numbers or ablations and the causal link from their empirical analysis to the fixes is not yet clear.

read the letter

The main takeaway is that this paper offers a post-training fix for visual forgetting in large vision-language models by adding a Visual Grounding Reward to keep attention on salient tokens and a Sink Suppression Reward to avoid collapse onto uninformative ones. They motivate it with observations that decay starts early and that certain sink tokens draw too much focus.

What the work does is identify those two factors through what they describe as systematic empirical analysis and then turn them into explicit rewards that operate during decoding. This is more targeted than generic reminders or late-stage lifting, and it avoids full retraining. The approach is practical for people who already have a deployed LVLM and want to improve long-horizon behavior.

The soft spots are straightforward. The abstract reports outperformance on eight datasets but supplies zero numbers, baseline details, or ablation results, so the size of the gains and whether both rewards are necessary remain unknown. More importantly, the stress-test point lands: if the analysis only correlates attention maps with accuracy drops without interventions that hold the model and prompt fixed while changing only the early decay or sink concentration, then the rewards may be treating symptoms rather than the load-bearing causes. The paper would be stronger with at least one controlled test showing that suppressing sinks or preserving early attention produces the claimed improvement when other factors are held constant.

The method itself looks like standard reward shaping rather than a new theoretical derivation, which is fine for an applied paper but means the novelty rests on the specific combination and the empirical motivation.

This is worth a reading group if your group works on multimodal decoding or attention regularization; the ideas are concrete enough to discuss even without the numbers. It deserves peer review so referees can check the experiments and the causal evidence. I would not cite it yet.

Referee Report

2 major / 1 minor

Summary. The paper claims that visual forgetting in LVLMs during long-horizon decoding is largely driven by two factors identified via systematic empirical analysis: early-stage attention decay that disrupts evidence acquisition, and attention concentration on task-irrelevant visual sink tokens. Motivated by this, it introduces LASER, a post-training framework using two rewards—a Visual Grounding Reward to maintain attention on semantically salient visual tokens and a Sink Suppression Reward to penalize excessive concentration on sinks—resulting in consistent outperformance over strong baselines on eight benchmark datasets.

Significance. If the empirical analysis and results hold under scrutiny, the work would demonstrate that attention-aware post-training can effectively mitigate visual forgetting in LVLMs, offering a practical remedy beyond heuristic reminders or post-hoc fixes. The dual-reward design targeting both trajectory and distribution is a targeted contribution if the causal drivers are isolated.

major comments (2)

[Abstract] Abstract: the central motivation rests on the claim that performance degradation 'is largely driven' by early-stage attention decay and sink concentration, identified through 'systematic empirical analysis.' However, the abstract supplies no description of the analysis (e.g., whether it includes controlled interventions that isolate these factors while holding model, prompt, and decoding fixed, versus purely observational attention-map vs. accuracy correlations). This is load-bearing because the two rewards are directly derived from these factors; if the analysis is only correlational, the rewards may target symptoms rather than primary causes.
[Abstract] Abstract: the claim of 'consistent outperformance on eight benchmark datasets' is presented without any quantitative results, baseline specifications, ablation studies, dataset sizes, or error analysis. This prevents assessment of effect sizes or whether the gains are robust, directly undermining evaluation of the central claim that the proposed rewards are an 'effective remedy.'

minor comments (1)

[Abstract] Abstract: 'visual sink tokens' is used without a definition or citation to prior work on attention sinks; this notation should be clarified on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on our manuscript. We address each major comment point by point below, focusing on the abstract's presentation of our empirical analysis and results. Where revisions to the abstract can improve clarity without exceeding length constraints, we indicate this explicitly.

read point-by-point responses

Referee: [Abstract] Abstract: the central motivation rests on the claim that performance degradation 'is largely driven' by early-stage attention decay and sink concentration, identified through 'systematic empirical analysis.' However, the abstract supplies no description of the analysis (e.g., whether it includes controlled interventions that isolate these factors while holding model, prompt, and decoding fixed, versus purely observational attention-map vs. accuracy correlations). This is load-bearing because the two rewards are directly derived from these factors; if the analysis is only correlational, the rewards may target symptoms rather than primary causes.

Authors: The abstract is a high-level summary, but the full manuscript details the systematic empirical analysis in Section 3. There, we conduct controlled interventions that hold the model, prompts, and decoding strategy fixed while isolating the effects of early-stage attention decay (via attention trajectory interventions) and sink token concentration (via targeted masking experiments), directly measuring causal impacts on accuracy. This establishes the factors as primary drivers rather than mere correlations, motivating the dual-reward design. We will revise the abstract to briefly note that the analysis involves such controlled interventions. revision: yes
Referee: [Abstract] Abstract: the claim of 'consistent outperformance on eight benchmark datasets' is presented without any quantitative results, baseline specifications, ablation studies, dataset sizes, or error analysis. This prevents assessment of effect sizes or whether the gains are robust, directly undermining evaluation of the central claim that the proposed rewards are an 'effective remedy.'

Authors: Abstracts are space-constrained and conventionally omit specific numbers, tables, and detailed ablations, which are instead provided in the main text (Section 4, Table 1, and ablation studies in Section 4.3) along with dataset details and error analysis. These elements demonstrate the consistent gains and robustness. We can partially revise the abstract to include one or two key quantitative highlights (e.g., average improvement margins) if space permits, while preserving its summary nature. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical post-training method with independent motivation

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivations. The two rewards are explicitly introduced as new constructs motivated by (but not defined in terms of) the empirical observations. No self-citation chains, uniqueness theorems, or ansatzes are referenced. The central claim rests on experimental validation rather than any reduction of outputs to inputs by construction. This is the most common honest non-finding for applied post-training papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5742 in / 1158 out tokens · 19651 ms · 2026-07-03T16:51:33.010940+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 34 canonical work pages · 19 internal anchors

[1]

In: NeurIPS (2022)

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Si- monyan, K.: Flamingo:...

2022
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. CoRR (2025).https://doi.org/10.48550/ ARXIV.2502.13923,https://doi.org/1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
[3]

PaliGemma: A versatile 3B VLM for transfer

Beyer, L., Steiner, A., Pinto, A.S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A.A., Houlsby, N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Bosnjak, M., Chen, X., Minderer, M., Voigtlaender, P., Bica, I., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07726 2024
[4]

arXiv preprint arXiv:2502.00698 (2025)

Cai, H., Yang, Y., Hu, W.: Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models. arXiv preprint arXiv:2502.00698 (2025)

work page arXiv 2025
[5]

InternLM2 Technical Report

Cai, Z., Cao, M., Chen, H., Chen, K., Chen, K., Chen, X., et al.: Internlm2 technical report. CoRR (2024),https://doi.org/10.48550/arXiv.2403.17297

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.17297 2024
[6]

In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Chen, J., Tang, J., Qin, J., Liang, X., Liu, L., Xing, E., Lin, L.: Geoqa: A geo- metric question answering benchmark towards multimodal numerical reasoning. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 513–523 (2021)

2021
[7]

In: ECCV

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: ECCV. pp. 19–35. Springer (2024)

2024
[8]

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., Zhao, F.: Are we on the right way for evaluating large vision-language models? In: NeurIPS (2024)

2024
[9]

arXiv preprint arXiv:2506.04207 (2025)

Chen, S., Guo, Y., Su, Z., Li, Y., Wu, Y., Chen, J., Chen, J., Wang, W., Qu, X., Cheng, Y.: Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning. arXiv preprint arXiv:2506.04207 (2025)

work page arXiv 2025
[10]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V., Levine, S., Ma, Y.: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025) 16 B. Yuan et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Deng, Y., Bansal, H., Yin, F., Peng, N., Wang, W., Chang, K.: Openvlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement. CoRR (2025),https://doi.org/10.48550/arXiv.2503.17352

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.17352 2025
[13]

Frontiers of Computer Science (2026)

Du, J., Li, Y., Li, Y., Liao, L., Zhao, Z., Ye, G.: Medfuse: a multi-source data fusion framework for diabetic retinopathy lesion segmentation. Frontiers of Computer Science (2026)

2026
[14]

CoRR (2025),https://doi.org/10.48550/arXiv.2501.01904

Du, Y., Liu, Z., Li, Y., Zhao, W.X., Huo, Y., Wang, B., Chen, W., Liu, Z., Wang, Z., Wen, J.: Virgo: A preliminary exploration on reproducing o1-like MLLM. CoRR (2025),https://doi.org/10.48550/arXiv.2501.01904

work page doi:10.48550/arxiv.2501.01904 2025
[15]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024)

2024
[16]

com/verl-project/verl(2025)

Engine, V.: verl: Volcano engine reinforcement learning for llms.https://github. com/verl-project/verl(2025)

2025
[17]

In: CVPR

Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swami- nathan, A., Soatto, S.: Multi-modal hallucination control by visual information grounding. In: CVPR. pp. 14303–14312. IEEE (2024)

2024
[18]

In: CVPR (2026)

Fu, Y., Zhang, Z., Zhang, Y., Wang, Z., Huang, Z., Luo, Y.: Mergevla: Cross-skill model merging toward a generalist vision-language-action agent. In: CVPR (2026)

2026
[19]

In: EMNLP

Gong, X., Ming, T., Wang, X., Wei, Z.: DAMRO: dive into the attention mechanism of LVLM to reduce object hallucination. In: EMNLP. pp. 7696–7712. Association for Computational Linguistics (2024)

2024
[20]

In: CVPR

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: CVPR. IEEE (2024)

2024
[21]

In: The Fourteenth International Conference on Learning Representations (ICLR) (2026)

Hao, Z., Li, Z., Li, W., Liu, F., Zhang, M., Li, J.: Echoes as anchors: Probabilistic costs and attention refocusing in llm reasoning. In: The Fourteenth International Conference on Learning Representations (ICLR) (2026)

2026
[22]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

He, C., Luo, R., Bai, Y., Hu, S., Thai, Z., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., et al.: Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 3828–3850 (2024)

2024
[23]

In: NeurIPS (2024)

Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettlemoyer, L., Smith, N.A., Krishna, R.: Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. In: NeurIPS (2024)

2024
[24]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. CoRR (2025),https://doi.org/10.48550/arXiv.2503.06749

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.06749 2025
[25]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

CoRR (2025),https://doi

Jian, P., Wu, J., Sun, W., Wang, C., Ren, S., Zhang, J.: Look again, think slowly: Enhancing visual reflection in vision-language models. CoRR (2025),https://doi. org/10.48550/arXiv.2509.12132 LASER 17

work page doi:10.48550/arxiv.2509.12132 2025
[27]

In: ICLR (2025)

Kang, S., Kim, J., Kim, J., Hwang, S.J.: See what you are told: Visual attention sink in large multimodal models. In: ICLR (2025)

2025
[28]

MMR1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

Leng, S., Wang, J., Li, J., Zhang, H., Hu, Z., Zhang, B., Jiang, Y., Zhang, H., Li, X., Bing, L., et al.: Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources. arXiv preprint arXiv:2509.21268 (2025)

work page arXiv 2025
[29]

In: CVPR

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: CVPR. pp. 13872–13882. IEEE (2024)

2024
[30]

Advances in Neural Information Processing Systems (2026)

Li, Q., Zhang, Y., Mai, Z., Chen, Y., Huang, H., Zhang, J., Zhang, Z., Wen, Y., Li, W., Fu, H., et al.: Can large multimodal models understand agricultural scenes? benchmarking with agromind. Advances in Neural Information Processing Systems (2026)

2026
[31]

In: ACL (2026)

Li, X., et al.: Llm inductive reasoning through multi-agent enhanced monte carlo tree search. In: ACL (2026)

2026
[32]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Li, Z., Yu, W., Huang, C., Liu, R., Liang, Z., Liu, F., Che, J., Yu, D., Boyd- Graber, J.L., Mi, H., Yu, D.: Self-rewarding vision-language model via reasoning decomposition. CoRR (2025),https://doi.org/10.48550/arXiv.2508.19652

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.19652 2025
[33]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Lin, Z., Lin, M., Lin, L., Ji, R.: Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 5334–5342 (2025)

2025
[34]

In: CVPR

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR. IEEE (2024)

2024
[35]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023)

2023
[36]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Liu, W., Pan, Q., Zhang, Y., Liu, Z., Wu, J., Zhou, J., Zhou, A., Chen, Q., Jiang, B., He, L.: Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 12585–12591 (2025)

2025
[37]

arXiv preprint arXiv:2505.23091 (2025)

Liu, Z., Liu, Y., Zhu, G., Xie, C., Li, Z., Yuan, J., Wang, X., Li, Q., Cheung, S.C., Zhang, S., et al.: Infi-mmr: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models. arXiv preprint arXiv:2505.23091 (2025)

work page arXiv 2025
[38]

In: ICLR

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: ICLR. OpenReview.net (2024)

2024
[39]

Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., Zhu, S.C.: Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers...

2021
[40]

IconQA: A new benchmark for abstract diagram understanding and visual language reasoning.arXiv preprint arXiv:2110.13214, 2021

Lu, P., Qiu, L., Chen, J., Xia, T., Zhao, Y., Zhang, W., Yu, Z., Liang, X., Zhu, S.C.: Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214 (2021)

work page arXiv 2021
[41]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., Han, T., Shi, B., Wang, W., He, J., et al.: Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

In: AAAI

Mondal, D., Modi, S., Panda, S., Singh, R., Rao, G.S.: Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. In: AAAI. pp. 18798–18806. AAAI Press (2024)

2024
[43]

OpenAI GPT-5 System Card

OpenAI: Openai gpt-5 system card (2025),https://arxiv.org/abs/2601.03267 18 B. Yuan et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Qiao, R., Tan, Q., Dong, G., Wu, M., Sun, C., Song, X., Wang, J., Gongque, Z., Lei, S., Zhang, Y., Wei, Z., Zhang, M., Qiao, R., Zong, X., Xu, Y., Yang, P., Bao, Z., Diao, M., Li, C., Zhang, H.: We-math: Does your large multimodal model achieve human-like mathematical reasoning? In: ACL. pp. 20023–20070. Association for Computational Linguistics (2025)

2025
[45]

In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J

Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object halluci- nation in image captioning. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) EMNLP. pp. 4035–4045. Association for Computational Linguistics (2018)

2018
[46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR (2024),https://doi.org/10.48550/arXiv.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[47]

CoRR (2025), https://doi.org/10.48550/arXiv.2504

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., Zhao, T.: VLM-R1: A stable and generalizable r1-style large vision-language model. CoRR (2025), https://doi.org/10.48550/arXiv.2504. 07615

work page doi:10.48550/arxiv.2504 2025
[48]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Shi, W., Hu, Z., Bin, Y., Liu, J., Yang, Y., Ng, S.K., Bing, L., Lee, R.K.W.: Math- llava: Bootstrapping mathematical reasoning for multimodal large language models. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 4663–4680 (2024)

2024
[49]

Sun, H., Sun, Z., Peng, H., Ye, H.: Mitigating visual forgetting via take-along visual conditioning for multi-modal long cot reasoning. In: ACL. pp. 5158–5171. Association for Computational Linguistics (2025)

2025
[50]

arXiv (2024)

Sun, M., et al.: Massive activations in large language models. arXiv (2024)

2024
[51]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Team, G.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. CoRRabs/2507.06261 (2025),https://doi.org/10.48550/arXiv.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025
[52]

Thawakar, O., Dissanayake, D., More, K.P., Thawkar, R., Heakl, A., Ahsan, N., Li, Y., Zumri, M., Lahoud, J., Anwer, R.M., Cholakkal, H., Laptev, I., Shah, M., Khan, F.S., Khan, S.H.: Llamav-o1: Rethinking step-by-step visual reasoning in llms. In: ACL. pp. 24290–24315. Association for Computational Linguistics (2025)

2025
[53]

CoRR (2025),https://doi.org/10.48550/arXiv.2509.25848

Tian,X.,Zou,S.,Yang,Z.,He,M.,Waschkowski,F.,Wesemann,L.,Tu,P.H.,Zhang, J.: More thought, less accuracy? on the dual nature of reasoning in vision-language models. CoRR (2025),https://doi.org/10.48550/arXiv.2509.25848

work page doi:10.48550/arxiv.2509.25848 2025
[54]

CoRR (2025),https://doi.org/10.48550/arXiv.2503.08342

Tu, C., Ye, P., Zhou, D., Bai, L., Yu, G., Chen, T., Ouyang, W.: Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms. CoRR (2025),https://doi.org/10.48550/arXiv.2503.08342

work page doi:10.48550/arxiv.2503.08342 2025
[55]

arXiv preprint arXiv:2506.11595 (2025)

Unsal, M., Akkus, A.: Easyarc: Evaluating vision language models on true visual reasoning. arXiv preprint arXiv:2506.11595 (2025)

work page arXiv 2025
[56]

When to Commit? Towards Variable-Size Self-Contained Blocks for Discrete Diffusion Language Models

Wang, D., Qiu, R., Huang, Z.: When to commit? towards variable-size self-contained blocks for discrete diffusion language models. arXiv preprint arXiv:2604.23994 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. CoRR (2025), https://doi.org/10.48550/arXiv.2504.08837

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.08837 2025
[58]

In: NeurIPS (2024)

Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., Li, H.: Measuring multimodal mathematical reasoning with math-vision dataset. In: NeurIPS (2024)

2024
[59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, S., Yu, X., Luo, Y., Wang, Z., Zhang, P., Huang, Z.: Language-driven fine-grained retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2682–2692 (2026)

2026
[60]

In: Proceedings of LASER 19 the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Wang, Y., Wang, S., Cheng, Q., Fei, Z., Ding, L., Guo, Q., Tao, D., Qiu, X.: Visuo- think: Empowering lvlm reasoning with multimodal tree search. In: Proceedings of LASER 19 the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 21707–21719 (2025)

2025
[61]

NeurIPS35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS35, 24824–24837 (2022)

2022
[62]

CoRR (2025), https://doi.org/10.48550/arXiv.2502.11740

Wu, J., Xiong, Y., Li, X., Xia, Y., Wang, R., Wang, Y., Yu, T., Kim, S., Rossi, R.A., Yao, L., Shang, J., McAuley, J.J.: Mitigating visual knowledge forgetting in MLLM instruction-tuning via modality-decoupled gradient descent. CoRR (2025), https://doi.org/10.48550/arXiv.2502.11740

work page doi:10.48550/arxiv.2502.11740 2025
[63]

Visionary-R1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

Xia, J., Zang, Y., Gao, P., Li, Y., Zhou, K.: Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning. CoRR (2025),https://doi.org/10. 48550/arXiv.2505.14677

work page arXiv 2025
[64]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Xiao, Y., Sun, E., Liu, T., Wang, W.: Logicvista: Multimodal LLM logical reasoning benchmark in visual contexts. CoRRabs/2407.04973(2024), https://doi.org/ 10.48550/arXiv.2407.04973

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.04973 2024
[66]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Xu, G., Jin, P., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. CoRR (2024), https://doi.org/10.48550/arXiv. 2411.10440

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[67]

arXiv preprint arXiv:2507.03019 (2025)

Yang, S., Niu, Y., Liu, Y., Ye, Y., Lin, B., Yuan, L.: Look-back: Implicit visual re-focusing in mllm reasoning. arXiv preprint arXiv:2507.03019 (2025)

work page arXiv 2025
[68]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yang, Y., He, X., Pan, H., Jiang, X., Deng, Y., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., Zhang, B., Chen, W.: R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. CoRR (2025),https://doi.org/10. 48550/arXiv.2503.10615

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

arXiv preprint arXiv:2505.16673 (2025)

Yao, H., Yin, Q., Zhang, J., Yang, M., Wang, Y., Wu, W., Su, F., Shen, L., Qiu, M., Tao, D., et al.: R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo. arXiv preprint arXiv:2505.16673 (2025)

work page arXiv 2025
[70]

In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)

Yuan,B.,Song,S.,Fernandez,J.,Luo,Y.,Baktashmotlagh,M.,Wang,Z.:Wiswheat: A three-tiered vision-language dataset for wheat management. In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)

2025
[71]

In: CVPR

Yue, X., Ni, Y., Zheng, T., Zhang, K., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In: CVPR. pp. 9556–9567. IEEE (2024)

2024
[72]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: Raven: A dataset for relational and analogical visual reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5317–5327 (2019)

2019
[73]

Zhang, R., Jiang, D., Zhang, Y., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K., Qiao, Y., Gao, P., Li, H.: MATHVERSE: does your multi-modal LLM truly see the diagrams in visual math problems? In: ECCV. pp. 169–186. Springer (2024)

2024
[74]

In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (2025)

Zhao, Z., Chen, Z., Huang, Z., Sadiq, S., Chen, T.: Continual text-to-video retrieval with frame fusion and task-aware routing. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (2025)

2025
[75]

In: NeurIPS (2023) 20 B

Zheng, G., Yang, B., Tang, J., Zhou, H., Yang, S.: Ddcot: Duty-distinct chain- of-thought prompting for multimodal reasoning in language models. In: NeurIPS (2023) 20 B. Yuan et al

2023
[76]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al.: Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[77]

Zhou, Y., Li, W., Lu, Y., Li, J., Liu, F., Zhang, M., Wang, Y., He, D., LIU, H., Zhang, M.: Reflection on knowledge graph for large language models reasoning. In: Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) (2025) LASER 21 This supplementary material provides additional descriptions of the main paper. It is o...

2025
[78]

The image prominently features buses, including a pink double-decker bus in the foreground, which is a clear indicator of public transportation
[79]

There are other buses visible in the background, suggesting that the image is related to a scene involving buses
[80]

The presence of a car does not overshadow the buses in the image, as the buses are more central and occupy a larger portion of the frame
[81]

Considering the options provided: A

The question asks for the main topic, and given the prominence and number of buses in the image, the focus is clearly on buses rather than cars. Considering the options provided: A. Driving cars - This is not the main topic as cars are not the primary focus. B. Driving buses - This could be a strong contender, but the image shows stationary buses rather t...

Showing first 80 references.

[1] [1]

In: NeurIPS (2022)

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Si- monyan, K.: Flamingo:...

2022

[2] [2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. CoRR (2025).https://doi.org/10.48550/ ARXIV.2502.13923,https://doi.org/1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025

[3] [3]

PaliGemma: A versatile 3B VLM for transfer

Beyer, L., Steiner, A., Pinto, A.S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A.A., Houlsby, N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Bosnjak, M., Chen, X., Minderer, M., Voigtlaender, P., Bica, I., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07726 2024

[4] [4]

arXiv preprint arXiv:2502.00698 (2025)

Cai, H., Yang, Y., Hu, W.: Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models. arXiv preprint arXiv:2502.00698 (2025)

work page arXiv 2025

[5] [5]

InternLM2 Technical Report

Cai, Z., Cao, M., Chen, H., Chen, K., Chen, K., Chen, X., et al.: Internlm2 technical report. CoRR (2024),https://doi.org/10.48550/arXiv.2403.17297

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.17297 2024

[6] [6]

In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Chen, J., Tang, J., Qin, J., Liang, X., Liu, L., Xing, E., Lin, L.: Geoqa: A geo- metric question answering benchmark towards multimodal numerical reasoning. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 513–523 (2021)

2021

[7] [7]

In: ECCV

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: ECCV. pp. 19–35. Springer (2024)

2024

[8] [8]

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., Zhao, F.: Are we on the right way for evaluating large vision-language models? In: NeurIPS (2024)

2024

[9] [9]

arXiv preprint arXiv:2506.04207 (2025)

Chen, S., Guo, Y., Su, Z., Li, Y., Wu, Y., Chen, J., Chen, J., Wang, W., Qu, X., Cheng, Y.: Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning. arXiv preprint arXiv:2506.04207 (2025)

work page arXiv 2025

[10] [10]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V., Levine, S., Ma, Y.: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025) 16 B. Yuan et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [12]

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Deng, Y., Bansal, H., Yin, F., Peng, N., Wang, W., Chang, K.: Openvlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement. CoRR (2025),https://doi.org/10.48550/arXiv.2503.17352

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.17352 2025

[12] [13]

Frontiers of Computer Science (2026)

Du, J., Li, Y., Li, Y., Liao, L., Zhao, Z., Ye, G.: Medfuse: a multi-source data fusion framework for diabetic retinopathy lesion segmentation. Frontiers of Computer Science (2026)

2026

[13] [14]

CoRR (2025),https://doi.org/10.48550/arXiv.2501.01904

Du, Y., Liu, Z., Li, Y., Zhao, W.X., Huo, Y., Wang, B., Chen, W., Liu, Z., Wang, Z., Wen, J.: Virgo: A preliminary exploration on reproducing o1-like MLLM. CoRR (2025),https://doi.org/10.48550/arXiv.2501.01904

work page doi:10.48550/arxiv.2501.01904 2025

[14] [15]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024)

2024

[15] [16]

com/verl-project/verl(2025)

Engine, V.: verl: Volcano engine reinforcement learning for llms.https://github. com/verl-project/verl(2025)

2025

[16] [17]

In: CVPR

Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swami- nathan, A., Soatto, S.: Multi-modal hallucination control by visual information grounding. In: CVPR. pp. 14303–14312. IEEE (2024)

2024

[17] [18]

In: CVPR (2026)

Fu, Y., Zhang, Z., Zhang, Y., Wang, Z., Huang, Z., Luo, Y.: Mergevla: Cross-skill model merging toward a generalist vision-language-action agent. In: CVPR (2026)

2026

[18] [19]

In: EMNLP

Gong, X., Ming, T., Wang, X., Wei, Z.: DAMRO: dive into the attention mechanism of LVLM to reduce object hallucination. In: EMNLP. pp. 7696–7712. Association for Computational Linguistics (2024)

2024

[19] [20]

In: CVPR

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: CVPR. IEEE (2024)

2024

[20] [21]

In: The Fourteenth International Conference on Learning Representations (ICLR) (2026)

Hao, Z., Li, Z., Li, W., Liu, F., Zhang, M., Li, J.: Echoes as anchors: Probabilistic costs and attention refocusing in llm reasoning. In: The Fourteenth International Conference on Learning Representations (ICLR) (2026)

2026

[21] [22]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

He, C., Luo, R., Bai, Y., Hu, S., Thai, Z., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., et al.: Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 3828–3850 (2024)

2024

[22] [23]

In: NeurIPS (2024)

Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettlemoyer, L., Smith, N.A., Krishna, R.: Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. In: NeurIPS (2024)

2024

[23] [24]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. CoRR (2025),https://doi.org/10.48550/arXiv.2503.06749

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.06749 2025

[24] [25]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [26]

CoRR (2025),https://doi

Jian, P., Wu, J., Sun, W., Wang, C., Ren, S., Zhang, J.: Look again, think slowly: Enhancing visual reflection in vision-language models. CoRR (2025),https://doi. org/10.48550/arXiv.2509.12132 LASER 17

work page doi:10.48550/arxiv.2509.12132 2025

[26] [27]

In: ICLR (2025)

Kang, S., Kim, J., Kim, J., Hwang, S.J.: See what you are told: Visual attention sink in large multimodal models. In: ICLR (2025)

2025

[27] [28]

MMR1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

Leng, S., Wang, J., Li, J., Zhang, H., Hu, Z., Zhang, B., Jiang, Y., Zhang, H., Li, X., Bing, L., et al.: Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources. arXiv preprint arXiv:2509.21268 (2025)

work page arXiv 2025

[28] [29]

In: CVPR

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: CVPR. pp. 13872–13882. IEEE (2024)

2024

[29] [30]

Advances in Neural Information Processing Systems (2026)

Li, Q., Zhang, Y., Mai, Z., Chen, Y., Huang, H., Zhang, J., Zhang, Z., Wen, Y., Li, W., Fu, H., et al.: Can large multimodal models understand agricultural scenes? benchmarking with agromind. Advances in Neural Information Processing Systems (2026)

2026

[30] [31]

In: ACL (2026)

Li, X., et al.: Llm inductive reasoning through multi-agent enhanced monte carlo tree search. In: ACL (2026)

2026

[31] [32]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Li, Z., Yu, W., Huang, C., Liu, R., Liang, Z., Liu, F., Che, J., Yu, D., Boyd- Graber, J.L., Mi, H., Yu, D.: Self-rewarding vision-language model via reasoning decomposition. CoRR (2025),https://doi.org/10.48550/arXiv.2508.19652

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.19652 2025

[32] [33]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Lin, Z., Lin, M., Lin, L., Ji, R.: Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 5334–5342 (2025)

2025

[33] [34]

In: CVPR

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR. IEEE (2024)

2024

[34] [35]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023)

2023

[35] [36]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Liu, W., Pan, Q., Zhang, Y., Liu, Z., Wu, J., Zhou, J., Zhou, A., Chen, Q., Jiang, B., He, L.: Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 12585–12591 (2025)

2025

[36] [37]

arXiv preprint arXiv:2505.23091 (2025)

Liu, Z., Liu, Y., Zhu, G., Xie, C., Li, Z., Yuan, J., Wang, X., Li, Q., Cheung, S.C., Zhang, S., et al.: Infi-mmr: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models. arXiv preprint arXiv:2505.23091 (2025)

work page arXiv 2025

[37] [38]

In: ICLR

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: ICLR. OpenReview.net (2024)

2024

[38] [39]

Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., Zhu, S.C.: Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers...

2021

[39] [40]

IconQA: A new benchmark for abstract diagram understanding and visual language reasoning.arXiv preprint arXiv:2110.13214, 2021

Lu, P., Qiu, L., Chen, J., Xia, T., Zhao, Y., Zhang, W., Yu, Z., Liang, X., Zhu, S.C.: Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214 (2021)

work page arXiv 2021

[40] [41]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., Han, T., Shi, B., Wang, W., He, J., et al.: Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [42]

In: AAAI

Mondal, D., Modi, S., Panda, S., Singh, R., Rao, G.S.: Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. In: AAAI. pp. 18798–18806. AAAI Press (2024)

2024

[42] [43]

OpenAI GPT-5 System Card

OpenAI: Openai gpt-5 system card (2025),https://arxiv.org/abs/2601.03267 18 B. Yuan et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [44]

Qiao, R., Tan, Q., Dong, G., Wu, M., Sun, C., Song, X., Wang, J., Gongque, Z., Lei, S., Zhang, Y., Wei, Z., Zhang, M., Qiao, R., Zong, X., Xu, Y., Yang, P., Bao, Z., Diao, M., Li, C., Zhang, H.: We-math: Does your large multimodal model achieve human-like mathematical reasoning? In: ACL. pp. 20023–20070. Association for Computational Linguistics (2025)

2025

[44] [45]

In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J

Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object halluci- nation in image captioning. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) EMNLP. pp. 4035–4045. Association for Computational Linguistics (2018)

2018

[45] [46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR (2024),https://doi.org/10.48550/arXiv.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024

[46] [47]

CoRR (2025), https://doi.org/10.48550/arXiv.2504

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., Zhao, T.: VLM-R1: A stable and generalizable r1-style large vision-language model. CoRR (2025), https://doi.org/10.48550/arXiv.2504. 07615

work page doi:10.48550/arxiv.2504 2025

[47] [48]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Shi, W., Hu, Z., Bin, Y., Liu, J., Yang, Y., Ng, S.K., Bing, L., Lee, R.K.W.: Math- llava: Bootstrapping mathematical reasoning for multimodal large language models. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 4663–4680 (2024)

2024

[48] [49]

Sun, H., Sun, Z., Peng, H., Ye, H.: Mitigating visual forgetting via take-along visual conditioning for multi-modal long cot reasoning. In: ACL. pp. 5158–5171. Association for Computational Linguistics (2025)

2025

[49] [50]

arXiv (2024)

Sun, M., et al.: Massive activations in large language models. arXiv (2024)

2024

[50] [51]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Team, G.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. CoRRabs/2507.06261 (2025),https://doi.org/10.48550/arXiv.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025

[51] [52]

Thawakar, O., Dissanayake, D., More, K.P., Thawkar, R., Heakl, A., Ahsan, N., Li, Y., Zumri, M., Lahoud, J., Anwer, R.M., Cholakkal, H., Laptev, I., Shah, M., Khan, F.S., Khan, S.H.: Llamav-o1: Rethinking step-by-step visual reasoning in llms. In: ACL. pp. 24290–24315. Association for Computational Linguistics (2025)

2025

[52] [53]

CoRR (2025),https://doi.org/10.48550/arXiv.2509.25848

Tian,X.,Zou,S.,Yang,Z.,He,M.,Waschkowski,F.,Wesemann,L.,Tu,P.H.,Zhang, J.: More thought, less accuracy? on the dual nature of reasoning in vision-language models. CoRR (2025),https://doi.org/10.48550/arXiv.2509.25848

work page doi:10.48550/arxiv.2509.25848 2025

[53] [54]

CoRR (2025),https://doi.org/10.48550/arXiv.2503.08342

Tu, C., Ye, P., Zhou, D., Bai, L., Yu, G., Chen, T., Ouyang, W.: Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms. CoRR (2025),https://doi.org/10.48550/arXiv.2503.08342

work page doi:10.48550/arxiv.2503.08342 2025

[54] [55]

arXiv preprint arXiv:2506.11595 (2025)

Unsal, M., Akkus, A.: Easyarc: Evaluating vision language models on true visual reasoning. arXiv preprint arXiv:2506.11595 (2025)

work page arXiv 2025

[55] [56]

When to Commit? Towards Variable-Size Self-Contained Blocks for Discrete Diffusion Language Models

Wang, D., Qiu, R., Huang, Z.: When to commit? towards variable-size self-contained blocks for discrete diffusion language models. arXiv preprint arXiv:2604.23994 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [57]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. CoRR (2025), https://doi.org/10.48550/arXiv.2504.08837

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.08837 2025

[57] [58]

In: NeurIPS (2024)

Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., Li, H.: Measuring multimodal mathematical reasoning with math-vision dataset. In: NeurIPS (2024)

2024

[58] [59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, S., Yu, X., Luo, Y., Wang, Z., Zhang, P., Huang, Z.: Language-driven fine-grained retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2682–2692 (2026)

2026

[59] [60]

In: Proceedings of LASER 19 the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Wang, Y., Wang, S., Cheng, Q., Fei, Z., Ding, L., Guo, Q., Tao, D., Qiu, X.: Visuo- think: Empowering lvlm reasoning with multimodal tree search. In: Proceedings of LASER 19 the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 21707–21719 (2025)

2025

[60] [61]

NeurIPS35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS35, 24824–24837 (2022)

2022

[61] [62]

CoRR (2025), https://doi.org/10.48550/arXiv.2502.11740

Wu, J., Xiong, Y., Li, X., Xia, Y., Wang, R., Wang, Y., Yu, T., Kim, S., Rossi, R.A., Yao, L., Shang, J., McAuley, J.J.: Mitigating visual knowledge forgetting in MLLM instruction-tuning via modality-decoupled gradient descent. CoRR (2025), https://doi.org/10.48550/arXiv.2502.11740

work page doi:10.48550/arxiv.2502.11740 2025

[62] [63]

Visionary-R1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

Xia, J., Zang, Y., Gao, P., Li, Y., Zhou, K.: Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning. CoRR (2025),https://doi.org/10. 48550/arXiv.2505.14677

work page arXiv 2025

[63] [64]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [65]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Xiao, Y., Sun, E., Liu, T., Wang, W.: Logicvista: Multimodal LLM logical reasoning benchmark in visual contexts. CoRRabs/2407.04973(2024), https://doi.org/ 10.48550/arXiv.2407.04973

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.04973 2024

[65] [66]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Xu, G., Jin, P., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. CoRR (2024), https://doi.org/10.48550/arXiv. 2411.10440

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[66] [67]

arXiv preprint arXiv:2507.03019 (2025)

Yang, S., Niu, Y., Liu, Y., Ye, Y., Lin, B., Yuan, L.: Look-back: Implicit visual re-focusing in mllm reasoning. arXiv preprint arXiv:2507.03019 (2025)

work page arXiv 2025

[67] [68]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yang, Y., He, X., Pan, H., Jiang, X., Deng, Y., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., Zhang, B., Chen, W.: R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. CoRR (2025),https://doi.org/10. 48550/arXiv.2503.10615

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [69]

arXiv preprint arXiv:2505.16673 (2025)

Yao, H., Yin, Q., Zhang, J., Yang, M., Wang, Y., Wu, W., Su, F., Shen, L., Qiu, M., Tao, D., et al.: R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo. arXiv preprint arXiv:2505.16673 (2025)

work page arXiv 2025

[69] [70]

In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)

Yuan,B.,Song,S.,Fernandez,J.,Luo,Y.,Baktashmotlagh,M.,Wang,Z.:Wiswheat: A three-tiered vision-language dataset for wheat management. In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)

2025

[70] [71]

In: CVPR

Yue, X., Ni, Y., Zheng, T., Zhang, K., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In: CVPR. pp. 9556–9567. IEEE (2024)

2024

[71] [72]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: Raven: A dataset for relational and analogical visual reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5317–5327 (2019)

2019

[72] [73]

Zhang, R., Jiang, D., Zhang, Y., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K., Qiao, Y., Gao, P., Li, H.: MATHVERSE: does your multi-modal LLM truly see the diagrams in visual math problems? In: ECCV. pp. 169–186. Springer (2024)

2024

[73] [74]

In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (2025)

Zhao, Z., Chen, Z., Huang, Z., Sadiq, S., Chen, T.: Continual text-to-video retrieval with frame fusion and task-aware routing. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (2025)

2025

[74] [75]

In: NeurIPS (2023) 20 B

Zheng, G., Yang, B., Tang, J., Zhou, H., Yang, S.: Ddcot: Duty-distinct chain- of-thought prompting for multimodal reasoning in language models. In: NeurIPS (2023) 20 B. Yuan et al

2023

[75] [76]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al.: Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[76] [77]

Zhou, Y., Li, W., Lu, Y., Li, J., Liu, F., Zhang, M., Wang, Y., He, D., LIU, H., Zhang, M.: Reflection on knowledge graph for large language models reasoning. In: Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) (2025) LASER 21 This supplementary material provides additional descriptions of the main paper. It is o...

2025

[77] [78]

The image prominently features buses, including a pink double-decker bus in the foreground, which is a clear indicator of public transportation

[78] [79]

There are other buses visible in the background, suggesting that the image is related to a scene involving buses

[79] [80]

The presence of a car does not overshadow the buses in the image, as the buses are more central and occupy a larger portion of the frame

[80] [81]

Considering the options provided: A

The question asks for the main topic, and given the prominence and number of buses in the image, the focus is clearly on buses rather than cars. Considering the options provided: A. Driving cars - This is not the main topic as cars are not the primary focus. B. Driving buses - This could be a strong contender, but the image shows stationary buses rather t...