pith. sign in

arxiv: 2506.11991 · v3 · submitted 2025-06-13 · 💻 cs.CV · cs.AI· cs.CL

VGR: Visual Grounded Reasoning

Pith reviewed 2026-05-19 09:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords visual grounded reasoningmultimodal chain-of-thoughtfine-grained visual perceptionbounding box replayimage token efficiencySFT datasetmultimodal large language models
0
0 comments X p. Extension

The pith

VGR improves multimodal reasoning by first detecting and replaying relevant image regions rather than relying on language alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VGR, a multimodal large language model designed for visual grounded reasoning. Traditional approaches to chain-of-thought reasoning in these models stay in language space, leading to bias and poor performance on tasks needing deep image details. VGR addresses this by first identifying relevant regions via bounding boxes and then replaying those regions during reasoning. This is supported by a new SFT dataset combining vision grounding with language deduction. The result is better understanding of complex visual tasks while using fewer image tokens.

Core claim

VGR is a novel reasoning MLLM that detects relevant regions to solve problems and provides precise answers based on replayed image regions. It uses a large-scale SFT dataset containing reasoning data with mixed vision grounding and language deduction. The inference pipeline allows the model to choose bounding boxes for visual reference, with a replay stage integrating the corresponding regions into the reasoning process to enhance multimodal comprehension.

What carries the argument

The inference pipeline that selects bounding boxes for visual reference and replays the corresponding image regions to enhance multimodal comprehension.

Load-bearing premise

That automatically detected bounding boxes and the subsequent replay stage will reliably supply the precise visual details needed without introducing selection errors or losing critical context.

What would settle it

A test case where critical visual information lies outside the automatically detected bounding boxes on an image, causing the model to produce incorrect answers despite correct language reasoning.

Figures

Figures reproduced from arXiv: 2506.11991 by Bohong Wu, Chao Feng, Haiyong Jiang, Haochen Wang, Jiacong Wang, Jiao Ran, Jiawen Li, Jun Xiao, Xiao Liang, Ya Wang, Zijian Kang.

Figure 1
Figure 1. Figure 1: Overview framework of our method. In the left of the image, we crop the original image [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview framework of our data pipeline. The blue arrow line indicates the cold-start data [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of training data in VGR-SFT. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The right part of the figure contains an example generated by our annotation model. After [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of training data in VGR-SFT in different formulations. Reject Sampling. During the reject sampling, we implement two verification steps with online commercial model, which is Doubao1.5-VL [13] in our implementation. The prompts for remote requests are detailed in [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of data from original data, cold-start model, annotator and training set. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of VGR response in MMStar and ChartQA benchmarks. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VGR, a multimodal LLM for visual grounded reasoning that first detects relevant image regions as bounding boxes and then replays those regions during inference to support fine-grained visual perception. It constructs a mixed VGR-SFT dataset of vision-grounding and language-deduction examples, trains on the LLaVA-NeXT-7B baseline, and reports that the resulting model outperforms the baseline on MMStar (+4.1), AI2D (+7.1), and ChartQA (+12.9) while using only 30% of the image tokens.

Significance. If the gains are shown to arise from the bounding-box selection and replay mechanism rather than dataset scale alone, the work would offer a practical route to token-efficient visual reasoning that mitigates language bias in multimodal CoT. The explicit separation of region detection from replayed visual context is a concrete architectural idea worth testing on detail-heavy benchmarks.

major comments (2)
  1. [Experiments] Experiments section: the headline improvements (+4.1 MMStar, +7.1 AI2D, +12.9 ChartQA) and 30% token reduction are reported without a control that fine-tunes the identical LLaVA-NeXT-7B baseline on the same VGR-SFT corpus but omits the bounding-box selection and region-replay pipeline. This ablation is required to isolate the contribution of the proposed mechanism from the effect of additional SFT data.
  2. [Abstract and Experiments] Abstract and §4 (experimental protocol): no description is given of how bounding boxes are obtained (model, threshold, post-processing), how the replay stage is implemented in the forward pass, or what statistical tests or variance estimates accompany the benchmark deltas.
minor comments (2)
  1. [Abstract] Abstract: 'multimodel comprehension' is a typo for 'multimodal comprehension'.
  2. [Abstract] Abstract: 'VGR -SFT' contains an extraneous space; standardize to 'VGR-SFT'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that an explicit ablation isolating the bounding-box and replay mechanism from the effects of additional SFT data, together with fuller implementation details, would strengthen the paper. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline improvements (+4.1 MMStar, +7.1 AI2D, +12.9 ChartQA) and 30% token reduction are reported without a control that fine-tunes the identical LLaVA-NeXT-7B baseline on the same VGR-SFT corpus but omits the bounding-box selection and region-replay pipeline. This ablation is required to isolate the contribution of the proposed mechanism from the effect of additional SFT data.

    Authors: We acknowledge that the reported gains could partly stem from the additional VGR-SFT data rather than the bounding-box selection and replay pipeline alone. To address this, we will add the requested control experiment in the revised manuscript: fine-tuning the identical LLaVA-NeXT-7B baseline on the same VGR-SFT corpus while disabling the region-detection and replay components. The results of this ablation will be included in the Experiments section to better isolate the contribution of the proposed mechanism. revision: yes

  2. Referee: [Abstract and Experiments] Abstract and §4 (experimental protocol): no description is given of how bounding boxes are obtained (model, threshold, post-processing), how the replay stage is implemented in the forward pass, or what statistical tests or variance estimates accompany the benchmark deltas.

    Authors: We agree that these implementation details are currently insufficient. In the revised manuscript we will expand §4 (and update the abstract where appropriate) to specify: (i) the model and procedure used to obtain bounding boxes, including any thresholds and post-processing; (ii) the exact implementation of the replay stage within the forward pass; and (iii) variance estimates or the number of evaluation runs for the reported benchmark deltas. If multiple runs were not performed, we will state this explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent benchmark results

full rationale

The paper describes a new inference pipeline (region detection then replay) and a custom VGR-SFT dataset mixing vision-grounding and language data, then reports empirical gains on LLaVA-NeXT-7B. No equations, fitted parameters, or self-citations appear in the provided text that would make the +4.1 MMStar / +7.1 AI2D / +12.9 ChartQA improvements reduce to the training objective or dataset construction by definition. The central claims rest on experimental outcomes rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised fine-tuning assumptions plus the untested premise that region replay will improve visual comprehension without new failure modes.

free parameters (1)
  • Region selection and replay hyperparameters
    The model must learn or be tuned to output useful bounding boxes and to integrate the replayed patches; these are learned parameters whose values are not reported.
axioms (1)
  • domain assumption Visual region replay reduces language bias in multimodal reasoning
    Invoked in the motivation and method description as the core justification for the new pipeline.

pith-pipeline@v0.9.0 · 5810 in / 1180 out tokens · 39743 ms · 2026-05-19T09:08:06.793463+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment

    cs.CV 2026-05 unverdicted novelty 7.0

    LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.

  2. Motion-o: Trajectory-Grounded Video Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.

  3. Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

    cs.CV 2026-05 unverdicted novelty 6.0

    Vision-OPD uses on-policy self-distillation from crop-conditioned to full-image policies within the same MLLM to close the regional-to-global perception gap.

  4. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

  5. AdaTooler-V: Adaptive Tool-Use for Images and Videos

    cs.CV 2025-12 conditional novelty 6.0

    AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

  6. DeepEyesV2: Toward Agentic Multimodal Model

    cs.CV 2025-11 unverdicted novelty 6.0

    DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

  7. Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    cs.CV 2025-05 unverdicted novelty 6.0

    Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.

  8. Perceptual Flow Network for Visually Grounded Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

  9. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 8 Pith papers · 19 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022

  2. [2]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024

  5. [5]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

  6. [6]

    Benchmarking and improving detail image caption

    Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, and Haoyuan Guo. Benchmarking and improving detail image caption. arXiv preprint arXiv:2405.19092, 2024

  7. [7]

    Scalable vision language model training via high quality data curation

    Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, and Jiao Ran. Scalable vision language model training via high quality data curation. arXiv preprint arXiv:2501.05952, 2025

  8. [8]

    Open r1: A fully open reproduction of deepseek-r1, January 2025

    Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

  9. [9]

    Grok-1.5 vision preview, 2024

    Grok. Grok-1.5 vision preview, 2024. URL https://x.ai/blog/grok-1.5v

  10. [10]

    Context-guided spatio-temporal video grounding

    Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, and Libo Zhang. Context-guided spatio-temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18330–18339, 2024

  11. [11]

    Knowing your target: Target-aware transformer makes better spatio-temporal video grounding

    Xin Gu, Yaojie Shen, Chenxi Luo, Tiejian Luo, Yan Huang, Yuewei Lin, Heng Fan, and Libo Zhang. Knowing your target: Target-aware transformer makes better spatio-temporal video grounding. arXiv preprint arXiv:2502.11168, 2025

  12. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  13. [13]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062, 2025

  14. [14]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

  15. [15]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  16. [16]

    arXiv preprint arXiv:2502.09621 , year=

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency. arXiv preprint arXiv:2502.09621, 2025. 10

  17. [17]

    Dvqa: Understanding data visualizations via question answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018

  18. [18]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016

  19. [19]

    arXiv preprint arXiv:2504.10462 , year=

    Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. arXiv preprint arXiv:2504.10462, 2025

  20. [20]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

  21. [21]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023

  22. [22]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

  23. [23]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  24. [24]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  25. [25]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  26. [26]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/

  27. [27]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025

  28. [28]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language under- standing. arXiv preprint arXiv:2403.05525, 2024

  29. [29]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

  30. [30]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  31. [31]

    Infographicvqa

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawa- har. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022

  32. [32]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019

  33. [33]

    Openai-o1, 2024

    OpenAI. Openai-o1, 2024. 11

  34. [34]

    Cogcom: A visual language model with chain-of-manipulations reasoning.arXiv preprint arXiv:2402.04236, 2024

    Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: Train large vision-language models diving into details through chain of manipulations. arXiv preprint arXiv:2402.04236, 2024

  35. [35]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  36. [36]

    Pixellm: Pixel reasoning with large multimodal model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

  37. [37]

    Videoworld: Exploring knowledge learning from unlabeled videos

    Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, and Xiao- jie Jin. Videoworld: Exploring knowledge learning from unlabeled videos. arXiv preprint arXiv:2501.09781, 2025

  38. [38]

    Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612–8642, 2024

  39. [39]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  40. [40]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615, 2025

  41. [41]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

  42. [42]

    Reinforcement learning: An introduction, volume 1

    Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  43. [43]

    Qwen2.5-vl, January 2025

    Qwen Team. Qwen2.5-vl, January 2025. URL https://qwenlm.github.io/blog/qwen2. 5-vl/

  44. [44]

    Reconstructive visual instruction tuning

    Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive visual instruction tuning. arXiv preprint arXiv:2410.09575, 2024

  45. [45]

    Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

    Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Reconstructive visual instruction tuning with 3d-awareness. arXiv preprint arXiv:2504.01901, 2025

  46. [46]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

  47. [47]

    World to code: Multi-modal data generation via self-instructed compositional captioning and filtering

    Jiacong Wang, Bohong Wu, Haiyong Jiang, Zhou Xun, Xin Xiao, Haoyuan Guo, and Jun Xiao. World to code: Multi-modal data generation via self-instructed compositional captioning and filtering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4608–4623, 2024

  48. [48]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

  49. [49]

    Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442, 2024. 12

  50. [50]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

  51. [51]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024

  52. [52]

    Seeing the image: Prioritizing visual correlation by contrastive alignment

    Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Haoyuan Guo, et al. Seeing the image: Prioritizing visual correlation by contrastive alignment. Advances in Neural Information Processing Systems, 37:30925–30950, 2024

  53. [53]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

  54. [54]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025

  55. [55]

    LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024. URL https://arxiv.org/abs/2407.12772

  56. [56]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 13 A More Results of VGR A.1 More ablation experiments analysis of VGR in the main text In Table 3...