pith. sign in

arxiv: 2507.05920 · v2 · submitted 2025-07-08 · 💻 cs.CV

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Pith reviewed 2026-05-19 06:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords large multimodal modelsreinforcement learningvisual groundingmulti-turn conversationhigh-resolution imagespolicy optimizationbinary rewardcold start
0
0 comments X

The pith

Large multimodal models can develop robust visual grounding through reinforcement learning that uses only binary rewards based on final answer correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Multi-turn Grounding-based Policy Optimization, an RL framework that trains LMMs to predict coordinates, crop sub-images, and iteratively focus on relevant regions during multi-turn conversations. It shows these models can acquire grounding skills from standard visual question answering data alone, without any extra annotations for coordinates or regions. A reader would care because high-resolution images flood models with irrelevant tokens, and this method sidesteps the expense of dense supervision while improving performance on both in-distribution and out-of-distribution benchmarks. The design also includes a multi-turn template and loss restriction to help the model start grounding on its own instead of staying in a cold-start state.

Core claim

MGPO is an end-to-end reinforcement learning framework in which LMMs autonomously predict grounding coordinates to crop and process sub-images across multiple dialogue turns, emerging stable visual grounding abilities solely from a binary reward tied to the correctness of the final answer; a multi-turn conversational template together with restriction of policy loss to multi-round outputs overcomes the cold-start problem where models otherwise fail to trigger grounding during rollout.

What carries the argument

Multi-turn Grounding-based Policy Optimization (MGPO), an RL method that lets the model generate grounding coordinates for iterative sub-image cropping inside a multi-turn conversation while limiting policy loss computation to outputs across dialogue rounds.

If this is right

  • When trained on ordinary visual-question-answering data without grounding labels, MGPO produces stronger grounding than standard GRPO.
  • The method yields a 5.4 percent gain on the in-distribution MME-Realworld benchmark and a 5.2 percent gain on the out-of-distribution V* Bench.
  • After post-training Qwen2.5-VL-7B on only 21K samples, MGPO exceeds the performance of OpenAI o1 and GPT-4o on the OOD V* Bench.
  • The multi-turn template and selective policy loss together promote stable optimization and autonomous triggering of visual grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Binary final-answer rewards may scale as a lightweight way to instill spatial reasoning in vision-language models without needing dense coordinate labels.
  • The same pattern of restricting loss to selected turns could stabilize training in other multi-turn conversational reinforcement learning settings.
  • Iterative cropping learned this way might extend naturally to tasks that require repeated visual refinement, such as detailed diagram or chart reasoning.

Load-bearing premise

A multi-turn conversational template combined with restricting policy loss to outputs across multiple dialogue rounds is enough to solve the cold-start problem and produce stable autonomous visual grounding without any explicit supervision.

What would settle it

Train the same base model with MGPO but remove the multi-turn template and loss restriction; if grounding coordinates stop appearing in rollouts and benchmark gains on V* Bench disappear, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2507.05920 by Bo Li, Rui Feng, Weiwei Tian, Xinyu Huang, Yuhao Dong, Ziwei Liu.

Figure 1
Figure 1. Figure 1: Examples of models trained with multi-turn grounding-based RL on high-resolution real [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of different post-training paradigms for LMMs. Our MGPO automatically [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fixed multi-turn grounding template, which eliminate cold start SFT process. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of image resolutions (width [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of V* Bench between MGPO and GRPO. 0 10 20 30 40 50 60 70 80 Step 0.4 0.5 0.6 0.7 0.8 Ratio of Valid Grounding Coodinates MGPO GRPO [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of point predic [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A illustration of cropping sub-image based on grounding coordinates. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A full conversation example of MGPO post-trained model on high-resolution image tasks. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end RL framework for LMMs that enables iterative visual grounding via model-predicted coordinate-based cropping of sub-images within a multi-turn conversation. Training uses only a binary reward from final-answer correctness on standard VQA data (no grounding annotations), with a multi-turn template and policy-loss restriction introduced to solve observed cold-start failures in autonomous grounding. On Qwen2.5-VL-7B trained with 21K samples, MGPO yields 5.4% gains on MME-Realworld and 5.2% on OOD V* Bench, outperforming GRPO and matching or exceeding GPT-4o/o1 on the latter.

Significance. If the central emergence claim holds after controls, the result would meaningfully reduce reliance on costly grounding supervision for high-resolution visual reasoning. The public code release supports reproducibility and is a clear strength. The OOD gains are noteworthy but require verification that they arise from the binary-reward RL process rather than the introduced scaffolding.

major comments (2)
  1. [§3.2] §3.2 (Method, multi-turn template and loss restriction): The paper states that LMMs 'struggle to autonomously trigger visual grounding' and therefore introduces an explicit multi-turn conversational template plus restriction of policy loss to multi-round outputs. No ablation is reported that removes both components while retaining the identical binary final-answer reward and GRPO-style optimization. This directly bears on whether the 5.4% / 5.2% gains demonstrate spontaneous emergence or are attributable to the engineered structure.
  2. [§4.1–4.2] §4.1–4.2 (Experiments and ablations): Results tables report absolute gains over GRPO but provide no variance across seeds, no statistical significance tests, and no control that disables the multi-turn scaffolding. Without these, it is impossible to assess whether the reported improvements are robust or load-bearing for the emergence claim.
minor comments (2)
  1. [Abstract] The abstract and §4 claim 'standard visual-question-short answering data' but do not list the exact datasets or preprocessing steps used for the 21K samples.
  2. [§3.1] Notation for the grounding coordinate prediction and cropping operation could be formalized with an equation in §3.1 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments on the necessity of ablations for the multi-turn components and on statistical robustness are well-taken and directly relevant to the emergence claim. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Method, multi-turn template and loss restriction): The paper states that LMMs 'struggle to autonomously trigger visual grounding' and therefore introduces an explicit multi-turn conversational template plus restriction of policy loss to multi-round outputs. No ablation is reported that removes both components while retaining the identical binary final-answer reward and GRPO-style optimization. This directly bears on whether the 5.4% / 5.2% gains demonstrate spontaneous emergence or are attributable to the engineered structure.

    Authors: We acknowledge that a full ablation removing both the multi-turn template and the policy-loss restriction (while keeping the binary reward and GRPO optimization) would provide stronger evidence for the emergence claim. In preliminary rollouts we observed that the base Qwen2.5-VL-7B almost never emits grounding coordinates without the template, causing the training to collapse to single-turn behavior. The loss restriction was added to stabilize gradients on the multi-turn trajectories. To address the referee's concern directly, we will run and report the requested ablation in the revised manuscript, comparing performance with and without both components under identical reward and optimizer settings. We will present these results transparently even if they show reduced gains. revision: yes

  2. Referee: [§4.1–4.2] §4.1–4.2 (Experiments and ablations): Results tables report absolute gains over GRPO but provide no variance across seeds, no statistical significance tests, and no control that disables the multi-turn scaffolding. Without these, it is impossible to assess whether the reported improvements are robust or load-bearing for the emergence claim.

    Authors: We agree that variance estimates, statistical tests, and an explicit control disabling the scaffolding are needed to evaluate robustness. In the revision we will re-train the main MGPO and GRPO baselines with at least three random seeds, report mean and standard deviation on MME-Realworld and V* Bench, and include p-values from paired t-tests. The scaffolding-ablated control will be folded into the new ablation study described in the response to the first comment, allowing readers to judge whether the gains depend on the introduced structure. revision: yes

Circularity Check

0 steps flagged

Empirical RL method with held-out benchmarks exhibits no derivation circularity

full rationale

The paper describes an end-to-end RL framework (MGPO) that trains LMMs on standard VQA data using only binary final-answer rewards, with a multi-turn conversational template introduced to mitigate observed cold-start issues. Performance gains are reported on held-out test sets (MME-Realworld, V* Bench). No mathematical derivation chain exists that reduces a claimed result to its own fitted parameters or self-citations by construction. The template and loss restriction are explicit design choices, not hidden self-definitions or renamings of prior results. The central claim remains an empirical observation about emergence under the stated training setup, evaluated externally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus the domain assumption that grounding abilities will emerge from binary final-answer rewards when the cold-start issue is mitigated by the multi-turn design.

free parameters (1)
  • multi-turn template hyperparameters
    Number of turns, loss masking rules, and cropping thresholds are design choices that affect whether grounding emerges.
axioms (1)
  • domain assumption Binary reward from final-answer correctness is sufficient to elicit intermediate grounding behavior
    Invoked when claiming that robust grounding emerges without grounding annotations.

pith-pipeline@v0.9.0 · 5844 in / 1271 out tokens · 63328 ms · 2026-05-19T06:05:46.071941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.

  2. Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

  3. Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    cs.CV 2025-09 unverdicted novelty 5.0

    Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 3 Pith papers · 22 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024

  3. [3]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024

  4. [4]

    A. T. Clark et al. How many megapixels is the human eye?, 2014. Accessed on May 7, 2025

  5. [5]

    C. A. Curcio, K. R. Sloan, R. E. Kalina, and A. E. Hendrickson. Human photoreceptor topography. Journal of Comparative Neurology, 292(4):497–523, 1990

  6. [6]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:2252–2274, 2023

  7. [7]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024

  8. [8]

    Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025

    Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. arXiv preprint arXiv:2411.14432, 2024

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  10. [10]

    Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images

    Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. In European Conference on Computer Vision, pages 390–406. Springer, 2024

  11. [11]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

  12. [13]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  13. [14]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  14. [15]

    The hungarian method for the assignment problem

    Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. 10

  15. [16]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  16. [17]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

  17. [18]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

    Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models. arXiv preprint arXiv:2501.14818, 2025

  18. [19]

    Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model

    Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yansong Tang, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondence elicit 3d spacetime understanding in multimodal language model. arXiv preprint arXiv:2408.00754, 2024

  19. [20]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  20. [21]

    Lost in the middle: How language models use long contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023

  21. [22]

    Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

    Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

  22. [23]

    Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

    Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models. arXiv preprint arXiv:2403.12966, 2024

  23. [24]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024a

    Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment. arXiv preprint arXiv:2502.04328, 2025

  24. [25]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  25. [27]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365, 2025

  26. [28]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/ , 2024. Accessed: 2025-04-18

  27. [29]

    arXiv preprint arXiv:2504.05599

    Yi Peng, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, et al. Skywork r1v: pioneering multimodal reasoning with chain-of- thought. arXiv preprint arXiv:2504.05599, 2025

  28. [30]

    Learning to count everything

    Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3394–3403, 2021

  29. [31]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 11

  30. [32]

    Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612–8642, 2024

  31. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  32. [34]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024

  33. [35]

    Scaling vision pre-training to 4k resolution

    Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, et al. Scaling vision pre-training to 4k resolution. arXiv preprint arXiv:2503.19903, 2025

  34. [36]

    J. D. Smith et al. Foveal cone density and visual acuity. Vision Research, 150:45–53, 2018

  35. [37]

    Visual agents as fast and slow thinkers

    Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers. arXiv preprint arXiv:2408.08862, 2024

  36. [38]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025

  37. [39]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

  38. [40]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

  39. [41]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

  40. [42]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

  41. [43]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  42. [44]

    Octopus: Embodied vision-language programmer from environmental feedback

    Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Haoran Tan, Chencheng Jiang, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, et al. Octopus: Embodied vision-language programmer from environmental feedback. In European Conference on Computer Vision, pages 20–38. Springer, 2024

  43. [45]

    Egolife: Towards egocentric life assistant,

    Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. arXiv preprint arXiv:2503.03803, 2025

  44. [46]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025

  45. [47]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025. 12

  46. [48]

    Beyond llava-hd: Diving into high-resolution large multimodal models

    Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, and Rong Jin. Beyond llava-hd: Diving into high-resolution large multimodal models. arXiv preprint arXiv:2406.08487, 2024

  47. [49]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? arXiv preprint arXiv:2408.13257, 2024. 13 A Training Details Model training is conducted on a computat...