High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Pith reviewed 2026-05-19 06:05 UTC · model grok-4.3
The pith
Large multimodal models can develop robust visual grounding through reinforcement learning that uses only binary rewards based on final answer correctness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MGPO is an end-to-end reinforcement learning framework in which LMMs autonomously predict grounding coordinates to crop and process sub-images across multiple dialogue turns, emerging stable visual grounding abilities solely from a binary reward tied to the correctness of the final answer; a multi-turn conversational template together with restriction of policy loss to multi-round outputs overcomes the cold-start problem where models otherwise fail to trigger grounding during rollout.
What carries the argument
Multi-turn Grounding-based Policy Optimization (MGPO), an RL method that lets the model generate grounding coordinates for iterative sub-image cropping inside a multi-turn conversation while limiting policy loss computation to outputs across dialogue rounds.
If this is right
- When trained on ordinary visual-question-answering data without grounding labels, MGPO produces stronger grounding than standard GRPO.
- The method yields a 5.4 percent gain on the in-distribution MME-Realworld benchmark and a 5.2 percent gain on the out-of-distribution V* Bench.
- After post-training Qwen2.5-VL-7B on only 21K samples, MGPO exceeds the performance of OpenAI o1 and GPT-4o on the OOD V* Bench.
- The multi-turn template and selective policy loss together promote stable optimization and autonomous triggering of visual grounding.
Where Pith is reading between the lines
- Binary final-answer rewards may scale as a lightweight way to instill spatial reasoning in vision-language models without needing dense coordinate labels.
- The same pattern of restricting loss to selected turns could stabilize training in other multi-turn conversational reinforcement learning settings.
- Iterative cropping learned this way might extend naturally to tasks that require repeated visual refinement, such as detailed diagram or chart reasoning.
Load-bearing premise
A multi-turn conversational template combined with restricting policy loss to outputs across multiple dialogue rounds is enough to solve the cold-start problem and produce stable autonomous visual grounding without any explicit supervision.
What would settle it
Train the same base model with MGPO but remove the multi-turn template and loss restriction; if grounding coordinates stop appearing in rollouts and benchmark gains on V* Bench disappear, the central claim is falsified.
Figures
read the original abstract
State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end RL framework for LMMs that enables iterative visual grounding via model-predicted coordinate-based cropping of sub-images within a multi-turn conversation. Training uses only a binary reward from final-answer correctness on standard VQA data (no grounding annotations), with a multi-turn template and policy-loss restriction introduced to solve observed cold-start failures in autonomous grounding. On Qwen2.5-VL-7B trained with 21K samples, MGPO yields 5.4% gains on MME-Realworld and 5.2% on OOD V* Bench, outperforming GRPO and matching or exceeding GPT-4o/o1 on the latter.
Significance. If the central emergence claim holds after controls, the result would meaningfully reduce reliance on costly grounding supervision for high-resolution visual reasoning. The public code release supports reproducibility and is a clear strength. The OOD gains are noteworthy but require verification that they arise from the binary-reward RL process rather than the introduced scaffolding.
major comments (2)
- [§3.2] §3.2 (Method, multi-turn template and loss restriction): The paper states that LMMs 'struggle to autonomously trigger visual grounding' and therefore introduces an explicit multi-turn conversational template plus restriction of policy loss to multi-round outputs. No ablation is reported that removes both components while retaining the identical binary final-answer reward and GRPO-style optimization. This directly bears on whether the 5.4% / 5.2% gains demonstrate spontaneous emergence or are attributable to the engineered structure.
- [§4.1–4.2] §4.1–4.2 (Experiments and ablations): Results tables report absolute gains over GRPO but provide no variance across seeds, no statistical significance tests, and no control that disables the multi-turn scaffolding. Without these, it is impossible to assess whether the reported improvements are robust or load-bearing for the emergence claim.
minor comments (2)
- [Abstract] The abstract and §4 claim 'standard visual-question-short answering data' but do not list the exact datasets or preprocessing steps used for the 21K samples.
- [§3.1] Notation for the grounding coordinate prediction and cropping operation could be formalized with an equation in §3.1 for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments on the necessity of ablations for the multi-turn components and on statistical robustness are well-taken and directly relevant to the emergence claim. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Method, multi-turn template and loss restriction): The paper states that LMMs 'struggle to autonomously trigger visual grounding' and therefore introduces an explicit multi-turn conversational template plus restriction of policy loss to multi-round outputs. No ablation is reported that removes both components while retaining the identical binary final-answer reward and GRPO-style optimization. This directly bears on whether the 5.4% / 5.2% gains demonstrate spontaneous emergence or are attributable to the engineered structure.
Authors: We acknowledge that a full ablation removing both the multi-turn template and the policy-loss restriction (while keeping the binary reward and GRPO optimization) would provide stronger evidence for the emergence claim. In preliminary rollouts we observed that the base Qwen2.5-VL-7B almost never emits grounding coordinates without the template, causing the training to collapse to single-turn behavior. The loss restriction was added to stabilize gradients on the multi-turn trajectories. To address the referee's concern directly, we will run and report the requested ablation in the revised manuscript, comparing performance with and without both components under identical reward and optimizer settings. We will present these results transparently even if they show reduced gains. revision: yes
-
Referee: [§4.1–4.2] §4.1–4.2 (Experiments and ablations): Results tables report absolute gains over GRPO but provide no variance across seeds, no statistical significance tests, and no control that disables the multi-turn scaffolding. Without these, it is impossible to assess whether the reported improvements are robust or load-bearing for the emergence claim.
Authors: We agree that variance estimates, statistical tests, and an explicit control disabling the scaffolding are needed to evaluate robustness. In the revision we will re-train the main MGPO and GRPO baselines with at least three random seeds, report mean and standard deviation on MME-Realworld and V* Bench, and include p-values from paired t-tests. The scaffolding-ablated control will be folded into the new ablation study described in the response to the first comment, allowing readers to judge whether the gains depend on the introduced structure. revision: yes
Circularity Check
Empirical RL method with held-out benchmarks exhibits no derivation circularity
full rationale
The paper describes an end-to-end RL framework (MGPO) that trains LMMs on standard VQA data using only binary final-answer rewards, with a multi-turn conversational template introduced to mitigate observed cold-start issues. Performance gains are reported on held-out test sets (MME-Realworld, V* Bench). No mathematical derivation chain exists that reduces a claimed result to its own fitted parameters or self-citations by construction. The template and loss restriction are explicit design choices, not hidden self-definitions or renamings of prior results. The central claim remains an empirical observation about emergence under the stated training setup, evaluated externally.
Axiom & Free-Parameter Ledger
free parameters (1)
- multi-turn template hyperparameters
axioms (1)
- domain assumption Binary reward from final-answer correctness is sufficient to elicit intermediate grounding behavior
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MGPO ... leveraging only a binary reward function derived from the correctness of the final answer ... multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LMMs can emerge robust grounding abilities during the RL training process
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024
work page 2024
-
[4]
A. T. Clark et al. How many megapixels is the human eye?, 2014. Accessed on May 7, 2025
work page 2014
-
[5]
C. A. Curcio, K. R. Sloan, R. E. Kalina, and A. E. Hendrickson. Human photoreceptor topography. Journal of Comparative Neurology, 292(4):497–523, 1990
work page 1990
-
[6]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:2252–2274, 2023
work page 2023
-
[7]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025
Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. arXiv preprint arXiv:2411.14432, 2024
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images
Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. In European Conference on Computer Vision, pages 390–406. Springer, 2024
work page 2024
-
[11]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
The hungarian method for the assignment problem
Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. 10
work page 1955
-
[16]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023
work page 2023
-
[17]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models. arXiv preprint arXiv:2501.14818, 2025
-
[19]
Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model
Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yansong Tang, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondence elicit 3d spacetime understanding in multimodal language model. arXiv preprint arXiv:2408.00754, 2024
-
[20]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024
work page 2024
-
[21]
Lost in the middle: How language models use long contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024
-
[23]
Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models. arXiv preprint arXiv:2403.12966, 2024
-
[24]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024a
Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment. arXiv preprint arXiv:2502.04328, 2025
-
[25]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/ , 2024. Accessed: 2025-04-18
work page 2024
-
[29]
arXiv preprint arXiv:2504.05599
Yi Peng, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, et al. Skywork r1v: pioneering multimodal reasoning with chain-of- thought. arXiv preprint arXiv:2504.05599, 2025
-
[30]
Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3394–3403, 2021
work page 2021
-
[31]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 11
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612–8642, 2024
work page 2024
-
[33]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Scaling vision pre-training to 4k resolution
Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, et al. Scaling vision pre-training to 4k resolution. arXiv preprint arXiv:2503.19903, 2025
-
[36]
J. D. Smith et al. Foveal cone density and visual acuity. Vision Research, 150:45–53, 2018
work page 2018
-
[37]
Visual agents as fast and slow thinkers
Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers. arXiv preprint arXiv:2408.08862, 2024
-
[38]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024
work page 2024
-
[42]
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Octopus: Embodied vision-language programmer from environmental feedback
Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Haoran Tan, Chencheng Jiang, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, et al. Octopus: Embodied vision-language programmer from environmental feedback. In European Conference on Computer Vision, pages 20–38. Springer, 2024
work page 2024
-
[45]
Egolife: Towards egocentric life assistant,
Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. arXiv preprint arXiv:2503.03803, 2025
-
[46]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Beyond llava-hd: Diving into high-resolution large multimodal models
Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, and Rong Jin. Beyond llava-hd: Diving into high-resolution large multimodal models. arXiv preprint arXiv:2406.08487, 2024
-
[49]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? arXiv preprint arXiv:2408.13257, 2024. 13 A Training Details Model training is conducted on a computat...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.