High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Bo Li; Rui Feng; Weiwei Tian; Xinyu Huang; Yuhao Dong; Ziwei Liu

arxiv: 2507.05920 · v2 · submitted 2025-07-08 · 💻 cs.CV

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Xinyu Huang , Yuhao Dong , Weiwei Tian , Bo Li , Rui Feng , Ziwei Liu This is my paper

Pith reviewed 2026-05-19 06:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords large multimodal modelsreinforcement learningvisual groundingmulti-turn conversationhigh-resolution imagespolicy optimizationbinary rewardcold start

0 comments

The pith

Large multimodal models can develop robust visual grounding through reinforcement learning that uses only binary rewards based on final answer correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Multi-turn Grounding-based Policy Optimization, an RL framework that trains LMMs to predict coordinates, crop sub-images, and iteratively focus on relevant regions during multi-turn conversations. It shows these models can acquire grounding skills from standard visual question answering data alone, without any extra annotations for coordinates or regions. A reader would care because high-resolution images flood models with irrelevant tokens, and this method sidesteps the expense of dense supervision while improving performance on both in-distribution and out-of-distribution benchmarks. The design also includes a multi-turn template and loss restriction to help the model start grounding on its own instead of staying in a cold-start state.

Core claim

MGPO is an end-to-end reinforcement learning framework in which LMMs autonomously predict grounding coordinates to crop and process sub-images across multiple dialogue turns, emerging stable visual grounding abilities solely from a binary reward tied to the correctness of the final answer; a multi-turn conversational template together with restriction of policy loss to multi-round outputs overcomes the cold-start problem where models otherwise fail to trigger grounding during rollout.

What carries the argument

Multi-turn Grounding-based Policy Optimization (MGPO), an RL method that lets the model generate grounding coordinates for iterative sub-image cropping inside a multi-turn conversation while limiting policy loss computation to outputs across dialogue rounds.

If this is right

When trained on ordinary visual-question-answering data without grounding labels, MGPO produces stronger grounding than standard GRPO.
The method yields a 5.4 percent gain on the in-distribution MME-Realworld benchmark and a 5.2 percent gain on the out-of-distribution V* Bench.
After post-training Qwen2.5-VL-7B on only 21K samples, MGPO exceeds the performance of OpenAI o1 and GPT-4o on the OOD V* Bench.
The multi-turn template and selective policy loss together promote stable optimization and autonomous triggering of visual grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Binary final-answer rewards may scale as a lightweight way to instill spatial reasoning in vision-language models without needing dense coordinate labels.
The same pattern of restricting loss to selected turns could stabilize training in other multi-turn conversational reinforcement learning settings.
Iterative cropping learned this way might extend naturally to tasks that require repeated visual refinement, such as detailed diagram or chart reasoning.

Load-bearing premise

A multi-turn conversational template combined with restricting policy loss to outputs across multiple dialogue rounds is enough to solve the cold-start problem and produce stable autonomous visual grounding without any explicit supervision.

What would settle it

Train the same base model with MGPO but remove the multi-turn template and loss restriction; if grounding coordinates stop appearing in rollouts and benchmark gains on V* Bench disappear, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2507.05920 by Bo Li, Rui Feng, Weiwei Tian, Xinyu Huang, Yuhao Dong, Ziwei Liu.

**Figure 2.** Figure 2: Comparison of different post-training paradigms for LMMs. Our MGPO automatically [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Fixed multi-turn grounding template, which eliminate cold start SFT process. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of image resolutions (width [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison of V* Bench between MGPO and GRPO. 0 10 20 30 40 50 60 70 80 Step 0.4 0.5 0.6 0.7 0.8 Ratio of Valid Grounding Coodinates MGPO GRPO [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Visualization of point predic [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: A illustration of cropping sub-image based on grounding coordinates. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: A full conversation example of MGPO post-trained model on high-resolution image tasks. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MGPO shows concrete gains on high-res VQA benchmarks via RL with a multi-turn template and loss mask, but the emergence-from-binary-reward claim rests on unablated scaffolding.

read the letter

The main thing to know is that this MGPO setup trains a 7B LMM on plain VQA data with only a binary final-answer reward and produces 5.4% gains on MME-Realworld plus 5.2% on the harder OOD V* Bench, beating GPT-4o and o1 on the latter. They do this by letting the model output grounding coordinates inside a multi-turn conversation, crop the image, and continue until it answers. To stop the model from ignoring grounding entirely, they fix a conversational template and mask the policy loss so it only updates on the multi-round outputs. That combination beats their GRPO baseline and ships code, which is useful for anyone trying to add grounding without extra labels. The results are reported on held-out sets and the OOD improvement is the clearest signal that something is working. The soft spot is exactly the one the stress-test flags. The abstract presents the gains as evidence that robust grounding emerges from binary reward alone, yet the method adds an explicit multi-turn template and loss restriction to solve the cold-start problem. Without an ablation that removes both while keeping the same reward and optimizer, it is hard to tell how much of the lift comes from the scaffolding versus any spontaneous behavior. The write-up also skips run variance and full controls on the template hyperparameters, so the numbers look promising but not yet locked down. This is worth a referee for groups working on RL post-training for LMMs or high-resolution visual tasks. A reader who wants practical ways to improve detail-oriented reasoning without heavy supervision will get something out of the experiments and the released code. I would send it to review rather than desk-reject; the empirical signal is there even if the mechanism story needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end RL framework for LMMs that enables iterative visual grounding via model-predicted coordinate-based cropping of sub-images within a multi-turn conversation. Training uses only a binary reward from final-answer correctness on standard VQA data (no grounding annotations), with a multi-turn template and policy-loss restriction introduced to solve observed cold-start failures in autonomous grounding. On Qwen2.5-VL-7B trained with 21K samples, MGPO yields 5.4% gains on MME-Realworld and 5.2% on OOD V* Bench, outperforming GRPO and matching or exceeding GPT-4o/o1 on the latter.

Significance. If the central emergence claim holds after controls, the result would meaningfully reduce reliance on costly grounding supervision for high-resolution visual reasoning. The public code release supports reproducibility and is a clear strength. The OOD gains are noteworthy but require verification that they arise from the binary-reward RL process rather than the introduced scaffolding.

major comments (2)

[§3.2] §3.2 (Method, multi-turn template and loss restriction): The paper states that LMMs 'struggle to autonomously trigger visual grounding' and therefore introduces an explicit multi-turn conversational template plus restriction of policy loss to multi-round outputs. No ablation is reported that removes both components while retaining the identical binary final-answer reward and GRPO-style optimization. This directly bears on whether the 5.4% / 5.2% gains demonstrate spontaneous emergence or are attributable to the engineered structure.
[§4.1–4.2] §4.1–4.2 (Experiments and ablations): Results tables report absolute gains over GRPO but provide no variance across seeds, no statistical significance tests, and no control that disables the multi-turn scaffolding. Without these, it is impossible to assess whether the reported improvements are robust or load-bearing for the emergence claim.

minor comments (2)

[Abstract] The abstract and §4 claim 'standard visual-question-short answering data' but do not list the exact datasets or preprocessing steps used for the 21K samples.
[§3.1] Notation for the grounding coordinate prediction and cropping operation could be formalized with an equation in §3.1 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments on the necessity of ablations for the multi-turn components and on statistical robustness are well-taken and directly relevant to the emergence claim. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (Method, multi-turn template and loss restriction): The paper states that LMMs 'struggle to autonomously trigger visual grounding' and therefore introduces an explicit multi-turn conversational template plus restriction of policy loss to multi-round outputs. No ablation is reported that removes both components while retaining the identical binary final-answer reward and GRPO-style optimization. This directly bears on whether the 5.4% / 5.2% gains demonstrate spontaneous emergence or are attributable to the engineered structure.

Authors: We acknowledge that a full ablation removing both the multi-turn template and the policy-loss restriction (while keeping the binary reward and GRPO optimization) would provide stronger evidence for the emergence claim. In preliminary rollouts we observed that the base Qwen2.5-VL-7B almost never emits grounding coordinates without the template, causing the training to collapse to single-turn behavior. The loss restriction was added to stabilize gradients on the multi-turn trajectories. To address the referee's concern directly, we will run and report the requested ablation in the revised manuscript, comparing performance with and without both components under identical reward and optimizer settings. We will present these results transparently even if they show reduced gains. revision: yes
Referee: [§4.1–4.2] §4.1–4.2 (Experiments and ablations): Results tables report absolute gains over GRPO but provide no variance across seeds, no statistical significance tests, and no control that disables the multi-turn scaffolding. Without these, it is impossible to assess whether the reported improvements are robust or load-bearing for the emergence claim.

Authors: We agree that variance estimates, statistical tests, and an explicit control disabling the scaffolding are needed to evaluate robustness. In the revision we will re-train the main MGPO and GRPO baselines with at least three random seeds, report mean and standard deviation on MME-Realworld and V* Bench, and include p-values from paired t-tests. The scaffolding-ablated control will be folded into the new ablation study described in the response to the first comment, allowing readers to judge whether the gains depend on the introduced structure. revision: yes

Circularity Check

0 steps flagged

Empirical RL method with held-out benchmarks exhibits no derivation circularity

full rationale

The paper describes an end-to-end RL framework (MGPO) that trains LMMs on standard VQA data using only binary final-answer rewards, with a multi-turn conversational template introduced to mitigate observed cold-start issues. Performance gains are reported on held-out test sets (MME-Realworld, V* Bench). No mathematical derivation chain exists that reduces a claimed result to its own fitted parameters or self-citations by construction. The template and loss restriction are explicit design choices, not hidden self-definitions or renamings of prior results. The central claim remains an empirical observation about emergence under the stated training setup, evaluated externally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus the domain assumption that grounding abilities will emerge from binary final-answer rewards when the cold-start issue is mitigated by the multi-turn design.

free parameters (1)

multi-turn template hyperparameters
Number of turns, loss masking rules, and cropping thresholds are design choices that affect whether grounding emerges.

axioms (1)

domain assumption Binary reward from final-answer correctness is sufficient to elicit intermediate grounding behavior
Invoked when claiming that robust grounding emerges without grounding annotations.

pith-pipeline@v0.9.0 · 5844 in / 1271 out tokens · 63328 ms · 2026-05-19T06:05:46.071941+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MGPO ... leveraging only a binary reward function derived from the correctness of the final answer ... multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LMMs can emerge robust grounding abilities during the RL training process

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
cs.CV 2026-04 unverdicted novelty 7.0

MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
cs.CV 2025-09 unverdicted novelty 5.0

Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 3 Pith papers · 22 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024

work page 2024
[4]

A. T. Clark et al. How many megapixels is the human eye?, 2014. Accessed on May 7, 2025

work page 2014
[5]

C. A. Curcio, K. R. Sloan, R. E. Kalina, and A. E. Hendrickson. Human photoreceptor topography. Journal of Comparative Neurology, 292(4):497–523, 1990

work page 1990
[6]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:2252–2274, 2023

work page 2023
[7]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025

Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. arXiv preprint arXiv:2411.14432, 2024

work page arXiv 2024
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images

Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. In European Conference on Computer Vision, pages 390–406. Springer, 2024

work page 2024
[11]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

The hungarian method for the assignment problem

Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. 10

work page 1955
[16]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[17]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models. arXiv preprint arXiv:2501.14818, 2025

work page arXiv 2025
[19]

Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model

Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yansong Tang, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondence elicit 3d spacetime understanding in multimodal language model. arXiv preprint arXiv:2408.00754, 2024

work page arXiv 2024
[20]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[21]

Lost in the middle: How language models use long contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

work page arXiv 2024
[23]

Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models. arXiv preprint arXiv:2403.12966, 2024

work page arXiv 2024
[24]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024a

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment. arXiv preprint arXiv:2502.04328, 2025

work page arXiv 2025
[25]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/ , 2024. Accessed: 2025-04-18

work page 2024
[29]

arXiv preprint arXiv:2504.05599

Yi Peng, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, et al. Skywork r1v: pioneering multimodal reasoning with chain-of- thought. arXiv preprint arXiv:2504.05599, 2025

work page arXiv 2025
[30]

Learning to count everything

Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3394–3403, 2021

work page 2021
[31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 11

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612–8642, 2024

work page 2024
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Scaling vision pre-training to 4k resolution

Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, et al. Scaling vision pre-training to 4k resolution. arXiv preprint arXiv:2503.19903, 2025

work page arXiv 2025
[36]

J. D. Smith et al. Foveal cone density and visual acuity. Vision Research, 150:45–53, 2018

work page 2018
[37]

Visual agents as fast and slow thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers. arXiv preprint arXiv:2408.08862, 2024

work page arXiv 2024
[38]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024
[42]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Octopus: Embodied vision-language programmer from environmental feedback

Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Haoran Tan, Chencheng Jiang, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, et al. Octopus: Embodied vision-language programmer from environmental feedback. In European Conference on Computer Vision, pages 20–38. Springer, 2024

work page 2024
[45]

Egolife: Towards egocentric life assistant,

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. arXiv preprint arXiv:2503.03803, 2025

work page arXiv 2025
[46]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Beyond llava-hd: Diving into high-resolution large multimodal models

Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, and Rong Jin. Beyond llava-hd: Diving into high-resolution large multimodal models. arXiv preprint arXiv:2406.08487, 2024

work page arXiv 2024
[49]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? arXiv preprint arXiv:2408.13257, 2024. 13 A Training Details Model training is conducted on a computat...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024

work page 2024

[4] [4]

A. T. Clark et al. How many megapixels is the human eye?, 2014. Accessed on May 7, 2025

work page 2014

[5] [5]

C. A. Curcio, K. R. Sloan, R. E. Kalina, and A. E. Hendrickson. Human photoreceptor topography. Journal of Comparative Neurology, 292(4):497–523, 1990

work page 1990

[6] [6]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36:2252–2274, 2023

work page 2023

[7] [7]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025

Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. arXiv preprint arXiv:2411.14432, 2024

work page arXiv 2024

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images

Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. In European Conference on Computer Vision, pages 390–406. Springer, 2024

work page 2024

[11] [11]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [14]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [15]

The hungarian method for the assignment problem

Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. 10

work page 1955

[15] [16]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023

[16] [17]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [18]

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models. arXiv preprint arXiv:2501.14818, 2025

work page arXiv 2025

[18] [19]

Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model

Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yansong Tang, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondence elicit 3d spacetime understanding in multimodal language model. arXiv preprint arXiv:2408.00754, 2024

work page arXiv 2024

[19] [20]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024

[20] [21]

Lost in the middle: How language models use long contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [22]

Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

work page arXiv 2024

[22] [23]

Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models. arXiv preprint arXiv:2403.12966, 2024

work page arXiv 2024

[23] [24]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024a

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment. arXiv preprint arXiv:2502.04328, 2025

work page arXiv 2025

[24] [25]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [27]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [28]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/ , 2024. Accessed: 2025-04-18

work page 2024

[27] [29]

arXiv preprint arXiv:2504.05599

Yi Peng, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, et al. Skywork r1v: pioneering multimodal reasoning with chain-of- thought. arXiv preprint arXiv:2504.05599, 2025

work page arXiv 2025

[28] [30]

Learning to count everything

Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3394–3403, 2021

work page 2021

[29] [31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 11

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [32]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612–8642, 2024

work page 2024

[31] [33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [34]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [35]

Scaling vision pre-training to 4k resolution

Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, et al. Scaling vision pre-training to 4k resolution. arXiv preprint arXiv:2503.19903, 2025

work page arXiv 2025

[34] [36]

J. D. Smith et al. Foveal cone density and visual acuity. Vision Research, 150:45–53, 2018

work page 2018

[35] [37]

Visual agents as fast and slow thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers. arXiv preprint arXiv:2408.08862, 2024

work page arXiv 2024

[36] [38]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [39]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [40]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [41]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024

[40] [42]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [43]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [44]

Octopus: Embodied vision-language programmer from environmental feedback

Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Haoran Tan, Chencheng Jiang, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, et al. Octopus: Embodied vision-language programmer from environmental feedback. In European Conference on Computer Vision, pages 20–38. Springer, 2024

work page 2024

[43] [45]

Egolife: Towards egocentric life assistant,

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. arXiv preprint arXiv:2503.03803, 2025

work page arXiv 2025

[44] [46]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [47]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [48]

Beyond llava-hd: Diving into high-resolution large multimodal models

Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, and Rong Jin. Beyond llava-hd: Diving into high-resolution large multimodal models. arXiv preprint arXiv:2406.08487, 2024

work page arXiv 2024

[47] [49]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? arXiv preprint arXiv:2408.13257, 2024. 13 A Training Details Model training is conducted on a computat...

work page internal anchor Pith review Pith/arXiv arXiv 2024