AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang; Dongyang Chen; Kaituo Feng; Manyuan Zhang; Meng Meng; Sicheng Gao; Xiangyu Yue; Xu Zhou; Yuzhang Shang; Zhixun Li

arxiv: 2512.16918 · v3 · submitted 2025-12-18 · 💻 cs.CV

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang , Kaituo Feng , Dongyang Chen , Zhongyu Wang , Zhixun Li , Sicheng Gao , Meng Meng , Xu Zhou

show 3 more authors

Manyuan Zhang Yuzhang Shang Xiangyu Yue

This is my paper

Pith reviewed 2026-05-16 21:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords adaptive tool usemultimodal large language modelsreinforcement learningvision toolstool benefit scorevisual reasoningchain of thoughtimages and videos

0 comments

The pith

AdaTooler-V enables multimodal models to invoke vision tools only when they improve results, reaching 89.8% on the V* benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing open-source multimodal models often invoke vision tools even when unnecessary, raising inference costs and sometimes lowering accuracy. AdaTooler-V addresses this by training the model to first decide whether a given image or video problem actually benefits from tool assistance. The central mechanism is AT-GRPO, a reinforcement learning method that scales each sample's reward according to its Tool Benefit Score so the policy learns selective rather than habitual tool use. Two new datasets support the training pipeline: one for supervised cold-start chain-of-thought data and a larger set for reinforcement learning with verifiable outcomes across single images, multiple images, and videos. The resulting 7B model sets a new mark on twelve benchmarks, including 89.8% accuracy on the high-resolution V* set where it exceeds GPT-4o and Gemini 1.5 Pro.

Core claim

AdaTooler-V performs adaptive tool-use by determining whether a visual problem truly requires tools. It introduces AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Two datasets are constructed for training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, with the 7B version achieving 89.8% accuracy on V* and surpassing GPT-4o and Gemini 1.5.

What carries the argument

AT-GRPO, a reinforcement learning algorithm that scales per-sample rewards by the Tool Benefit Score to train selective rather than unconditional vision-tool invocation.

Load-bearing premise

The Tool Benefit Score correctly identifies samples where calling a vision tool produces a genuine accuracy gain rather than merely reflecting dataset biases or training artifacts.

What would settle it

Measure whether a non-adaptive baseline that always calls tools matches or exceeds AdaTooler-V accuracy on a new test set deliberately constructed so that tool use never improves the answer.

Figures

Figures reproduced from arXiv: 2512.16918 by Chaoyang Wang, Dongyang Chen, Kaituo Feng, Manyuan Zhang, Meng Meng, Sicheng Gao, Xiangyu Yue, Xu Zhou, Yuzhang Shang, Zhixun Li, Zhongyu Wang.

**Figure 2.** Figure 2: Case reasoning trajectory of AdaTooler-V. For single-image and video questions, the model alternates between internal reasoning, vision tool invocations and final answers, enabling zoom-in on fine-grained regions and inspection of informative clips. In contrast, for the multi-image clock example, AdaTooler-V solves the problem purely via text-based CoT, illustrating its ability to adaptively decide when vi… view at source ↗

**Figure 3.** Figure 3: The data distribution of our AdaTooler-V-300k dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: An illustration of our proposed AT-GRPO. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: RL training curves. reinforcement learning for training; SFT+GRPO, which replaces our proposed AT-GRPO algorithm with the standard GRPO method. As shown in the last two rows of Tab. 3, incorporating the proposed AT-GRPO training strategy leads to a substantial performance improvement. These results confirm that dynamically adjusting tool-use rewards based on the Tool Benefit Score enables the model to inv… view at source ↗

**Figure 6.** Figure 6: An example of AdaTooler-V-7B’s reasoning output on V* Benchmark. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: An example of AdaTooler-V-7B’s reasoning output on MVBench. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for training and inference. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaTooler-V introduces adaptive tool-use via AT-GRPO and new datasets, delivering strong benchmark results, though the Tool Benefit Score requires validation to ensure it's not biased.

read the letter

The paper's main contribution is a reinforcement learning setup called AT-GRPO that lets an MLLM learn when to invoke vision tools based on a per-sample Tool Benefit Score. They pair this with two new datasets, AdaTooler-V-CoT-100k for initial fine-tuning and AdaTooler-V-300k for the RL stage, covering images and videos. The result is a 7B model that reportedly reaches 89.8% on the V* benchmark, ahead of GPT-4o and Gemini 1.5 Pro.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes AdaTooler-V, an MLLM for adaptive tool-use in visual reasoning tasks across images and videos. It introduces the AT-GRPO reinforcement learning algorithm that scales rewards using a Tool Benefit Score to encourage tool invocation only when beneficial. The model is trained on AdaTooler-V-CoT-100k for SFT and AdaTooler-V-300k for RL, and achieves state-of-the-art results including 89.8% accuracy on the V* benchmark, outperforming GPT-4o and Gemini 1.5 Pro.

Significance. If the adaptive tool-use mechanism is robustly validated, this work would be significant for improving efficiency and performance of open-source MLLMs in multimodal reasoning by mitigating blind tool invocation. The release of code, models, and datasets strengthens reproducibility and impact.

major comments (2)

[§3.2 (AT-GRPO)] §3.2 (AT-GRPO): The Tool Benefit Score is load-bearing for the adaptive behavior claim, but its computation method, whether it is derived independently from held-out data or the training distribution AdaTooler-V-300k, and how it avoids reflecting dataset biases (e.g., resolution patterns or annotation artifacts) is not sufficiently detailed. This risks the RL objective overfitting to spurious correlations rather than genuine tool benefit.
[§5 (Experiments)] §5 (Experiments): The 89.8% accuracy on V* surpassing proprietary models requires verification of the evaluation setup, including whether tool-use is fairly compared, number of evaluation runs, and confirmation that the Tool Benefit Score was not tuned on the test set.

minor comments (2)

[Abstract] Abstract: The abstract mentions 'verifiable rewards' but does not specify what makes the rewards verifiable across single-image, multi-image, and video data.
[§4] §4: Clarify the exact form of the reward scaling function in AT-GRPO to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and will revise the manuscript to enhance clarity and rigor where needed.

read point-by-point responses

Referee: §3.2 (AT-GRPO): The Tool Benefit Score is load-bearing for the adaptive behavior claim, but its computation method, whether it is derived independently from held-out data or the training distribution AdaTooler-V-300k, and how it avoids reflecting dataset biases (e.g., resolution patterns or annotation artifacts) is not sufficiently detailed. This risks the RL objective overfitting to spurious correlations rather than genuine tool benefit.

Authors: We thank the referee for this important observation. We acknowledge that §3.2 would benefit from greater detail on the Tool Benefit Score. In the revised manuscript, we will expand the description to specify that the score is computed as the performance delta (with vs. without tool use) on a held-out validation subset drawn independently from the AdaTooler-V-300k training distribution. To guard against dataset biases such as resolution patterns or annotation artifacts, the score incorporates normalization across data modalities (single-image, multi-image, video) and is cross-validated on multiple disjoint subsets. We will also add an ablation demonstrating that removing the Tool Benefit Score leads to increased blind tool invocation, supporting that the RL objective learns genuine benefit rather than spurious correlations. revision: yes
Referee: §5 (Experiments): The 89.8% accuracy on V* surpassing proprietary models requires verification of the evaluation setup, including whether tool-use is fairly compared, number of evaluation runs, and confirmation that the Tool Benefit Score was not tuned on the test set.

Authors: We agree that transparent verification of the evaluation protocol is necessary. In the revised §5, we will clarify that V* results follow the benchmark's official protocol, with tool-use comparisons conducted under matched conditions (resolution handling and tool availability) against the publicly reported numbers for GPT-4o and Gemini 1.5 Pro. We will report results averaged over 5 independent runs with different random seeds, including standard deviation. We explicitly confirm that the Tool Benefit Score was computed only on training and internal validation splits and never tuned or evaluated on the V* test set, which remained completely held-out for final benchmarking. We will include these details along with the exact prompting templates used. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents the 89.8% V* accuracy as an empirical held-out benchmark result after training with AT-GRPO on the constructed AdaTooler-V-300k dataset. The Tool Benefit Score is used to scale rewards during RL, but the reported performance metric is not shown to reduce to a fitted parameter or self-referential definition by construction. No equations, self-citation load-bearing steps, or ansatz smuggling are exhibited that would make the adaptive tool-use claim equivalent to its inputs. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the Tool Benefit Score can be computed reliably from model outputs and that the RL objective with adaptive scaling produces better tool-use policies than standard GRPO. No free parameters are explicitly listed in the abstract, but the reward scaling function itself functions as an implicit fitted component.

free parameters (1)

Tool Benefit Score scaling function
The abstract states that AT-GRPO adaptively adjusts reward scales based on this score; its exact functional form and any hyperparameters are not specified here.

axioms (1)

domain assumption RL with verifiable rewards produces policies that generalize to unseen visual reasoning tasks
Invoked implicitly when claiming superior performance on twelve benchmarks after training on the new 300k dataset.

invented entities (1)

Tool Benefit Score no independent evidence
purpose: Quantifies whether tool use improves the answer for a given sample so that reward scaling can discourage unnecessary calls.
Introduced as the core signal for AT-GRPO; no independent evidence outside the training loop is provided in the abstract.

pith-pipeline@v0.9.0 · 5565 in / 1462 out tokens · 18613 ms · 2026-05-16T21:21:02.075640+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
Gen-Searcher: Reinforcing Agentic Search for Image Generation
cs.CV 2026-03 unverdicted novelty 7.0

Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
cs.CV 2026-05 unverdicted novelty 6.0

VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Gen-Searcher: Reinforcing Agentic Search for Image Generation
cs.CV 2026-03 unverdicted novelty 6.0

Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
cs.CV 2026-04 unverdicted novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 5 Pith papers · 26 internal anchors

[1]

Qwen2.5- vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report, ...

work page 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,

Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Rea- soning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025. 2

work page arXiv 2025
[4]

Do not think that much for 2+3=? on the overthink- ing of o1-like llms, 2025

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do not think that much for 2+3=? on the overthink- ing of o1-like llms, 2025. 2

work page 2025
[5]

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 2

work page internal anchor Pith review arXiv 2025
[6]

Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025. 7

work page 2025
[7]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. 1, 2, 5, 6

work page 2025
[8]

Miss- ing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?, 2025

Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Miss- ing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?, 2025. 2

work page 2025
[9]

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025. 1, 6

work page arXiv 2025
[10]

Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos, 2024

Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos, 2024. 7

work page 2024
[11]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2025

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2025. 7

work page 2025
[14]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025. 7

work page 2025
[15]

Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025. 7

work page arXiv 2025
[16]

Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts.arXiv preprint arXiv:2507.20939, 2025

Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, et al. Arc- 9 hunyuan-video-7b: Structured video comprehension of real- world shorts.arXiv preprint arXiv:2507.20939, 2025. 2

work page arXiv 2025
[17]

Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context, 2024

Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context, 2024. 6, 7

work page 2024
[19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos, 2025

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos, 2025. 7

work page 2025
[21]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025. 1

work page 2025
[23]

Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025. 7

work page arXiv 2025
[24]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 7

work page 2023
[25]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 6

work page internal anchor Pith review arXiv 2025
[26]

Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025. 2

work page 2025
[27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 3

work page internal anchor Pith review arXiv 2025
[29]

Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025

Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025. 2

work page arXiv 2025
[30]

Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025. 2

work page arXiv 2025
[31]

Mvbench: A comprehensive multi-modal video understanding benchmark, 2024

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. 7

work page 2024
[32]

Adaptive tool use in large language models with meta-cognition trigger, 2025

Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, and Yong Liu. Adaptive tool use in large language models with meta-cognition trigger, 2025. 2

work page 2025
[34]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Perception, reason, think, and plan: A survey on large multimodal reasoning models, 2025

Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, and Min Zhang. Perception, reason, think, and plan: A survey on large multimodal ...

work page 2025
[36]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 6

work page 2024
[37]

Mmbench: Is your multi-modal model an all-around player?, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2024. 7

work page 2024
[38]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 7

work page 2019
[41]

Mathvista: Evaluating mathemati- cal reasoning of foundation models in visual contexts, 2024

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemati- cal reasoning of foundation models in visual contexts, 2024. 7

work page 2024
[42]

V Jawahar

Minesh Mathew, Viraj Bagal, Rub`en P´erez Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V Jawahar. Infographicvqa,

work page
[43]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 7 10

work page arXiv 2025
[44]

Hello gpt-4o

OpenAI. Hello gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. Accessed 2025-09-19. 6, 7

work page 2024
[45]

Thinking with images, 2025

OpenAI. Thinking with images, 2025. 2

work page 2025
[46]

Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025

Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyun- woo J Kim. Deepvideo-r1: Video reinforcement fine- tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025. 2

work page arXiv 2025
[47]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Video- xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video- xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 7

work page 2025
[52]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025. 2

work page 2025
[54]

Openthinkimg: Learning to think with images via visual tool reinforcement learning,

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, and Yu Cheng. Openthinkimg: Learning to think with images via visual tool reinforcement learning,

work page
[55]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 2

work page internal anchor Pith review arXiv 2025
[56]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models, 2025. 1

work page 2025
[58]

Qwen2.5-vl, 2025

Qwen Team. Qwen2.5-vl, 2025. 7

work page 2025
[59]

More thought, less accuracy? on the dual nature of reasoning in vision-language models, 2025

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of reasoning in vision-language models, 2025. 2

work page 2025
[60]

Knowing the answer isn’t enough: Fixing reasoning path failures in lvlms.arXiv preprint arXiv:2512.06258, 2025

Chaoyang Wang, Yangfan He, Yiyang Zhou, Yixuan Wang, Ji- aqi Liu, Peng Xia, Zhengzhong Tu, Mohit Bansal, and Huaxiu Yao. Knowing the answer isn’t enough: Fixing reasoning path failures in lvlms.arXiv preprint arXiv:2512.06258, 2025. 2

work page arXiv 2025
[61]

Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

Chaoyang Wang, Zeyu Zhang, Meng Meng, Xu Zhou, and Haiyun Jiang. Vision-ekipl: External knowledge- infused policy learning for visual reasoning.arXiv preprint arXiv:2506.06856, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Tmcir: Token merge benefits composed image retrieval.arXiv preprint arXiv:2504.10995, 2025

Chaoyang Wang, Zeyu Zhang, Long Teng, Zijun Li, and Shichao Kan. Tmcir: Token merge benefits composed image retrieval.arXiv preprint arXiv:2504.10995, 2025. 2

work page arXiv 2025
[63]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025. 2

work page arXiv 2025
[64]

Vl-rethinker: Incentivizing self- reflection of vision-language models with reinforcement learn- ing, 2025

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self- reflection of vision-language models with reinforcement learn- ing, 2025. 6

work page 2025
[65]

Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025. 2, 3, 6

work page 2025
[66]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

work page arXiv
[68]

Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

work page arXiv
[69]

Vstar: A video-grounded dialogue dataset for situated semantic understanding with scene and topic transitions, 2023

Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, and Dongyan Zhao. Vstar: A video-grounded dialogue dataset for situated semantic understanding with scene and topic transitions, 2023. 7

work page 2023
[70]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2

work page internal anchor Pith review arXiv 2025
[71]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

work page internal anchor Pith review arXiv
[72]

Proxythinker: Test-time guidance through small visual reasoners, 2025

Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, and Vicente Ordonez. Proxythinker: Test-time guidance through small visual reasoners, 2025. 6

work page 2025
[73]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces, 2025. 7

work page 2025
[74]

Mmsi-bench: A benchmark for multi-image spatial intelli- gence, 2025

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelli- gence, 2025. 7

work page 2025
[75]

Language-aware vision transformer for referring segmentation.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024

Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Language-aware vision transformer for referring segmentation.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 2

work page 2024
[76]

Dapo: An open-source llm reinforcement learning system at scale,

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu,...

work page
[77]

Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms

Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Ren- rui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025. 2

work page arXiv 2025
[78]

Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,

work page
[79]

From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025. 7

work page 2025
[80]

Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025

Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025. 1

work page arXiv 2025
[81]

Thyme: Think beyond images, 2025

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, and Guorui Zhou. Thyme: Think beyond images, 2025. 2

work page 2025
[82]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Qwen2.5- vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report, ...

work page 2025

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,

Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Rea- soning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025. 2

work page arXiv 2025

[4] [4]

Do not think that much for 2+3=? on the overthink- ing of o1-like llms, 2025

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do not think that much for 2+3=? on the overthink- ing of o1-like llms, 2025. 2

work page 2025

[5] [5]

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 2

work page internal anchor Pith review arXiv 2025

[6] [6]

Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025. 7

work page 2025

[7] [7]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. 1, 2, 5, 6

work page 2025

[8] [8]

Miss- ing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?, 2025

Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Miss- ing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?, 2025. 2

work page 2025

[9] [9]

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025. 1, 6

work page arXiv 2025

[10] [10]

Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos, 2024

Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos, 2024. 7

work page 2024

[11] [11]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2025

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2025. 7

work page 2025

[14] [14]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025. 7

work page 2025

[15] [15]

Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025. 7

work page arXiv 2025

[16] [16]

Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts.arXiv preprint arXiv:2507.20939, 2025

Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, et al. Arc- 9 hunyuan-video-7b: Structured video comprehension of real- world shorts.arXiv preprint arXiv:2507.20939, 2025. 2

work page arXiv 2025

[17] [17]

Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context, 2024

Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context, 2024. 6, 7

work page 2024

[18] [19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [20]

Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos, 2025

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos, 2025. 7

work page 2025

[20] [21]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [22]

Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025. 1

work page 2025

[22] [23]

Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025. 7

work page arXiv 2025

[23] [24]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 7

work page 2023

[24] [25]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 6

work page internal anchor Pith review arXiv 2025

[25] [26]

Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025. 2

work page 2025

[26] [27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [28]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 3

work page internal anchor Pith review arXiv 2025

[28] [29]

Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025

Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025. 2

work page arXiv 2025

[29] [30]

Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025. 2

work page arXiv 2025

[30] [31]

Mvbench: A comprehensive multi-modal video understanding benchmark, 2024

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. 7

work page 2024

[31] [32]

Adaptive tool use in large language models with meta-cognition trigger, 2025

Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, and Yong Liu. Adaptive tool use in large language models with meta-cognition trigger, 2025. 2

work page 2025

[32] [34]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [35]

Perception, reason, think, and plan: A survey on large multimodal reasoning models, 2025

Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, and Min Zhang. Perception, reason, think, and plan: A survey on large multimodal ...

work page 2025

[34] [36]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 6

work page 2024

[35] [37]

Mmbench: Is your multi-modal model an all-around player?, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2024. 7

work page 2024

[36] [38]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [39]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [40]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 7

work page 2019

[39] [41]

Mathvista: Evaluating mathemati- cal reasoning of foundation models in visual contexts, 2024

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemati- cal reasoning of foundation models in visual contexts, 2024. 7

work page 2024

[40] [42]

V Jawahar

Minesh Mathew, Viraj Bagal, Rub`en P´erez Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V Jawahar. Infographicvqa,

work page

[41] [43]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 7 10

work page arXiv 2025

[42] [44]

Hello gpt-4o

OpenAI. Hello gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. Accessed 2025-09-19. 6, 7

work page 2024

[43] [45]

Thinking with images, 2025

OpenAI. Thinking with images, 2025. 2

work page 2025

[44] [46]

Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025

Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyun- woo J Kim. Deepvideo-r1: Video reinforcement fine- tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025. 2

work page arXiv 2025

[45] [47]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [48]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [49]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [50]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [51]

Video- xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video- xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 7

work page 2025

[50] [52]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [53]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025. 2

work page 2025

[52] [54]

Openthinkimg: Learning to think with images via visual tool reinforcement learning,

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, and Yu Cheng. Openthinkimg: Learning to think with images via visual tool reinforcement learning,

work page

[53] [55]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 2

work page internal anchor Pith review arXiv 2025

[54] [56]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [57]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models, 2025. 1

work page 2025

[56] [58]

Qwen2.5-vl, 2025

Qwen Team. Qwen2.5-vl, 2025. 7

work page 2025

[57] [59]

More thought, less accuracy? on the dual nature of reasoning in vision-language models, 2025

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of reasoning in vision-language models, 2025. 2

work page 2025

[58] [60]

Knowing the answer isn’t enough: Fixing reasoning path failures in lvlms.arXiv preprint arXiv:2512.06258, 2025

Chaoyang Wang, Yangfan He, Yiyang Zhou, Yixuan Wang, Ji- aqi Liu, Peng Xia, Zhengzhong Tu, Mohit Bansal, and Huaxiu Yao. Knowing the answer isn’t enough: Fixing reasoning path failures in lvlms.arXiv preprint arXiv:2512.06258, 2025. 2

work page arXiv 2025

[59] [61]

Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

Chaoyang Wang, Zeyu Zhang, Meng Meng, Xu Zhou, and Haiyun Jiang. Vision-ekipl: External knowledge- infused policy learning for visual reasoning.arXiv preprint arXiv:2506.06856, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [62]

Tmcir: Token merge benefits composed image retrieval.arXiv preprint arXiv:2504.10995, 2025

Chaoyang Wang, Zeyu Zhang, Long Teng, Zijun Li, and Shichao Kan. Tmcir: Token merge benefits composed image retrieval.arXiv preprint arXiv:2504.10995, 2025. 2

work page arXiv 2025

[61] [63]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025. 2

work page arXiv 2025

[62] [64]

Vl-rethinker: Incentivizing self- reflection of vision-language models with reinforcement learn- ing, 2025

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self- reflection of vision-language models with reinforcement learn- ing, 2025. 6

work page 2025

[63] [65]

Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025. 2, 3, 6

work page 2025

[64] [66]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [67]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

work page arXiv

[66] [68]

Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

work page arXiv

[67] [69]

Vstar: A video-grounded dialogue dataset for situated semantic understanding with scene and topic transitions, 2023

Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, and Dongyan Zhao. Vstar: A video-grounded dialogue dataset for situated semantic understanding with scene and topic transitions, 2023. 7

work page 2023

[68] [70]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2

work page internal anchor Pith review arXiv 2025

[69] [71]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

work page internal anchor Pith review arXiv

[70] [72]

Proxythinker: Test-time guidance through small visual reasoners, 2025

Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, and Vicente Ordonez. Proxythinker: Test-time guidance through small visual reasoners, 2025. 6

work page 2025

[71] [73]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces, 2025. 7

work page 2025

[72] [74]

Mmsi-bench: A benchmark for multi-image spatial intelli- gence, 2025

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelli- gence, 2025. 7

work page 2025

[73] [75]

Language-aware vision transformer for referring segmentation.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024

Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Language-aware vision transformer for referring segmentation.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 2

work page 2024

[74] [76]

Dapo: An open-source llm reinforcement learning system at scale,

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu,...

work page

[75] [77]

Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms

Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Ren- rui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025. 2

work page arXiv 2025

[76] [78]

Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,

work page

[77] [79]

From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025. 7

work page 2025

[78] [80]

Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025

Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025. 1

work page arXiv 2025

[79] [81]

Thyme: Think beyond images, 2025

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, and Guorui Zhou. Thyme: Think beyond images, 2025. 2

work page 2025

[80] [82]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025