arxiv: 2509.07969 · v1 · pith:YEM2XMUInew · submitted 2025-09-09 · 💻 cs.CV · cs.AI· cs.CL

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai , Junyi Li , Wei Li , Tao Liu , Tianjian Li , Hengshuang Zhao This is my paper

Pith reviewed 2026-05-18 01:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords visual searchmulti-turn reasoningreinforcement learningtool-based interactionslarge multimodal modelsreasoning patternsover-turn maskingvisual probe dataset

0 comments

The pith

Mini-o3 trains on six interaction turns yet produces naturally longer reasoning chains that improve accuracy on visual search tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that existing limits on interaction length and reasoning variety in tool-using multimodal models can be overcome without training directly on long sequences. It does so by building a dataset of hard visual search problems, collecting initial trajectories that display varied patterns such as depth-first search and trial-and-error, and applying an over-turn masking rule during reinforcement learning. A sympathetic reader would care because this suggests open models can tackle exploratory visual problems that currently require many back-and-forth steps. If the approach holds, performance keeps rising as the model is allowed more turns at inference time rather than plateauing at the training limit.

Core claim

Mini-o3 executes deep multi-turn reasoning spanning tens of steps on visual search tasks. It achieves this with a Visual Probe Dataset of challenging problems, an iterative pipeline that yields cold-start trajectories containing diverse patterns including depth-first search, trial-and-error, and goal maintenance, and an over-turn masking strategy in reinforcement learning that avoids penalizing responses reaching the maximum turn count. Despite training under a six-turn upper bound, the resulting model generates longer trajectories at inference time and shows rising accuracy with additional turns.

What carries the argument

Over-turn masking strategy during reinforcement learning that prevents penalization of responses hitting the turn limit, allowing test-time trajectories to exceed the six-turn training bound.

If this is right

Accuracy on visual search problems continues to rise as the number of allowed interaction turns increases at inference time.
The model produces varied reasoning patterns such as depth-first search and trial-and-error without explicit training on each pattern.
State-of-the-art results are reached on challenging visual search tasks that require extended exploration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The masking technique may serve as a general method to encourage longer reasoning horizons in other tool-use settings without having to train on those longer horizons.
The same data-collection loop could be repeated to target even deeper search behaviors on different visual or multimodal problems.
If the scaling holds, training compute can remain modest while inference budgets are adjusted per task difficulty.

Load-bearing premise

The iterative data collection pipeline yields cold-start trajectories whose diverse reasoning patterns transfer to longer chains without systematic bias introduced by the masking rule.

What would settle it

Measure accuracy while steadily increasing the allowed inference turns; the claim is falsified if accuracy stops rising or begins to fall after a modest number of additional turns.

read the original abstract

Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The scaling claim is the main draw but rests on unverified assumptions about the masking strategy producing real long-horizon reasoning rather than just nonstop output.

read the letter

The key takeaway is that this paper offers a practical way to extend interaction length in visual tool-use models beyond the training limit, but the support for genuine scaling of reasoning is still weak without more controls. They start from the problem that current open models get stuck in short, repetitive loops on hard visual searches. To fix it they build three pieces: a new dataset of tough visual probe problems, an iterative collection process that generates starting trajectories with varied strategies like depth-first search and goal keeping, and then this over-turn masking in the RL stage so the model isn't punished for continuing past the six-turn cap used in training. What stands out is how they get the model to produce longer chains at test time where accuracy keeps rising with more steps. If the full experiments back this up with solid numbers, it gives a workable path for people trying to make visual agents that can explore more thoroughly. The approach builds directly on existing RL for tool use and multi-turn setups, so the novelty is mostly in how they combine the dataset, collection loop, and masking to push the length. Credit for trying to reproduce o3-like depth in an open setting. On the downside, the abstract gives no concrete metrics, no comparison tables, and no ablation that isolates the masking effect. The concern about the mask just encouraging nonstop output rather than coherent long thinking is worth checking. If the paper doesn't show that trajectories stay on task and diverse even without the mask, or provide stats on termination and pattern variety, then the scaling story stays unconvincing. This is aimed at labs working on multimodal agents and long-horizon planning. Someone building on tool-augmented vision models could pick up the dataset construction and masking idea and test it themselves. It deserves a serious referee because the problem is real and the proposed fix is testable, even though the current writeup leaves too many questions on the results side. I'd recommend sending it for peer review, with the expectation that reviewers will ask for those missing ablations and quantitative details to make the claims stick.

Referee Report

3 major / 1 minor

Summary. The paper introduces Mini-o3 for scaling tool-based multi-turn reasoning in visual search tasks with large multimodal models. It constructs the Visual Probe Dataset of challenging problems, uses an iterative pipeline to collect cold-start trajectories exhibiting diverse patterns (depth-first search, trial-and-error, goal maintenance), and applies an over-turn masking strategy during RL training. The central claim is that a model trained with a hard cap of only six interaction turns produces trajectories that naturally extend to tens of turns at inference time, with accuracy continuing to improve as turn count grows, yielding SOTA results on difficult visual search problems.

Significance. If the scaling behavior is shown to arise from transferable reasoning patterns rather than an artifact of the masking strategy, the work would provide a practical open-source recipe for longer-horizon exploratory visual reasoning, addressing current limitations of monotonous patterns and short interaction limits in multimodal agents. The dataset construction and iterative collection pipeline are concrete contributions that could be reused, though the absence of reported quantitative metrics, baselines, and ablations in the abstract limits immediate assessment of impact.

major comments (3)

[Abstract] Abstract: the central scaling claim ('accuracy improving as the number of turns increases' despite a training cap of six turns) is load-bearing for the contribution yet is stated without any referenced table, figure, or quantitative result (e.g., accuracy-vs-turns curve, error bars, or comparison to a hard-stop baseline).
[Abstract] Abstract (over-turn masking strategy): the claim that masking enables test-time scalability without penalizing over-turn responses during training requires an ablation (training with vs. without the mask, or with a hard stop) to demonstrate that longer productive trajectories are due to learned reasoning patterns rather than the training hack; no such experiment is described.
[Abstract] Abstract (iterative data collection pipeline): the assertion that cold-start trajectories exhibit genuinely diverse and effective patterns (DFS, trial-and-error, goal maintenance) that transfer to longer chains lacks any reported metric of trajectory diversity, termination statistics, or bias analysis from the masking procedure.

minor comments (1)

[Abstract] Abstract: 'state-of-the-art performance' is asserted without naming the specific benchmarks, prior open-source baselines, or exact metrics used for comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments correctly identify areas where the abstract could more explicitly connect to the quantitative evidence and analyses in the main text. We address each point below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the central scaling claim ('accuracy improving as the number of turns increases' despite a training cap of six turns) is load-bearing for the contribution yet is stated without any referenced table, figure, or quantitative result (e.g., accuracy-vs-turns curve, error bars, or comparison to a hard-stop baseline).

Authors: We agree that the abstract should reference the supporting quantitative results. The main text includes Figure 3, which plots accuracy versus number of inference turns (with error bars from multiple seeds) and compares against a hard-stop baseline. In the revised manuscript we have updated the abstract to cite this figure and briefly note the observed trend of continued accuracy gains beyond the six-turn training cap. revision: yes
Referee: [Abstract] Abstract (over-turn masking strategy): the claim that masking enables test-time scalability without penalizing over-turn responses during training requires an ablation (training with vs. without the mask, or with a hard stop) to demonstrate that longer productive trajectories are due to learned reasoning patterns rather than the training hack; no such experiment is described.

Authors: This is a fair criticism. While Section 3.3 motivates the over-turn masking strategy, the initial submission did not contain a direct ablation. We have added an ablation study to the revised version (new Table 4 and accompanying text in Section 4.3) that trains an otherwise identical model without the mask and compares resulting trajectory lengths and accuracies at inference. The results indicate that masking permits longer productive chains without introducing the artifacts a hard stop would produce. revision: yes
Referee: [Abstract] Abstract (iterative data collection pipeline): the assertion that cold-start trajectories exhibit genuinely diverse and effective patterns (DFS, trial-and-error, goal maintenance) that transfer to longer chains lacks any reported metric of trajectory diversity, termination statistics, or bias analysis from the masking procedure.

Authors: We appreciate the request for quantitative support. Section 3.2 describes the iterative collection pipeline and provides qualitative examples of the patterns. To address the gap, the revised manuscript now includes a table (new Table 2) reporting the distribution of reasoning patterns across collected trajectories, termination statistics, and a short bias analysis of the masking procedure. We have also added a reference to this table in the abstract. revision: yes

Circularity Check

0 steps flagged

Empirical training pipeline shows no definitional circularity

full rationale

The paper presents an empirical recipe consisting of dataset construction, iterative trajectory collection exhibiting patterns such as depth-first search and trial-and-error, and RL training with an over-turn masking heuristic. The central claim that accuracy improves with turn count beyond the training cap of six is reported as an observed inference-time behavior on held-out visual search tasks, not as a quantity algebraically or statistically forced by the training limit or masking rule. No equations, self-definitional normalizations, or load-bearing self-citations reduce the reported scaling or accuracy gains to the fitted inputs by construction; the results remain externally falsifiable against standard benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central scaling claim rests on the unverified quality of the constructed Visual Probe Dataset and the assumption that the masking strategy preserves learning signal for longer trajectories.

free parameters (1)

maximum interaction turns during training
Upper bound of six turns chosen for training efficiency; directly affects what trajectories are collected and masked.

axioms (1)

domain assumption The Visual Probe Dataset contains problems that elicit diverse exploratory reasoning patterns when solved by the base model.
Invoked to justify the iterative data collection step; no external validation of dataset difficulty or pattern diversity is described.

invented entities (1)

over-turn masking strategy no independent evidence
purpose: Prevents penalization of responses that hit the maximum turn limit during RL training.
New training modification introduced to allow test-time scaling; no independent evidence of its effect outside the reported experiments.

pith-pipeline@v0.9.0 · 5791 in / 1277 out tokens · 43576 ms · 2026-05-18T01:13:09.828737+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
eess.AS 2026-04 unverdicted novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
Visual Reasoning through Tool-supervised Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
cs.AI 2026-04 unverdicted novelty 6.0

CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
AdaTooler-V: Adaptive Tool-Use for Images and Videos
cs.CV 2025-12 conditional novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
Boosting Reasoning in Large Multimodal Models via Activation Replay
cs.CV 2025-11 unverdicted novelty 6.0

Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
cs.CV 2025-11 unverdicted novelty 6.0

CropVLM uses reinforcement learning to learn image zooming policies that boost fine-grained perception in VLMs on out-of-domain high-resolution tasks without labeled boxes, synthetic data, or VLM changes.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
cs.CV 2026-04 unverdicted novelty 5.0

Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
cs.CV 2026-04 unverdicted novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
cs.CV 2026-03 unverdicted novelty 5.0

A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 16 Pith papers · 25 internal anchors

[1]

End-to-end rl training for emerging agentic capabilities, 2025

Moonshot AI. End-to-end rl training for emerging agentic capabilities, 2025. URLhttps://moonshotai.github. io/Kimi-Researcher/

work page 2025
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

work page 2022
[3]

Claude 3.5 Sonnet

Anthropic. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet/. Technical Report, 2024

work page 2024
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

work page 2024
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontiers of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. InEuropean Conference on Computer Vision, pages 390–406. Springer, 2024

work page 2024
[10]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizingreasoningcapabilityinmultimodallargelanguagemodels. arXivpreprintarXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, and Ziwei Liu. High-resolution visual reasoning via multi-turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019. URLhttps://api.semanticscholar.org/CorpusID:198489118

work page 2019
[15]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024

work page arXiv 2024
[17]

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding, 2025

Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding, 2025. URLhttps://arxiv.org/abs/2504.14920

work page arXiv 2025
[18]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 12

work page 2023
[19]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Remax: A simple, effective, and efficient method for aligning large language models

Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, RUoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient method for aligning large language models. 2023

work page 2023
[21]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

work page 2024
[22]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

work page 2024
[23]

Understanding r1-zero-like training: A critical perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InConference on Language Modeling (COLM), 2025

work page 2025
[24]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Negar Maleki, Balaji Padmanabhan, and Kaushik Dutta

Xinji Mai, Haotian Xu, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang, et al. Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773, 2025

work page arXiv 2025
[26]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm- eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models.https://ai.meta.com/ blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

Meta. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models.https://ai.meta.com/ blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Technical Report, 2024

work page 2024
[28]

Introducing o3 and o4-mini, 2025

OpenAI. Introducing o3 and o4-mini, 2025. URLhttps://openai.com/index/introducing-o3-and-o4-mini/

work page 2025
[29]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models.CoRR, 2024

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models.CoRR, 2024

work page 2024
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

work page arXiv 2025
[34]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025

work page 2025
[38]

Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022

work page 2022
[39]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992

work page 1992
[40]

MMSearch-R1: Incentivizing LMMs to Search

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024
[42]

Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025. URLhttps://arxiv.org/abs/2509.02479

work page arXiv 2025
[43]

Visionthink: Smart and efficient vision language model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025

Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025

work page arXiv 2025
[44]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https: //arxiv.org/abs/2507.18071

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

aha moment

Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s" aha moment" in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025

work page arXiv 2025
[51]

PRAKING”. ... I can see a sign on the right side of the road, below a traffic light. ... It is likely that this sign has the text

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo. arXiv preprint arXiv:2505.21457, 2025. 15 Appendix A More illustrations of multi-turn trajectories Turn1: The user is asking for the direction of...

work page arXiv 2025