Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Pith reviewed 2026-05-18 01:13 UTC · model grok-4.3
The pith
Mini-o3 trains on six interaction turns yet produces naturally longer reasoning chains that improve accuracy on visual search tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mini-o3 executes deep multi-turn reasoning spanning tens of steps on visual search tasks. It achieves this with a Visual Probe Dataset of challenging problems, an iterative pipeline that yields cold-start trajectories containing diverse patterns including depth-first search, trial-and-error, and goal maintenance, and an over-turn masking strategy in reinforcement learning that avoids penalizing responses reaching the maximum turn count. Despite training under a six-turn upper bound, the resulting model generates longer trajectories at inference time and shows rising accuracy with additional turns.
What carries the argument
Over-turn masking strategy during reinforcement learning that prevents penalization of responses hitting the turn limit, allowing test-time trajectories to exceed the six-turn training bound.
If this is right
- Accuracy on visual search problems continues to rise as the number of allowed interaction turns increases at inference time.
- The model produces varied reasoning patterns such as depth-first search and trial-and-error without explicit training on each pattern.
- State-of-the-art results are reached on challenging visual search tasks that require extended exploration.
Where Pith is reading between the lines
- The masking technique may serve as a general method to encourage longer reasoning horizons in other tool-use settings without having to train on those longer horizons.
- The same data-collection loop could be repeated to target even deeper search behaviors on different visual or multimodal problems.
- If the scaling holds, training compute can remain modest while inference budgets are adjusted per task difficulty.
Load-bearing premise
The iterative data collection pipeline yields cold-start trajectories whose diverse reasoning patterns transfer to longer chains without systematic bias introduced by the masking rule.
What would settle it
Measure accuracy while steadily increasing the allowed inference turns; the claim is falsified if accuracy stops rising or begins to fall after a modest number of additional turns.
read the original abstract
Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Mini-o3 for scaling tool-based multi-turn reasoning in visual search tasks with large multimodal models. It constructs the Visual Probe Dataset of challenging problems, uses an iterative pipeline to collect cold-start trajectories exhibiting diverse patterns (depth-first search, trial-and-error, goal maintenance), and applies an over-turn masking strategy during RL training. The central claim is that a model trained with a hard cap of only six interaction turns produces trajectories that naturally extend to tens of turns at inference time, with accuracy continuing to improve as turn count grows, yielding SOTA results on difficult visual search problems.
Significance. If the scaling behavior is shown to arise from transferable reasoning patterns rather than an artifact of the masking strategy, the work would provide a practical open-source recipe for longer-horizon exploratory visual reasoning, addressing current limitations of monotonous patterns and short interaction limits in multimodal agents. The dataset construction and iterative collection pipeline are concrete contributions that could be reused, though the absence of reported quantitative metrics, baselines, and ablations in the abstract limits immediate assessment of impact.
major comments (3)
- [Abstract] Abstract: the central scaling claim ('accuracy improving as the number of turns increases' despite a training cap of six turns) is load-bearing for the contribution yet is stated without any referenced table, figure, or quantitative result (e.g., accuracy-vs-turns curve, error bars, or comparison to a hard-stop baseline).
- [Abstract] Abstract (over-turn masking strategy): the claim that masking enables test-time scalability without penalizing over-turn responses during training requires an ablation (training with vs. without the mask, or with a hard stop) to demonstrate that longer productive trajectories are due to learned reasoning patterns rather than the training hack; no such experiment is described.
- [Abstract] Abstract (iterative data collection pipeline): the assertion that cold-start trajectories exhibit genuinely diverse and effective patterns (DFS, trial-and-error, goal maintenance) that transfer to longer chains lacks any reported metric of trajectory diversity, termination statistics, or bias analysis from the masking procedure.
minor comments (1)
- [Abstract] Abstract: 'state-of-the-art performance' is asserted without naming the specific benchmarks, prior open-source baselines, or exact metrics used for comparison.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The comments correctly identify areas where the abstract could more explicitly connect to the quantitative evidence and analyses in the main text. We address each point below and have revised the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central scaling claim ('accuracy improving as the number of turns increases' despite a training cap of six turns) is load-bearing for the contribution yet is stated without any referenced table, figure, or quantitative result (e.g., accuracy-vs-turns curve, error bars, or comparison to a hard-stop baseline).
Authors: We agree that the abstract should reference the supporting quantitative results. The main text includes Figure 3, which plots accuracy versus number of inference turns (with error bars from multiple seeds) and compares against a hard-stop baseline. In the revised manuscript we have updated the abstract to cite this figure and briefly note the observed trend of continued accuracy gains beyond the six-turn training cap. revision: yes
-
Referee: [Abstract] Abstract (over-turn masking strategy): the claim that masking enables test-time scalability without penalizing over-turn responses during training requires an ablation (training with vs. without the mask, or with a hard stop) to demonstrate that longer productive trajectories are due to learned reasoning patterns rather than the training hack; no such experiment is described.
Authors: This is a fair criticism. While Section 3.3 motivates the over-turn masking strategy, the initial submission did not contain a direct ablation. We have added an ablation study to the revised version (new Table 4 and accompanying text in Section 4.3) that trains an otherwise identical model without the mask and compares resulting trajectory lengths and accuracies at inference. The results indicate that masking permits longer productive chains without introducing the artifacts a hard stop would produce. revision: yes
-
Referee: [Abstract] Abstract (iterative data collection pipeline): the assertion that cold-start trajectories exhibit genuinely diverse and effective patterns (DFS, trial-and-error, goal maintenance) that transfer to longer chains lacks any reported metric of trajectory diversity, termination statistics, or bias analysis from the masking procedure.
Authors: We appreciate the request for quantitative support. Section 3.2 describes the iterative collection pipeline and provides qualitative examples of the patterns. To address the gap, the revised manuscript now includes a table (new Table 2) reporting the distribution of reasoning patterns across collected trajectories, termination statistics, and a short bias analysis of the masking procedure. We have also added a reference to this table in the abstract. revision: yes
Circularity Check
Empirical training pipeline shows no definitional circularity
full rationale
The paper presents an empirical recipe consisting of dataset construction, iterative trajectory collection exhibiting patterns such as depth-first search and trial-and-error, and RL training with an over-turn masking heuristic. The central claim that accuracy improves with turn count beyond the training cap of six is reported as an observed inference-time behavior on held-out visual search tasks, not as a quantity algebraically or statistically forced by the training limit or masking rule. No equations, self-definitional normalizations, or load-bearing self-citations reduce the reported scaling or accuracy gains to the fitted inputs by construction; the results remain externally falsifiable against standard benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- maximum interaction turns during training
axioms (1)
- domain assumption The Visual Probe Dataset contains problems that elicit diverse exploratory reasoning patterns when solved by the base model.
invented entities (1)
-
over-turn masking strategy
no independent evidence
Forward citations
Cited by 17 Pith papers
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
Visual Reasoning through Tool-supervised Reinforcement Learning
ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
-
AdaTooler-V: Adaptive Tool-Use for Images and Videos
AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
-
Boosting Reasoning in Large Multimodal Models via Activation Replay
Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
-
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
CropVLM uses reinforcement learning to learn image zooming policies that boost fine-grained perception in VLMs on out-of-domain high-resolution tasks without labeled boxes, synthetic data, or VLM changes.
-
DeepEyesV2: Toward Agentic Multimodal Model
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
-
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
Reference graph
Works this paper leans on
-
[1]
End-to-end rl training for emerging agentic capabilities, 2025
Moonshot AI. End-to-end rl training for emerging agentic capabilities, 2025. URLhttps://moonshotai.github. io/Kimi-Researcher/
work page 2025
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022
work page 2022
-
[3]
Anthropic. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet/. Technical Report, 2024
work page 2024
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024
work page 2024
-
[6]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[7]
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontiers of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images
Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. InEuropean Conference on Computer Vision, pages 390–406. Springer, 2024
work page 2024
-
[10]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizingreasoningcapabilityinmultimodallargelanguagemodels. arXivpreprintarXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, and Ziwei Liu. High-resolution visual reasoning via multi-turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019
Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019. URLhttps://api.semanticscholar.org/CorpusID:198489118
work page 2019
-
[15]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024
-
[17]
Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding, 2025. URLhttps://arxiv.org/abs/2504.14920
-
[18]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 12
work page 2023
-
[19]
WebSailor: Navigating Super-human Reasoning for Web Agent
Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Remax: A simple, effective, and efficient method for aligning large language models
Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, RUoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient method for aligning large language models. 2023
work page 2023
-
[21]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024
work page 2024
-
[22]
Visual instruction tuning.Advances in neural information processing systems, 36, 2024
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024
work page 2024
-
[23]
Understanding r1-zero-like training: A critical perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InConference on Language Modeling (COLM), 2025
work page 2025
-
[24]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Negar Maleki, Balaji Padmanabhan, and Kaushik Dutta
Xinji Mai, Haotian Xu, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang, et al. Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773, 2025
-
[26]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm- eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Meta. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models.https://ai.meta.com/ blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Technical Report, 2024
work page 2024
-
[28]
Introducing o3 and o4-mini, 2025
OpenAI. Introducing o3 and o4-mini, 2025. URLhttps://openai.com/index/introducing-o3-and-o4-mini/
work page 2025
-
[29]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models.CoRR, 2024
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models.CoRR, 2024
work page 2024
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025
-
[34]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025
work page 2025
-
[38]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022
work page 2022
- [39]
-
[40]
MMSearch-R1: Incentivizing LMMs to Search
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024
work page 2024
-
[42]
Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025
Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025. URLhttps://arxiv.org/abs/2509.02479
-
[43]
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025
-
[44]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https: //arxiv.org/abs/2507.18071
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s" aha moment" in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025
-
[51]
Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo. arXiv preprint arXiv:2505.21457, 2025. 15 Appendix A More illustrations of multi-turn trajectories Turn1: The user is asking for the direction of...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.