arxiv: 2509.22746 · v2 · pith:XRTUF2IInew · submitted 2025-09-26 · 💻 cs.AI · cs.CV

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Zejun Li , Yingxiu Zhao , Jiwen Zhang , Siyuan Wang , Yang Yao , Runzhou Zhao , Jun Song , Bo Zheng

show 1 more author

Zhongyu Wei

This is my paper

Pith reviewed 2026-05-18 13:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords visual reasoningadaptive reasoningmode selectionmixture of thoughtsreinforcement learningcontext-adaptivemultimodal modelsgeneral reasoning

0 comments

The pith

A single model can unify multiple visual reasoning modes and learn to select the right one based on context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Mixture-of-Visual-Thoughts to bring different reasoning modes together inside one model rather than training separate systems for narrow domains. It introduces the AdaVaR framework that first joins the modes through supervised training and then uses reinforcement learning with the AdaGRPO algorithm to develop the ability to pick the suitable mode for each input. A sympathetic reader would care because existing visual reasoning approaches often improve only in specific settings while failing to generalize. If the claim holds, it offers a path to more flexible models that handle varied visual tasks without mode-specific retraining.

Core claim

Mixture-of-Visual-Thoughts unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. This is realized through AdaVaR, a two-stage Adaptive Visual Reasoning learning framework where modes are unified and learned during the supervised cold-start stage and mode selection is induced via reinforcement learning with the AdaGRPO algorithm. Experiments demonstrate that AdaVaR guides the model to learn and differentiate multiple modes while performing context-adaptive selection and delivering consistent gains across scenarios.

What carries the argument

AdaVaR, the two-stage framework that first unifies and jointly trains multiple reasoning modes through supervised learning then applies reinforcement learning with AdaGRPO to induce context-adaptive mode selection.

If this is right

The model learns to differentiate multiple reasoning modes within a shared parameter set.
Context-adaptive selection leads to measurable gains on diverse visual reasoning benchmarks.
A single trained system can handle scenarios that previously required separate specialized models.
The two-stage process of supervised unification followed by reinforcement learning produces stable mode selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unification-plus-reinforcement pattern could be tested on non-visual reasoning tasks such as textual or auditory inputs.
If mode selection generalizes, it reduces the need to maintain multiple domain-specific models for visual problems.
A direct test would measure whether the learned selection rule transfers to entirely novel contexts absent from the reinforcement learning phase.

Load-bearing premise

Different reasoning modes can be successfully unified and jointly learned in one model during the initial supervised stage so that later reinforcement learning reliably produces context-sensitive selection instead of mode collapse or overfitting to training examples.

What would settle it

Training a model with AdaVaR and then observing no performance gain over single-mode baselines on out-of-distribution visual reasoning tasks or no evidence that the model switches between distinct modes for inputs that require different reasoning styles would falsify the central claim.

Figures

Figures reproduced from arXiv: 2509.22746 by Bo Zheng, Jiwen Zhang, Jun Song, Runzhou Zhao, Siyuan Wang, Yang Yao, Yingxiu Zhao, Zejun Li, Zhongyu Wei.

**Figure 2.** Figure 2: Demonstration of GRPO and AdaGRPO. In AdaGRPO, we use mode prefixes to guide [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Math-related training dynamics and evaluation metrics during Stage 2. ADA, TXT, and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Proportion of each mode selected by different models across categories in MMStar. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Proportion of each mode selected by different models across sub-categories in MMStar. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy reward curves for different modes across different tasks during RL. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Cases from MathVista. The pointing-finger icon indicates the mode selected by AdaVaR. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Cases from other math-oriented benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Cases from V* and POPE. Please zoom in for a better view of small objects in images. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Cases from MMStar. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Cases from SpatialScore. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

read the original abstract

Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's two-stage setup for unifying visual reasoning modes then using RL to induce context-adaptive selection is a reasonable idea, but the abstract leaves the central claim under-supported.

read the letter

The main point is that they want one model to handle multiple visual reasoning styles instead of picking one mode per task. They do this with a supervised cold-start to learn the modes together, then RL via their AdaGRPO algorithm to train the model to choose based on context. That framing is straightforward and addresses a practical gap in current visual reasoning work that tends to lock into narrow strategies. The experiments are described as showing consistent gains across scenarios, which is the kind of result that could interest people building general multimodal systems. What stands out is the explicit two-stage split and the focus on mode differentiation rather than just scaling one approach. The soft spots are more noticeable. The abstract gives no numbers, no baseline comparisons, and no ablations on whether the RL stage actually changes selection behavior or simply reinforces patterns from the cold-start data. Without those controls it is difficult to rule out mode collapse or overfitting to the training contexts. The reward design in AdaGRPO is called carefully designed, but that detail matters a lot and needs to be shown rather than asserted. The stress-test concern about whether the cold-start already interferes with distinct modes and whether RL produces genuine adaptation rather than memorization looks like it still applies on the basis of what is written. This is for people working on adaptive or mixture-style methods in vision-language models who want a concrete training recipe. A reader already familiar with RL for reasoning would get the most out of it. It deserves peer review because the problem is real and the proposed structure is clear enough to evaluate, even if the current evidence is preliminary. I would send it out with requests for the missing ablations and mode-selection diagnostics.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Mixture-of-Visual-Thoughts (MoVT), a paradigm that unifies multiple reasoning modes in a single model for context-adaptive selection in general visual reasoning. It introduces the AdaVaR two-stage framework: a supervised cold-start SFT stage to jointly learn and unify distinct modes, followed by an RL stage using the AdaGRPO algorithm to induce context-sensitive mode selection. The abstract claims that this produces mode differentiation and consistent performance gains across scenarios.

Significance. If the empirical results hold, the work could support progress toward general visual reasoning models by showing that mode unification and adaptive routing are feasible without domain-specific specialization. The two-stage SFT+RL structure is a known pattern, but its application here to explicit mode differentiation would be a useful contribution if supported by controls for collapse and context sensitivity.

major comments (2)

[Abstract] Abstract: the central claim of 'consistent improvement across various scenarios' and effective 'context-adaptive mode selection' is asserted without any quantitative results, baselines, ablation studies, or metrics. This prevents assessment of whether the data actually support the claim that AdaVaR induces true adaptive routing rather than collapse or memorization.
[Abstract] Abstract (AdaGRPO description): the reward structure is described only as 'carefully designed,' with no equations, loss terms, or controls showing that selection changes with context rather than defaulting to a single mode or overfitting training contexts. This directly bears on the weakest assumption that cold-start unification plus RL will produce reliable context-sensitive selection.

minor comments (1)

Clarify the precise relationship between the MoVT paradigm and the AdaVaR framework name; the abstract introduces both without distinguishing their scopes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address the two major points on the abstract below, providing clarifications from the full manuscript while agreeing to strengthen the abstract for better readability and support of the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'consistent improvement across various scenarios' and effective 'context-adaptive mode selection' is asserted without any quantitative results, baselines, ablation studies, or metrics. This prevents assessment of whether the data actually support the claim that AdaVaR induces true adaptive routing rather than collapse or memorization.

Authors: The abstract is intended as a high-level summary. The full manuscript provides extensive quantitative results in the Experiments section, including performance tables with baselines, ablations on mode unification and selection, and analyses (e.g., mode usage distributions across contexts) that demonstrate consistent gains and evidence against collapse or simple memorization. We will revise the abstract to incorporate key quantitative highlights, such as average accuracy improvements and references to the mode differentiation metrics, to better substantiate the claims upfront. revision: yes
Referee: [Abstract] Abstract (AdaGRPO description): the reward structure is described only as 'carefully designed,' with no equations, loss terms, or controls showing that selection changes with context rather than defaulting to a single mode or overfitting training contexts. This directly bears on the weakest assumption that cold-start unification plus RL will produce reliable context-sensitive selection.

Authors: Space limitations in the abstract led to the concise phrasing. The Method section details the AdaGRPO reward formulation, including the specific reward components, the GRPO loss terms, and experimental controls (such as context-variation tests and mode-probability tracking) that show selection adapts rather than collapsing to one mode or overfitting. We will revise the abstract to briefly note the reward's context-sensitivity incentives and point readers to the full algorithmic description and supporting analyses in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no self-referential derivations or fitted predictions

full rationale

The paper presents an empirical two-stage learning framework (AdaVaR) consisting of supervised cold-start unification of modes followed by RL with AdaGRPO for inducing context-adaptive selection. No equations, derivations, or first-principles results are described that reduce by construction to the inputs. The abstract and provided text contain no self-definitional steps, no renaming of known results as new predictions, and no load-bearing self-citations that close a loop. Claims rest on experimental outcomes across scenarios rather than tautological reductions. The 'carefully designed' qualifier on AdaGRPO describes a methodological choice but does not exhibit a specific reduction (e.g., reward = outcome by construction) without further equations or controls shown. This is a standard empirical proposal and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The 'carefully designed' AdaGRPO reward and the mode-unification process may implicitly contain fitted or hand-chosen components.

free parameters (1)

AdaGRPO reward design parameters
The algorithm is described as carefully designed, suggesting possible hand-tuned or data-fitted reward components that affect mode selection.

pith-pipeline@v0.9.0 · 5705 in / 1160 out tokens · 51608 ms · 2026-05-18T13:29:31.430149+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 26 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 17682--17690, 2024

work page 2024
[4]

Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025

work page arXiv 2025
[5]

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm's referential dialogue magic. arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37: 0 27056--27087, 2024 a

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37: 0 27056--27087, 2024 a

work page 2024
[8]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67 0 (12): 0 220101, 2024 b

work page 2024
[9]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dama...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Insight-v: Exploring long-chain visual reasoning with multimodal large language models

Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 9062--9072, 2025

work page 2025
[11]

Virgo: A preliminary exploration on reproducing o1-like mllm

Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm. arXiv preprint arXiv:2501.01904, 2025

work page arXiv 2025
[12]

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images, 2025. URL https://arxiv.org/abs/2505.15879

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

G-llava: Solving geometric problem with multi-modal large language model, 2023

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geometric problem with multi-modal large language model, 2023

work page 2023
[14]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025. URL https://arxiv.org/abs/2503.06749

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Kimi-Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

work page 2022
[18]

Hypertree proof search for neural theorem proving

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving. Advances in neural information processing systems, 35: 0 26337--26349, 2022

work page 2022
[19]

Scaffolding coordinates to promote vision-language coordination in large multi-modal models, 2024

Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu. Scaffolding coordinates to promote vision-language coordination in large multi-modal models, 2024

work page 2024
[20]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. URL https://arxiv.org/abs/2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Numinamath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...

work page 2024
[22]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Vocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, and Zhongyu Wei. Vocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025. URL https://arxiv.org/abs/2405.16919

work page arXiv 2025
[24]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, and Junjie Yan. One rl to see them all: Visual triple unified reinforcement learning. arXiv preprint arXiv:2505.18129, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36: 0 46534--46594, 2023

work page 2023
[28]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. URL https://arxiv.org/abs/2503.07365

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Compositional chain-of-thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14420--14431, 2024

work page 2024
[30]

Omnicount: Multi-label object counting with semantic-geometric priors

Anindya Mondal, Sauradip Nag, Xiatian Zhu, and Anjan Dutta. Omnicount: Multi-label object counting with semantic-geometric priors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 19537--19545, 2025

work page 2025
[31]

Thinking with images

OpenAI. Thinking with images. https://openai.com/index/thinking-with-images/, 2025

work page 2025
[32]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

QwQ-32B : Embracing the power of reinforcement learning, March 2025

Qwen Team . QwQ-32B : Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[35]

Grounded reinforcement learning for visual reasoning

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning. arXiv preprint arXiv:2505.23678, 2025

work page arXiv 2025
[36]

Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models. arXiv preprint arXiv:2403.16999, 2024 a

work page arXiv 2024
[37]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration

Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. arXiv preprint arXiv:2411.16044, 2024

work page arXiv 2024
[39]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023

work page 2023
[40]

Llamav-o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025

work page arXiv 2025
[41]

Toward self-improvement of llms via imagination, searching, and criticizing

Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing. Advances in Neural Information Processing Systems, 37: 0 52723--52748, 2024

work page 2024
[42]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URL https://arxiv.org/abs/2402.14804

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Self-consistency improves chain of thought reasoning in language models, 2023

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023

work page 2023
[44]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

work page 2022
[45]

Open vision reasoner: Transferring linguistic cognitive behavior for visual reasoning

Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, et al. Open vision reasoner: Transferring linguistic cognitive behavior for visual reasoning. arXiv preprint arXiv:2507.05255, 2025

work page arXiv 2025
[47]

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding, 2025 b . URL https://arxiv.org/abs/2505.17012

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135, 2023

work page arXiv 2023
[49]

Grounded chain-of-thought for multimodal large language models

Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models. arXiv preprint arXiv:2503.12799, 2025 c

work page arXiv 2025
[50]

Self-evaluation guided beam search for reasoning

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems, 36: 0 41618--41650, 2023

work page 2023
[51]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Geosense: Evaluating identification and application of geometric principles in multimodal reasoning

Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, et al. Geosense: Evaluating identification and application of geometric principles in multimodal reasoning. arXiv preprint arXiv:2504.12597, 2025

work page arXiv 2025
[53]

Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023

work page 2023
[54]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025. URL https://arxiv.org/abs/2503.10615

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319, 2024 a

work page arXiv 2024
[56]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024 b

work page 2024
[57]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024 a . URL https://arxiv.org/abs/2403.14624

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Improve vision language model chain-of-thought reasoning

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of-thought reasoning. arXiv preprint arXiv:2410.16198, 2024 b

work page arXiv 2024
[60]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl, 2025. URL https://arxiv.org/abs/2505.15436

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning, 2025. URL https://arxiv.org/abs/2505.14362

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Image-of-thought prompting for visual reasoning refinement in multimodal large language models

Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of-thought prompting for visual reasoning refinement in multimodal large language models. arXiv preprint arXiv:2405.13872, 2024

work page arXiv 2024
[63]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[65]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[66]

PlayStation

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv