pith. machine review for the scientific record. sign in

arxiv: 2509.22746 · v2 · pith:XRTUF2IInew · submitted 2025-09-26 · 💻 cs.AI · cs.CV

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Pith reviewed 2026-05-18 13:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords visual reasoningadaptive reasoningmode selectionmixture of thoughtsreinforcement learningcontext-adaptivemultimodal modelsgeneral reasoning
0
0 comments X

The pith

A single model can unify multiple visual reasoning modes and learn to select the right one based on context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Mixture-of-Visual-Thoughts to bring different reasoning modes together inside one model rather than training separate systems for narrow domains. It introduces the AdaVaR framework that first joins the modes through supervised training and then uses reinforcement learning with the AdaGRPO algorithm to develop the ability to pick the suitable mode for each input. A sympathetic reader would care because existing visual reasoning approaches often improve only in specific settings while failing to generalize. If the claim holds, it offers a path to more flexible models that handle varied visual tasks without mode-specific retraining.

Core claim

Mixture-of-Visual-Thoughts unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. This is realized through AdaVaR, a two-stage Adaptive Visual Reasoning learning framework where modes are unified and learned during the supervised cold-start stage and mode selection is induced via reinforcement learning with the AdaGRPO algorithm. Experiments demonstrate that AdaVaR guides the model to learn and differentiate multiple modes while performing context-adaptive selection and delivering consistent gains across scenarios.

What carries the argument

AdaVaR, the two-stage framework that first unifies and jointly trains multiple reasoning modes through supervised learning then applies reinforcement learning with AdaGRPO to induce context-adaptive mode selection.

If this is right

  • The model learns to differentiate multiple reasoning modes within a shared parameter set.
  • Context-adaptive selection leads to measurable gains on diverse visual reasoning benchmarks.
  • A single trained system can handle scenarios that previously required separate specialized models.
  • The two-stage process of supervised unification followed by reinforcement learning produces stable mode selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unification-plus-reinforcement pattern could be tested on non-visual reasoning tasks such as textual or auditory inputs.
  • If mode selection generalizes, it reduces the need to maintain multiple domain-specific models for visual problems.
  • A direct test would measure whether the learned selection rule transfers to entirely novel contexts absent from the reinforcement learning phase.

Load-bearing premise

Different reasoning modes can be successfully unified and jointly learned in one model during the initial supervised stage so that later reinforcement learning reliably produces context-sensitive selection instead of mode collapse or overfitting to training examples.

What would settle it

Training a model with AdaVaR and then observing no performance gain over single-mode baselines on out-of-distribution visual reasoning tasks or no evidence that the model switches between distinct modes for inputs that require different reasoning styles would falsify the central claim.

Figures

Figures reproduced from arXiv: 2509.22746 by Bo Zheng, Jiwen Zhang, Jun Song, Runzhou Zhao, Siyuan Wang, Yang Yao, Yingxiu Zhao, Zejun Li, Zhongyu Wei.

Figure 1
Figure 1. Figure 1: Comparison between two reasoning modes. (b) Performance gains (positive values) and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Demonstration of GRPO and AdaGRPO. In AdaGRPO, we use mode prefixes to guide [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Math-related training dynamics and evaluation metrics during Stage 2. ADA, TXT, and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Proportion of each mode selected by different models across categories in MMStar. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Proportion of each mode selected by different models across sub-categories in MMStar. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy reward curves for different modes across different tasks during RL. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cases from MathVista. The pointing-finger icon indicates the mode selected by AdaVaR. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cases from other math-oriented benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cases from V* and POPE. Please zoom in for a better view of small objects in images. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cases from MMStar. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cases from SpatialScore. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
read the original abstract

Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Mixture-of-Visual-Thoughts (MoVT), a paradigm that unifies multiple reasoning modes in a single model for context-adaptive selection in general visual reasoning. It introduces the AdaVaR two-stage framework: a supervised cold-start SFT stage to jointly learn and unify distinct modes, followed by an RL stage using the AdaGRPO algorithm to induce context-sensitive mode selection. The abstract claims that this produces mode differentiation and consistent performance gains across scenarios.

Significance. If the empirical results hold, the work could support progress toward general visual reasoning models by showing that mode unification and adaptive routing are feasible without domain-specific specialization. The two-stage SFT+RL structure is a known pattern, but its application here to explicit mode differentiation would be a useful contribution if supported by controls for collapse and context sensitivity.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'consistent improvement across various scenarios' and effective 'context-adaptive mode selection' is asserted without any quantitative results, baselines, ablation studies, or metrics. This prevents assessment of whether the data actually support the claim that AdaVaR induces true adaptive routing rather than collapse or memorization.
  2. [Abstract] Abstract (AdaGRPO description): the reward structure is described only as 'carefully designed,' with no equations, loss terms, or controls showing that selection changes with context rather than defaulting to a single mode or overfitting training contexts. This directly bears on the weakest assumption that cold-start unification plus RL will produce reliable context-sensitive selection.
minor comments (1)
  1. Clarify the precise relationship between the MoVT paradigm and the AdaVaR framework name; the abstract introduces both without distinguishing their scopes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address the two major points on the abstract below, providing clarifications from the full manuscript while agreeing to strengthen the abstract for better readability and support of the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'consistent improvement across various scenarios' and effective 'context-adaptive mode selection' is asserted without any quantitative results, baselines, ablation studies, or metrics. This prevents assessment of whether the data actually support the claim that AdaVaR induces true adaptive routing rather than collapse or memorization.

    Authors: The abstract is intended as a high-level summary. The full manuscript provides extensive quantitative results in the Experiments section, including performance tables with baselines, ablations on mode unification and selection, and analyses (e.g., mode usage distributions across contexts) that demonstrate consistent gains and evidence against collapse or simple memorization. We will revise the abstract to incorporate key quantitative highlights, such as average accuracy improvements and references to the mode differentiation metrics, to better substantiate the claims upfront. revision: yes

  2. Referee: [Abstract] Abstract (AdaGRPO description): the reward structure is described only as 'carefully designed,' with no equations, loss terms, or controls showing that selection changes with context rather than defaulting to a single mode or overfitting training contexts. This directly bears on the weakest assumption that cold-start unification plus RL will produce reliable context-sensitive selection.

    Authors: Space limitations in the abstract led to the concise phrasing. The Method section details the AdaGRPO reward formulation, including the specific reward components, the GRPO loss terms, and experimental controls (such as context-variation tests and mode-probability tracking) that show selection adapts rather than collapsing to one mode or overfitting. We will revise the abstract to briefly note the reward's context-sensitivity incentives and point readers to the full algorithmic description and supporting analyses in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no self-referential derivations or fitted predictions

full rationale

The paper presents an empirical two-stage learning framework (AdaVaR) consisting of supervised cold-start unification of modes followed by RL with AdaGRPO for inducing context-adaptive selection. No equations, derivations, or first-principles results are described that reduce by construction to the inputs. The abstract and provided text contain no self-definitional steps, no renaming of known results as new predictions, and no load-bearing self-citations that close a loop. Claims rest on experimental outcomes across scenarios rather than tautological reductions. The 'carefully designed' qualifier on AdaGRPO describes a methodological choice but does not exhibit a specific reduction (e.g., reward = outcome by construction) without further equations or controls shown. This is a standard empirical proposal and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The 'carefully designed' AdaGRPO reward and the mode-unification process may implicitly contain fitted or hand-chosen components.

free parameters (1)
  • AdaGRPO reward design parameters
    The algorithm is described as carefully designed, suggesting possible hand-tuned or data-fitted reward components that affect mode selection.

pith-pipeline@v0.9.0 · 5705 in / 1160 out tokens · 51608 ms · 2026-05-18T13:29:31.430149+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 26 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

  3. [3]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 17682--17690, 2024

  4. [4]

    Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning

    Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025

  5. [5]

    SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468, 2025

  6. [6]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm's referential dialogue magic. arXiv:2306.15195, 2023

  7. [7]

    Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37: 0 27056--27087, 2024 a

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37: 0 27056--27087, 2024 a

  8. [8]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67 0 (12): 0 220101, 2024 b

  9. [9]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dama...

  10. [10]

    Insight-v: Exploring long-chain visual reasoning with multimodal large language models

    Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 9062--9072, 2025

  11. [11]

    Virgo: A preliminary exploration on reproducing o1-like mllm

    Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm. arXiv preprint arXiv:2501.01904, 2025

  12. [12]

    GRIT: Teaching MLLMs to Think with Images

    Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images, 2025. URL https://arxiv.org/abs/2505.15879

  13. [13]

    G-llava: Solving geometric problem with multi-modal large language model, 2023

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geometric problem with multi-modal large language model, 2023

  14. [14]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025. URL https://arxiv.org/abs/2503.06749

  15. [15]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  16. [16]

    Kimi-Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

  17. [17]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

  18. [18]

    Hypertree proof search for neural theorem proving

    Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving. Advances in neural information processing systems, 35: 0 26337--26349, 2022

  19. [19]

    Scaffolding coordinates to promote vision-language coordination in large multi-modal models, 2024

    Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu. Scaffolding coordinates to promote vision-language coordination in large multi-modal models, 2024

  20. [20]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. URL https://arxiv.org/abs/2408.03326

  21. [21]

    Numinamath

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...

  22. [22]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023

  23. [23]

    Vocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025

    Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, and Zhongyu Wei. Vocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025. URL https://arxiv.org/abs/2405.16919

  24. [24]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025

  25. [25]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023

  26. [26]

    One RL to See Them All: Visual Triple Unified Reinforcement Learning

    Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, and Junjie Yan. One rl to see them all: Visual triple unified reinforcement learning. arXiv preprint arXiv:2505.18129, 2025

  27. [27]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36: 0 46534--46594, 2023

  28. [28]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. URL https://arxiv.org/abs/2503.07365

  29. [29]

    Compositional chain-of-thought prompting for large multimodal models

    Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14420--14431, 2024

  30. [30]

    Omnicount: Multi-label object counting with semantic-geometric priors

    Anindya Mondal, Sauradip Nag, Xiatian Zhu, and Anjan Dutta. Omnicount: Multi-label object counting with semantic-geometric priors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 19537--19545, 2025

  31. [31]

    Thinking with images

    OpenAI. Thinking with images. https://openai.com/index/thinking-with-images/, 2025

  32. [32]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025

  33. [33]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

  34. [34]

    QwQ-32B : Embracing the power of reinforcement learning, March 2025

    Qwen Team . QwQ-32B : Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

  35. [35]

    Grounded reinforcement learning for visual reasoning

    Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning. arXiv preprint arXiv:2505.23678, 2025

  36. [36]

    Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models. arXiv preprint arXiv:2403.16999, 2024 a

  37. [37]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024 b

  38. [38]

    Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration

    Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. arXiv preprint arXiv:2411.16044, 2024

  39. [39]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023

  40. [40]

    Llamav-o1: Rethinking step-by-step visual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025

  41. [41]

    Toward self-improvement of llms via imagination, searching, and criticizing

    Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing. Advances in Neural Information Processing Systems, 37: 0 52723--52748, 2024

  42. [42]

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URL https://arxiv.org/abs/2402.14804

  43. [43]

    Self-consistency improves chain of thought reasoning in language models, 2023

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023

  44. [44]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  45. [45]

    Open vision reasoner: Transferring linguistic cognitive behavior for visual reasoning

    Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, et al. Open vision reasoner: Transferring linguistic cognitive behavior for visual reasoning. arXiv preprint arXiv:2507.05255, 2025

  46. [47]

    SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

    Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding, 2025 b . URL https://arxiv.org/abs/2505.17012

  47. [48]

    V*: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135, 2023

  48. [49]

    Grounded chain-of-thought for multimodal large language models

    Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models. arXiv preprint arXiv:2503.12799, 2025 c

  49. [50]

    Self-evaluation guided beam search for reasoning

    Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems, 36: 0 41618--41650, 2023

  50. [51]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

  51. [52]

    Geosense: Evaluating identification and application of geometric principles in multimodal reasoning

    Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, et al. Geosense: Evaluating identification and application of geometric principles in multimodal reasoning. arXiv preprint arXiv:2504.12597, 2025

  52. [53]

    Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023

  53. [54]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025. URL https://arxiv.org/abs/2503.10615

  54. [55]

    Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319, 2024 a

  55. [56]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024 b

  56. [57]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  57. [58]

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024 a . URL https://arxiv.org/abs/2403.14624

  58. [59]

    Improve vision language model chain-of-thought reasoning

    Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of-thought reasoning. arXiv preprint arXiv:2410.16198, 2024 b

  59. [60]

    Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl, 2025. URL https://arxiv.org/abs/2505.15436

  60. [61]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning, 2025. URL https://arxiv.org/abs/2505.14362

  61. [62]

    Image-of-thought prompting for visual reasoning refinement in multimodal large language models

    Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of-thought prompting for visual reasoning refinement in multimodal large language models. arXiv preprint arXiv:2405.13872, 2024

  62. [63]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

  63. [64]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  64. [65]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  65. [66]

    PlayStation

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...