pith. machine review for the scientific record. sign in

arxiv: 2509.23322 · v2 · submitted 2025-09-27 · 💻 cs.CV

Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework

Pith reviewed 2026-05-18 12:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords large multimodal modelsvisual groundingreasoning driftagentic frameworktraining-freemultimodal reasoningMathVisiondecoupled reasoning
0
0 comments X

The pith

A training-free framework decouples reasoning from perception to stop multimodal models from losing visual grounding in long chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that large multimodal models lose visual grounding as their reasoning chains lengthen, increasingly relying on prior text outputs instead of revisiting the image. It proposes the DRP framework to counter this by assigning an LLM as the Reasoner to manage the process and query an LMM Observer for visual details only when necessary. This setup requires no training and can be added to existing models easily. If the approach works, it provides a practical way to improve accuracy in visual reasoning tasks like solving math problems from diagrams. Sympathetic readers would see value in this efficient alternative to retraining entire models for better reliability.

Core claim

The central discovery is a training-free agentic paradigm called DRP that decouples cognitive reasoning from visual perception. A powerful LLM acts as the strategic Reasoner that orchestrates inference by explicitly querying an LMM as Observer to retrieve fine-grained visual details on demand. This regulates the visual reasoning trajectory, significantly mitigates reasoning drift, and enforces robust visual grounding.

What carries the argument

The DRP framework, where the LLM Reasoner queries the LMM Observer for specific visual information as the reasoning progresses.

If this is right

  • The visual reasoning trajectory is regulated to stay close to the image content.
  • Reasoning drift is significantly mitigated, reducing visually implausible conclusions.
  • Robust visual grounding is enforced throughout the inference process.
  • Performance improves on challenging benchmarks, such as reaching 47.2% accuracy on MathVision with Qwen models, outperforming GPT-4o.
  • The method applies to any LMM in a plug-and-play manner without modifications or training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This decoupling strategy might apply to other modalities or sequential tasks where models lose fidelity to initial inputs over time.
  • Combining different sized models in this observer-reasoner setup could lead to more efficient AI systems overall.
  • Explicit querying could be optimized further to minimize the number of visual checks needed for a given task.

Load-bearing premise

The LMM can provide accurate and fine-grained visual details in response to explicit queries from the LLM without requiring any training or model changes.

What would settle it

If the LMM Observer returns incorrect visual information in response to queries on a test set of images with clear but detailed features, the framework would produce the same errors as standard models.

Figures

Figures reproduced from arXiv: 2509.23322 by Chaoya Jiang, Hongrui Jia, Shikun Zhang, Wei Ye.

Figure 1
Figure 1. Figure 1: An Illustration of Progressive Loss of Visual Grounding in MLLM-based Reasoning. This [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of our iterative dialogue-based reasoning framework. The framework [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation studies on the impact of component scaling. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A comparative analysis of visual grounding over long reasoning chains. The baseline LMM [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A Comparative Case Study: The QvQ-32B Model vs. Our Proposed Framework [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A Comparative Case Study: The QvQ-32B Model vs. Our Proposed Framework [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

With the continuous expansion of Large Language Models (LLMs) and advances in reinforcement learning, LLMs have demonstrated exceptional reasoning capabilities, enabling them to address a wide range of complex problems. Inspired by these achievements, researchers have extended related techniques to Large Multimodal Models (LMMs). However, a critical limitation has emerged, reflected in the progressive loss of visual grounding. As the reasoning chain grows longer, LMMs tend to rely increasingly on the textual information generated in earlier steps, while the initially extracted visual information is rarely revisited or incorporated. This phenomenon often causes the reasoning process to drift away from the actual image content, resulting in visually implausible or even erroneous conclusions. To overcome this fundamental limitation, we propose a novel, training-free agentic paradigm that Decouples cognitive Reasoning from visual Perception (DRP). In this framework, a powerful LLM serves as a strategic Reasoner, orchestrating the inference process by explicitly querying an LMM-acting as a dedicated Observer-to retrieve fine-grained visual details on demand. This approach is lightweight, model-agnostic, and plug-and-play, necessitating no additional training or architectural modifications. Extensive experiments demonstrate our framework DRP's efficacy in regulating the visual reasoning trajectory, significantly mitigating reasoning drift, and enforcing robust visual grounding. Notably, on the MathVision benchmark, the integration of Qwen2.5-VL-7B and Qwen3-32B achieves an accuracy of 47.2\%, outperforming GPT-4o's 40.6\%. These findings underscore the potential of our approach to enhance multimodal reasoning reliability without the need for costly retraining. Our code is publicly available at https://github.com/hongruijia/DRP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DRP, a training-free agentic framework that decouples visual perception (LMM as Observer) from cognitive reasoning (LLM as Reasoner). The Reasoner explicitly queries the Observer for fine-grained visual details on demand to mitigate progressive loss of visual grounding and reasoning drift in long chains. Experiments on MathVision report 47.2% accuracy using Qwen2.5-VL-7B with Qwen3-32B, outperforming GPT-4o at 40.6%, with the method described as lightweight, model-agnostic, and plug-and-play.

Significance. If the central results hold under scrutiny, the work offers a practical, training-free route to improve visual grounding in multimodal reasoning without architectural changes or fine-tuning. The public code release at https://github.com/hongruijia/DRP supports reproducibility and enables further testing of the agentic decoupling idea.

major comments (2)
  1. [Abstract / Framework] Abstract and framework description: the performance gain (47.2% on MathVision) is attributed to enforced visual grounding via on-demand queries, yet no quantitative evaluation or error analysis of the Observer LMM's response accuracy, hallucination rate, or completeness is provided. This leaves the load-bearing assumption—that an unmodified LMM reliably supplies faithful fine-grained details throughout multi-step reasoning—unverified and open to the possibility that gains arise from other factors such as prompt engineering or model combination.
  2. [Experiments] Methods and experiments: the manuscript lacks ablations on query formulation, number of Observer calls, or failure modes when the Observer returns incomplete or erroneous details. Without these, it is difficult to isolate the contribution of the decoupled architecture from baseline LMM capabilities or the specific LLM-LMM pairing.
minor comments (1)
  1. [Implementation Details] Clarify the exact prompting templates used for Observer queries and how the Reasoner integrates the returned text to ensure the process is fully reproducible from the released code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Framework] Abstract and framework description: the performance gain (47.2% on MathVision) is attributed to enforced visual grounding via on-demand queries, yet no quantitative evaluation or error analysis of the Observer LMM's response accuracy, hallucination rate, or completeness is provided. This leaves the load-bearing assumption—that an unmodified LMM reliably supplies faithful fine-grained details throughout multi-step reasoning—unverified and open to the possibility that gains arise from other factors such as prompt engineering or model combination.

    Authors: We appreciate this observation. The end-to-end gains, including outperforming GPT-4o, provide supporting evidence that on-demand querying helps maintain visual grounding. However, we agree that direct metrics on Observer accuracy and hallucination would better isolate the mechanism. In the revision we will add a quantitative analysis of Observer response quality (accuracy, hallucination rate, completeness) on a sampled subset of MathVision queries, along with examples of how the Reasoner uses or corrects these outputs. revision: yes

  2. Referee: [Experiments] Methods and experiments: the manuscript lacks ablations on query formulation, number of Observer calls, or failure modes when the Observer returns incomplete or erroneous details. Without these, it is difficult to isolate the contribution of the decoupled architecture from baseline LMM capabilities or the specific LLM-LMM pairing.

    Authors: We concur that targeted ablations would clarify the contribution of the decoupling strategy. The current experiments focus on demonstrating overall efficacy and plug-and-play applicability across model pairs. In the revised manuscript we will include ablations varying query formulation and the number of Observer calls, plus a discussion of failure modes (e.g., incomplete Observer replies) and how the Reasoner mitigates them via iterative querying. These additions will help separate architectural effects from prompt or pairing factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in DRP framework

full rationale

The paper proposes a training-free decoupled agentic framework (DRP) that uses an LLM as Reasoner to query an LMM as Observer for on-demand visual details. No equations, mathematical derivations, parameter fittings, or predictions are present in the provided text. Claims of mitigating reasoning drift rest on empirical benchmark results (e.g., MathVision accuracy) rather than any self-referential definitions or reductions. No self-citations appear as load-bearing elements for uniqueness theorems or ansatzes. The framework is described as model-agnostic and plug-and-play with external evaluations, making the contribution self-contained against benchmarks without circular reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions about LLM reasoning strength and LMM query responsiveness but introduces no free parameters, new physical entities, or ad-hoc constants; the contribution is the framework structure itself.

axioms (2)
  • domain assumption Large language models possess strong reasoning capabilities suitable for strategic orchestration of inference steps.
    The framework assigns the LLM the role of Reasoner based on this established capability.
  • domain assumption Large multimodal models can extract and return accurate fine-grained visual details when explicitly prompted for specific information.
    This underpins the Observer role and the on-demand querying mechanism.

pith-pipeline@v0.9.0 · 5854 in / 1400 out tokens · 54985 ms · 2026-05-18T12:24:17.630066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 30 internal anchors

  1. [1]

    Claude 3.7 Sonnet and Claude Code

    Anthropic . Claude 3.7 Sonnet and Claude Code . https://www.anthropic.com/news/claude-3-7-sonnet, 2025. Accessed: 2025-02-25

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. A...

  3. [3]

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Trans. Mach. Learn. Res., 2023, 2022. URL https://api.semanticscholar.org/CorpusID:253801709

  4. [4]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dama...

  5. [5]

    Openvlthinker: Complex vision-language reasoning via iterative sft-rl cycles, 2025

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: Complex vision-language reasoning via iterative sft-rl cycles, 2025. URL https://arxiv.org/abs/2503.17352

  6. [6]

    Virgo: A preliminary exploration on reproducing o1-like mllm

    Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm, 2025. URL https://arxiv.org/abs/2501.01904

  7. [7]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM, :, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, ...

  8. [8]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008

  9. [9]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025

  10. [10]

    Tab-cot: Zero-shot tabular chain of thought

    Ziqi Jin and Wei Lu. Tab-cot: Zero-shot tabular chain of thought. In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:258960483

  11. [11]

    Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models, 2025

    Mingi Jung, Saehyung Lee, Eunji Kim, and Sungroh Yoon. Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models, 2025. URL https://arxiv.org/abs/2502.01419

  12. [12]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought, 2025 a . URL https://arxiv.org/abs/2501.07542

  13. [13]

    Perception, reason, think, and plan: A survey on large multimodal reasoning models

    Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models. arXiv preprint arXiv:2505.04921, 2025 b

  14. [14]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

  15. [15]

    Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly, 2025

    Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, and Bo Zhao. Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly, 2025. URL https://arxiv.org/abs/2406.10638

  16. [16]

    Visual-o1: Understanding ambiguous instructions via multi-modal multi-turn chain-of-thoughts reasoning

    Minheng Ni, Yutao Fan, Lei Zhang, and Wangmeng Zuo. Visual-o1: Understanding ambiguous instructions via multi-modal multi-turn chain-of-thoughts reasoning. arXiv preprint arXiv:2410.03321, 2024

  17. [17]

    Skeleton-of-thought: Prompting llms for efficient parallel generation

    Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of-thought: Prompting llms for efficient parallel generation. arXiv preprint arXiv:2307.15337, 2023

  18. [18]

    Introducing openai o3 and o4-mini, 2024

    OpenAI. Introducing openai o3 and o4-mini, 2024. https://openai.com/index/introducing-o3-and-o4-mini/

  19. [19]

    Introducing GPT-4.1 in the API , April 2025

    OpenAI. Introducing GPT-4.1 in the API , April 2025. URL https://openai.com/index/gpt-4-1/

  20. [20]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alex...

  21. [21]

    OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

  22. [22]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  23. [23]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, 2025. URL https://arxiv.org/abs/2503.07536

  24. [24]

    Mutual reasoning makes smaller llms stronger problem-solvers

    Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller llms stronger problem-solvers. arXiv preprint arXiv:2408.06195, 2024

  25. [25]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024. URL https://arxiv.org/abs/2407.01284

  26. [26]

    Visual chain of thought: bridging logical gaps with multimodal infillings

    Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317, 2023

  27. [27]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  28. [28]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  29. [29]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/abs/2504.07615

  30. [30]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. ArXiv, abs/2408.03314, 2024. URL https://api.semanticscholar.org/CorpusID:271719990

  31. [31]

    Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification, 2024

    Kai Sun, Yushi Bai, Ji Qi, Lei Hou, and Juanzi Li. Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification, 2024. URL https://arxiv.org/abs/2404.05091

  32. [32]

    Mm-verify: Enhancing multimodal reasoning with chain-of-thought verification, 2025

    Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, and Wentao Zhang. Mm-verify: Enhancing multimodal reasoning with chain-of-thought verification, 2025. URL https://arxiv.org/abs/2502.13383

  33. [33]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

  34. [34]

    Qvq: To see the world with wisdom, 2024

    Qwen Team. Qvq: To see the world with wisdom, 2024. https://qwenlm.github.io/blog/qvq-72b-preview/

  35. [35]

    Llamav-o1: Rethinking step-by-step visual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, and Salman Khan. Llamav-o1: Rethinking step-by-step visual reasoning in llms, 2025. URL https://arxiv.org/abs/2501.06186

  36. [36]

    Seekworld: Geolocation is a natural rl task for o3-like visual clue-tracking reasoning, 2025

    Kaibin Tian, Zijie Xin, and Jiazhen Liu. Seekworld: Geolocation is a natural rl task for o3-like visual clue-tracking reasoning, 2025. https://huggingface.co/datasets/TheEighthDay/SeekWorld

  37. [37]

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URL https://arxiv.org/abs/2402.14804

  38. [38]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  39. [39]

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605, 2025

  40. [40]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

  41. [41]

    Boosting multimodal reasoning with mcts-automated structured thinking

    Jinyang Wu, Mingkuan Feng, Shuai Zhang, Ruihan Jin, Feihu Che, Zengqi Wen, and Jianhua Tao. Boosting multimodal reasoning with mcts-automated structured thinking. arXiv preprint arXiv:2502.02339, 2025

  42. [42]

    LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts, 2024. URL https://arxiv.org/abs/2407.04973

  43. [43]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. URL https://arxiv.org/abs/2411.10440

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  45. [45]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025 b . URL https://arxiv.org/abs/2503.10615

  46. [46]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. ArXiv, abs/2305.10601, 2023. URL https://api.semanticscholar.org/CorpusID:258762525

  47. [47]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2024. URL https://arxiv.org/abs/2308.02490

  48. [48]

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URL https://arxiv.org/abs/2403.14624

  49. [49]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023

  50. [50]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

  51. [51]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  52. [52]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  53. [53]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  54. [54]

    PlayStation

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...