arxiv: 2509.23322 · v2 · submitted 2025-09-27 · 💻 cs.CV

Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework

Hongrui Jia , Chaoya Jiang , Shikun Zhang , Wei Ye This is my paper

Pith reviewed 2026-05-18 12:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords large multimodal modelsvisual groundingreasoning driftagentic frameworktraining-freemultimodal reasoningMathVisiondecoupled reasoning

0 comments

The pith

A training-free framework decouples reasoning from perception to stop multimodal models from losing visual grounding in long chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that large multimodal models lose visual grounding as their reasoning chains lengthen, increasingly relying on prior text outputs instead of revisiting the image. It proposes the DRP framework to counter this by assigning an LLM as the Reasoner to manage the process and query an LMM Observer for visual details only when necessary. This setup requires no training and can be added to existing models easily. If the approach works, it provides a practical way to improve accuracy in visual reasoning tasks like solving math problems from diagrams. Sympathetic readers would see value in this efficient alternative to retraining entire models for better reliability.

Core claim

The central discovery is a training-free agentic paradigm called DRP that decouples cognitive reasoning from visual perception. A powerful LLM acts as the strategic Reasoner that orchestrates inference by explicitly querying an LMM as Observer to retrieve fine-grained visual details on demand. This regulates the visual reasoning trajectory, significantly mitigates reasoning drift, and enforces robust visual grounding.

What carries the argument

The DRP framework, where the LLM Reasoner queries the LMM Observer for specific visual information as the reasoning progresses.

If this is right

The visual reasoning trajectory is regulated to stay close to the image content.
Reasoning drift is significantly mitigated, reducing visually implausible conclusions.
Robust visual grounding is enforced throughout the inference process.
Performance improves on challenging benchmarks, such as reaching 47.2% accuracy on MathVision with Qwen models, outperforming GPT-4o.
The method applies to any LMM in a plug-and-play manner without modifications or training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This decoupling strategy might apply to other modalities or sequential tasks where models lose fidelity to initial inputs over time.
Combining different sized models in this observer-reasoner setup could lead to more efficient AI systems overall.
Explicit querying could be optimized further to minimize the number of visual checks needed for a given task.

Load-bearing premise

The LMM can provide accurate and fine-grained visual details in response to explicit queries from the LLM without requiring any training or model changes.

What would settle it

If the LMM Observer returns incorrect visual information in response to queries on a test set of images with clear but detailed features, the framework would produce the same errors as standard models.

Figures

Figures reproduced from arXiv: 2509.23322 by Chaoya Jiang, Hongrui Jia, Shikun Zhang, Wei Ye.

**Figure 2.** Figure 2: An illustration of our iterative dialogue-based reasoning framework. The framework [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation studies on the impact of component scaling. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: A comparative analysis of visual grounding over long reasoning chains. The baseline LMM [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: A Comparative Case Study: The QvQ-32B Model vs. Our Proposed Framework [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: A Comparative Case Study: The QvQ-32B Model vs. Our Proposed Framework [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

With the continuous expansion of Large Language Models (LLMs) and advances in reinforcement learning, LLMs have demonstrated exceptional reasoning capabilities, enabling them to address a wide range of complex problems. Inspired by these achievements, researchers have extended related techniques to Large Multimodal Models (LMMs). However, a critical limitation has emerged, reflected in the progressive loss of visual grounding. As the reasoning chain grows longer, LMMs tend to rely increasingly on the textual information generated in earlier steps, while the initially extracted visual information is rarely revisited or incorporated. This phenomenon often causes the reasoning process to drift away from the actual image content, resulting in visually implausible or even erroneous conclusions. To overcome this fundamental limitation, we propose a novel, training-free agentic paradigm that Decouples cognitive Reasoning from visual Perception (DRP). In this framework, a powerful LLM serves as a strategic Reasoner, orchestrating the inference process by explicitly querying an LMM-acting as a dedicated Observer-to retrieve fine-grained visual details on demand. This approach is lightweight, model-agnostic, and plug-and-play, necessitating no additional training or architectural modifications. Extensive experiments demonstrate our framework DRP's efficacy in regulating the visual reasoning trajectory, significantly mitigating reasoning drift, and enforcing robust visual grounding. Notably, on the MathVision benchmark, the integration of Qwen2.5-VL-7B and Qwen3-32B achieves an accuracy of 47.2\%, outperforming GPT-4o's 40.6\%. These findings underscore the potential of our approach to enhance multimodal reasoning reliability without the need for costly retraining. Our code is publicly available at https://github.com/hongruijia/DRP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRP offers a simple training-free split where an LLM reasoner calls an LMM observer on demand to fetch visual details, and the Qwen combo beats GPT-4o on MathVision, but the gain rests on the observer staying accurate without any fixes.

read the letter

The main takeaway is that this paper describes a lightweight agentic framework called DRP that keeps an LLM in charge of the overall reasoning while letting it query an LMM only when it needs fresh visual information from the image. The goal is to cut down on the drift that happens when long reasoning chains start ignoring the original picture and lean on earlier text instead. They report that pairing Qwen2.5-VL-7B as the observer with Qwen3-32B as the reasoner reaches 47.2% on MathVision, above GPT-4o's 40.6% mark, and they release the code at the github link in the abstract. The setup needs no training or model changes, which makes it easy to try on top of existing LMMs. The explicit separation of roles and the on-demand querying step is the clearest new piece compared with the prompting baselines they reference. It is a clean, model-agnostic way to structure the interaction that could be useful for people already running these models on visual math or diagram tasks. The public code is a real help for checking the implementation. The soft spot is the untested assumption that the observer LMM will return faithful, fine-grained answers to the specific questions the reasoner sends. If the base vision model still misses details or hallucinates on direct queries, the reasoner ends up reasoning over incomplete or wrong text, and the reported improvement cannot be credited to better grounding. The abstract gives no error analysis or ablation on query success rates, so it is hard to tell how often this happens in practice. This work is aimed at practitioners who want to boost multimodal reasoning reliability without retraining budgets. A reader experimenting with agentic patterns on benchmarks like MathVision would find the description and results worth testing. It deserves a serious referee because the idea is straightforward, the numbers are concrete, and the code is available, even though additional checks on observer reliability would make the claims stronger. I would recommend sending it to review rather than desk rejecting it.

Referee Report

2 major / 1 minor

Summary. The paper introduces DRP, a training-free agentic framework that decouples visual perception (LMM as Observer) from cognitive reasoning (LLM as Reasoner). The Reasoner explicitly queries the Observer for fine-grained visual details on demand to mitigate progressive loss of visual grounding and reasoning drift in long chains. Experiments on MathVision report 47.2% accuracy using Qwen2.5-VL-7B with Qwen3-32B, outperforming GPT-4o at 40.6%, with the method described as lightweight, model-agnostic, and plug-and-play.

Significance. If the central results hold under scrutiny, the work offers a practical, training-free route to improve visual grounding in multimodal reasoning without architectural changes or fine-tuning. The public code release at https://github.com/hongruijia/DRP supports reproducibility and enables further testing of the agentic decoupling idea.

major comments (2)

[Abstract / Framework] Abstract and framework description: the performance gain (47.2% on MathVision) is attributed to enforced visual grounding via on-demand queries, yet no quantitative evaluation or error analysis of the Observer LMM's response accuracy, hallucination rate, or completeness is provided. This leaves the load-bearing assumption—that an unmodified LMM reliably supplies faithful fine-grained details throughout multi-step reasoning—unverified and open to the possibility that gains arise from other factors such as prompt engineering or model combination.
[Experiments] Methods and experiments: the manuscript lacks ablations on query formulation, number of Observer calls, or failure modes when the Observer returns incomplete or erroneous details. Without these, it is difficult to isolate the contribution of the decoupled architecture from baseline LMM capabilities or the specific LLM-LMM pairing.

minor comments (1)

[Implementation Details] Clarify the exact prompting templates used for Observer queries and how the Reasoner integrates the returned text to ensure the process is fully reproducible from the released code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Framework] Abstract and framework description: the performance gain (47.2% on MathVision) is attributed to enforced visual grounding via on-demand queries, yet no quantitative evaluation or error analysis of the Observer LMM's response accuracy, hallucination rate, or completeness is provided. This leaves the load-bearing assumption—that an unmodified LMM reliably supplies faithful fine-grained details throughout multi-step reasoning—unverified and open to the possibility that gains arise from other factors such as prompt engineering or model combination.

Authors: We appreciate this observation. The end-to-end gains, including outperforming GPT-4o, provide supporting evidence that on-demand querying helps maintain visual grounding. However, we agree that direct metrics on Observer accuracy and hallucination would better isolate the mechanism. In the revision we will add a quantitative analysis of Observer response quality (accuracy, hallucination rate, completeness) on a sampled subset of MathVision queries, along with examples of how the Reasoner uses or corrects these outputs. revision: yes
Referee: [Experiments] Methods and experiments: the manuscript lacks ablations on query formulation, number of Observer calls, or failure modes when the Observer returns incomplete or erroneous details. Without these, it is difficult to isolate the contribution of the decoupled architecture from baseline LMM capabilities or the specific LLM-LMM pairing.

Authors: We concur that targeted ablations would clarify the contribution of the decoupling strategy. The current experiments focus on demonstrating overall efficacy and plug-and-play applicability across model pairs. In the revised manuscript we will include ablations varying query formulation and the number of Observer calls, plus a discussion of failure modes (e.g., incomplete Observer replies) and how the Reasoner mitigates them via iterative querying. These additions will help separate architectural effects from prompt or pairing factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in DRP framework

full rationale

The paper proposes a training-free decoupled agentic framework (DRP) that uses an LLM as Reasoner to query an LMM as Observer for on-demand visual details. No equations, mathematical derivations, parameter fittings, or predictions are present in the provided text. Claims of mitigating reasoning drift rest on empirical benchmark results (e.g., MathVision accuracy) rather than any self-referential definitions or reductions. No self-citations appear as load-bearing elements for uniqueness theorems or ansatzes. The framework is described as model-agnostic and plug-and-play with external evaluations, making the contribution self-contained against benchmarks without circular reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions about LLM reasoning strength and LMM query responsiveness but introduces no free parameters, new physical entities, or ad-hoc constants; the contribution is the framework structure itself.

axioms (2)

domain assumption Large language models possess strong reasoning capabilities suitable for strategic orchestration of inference steps.
The framework assigns the LLM the role of Reasoner based on this established capability.
domain assumption Large multimodal models can extract and return accurate fine-grained visual details when explicitly prompted for specific information.
This underpins the Observer role and the on-demand querying mechanism.

pith-pipeline@v0.9.0 · 5854 in / 1400 out tokens · 54985 ms · 2026-05-18T12:24:17.630066+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decouples cognitive Reasoning from visual Perception (DRP)... LLM Reasoner... LMM Observer... iterative dialogue

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 30 internal anchors

[1]

Claude 3.7 Sonnet and Claude Code

Anthropic . Claude 3.7 Sonnet and Claude Code . https://www.anthropic.com/news/claude-3-7-sonnet, 2025. Accessed: 2025-02-25

work page 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. A...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Trans. Mach. Learn. Res., 2023, 2022. URL https://api.semanticscholar.org/CorpusID:253801709

work page 2023
[4]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dama...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Openvlthinker: Complex vision-language reasoning via iterative sft-rl cycles, 2025

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: Complex vision-language reasoning via iterative sft-rl cycles, 2025. URL https://arxiv.org/abs/2503.17352

work page arXiv 2025
[6]

Virgo: A preliminary exploration on reproducing o1-like mllm

Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm, 2025. URL https://arxiv.org/abs/2501.01904

work page arXiv 2025
[7]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, :, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Tab-cot: Zero-shot tabular chain of thought

Ziqi Jin and Wei Lu. Tab-cot: Zero-shot tabular chain of thought. In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:258960483

work page 2023
[11]

Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models, 2025

Mingi Jung, Saehyung Lee, Eunji Kim, and Sungroh Yoon. Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models, 2025. URL https://arxiv.org/abs/2502.01419

work page arXiv 2025
[12]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought, 2025 a . URL https://arxiv.org/abs/2501.07542

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Perception, reason, think, and plan: A survey on large multimodal reasoning models

Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models. arXiv preprint arXiv:2505.04921, 2025 b

work page arXiv 2025
[14]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly, 2025

Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, and Bo Zhao. Unveiling the ignorance of mllms: Seeing clearly, answering incorrectly, 2025. URL https://arxiv.org/abs/2406.10638

work page arXiv 2025
[16]

Visual-o1: Understanding ambiguous instructions via multi-modal multi-turn chain-of-thoughts reasoning

Minheng Ni, Yutao Fan, Lei Zhang, and Wangmeng Zuo. Visual-o1: Understanding ambiguous instructions via multi-modal multi-turn chain-of-thoughts reasoning. arXiv preprint arXiv:2410.03321, 2024

work page arXiv 2024
[17]

Skeleton-of-thought: Prompting llms for efficient parallel generation

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of-thought: Prompting llms for efficient parallel generation. arXiv preprint arXiv:2307.15337, 2023

work page arXiv 2023
[18]

Introducing openai o3 and o4-mini, 2024

OpenAI. Introducing openai o3 and o4-mini, 2024. https://openai.com/index/introducing-o3-and-o4-mini/

work page 2024
[19]

Introducing GPT-4.1 in the API , April 2025

OpenAI. Introducing GPT-4.1 in the API , April 2025. URL https://openai.com/index/gpt-4-1/

work page 2025
[20]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alex...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl, 2025. URL https://arxiv.org/abs/2503.07536

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Mutual reasoning makes smaller llms stronger problem-solvers

Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller llms stronger problem-solvers. arXiv preprint arXiv:2408.06195, 2024

work page arXiv 2024
[25]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024. URL https://arxiv.org/abs/2407.01284

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Visual chain of thought: bridging logical gaps with multimodal infillings

Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317, 2023

work page arXiv 2023
[27]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/abs/2504.07615

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. ArXiv, abs/2408.03314, 2024. URL https://api.semanticscholar.org/CorpusID:271719990

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification, 2024

Kai Sun, Yushi Bai, Ji Qi, Lei Hou, and Juanzi Li. Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification, 2024. URL https://arxiv.org/abs/2404.05091

work page arXiv 2024
[32]

Mm-verify: Enhancing multimodal reasoning with chain-of-thought verification, 2025

Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, and Wentao Zhang. Mm-verify: Enhancing multimodal reasoning with chain-of-thought verification, 2025. URL https://arxiv.org/abs/2502.13383

work page arXiv 2025
[33]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Qvq: To see the world with wisdom, 2024

Qwen Team. Qvq: To see the world with wisdom, 2024. https://qwenlm.github.io/blog/qvq-72b-preview/

work page 2024
[35]

Llamav-o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, and Salman Khan. Llamav-o1: Rethinking step-by-step visual reasoning in llms, 2025. URL https://arxiv.org/abs/2501.06186

work page arXiv 2025
[36]

Seekworld: Geolocation is a natural rl task for o3-like visual clue-tracking reasoning, 2025

Kaibin Tian, Zijie Xin, and Jiazhen Liu. Seekworld: Geolocation is a natural rl task for o3-like visual clue-tracking reasoning, 2025. https://huggingface.co/datasets/TheEighthDay/SeekWorld

work page 2025
[37]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URL https://arxiv.org/abs/2402.14804

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Boosting multimodal reasoning with mcts-automated structured thinking

Jinyang Wu, Mingkuan Feng, Shuai Zhang, Ruihan Jin, Feihu Che, Zengqi Wen, and Jianhua Tao. Boosting multimodal reasoning with mcts-automated structured thinking. arXiv preprint arXiv:2502.02339, 2025

work page arXiv 2025
[42]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts, 2024. URL https://arxiv.org/abs/2407.04973

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. URL https://arxiv.org/abs/2411.10440

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025 b . URL https://arxiv.org/abs/2503.10615

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. ArXiv, abs/2305.10601, 2023. URL https://api.semanticscholar.org/CorpusID:258762525

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2024. URL https://arxiv.org/abs/2308.02490

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URL https://arxiv.org/abs/2403.14624

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[52]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[53]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[54]

PlayStation

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2000