Latent Visual States for Efficient Multimodal Reasoning

Hang Xu; Hui-Ling Zhen; Jianhua Han; Jingyuan Zou; Kun Xiang; Likui Zhang; Wentao Hu; Xiaodan Liang; Xiuwei Chen; Yongxin Wang

arxiv: 2606.24233 · v1 · pith:KJJGBRWQnew · submitted 2026-06-23 · 💻 cs.CV

Latent Visual States for Efficient Multimodal Reasoning

Xiuwei Chen , Wentao Hu , Yongxin Wang , Zisheng Chen , Likui Zhang , Kun Xiang , Jianhua Han , Hui-Ling Zhen

show 3 more authors

Jingyuan Zou Hang Xu Xiaodan Liang

This is my paper

Pith reviewed 2026-06-26 00:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords latent visual statesmultimodal reasoningLatent_slot tokensD-GSPOchain-of-thoughtefficient inferencevisual representations

0 comments

The pith

EVA generates continuous latent visual states as adaptive tokens during reasoning to improve both accuracy and inference speed in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal models typically rely on discrete outputs such as code or coordinates to invoke external tools, which adds latency and rigid dependencies. EVA instead produces continuous latent visual representations internally as a variable-length sequence of Latent_slot tokens that function as intermediate visual thoughts. These tokens are trained jointly with discrete text tokens in an end-to-end fashion. The joint optimization creates extreme policy deviation in the transition window after the latent tokens, which the method counters with D-GSPO by decoupling the latent and discrete optimization objectives. On a new interleaved text-image CoT dataset and multiple benchmarks, the resulting models show performance gains together with faster inference.

Core claim

EVA natively generates continuous latent visual representations that appear as an adaptive sequence of Latent_slot tokens serving as intermediate visual thoughts. These tokens are trained end-to-end with discrete text tokens; the co-optimization produces extreme policy deviation in the transition window, which D-GSPO resolves by decoupling the latent and discrete components. The approach yields significant performance gains while enhancing inference efficiency across real-world scenes, documents, charts, and OCR tasks.

What carries the argument

Latent_slot tokens (adaptive sequences of continuous latent visual representations) together with D-GSPO (Decouple-GSPO), which decouples optimization of the latent visual and discrete text components to manage policy deviation during end-to-end training.

If this is right

Multimodal reasoning can proceed with internal continuous visual states rather than external discrete tool calls.
Inference latency decreases because the model avoids generating and executing separate tool invocations.
End-to-end training remains feasible once the transition-window deviation is isolated by D-GSPO.
Performance improves on interleaved visual-text tasks such as document understanding and chart reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-state mechanism could be tested on non-visual modalities to check whether continuous internal representations generalize beyond vision.
Models using Latent_slot tokens might eventually reduce reliance on separate vision encoders by folding visual processing into the language-model token stream.
Scaling experiments on larger base models would show whether the efficiency benefit persists when parameter count increases.

Load-bearing premise

Co-optimization of the latent visual tokens and discrete text tokens produces a policy deviation that D-GSPO can successfully decouple without losing the advantages of joint end-to-end training.

What would settle it

An ablation that trains the same architecture without D-GSPO and measures whether the reported accuracy gains and inference-time reductions disappear or training fails to converge.

Figures

Figures reproduced from arXiv: 2606.24233 by Hang Xu, Hui-Ling Zhen, Jianhua Han, Jingyuan Zou, Kun Xiang, Likui Zhang, Wentao Hu, Xiaodan Liang, Xiuwei Chen, Yongxin Wang, Zisheng Chen.

**Figure 1.** Figure 1: Performance and Paradigm Comparison of the Proposed EVA Framework. (Left) Comparative results with alternative methods demonstrate that EVA achieves superior overall performance. (Middle) Contrasting different paradigms, EVA internalizes the explicit external tool invocation into the model’s intrinsic capabilities; this mechanism is validated by the significantly lower latency observed when compared to met… view at source ↗

**Figure 2.** Figure 2: Overall of the EVA framework. Algorithm 1 EVA Algorithm Input: Cold-Start Dataset Dsf t, GSPO Dataset DGSP O, Base Model πθ Output: Improved Model πθ,GSPO 1: ▷ Learning to Generate & Use \triangleright see §3.2 2: πθ,SF T1 ← update πθ with SFT on Dsf t \triangleright Using Eq 3 3: πθ,SF T2 ← update πθ,SF T1 with SFT on Dsf t \triangleright Using Eq 2 4: ▷ Learning to Explore \triangleright see §3.3 5: for … view at source ↗

**Figure 3.** Figure 3: Qualitative Examples of Our Dataset. Each sample illustrates the model’s reasoning process (think), the corresponding localized visual regions, and the final answer. other tasks such as rotation and drawing, we retain the original features of the target image to preserve its global context. Specifically, we compute an affinity map between the visual features and the task-specific semantic embeddings extra… view at source ↗

**Figure 4.** Figure 4: (Left) Inference Latency Comparison across Different Image Resolutions. (Right) Effect of token length per Latent [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative analysis of EVA framework. × represents text tokens, ▲ represents visual tokens, and • represents latent tokens [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of Tool slot embeddings. × represents text tokens, ▲ represents visual tokens, and • represents latent tokens [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative analysis of EVA framework. C.6. EVA Generated Examples Figures 9, Figures 11, Figures 12, Figures 13 provide visualizations of scenarios involving low-resolution images with blurred details, simple problems where original images are clear, and content-flipped cases, respectively. It can be observed that while the EVA model answers directly for clear images, it adaptively generates correspondi… view at source ↗

**Figure 9.** Figure 9: Visualization of SFT data instances 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative Examples of EVA. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Rotation Example. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Crop Example. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Easy Example 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Inaccurate cropping. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

The integration of visual evidence has significantly enhanced the capabilities of large multimodal models. However, this integration predominantly relies on generating discrete outputs (etc., code or box coordinates) to invoke external tools, a process that introduces rigid dependencies and substantial latency. To overcome these limitations, we propose {EVA} (LatEnt Visual StAtes), a novel framework that natively generates continuous latent visual representations. These internal representations manifest as an adaptive sequence of Latent\_slot tokens, serving as intermediate visual thoughts during the reasoning process. These Latent\_slot tokens are then trained end-to-end with the discrete text tokens. This co-optimization, notably, causes extreme policy deviation in the 'transition window' following the Latent\_slot tokens. We develop D-GSPO (Decouple-GSPO) to target this root cause by decoupling the optimization of latent and discrete components. To support SFT, we construct EVA-230K, a high-quality text-image interleaved CoT dataset encompassing a diverse range of real-world scenes, documents, charts and OCR tasks. Extensive experiments across multiple benchmarks confirm that EVA achieves significant performance gains while enhancing inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EVA claims continuous latent visual states cut latency versus discrete tool calls, but D-GSPO is described only at high level with no numbers or formulation to show it works.

read the letter

The main thing to know is that this paper replaces discrete tool calls with internal continuous latent visual states in the form of adaptive Latent_slot tokens, trained end-to-end with text tokens, and introduces D-GSPO to fix the policy deviation that arises in the transition window. The abstract gives no quantitative results or technical details on how the decoupling actually happens.

What is new is the framing of latent visual representations as intermediate thoughts during reasoning, plus the EVA-230K dataset of text-image interleaved chain-of-thought examples covering scenes, documents, charts, and OCR. The motivation around latency from external tool invocations is straightforward, and releasing a targeted dataset is a usable contribution even if the rest of the work is thin.

The paper does a decent job identifying the rigid dependencies that come from forcing discrete outputs. The high-level idea of keeping visual information continuous and internal is a reasonable direction for efficiency.

The soft spots are the lack of any benchmarks, ablation, or error analysis to support the claimed gains, and especially the absence of any formulation for D-GSPO. The abstract says it decouples latent and discrete optimization to target the root cause, but supplies no modified objective, gradient routing, or KL term. That matches the stress-test note: without seeing how the method isolates the transition window without breaking end-to-end gradient flow, the performance and efficiency claims cannot be attributed to the proposed states. If the full paper contains the math or code, that would change the picture; based on what is here, the central mechanism remains unshown.

This is for researchers working on practical multimodal inference who want lower latency. A reader hunting for new architectural patterns might pick up the latent token idea, but anyone needing to reproduce or extend the result will find the current version insufficient.

I would not send this to peer review yet. It needs the experimental results and the concrete specification of D-GSPO before a referee can evaluate whether the decoupling works as described.

Referee Report

2 major / 1 minor

Summary. The paper proposes EVA (LatEnt Visual StAtes), a framework for multimodal reasoning that natively generates continuous latent visual representations as an adaptive sequence of Latent_slot tokens. These tokens are trained end-to-end with discrete text tokens; the resulting co-optimization is said to produce extreme policy deviation in the transition window, which is addressed by D-GSPO (Decouple-GSPO) that decouples latent and discrete optimization. The authors introduce the EVA-230K text-image interleaved CoT dataset and claim that experiments across benchmarks demonstrate significant performance gains together with improved inference efficiency.

Significance. If the central claims were substantiated, the method could reduce latency in multimodal models by replacing discrete tool invocations with native continuous latent states. However, the manuscript supplies no quantitative results, benchmarks, ablation studies, or error analysis, so the claimed gains in performance and efficiency cannot be evaluated and the significance remains undetermined.

major comments (2)

[Abstract] Abstract: the central claims of 'significant performance gains' and 'enhancing inference efficiency' are asserted without any supporting numbers, tables, baseline comparisons, or statistical analysis, rendering the primary empirical contribution unevaluable.
[Abstract] Abstract: D-GSPO is described only at a high level as decoupling optimization to correct policy deviation in the transition window, with no formulation (modified loss, gradient routing, KL term, or objective) provided; this prevents verification that the method isolates the deviation without severing end-to-end gradient flow or reintroducing the rigid dependencies the framework claims to avoid.

minor comments (1)

[Abstract] Abstract: the expansion 'LatEnt Visual StAtes' produces an awkward acronym; standard capitalization (e.g., EVA) should be clarified for consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our empirical results and methodological details.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 'significant performance gains' and 'enhancing inference efficiency' are asserted without any supporting numbers, tables, baseline comparisons, or statistical analysis, rendering the primary empirical contribution unevaluable.

Authors: We agree that the abstract would benefit from concrete quantitative support. In the revised manuscript we will add specific performance metrics (e.g., accuracy gains on the evaluated benchmarks) and efficiency numbers (e.g., latency reduction relative to baselines) drawn from the experiments on EVA-230K and the reported benchmarks, together with pointers to the corresponding tables and figures. revision: yes
Referee: [Abstract] Abstract: D-GSPO is described only at a high level as decoupling optimization to correct policy deviation in the transition window, with no formulation (modified loss, gradient routing, KL term, or objective) provided; this prevents verification that the method isolates the deviation without severing end-to-end gradient flow or reintroducing the rigid dependencies the framework claims to avoid.

Authors: We acknowledge the abstract presents D-GSPO at a summary level. The methods section of the full manuscript contains the decoupling formulation; however, to improve accessibility we will expand the abstract with a concise statement of the modified objective and gradient routing strategy, ensuring it is clear that end-to-end differentiability is preserved. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; claims remain descriptive without self-referential reductions.

full rationale

The abstract and available text introduce EVA and D-GSPO at a conceptual level, stating that co-optimization of Latent_slot tokens with text tokens causes policy deviation addressed by decoupling, but provide no equations, objective functions, gradient formulations, or derivation steps. Without mathematical content, no self-definitional, fitted-input, or self-citation reductions can be exhibited. The framework is presented as an empirical proposal supported by a constructed dataset and benchmarks rather than a closed logical chain that reduces to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract supplies insufficient technical detail to identify specific free parameters, background axioms, or the independence of any new entities; Latent_slot tokens are introduced as a core mechanism but lack external validation evidence.

invented entities (1)

Latent_slot tokens no independent evidence
purpose: Act as intermediate visual thoughts during multimodal reasoning
New representation introduced in the framework description; no independent evidence or falsifiable handle provided in abstract.

pith-pipeline@v0.9.1-grok · 5757 in / 1073 out tokens · 26836 ms · 2026-06-26T00:36:27.975683+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 27 linked inside Pith

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6

Pith/arXiv arXiv 2025
[2]

Per- ception tokens enhance visual reasoning in multimodal lan- guage models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3836–3845, 2025. 3

2025
[3]

Transmamba: Fast universal architec- ture adaption from transformers to mamba.arXiv preprint arXiv:2502.15130, 2025

Xiuwei Chen, Wentao Hu, Xiao Dong, Sihao Lin, Zisheng Chen, Meng Cao, Yina Zhuang, Jianhua Han, Hang Xu, and Xiaodan Liang. Transmamba: Fast universal architec- ture adaption from transformers to mamba.arXiv preprint arXiv:2502.15130, 2025. 15

arXiv 2025
[4]

C2-evo: Co-evolving multimodal data and model for self-improving reasoning

Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jian- hua Han, Hang Xu, and Xiaodan Liang. C2-evo: Co-evolving multimodal data and model for self-improving reasoning. ArXiv, abs/2507.16518, 2025. 15

Pith/arXiv arXiv 2025
[5]

Mint-cot: En- abling interleaved visual tokens in mathematical chain-of- thought reasoning.arXiv preprint arXiv:2506.05331, 2025

Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: En- abling interleaved visual tokens in mathematical chain-of- thought reasoning.arXiv preprint arXiv:2506.05331, 2025. 1, 3

arXiv 2025
[7]

Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wan- jun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025. 1

Pith/arXiv arXiv 2025
[8]

Blink: Multimodal large language mod- els can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language mod- els can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024. 6

2024
[9]

Refocus: Visual editing as a chain of thought for struc- tured image understanding.arXiv preprint arXiv:2501.05452,

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for struc- tured image understanding.arXiv preprint arXiv:2501.05452,

arXiv
[10]

G-llava: Solving geomet- ric problem with multi-modal large language model.ArXiv, abs/2312.11370, 2023

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geomet- ric problem with multi-modal large language model.ArXiv, abs/2312.11370, 2023. 15

arXiv 2023
[11]

Webwatcher: Breaking new fron- tier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qi- uchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new fron- tier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025. 15

Pith/arXiv arXiv 2025
[12]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 15

Pith/arXiv arXiv 2025
[13]

Training large lan- guage models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large lan- guage models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. 2

Pith/arXiv arXiv 2024
[14]

Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 1

2024
[15]

Zettlemoyer, Noah A

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke S. Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.ArXiv, abs/2406.09403, 2024. 15

arXiv 2024
[17]

Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

Pith/arXiv arXiv
[18]

High-resolution visual reasoning via multi- turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920, 2025

Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, and Ziwei Liu. High-resolution visual reasoning via multi- turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920, 2025. 15

Pith/arXiv arXiv 2025
[19]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1

Pith/arXiv arXiv 2024
[20]

Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 15

Pith/arXiv arXiv 2024
[21]

Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 1, 6, 15

Pith/arXiv arXiv 2025
[22]

Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.ArXiv, abs/2509.21268, 2025

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Li Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, and Shijian Lu. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.ArXiv, abs/2509.21268, 2025. 15

arXiv 2025
[23]

Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025. 3, 6, 8

Pith/arXiv arXiv 2025
[24]

Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308, 2025

Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, et al. Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308, 2025. 2

arXiv 2025
[25]

Implicit 9 reasoning in large language models: A comprehensive survey

Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, and Rex Ying. Implicit 9 reasoning in large language models: A comprehensive survey. arXiv preprint arXiv:2509.02350, 2025. 2

arXiv 2025
[26]

Agent rl scaling law: Agent rl with spontaneous code execution for mathemat- ical problem solving.arXiv preprint arXiv:2505.07773, 2025

Xinji Mai, Haotian Xu, Zhong-Zhi Li, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang, et al. Agent rl scaling law: Agent rl with spontaneous code execution for mathemat- ical problem solving.arXiv preprint arXiv:2505.07773, 2025. 15

arXiv 2025
[27]

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 15

Pith/arXiv arXiv 2025
[28]

Introducing o3 and o4-mini, 2025

OpenAI. Introducing o3 and o4-mini, 2025. 1

2025
[29]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2024. 5, 11

2024
[30]

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Li Bing, and Roy Ka wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. InConference on Empirical Methods in Natural Language Processing, 2024. 15

2024
[31]

Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 15

Pith/arXiv arXiv 2025
[32]

Webshaper: Agentically data synthesiz- ing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baix- uan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesiz- ing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025. 15

arXiv 2025
[33]

Kimi k2: Open agentic intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025. 15

Pith/arXiv arXiv 2025
[34]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7907–7915, 2025. 6

2025
[35]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 5, 11

arXiv 2025
[36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022. 15

Pith/arXiv arXiv 2022
[37]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 6

2024
[38]

Atomthink: A slow thinking framework for multi- modal mathematical reasoning.ArXiv, abs/2411.11930, 2024

Kun Xiang, Zhili Liu, Zihao Jiang, Yunshuang Nie, Runhu Huang, Haoxiang Fan, Hanhui Li, Weiran Huang, Yihan Zeng, Jianhua Han, Lanqing Hong, Hang Xu, and Xiaodan Liang. Atomthink: A slow thinking framework for multi- modal mathematical reasoning.ArXiv, abs/2411.11930, 2024. 15

arXiv 2024
[39]

Llava-cot: Let vision language models reason step-by-step.ArXiv, abs/2411.10440, 2024

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step.ArXiv, abs/2411.10440, 2024. 1, 15

Pith/arXiv arXiv 2024
[40]

Simpletir: End-to-end rein- forcement learning for multi-turn tool-integrated reasoning

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end rein- forcement learning for multi-turn tool-integrated reasoning. arXiv preprint arXiv:2509.02479, 2025. 15

arXiv 2025
[41]

R1-onevision: Advanc- ing generalized multimodal reasoning through cross-modal formalization.ArXiv, abs/2503.10615, 2025

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Min- feng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advanc- ing generalized multimodal reasoning through cross-modal formalization.ArXiv, abs/2503.10615, 2025. 1, 15

Pith/arXiv arXiv 2025
[42]

Machine mental imagery: Empower multi- modal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025. 3

Pith/arXiv arXiv 2025
[43]

Mimo-vl technical report.ArXiv, abs/2506.03569,

Xiaomi LLM-Core Team Zihao Yue, Zhenrui Lin, Yi-Hao Song, Weikun Wang, Shu-Qin Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zi-Ang Jiang, Zhixian Zheng, Zhichao Song, Zhen Luo, Yue Yu, Yudong Wang, Yu Tian, Yu Tu, Yihan Yan, Yi H...

arXiv
[44]

Llama-berry: Pair- wise optimization for o1-like olympiad-level mathematical reasoning.ArXiv, abs/2410.02884, 2024

Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan Zhou. Llama-berry: Pair- wise optimization for o1-like olympiad-level mathematical reasoning.ArXiv, abs/2410.02884, 2024. 15

arXiv 2024
[45]

R1-vl: Learning to reason with multimodal large language mod- els via step-wise group relative policy optimization.ArXiv, abs/2503.12937, 2025

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language mod- els via step-wise group relative policy optimization.ArXiv, abs/2503.12937, 2025. 1, 15

Pith/arXiv arXiv 2025
[46]

Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 1, 15

Pith/arXiv arXiv 2025
[47]

Mme-realworld: Could your multi- modal llm challenge high-resolution real-world scenarios that 10 are difficult for humans?arXiv preprint arXiv:2408.13257,

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multi- modal llm challenge high-resolution real-world scenarios that 10 are difficult for humans?arXiv preprint arXiv:2408.13257,

Pith/arXiv arXiv
[48]

Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 1, 5, 11, 15

Pith/arXiv arXiv 2025
[49]

Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 6

Pith/arXiv arXiv 2025
[50]

Deepeyes: Incentivizing” thinking with images” via reinforcement learn- ing.arXiv preprint arXiv:2505.14362, 2025

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learn- ing.arXiv preprint arXiv:2505.14362, 2025. 1, 5, 7, 11, 15

Pith/arXiv arXiv 2025
[51]

Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 1

Pith/arXiv arXiv 2023
[52]

Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025. 15 A. Details A.1. Dataset Construction EV A training data consists of two parts.: EV A-SFT-230K and EV A-R...

Pith/arXiv arXiv 2025
[53]

To guarantee accurate alignment during the matching process with the ground truth, we use order constraints

We allocate a corresponding number of Latent slot placeholders based on the quantity of tool images present in each sample. To guarantee accurate alignment during the matching process with the ground truth, we use order constraints. Furthermore, in the Reinforcement Learning (RL) phase, we discard a small number of samples generated during the rollout pro...
[54]

Remove Python Code: Completely remove all <code>...</code> blocks and the Python code within them
[55]

Replace with Image Tag: In the exact place where the `<code>...</code>` block used to be, insert the following tag: `<output_image>`
[56]

I will crop the image to make a closer observation

Modify the Transition Sentence: Locate the sentence immediately preceding the newly inserted `<output_image>` tag. You must rewrite this sentence to explicitly state your intended action for closer observation. The action verb must match what the removed code was doing (e.g., crop, highlight, rotate). Example phrasing: "I will crop the image to make a clo...
[57]

original_image\

Preserve the Rest: Do not alter the logical flow, the `<sandbox_output><image></sandbox_output>` tags, or the final conclusion. Keep the `<answer>...</answer>` block exactly as it is in the original text. What does the graph indicate happens to the abundance of elements with a mass number around 195? A. It shows a significant increase in abundance. B. It ...

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6

Pith/arXiv arXiv 2025

[2] [2]

Per- ception tokens enhance visual reasoning in multimodal lan- guage models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3836–3845, 2025. 3

2025

[3] [3]

Transmamba: Fast universal architec- ture adaption from transformers to mamba.arXiv preprint arXiv:2502.15130, 2025

Xiuwei Chen, Wentao Hu, Xiao Dong, Sihao Lin, Zisheng Chen, Meng Cao, Yina Zhuang, Jianhua Han, Hang Xu, and Xiaodan Liang. Transmamba: Fast universal architec- ture adaption from transformers to mamba.arXiv preprint arXiv:2502.15130, 2025. 15

arXiv 2025

[4] [4]

C2-evo: Co-evolving multimodal data and model for self-improving reasoning

Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jian- hua Han, Hang Xu, and Xiaodan Liang. C2-evo: Co-evolving multimodal data and model for self-improving reasoning. ArXiv, abs/2507.16518, 2025. 15

Pith/arXiv arXiv 2025

[5] [5]

Mint-cot: En- abling interleaved visual tokens in mathematical chain-of- thought reasoning.arXiv preprint arXiv:2506.05331, 2025

Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: En- abling interleaved visual tokens in mathematical chain-of- thought reasoning.arXiv preprint arXiv:2506.05331, 2025. 1, 3

arXiv 2025

[6] [7]

Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wan- jun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025. 1

Pith/arXiv arXiv 2025

[7] [8]

Blink: Multimodal large language mod- els can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language mod- els can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024. 6

2024

[8] [9]

Refocus: Visual editing as a chain of thought for struc- tured image understanding.arXiv preprint arXiv:2501.05452,

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for struc- tured image understanding.arXiv preprint arXiv:2501.05452,

arXiv

[9] [10]

G-llava: Solving geomet- ric problem with multi-modal large language model.ArXiv, abs/2312.11370, 2023

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geomet- ric problem with multi-modal large language model.ArXiv, abs/2312.11370, 2023. 15

arXiv 2023

[10] [11]

Webwatcher: Breaking new fron- tier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qi- uchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new fron- tier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025. 15

Pith/arXiv arXiv 2025

[11] [12]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 15

Pith/arXiv arXiv 2025

[12] [13]

Training large lan- guage models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large lan- guage models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. 2

Pith/arXiv arXiv 2024

[13] [14]

Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 1

2024

[14] [15]

Zettlemoyer, Noah A

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke S. Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.ArXiv, abs/2406.09403, 2024. 15

arXiv 2024

[15] [17]

Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

Pith/arXiv arXiv

[16] [18]

High-resolution visual reasoning via multi- turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920, 2025

Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, and Ziwei Liu. High-resolution visual reasoning via multi- turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920, 2025. 15

Pith/arXiv arXiv 2025

[17] [19]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1

Pith/arXiv arXiv 2024

[18] [20]

Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 15

Pith/arXiv arXiv 2024

[19] [21]

Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 1, 6, 15

Pith/arXiv arXiv 2025

[20] [22]

Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.ArXiv, abs/2509.21268, 2025

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Li Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, and Shijian Lu. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.ArXiv, abs/2509.21268, 2025. 15

arXiv 2025

[21] [23]

Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025. 3, 6, 8

Pith/arXiv arXiv 2025

[22] [24]

Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308, 2025

Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, et al. Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308, 2025. 2

arXiv 2025

[23] [25]

Implicit 9 reasoning in large language models: A comprehensive survey

Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, and Rex Ying. Implicit 9 reasoning in large language models: A comprehensive survey. arXiv preprint arXiv:2509.02350, 2025. 2

arXiv 2025

[24] [26]

Agent rl scaling law: Agent rl with spontaneous code execution for mathemat- ical problem solving.arXiv preprint arXiv:2505.07773, 2025

Xinji Mai, Haotian Xu, Zhong-Zhi Li, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang, et al. Agent rl scaling law: Agent rl with spontaneous code execution for mathemat- ical problem solving.arXiv preprint arXiv:2505.07773, 2025. 15

arXiv 2025

[25] [27]

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 15

Pith/arXiv arXiv 2025

[26] [28]

Introducing o3 and o4-mini, 2025

OpenAI. Introducing o3 and o4-mini, 2025. 1

2025

[27] [29]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2024. 5, 11

2024

[28] [30]

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Li Bing, and Roy Ka wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. InConference on Empirical Methods in Natural Language Processing, 2024. 15

2024

[29] [31]

Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 15

Pith/arXiv arXiv 2025

[30] [32]

Webshaper: Agentically data synthesiz- ing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baix- uan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesiz- ing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025. 15

arXiv 2025

[31] [33]

Kimi k2: Open agentic intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025. 15

Pith/arXiv arXiv 2025

[32] [34]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7907–7915, 2025. 6

2025

[33] [35]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 5, 11

arXiv 2025

[34] [36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022. 15

Pith/arXiv arXiv 2022

[35] [37]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 6

2024

[36] [38]

Atomthink: A slow thinking framework for multi- modal mathematical reasoning.ArXiv, abs/2411.11930, 2024

Kun Xiang, Zhili Liu, Zihao Jiang, Yunshuang Nie, Runhu Huang, Haoxiang Fan, Hanhui Li, Weiran Huang, Yihan Zeng, Jianhua Han, Lanqing Hong, Hang Xu, and Xiaodan Liang. Atomthink: A slow thinking framework for multi- modal mathematical reasoning.ArXiv, abs/2411.11930, 2024. 15

arXiv 2024

[37] [39]

Llava-cot: Let vision language models reason step-by-step.ArXiv, abs/2411.10440, 2024

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step.ArXiv, abs/2411.10440, 2024. 1, 15

Pith/arXiv arXiv 2024

[38] [40]

Simpletir: End-to-end rein- forcement learning for multi-turn tool-integrated reasoning

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end rein- forcement learning for multi-turn tool-integrated reasoning. arXiv preprint arXiv:2509.02479, 2025. 15

arXiv 2025

[39] [41]

R1-onevision: Advanc- ing generalized multimodal reasoning through cross-modal formalization.ArXiv, abs/2503.10615, 2025

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Min- feng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advanc- ing generalized multimodal reasoning through cross-modal formalization.ArXiv, abs/2503.10615, 2025. 1, 15

Pith/arXiv arXiv 2025

[40] [42]

Machine mental imagery: Empower multi- modal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025. 3

Pith/arXiv arXiv 2025

[41] [43]

Mimo-vl technical report.ArXiv, abs/2506.03569,

Xiaomi LLM-Core Team Zihao Yue, Zhenrui Lin, Yi-Hao Song, Weikun Wang, Shu-Qin Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zi-Ang Jiang, Zhixian Zheng, Zhichao Song, Zhen Luo, Yue Yu, Yudong Wang, Yu Tian, Yu Tu, Yihan Yan, Yi H...

arXiv

[42] [44]

Llama-berry: Pair- wise optimization for o1-like olympiad-level mathematical reasoning.ArXiv, abs/2410.02884, 2024

Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan Zhou. Llama-berry: Pair- wise optimization for o1-like olympiad-level mathematical reasoning.ArXiv, abs/2410.02884, 2024. 15

arXiv 2024

[43] [45]

R1-vl: Learning to reason with multimodal large language mod- els via step-wise group relative policy optimization.ArXiv, abs/2503.12937, 2025

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language mod- els via step-wise group relative policy optimization.ArXiv, abs/2503.12937, 2025. 1, 15

Pith/arXiv arXiv 2025

[44] [46]

Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 1, 15

Pith/arXiv arXiv 2025

[45] [47]

Mme-realworld: Could your multi- modal llm challenge high-resolution real-world scenarios that 10 are difficult for humans?arXiv preprint arXiv:2408.13257,

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multi- modal llm challenge high-resolution real-world scenarios that 10 are difficult for humans?arXiv preprint arXiv:2408.13257,

Pith/arXiv arXiv

[46] [48]

Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 1, 5, 11, 15

Pith/arXiv arXiv 2025

[47] [49]

Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 6

Pith/arXiv arXiv 2025

[48] [50]

Deepeyes: Incentivizing” thinking with images” via reinforcement learn- ing.arXiv preprint arXiv:2505.14362, 2025

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learn- ing.arXiv preprint arXiv:2505.14362, 2025. 1, 5, 7, 11, 15

Pith/arXiv arXiv 2025

[49] [51]

Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 1

Pith/arXiv arXiv 2023

[50] [52]

Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025. 15 A. Details A.1. Dataset Construction EV A training data consists of two parts.: EV A-SFT-230K and EV A-R...

Pith/arXiv arXiv 2025

[51] [53]

To guarantee accurate alignment during the matching process with the ground truth, we use order constraints

We allocate a corresponding number of Latent slot placeholders based on the quantity of tool images present in each sample. To guarantee accurate alignment during the matching process with the ground truth, we use order constraints. Furthermore, in the Reinforcement Learning (RL) phase, we discard a small number of samples generated during the rollout pro...

[52] [54]

Remove Python Code: Completely remove all <code>...</code> blocks and the Python code within them

[53] [55]

Replace with Image Tag: In the exact place where the `<code>...</code>` block used to be, insert the following tag: `<output_image>`

[54] [56]

I will crop the image to make a closer observation

Modify the Transition Sentence: Locate the sentence immediately preceding the newly inserted `<output_image>` tag. You must rewrite this sentence to explicitly state your intended action for closer observation. The action verb must match what the removed code was doing (e.g., crop, highlight, rotate). Example phrasing: "I will crop the image to make a clo...

[55] [57]

original_image\

Preserve the Rest: Do not alter the logical flow, the `<sandbox_output><image></sandbox_output>` tags, or the final conclusion. Keep the `<answer>...</answer>` block exactly as it is in the original text. What does the graph indicate happens to the abundance of elements with a mass number around 195? A. It shows a significant increase in abundance. B. It ...