From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Cihang Xie; Freda Shi; Hanqing Lu; Haoqin Tu; Hardy Chen; Hui Liu; Juncheng Wu; Xianfeng Tang; Yuyin Zhou

arxiv: 2605.20177 · v1 · pith:VC5XR6NDnew · submitted 2026-05-19 · 💻 cs.CL · cs.CV

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Juncheng Wu , Hardy Chen , Haoqin Tu , Xianfeng Tang , Freda Shi , Hui Liu , Hanqing Lu , Cihang Xie

show 1 more author

Yuyin Zhou

This is my paper

Pith reviewed 2026-05-20 05:08 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords vision-language modelspost-trainingvisual perceptionvisual reasoningstaged trainingchain-of-thoughtreinforcement learningcurriculum

0 comments

The pith

Decoupling visual perception from reasoning in post-training allows vision-language models to achieve higher accuracy with shorter reasoning traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often underperform on visual tasks because their perception skills lag behind their reasoning ones. The paper demonstrates that isolating perception training with specialized data and reinforcement learning before moving to reasoning stages builds a stronger foundation. This separation leads to models that not only score higher on reasoning benchmarks but also produce more concise chains of thought. The method introduces capability-based staging as a complement to traditional difficulty curricula, yielding further improvements when combined.

Core claim

By breaking post-training into visual perception, visual reasoning, and textual reasoning stages with specialized data, perception is shown to need targeted optimization and to serve as a scaffold that should precede reasoning refinement. Staged training consistently outperforms merged training, delivering 1.5 percent higher reasoning accuracy alongside 20.8 percent shorter reasoning traces and leading results on visual math and perception tasks among open-weight models.

What carries the argument

A three-stage curriculum that first optimizes visual perception using specialized data and reinforcement learning before addressing visual and textual reasoning.

If this is right

Staged models reach higher reasoning accuracy than merged-training counterparts.
Reasoning traces become 20.8 percent shorter due to stronger initial perception.
Performance gains appear on visual math benchmarks such as WeMath and perception tasks like RealWorldQA.
Combining capability staging with difficulty-based curricula produces additive benefits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could apply to other multimodal systems where early skill separation prevents interference between capabilities.
Testing the method on additional vision-language benchmarks beyond math could reveal broader applicability.
Shorter reasoning traces might reduce computational costs during model inference in practical applications.

Load-bearing premise

The training data for perception can be cleanly separated from reasoning data in a way that avoids harmful distribution shifts during the overall process.

What would settle it

A controlled experiment showing that joint training on combined perception and reasoning data produces equivalent or superior accuracy and efficiency on the same visual tasks would disprove the advantage of staging.

Figures

Figures reproduced from arXiv: 2605.20177 by Cihang Xie, Freda Shi, Hanqing Lu, Haoqin Tu, Hardy Chen, Hui Liu, Juncheng Wu, Xianfeng Tang, Yuyin Zhou.

**Figure 1.** Figure 1: Longer thinking can not fix incorrect perception. Re-checking the image during the reasoning leads to the same perception error. 1. Introduction Vision-Language Models (VLMs) have achieved remarkable progress in a wide range of multimodal tasks, including visual question answering (Yue et al., 2024; Huang et al., 2025; Wu et al., 2025), diagram understanding (Hou et al., 2024; Hong et al., 2024), and visu… view at source ↗

**Figure 2.** Figure 2: Improving VLM Post-training with Visual Perception Data Synthesis and Staged Training: (a) Generating image-content based QA pairs by feeding captions to an LLM and labeling answers with a strong VLM; (b) Perception difficulty filtering, which removes samples that can be answered by the base VLMs based on caption; (c) Staged training by different capabilities from seeing to thinking. 3.2. Training Strategi… view at source ↗

**Figure 3.** Figure 3: Comparison between the base model, the model trained with reasoning-only, and perception+reasoning data. Incorporating perception data improves visual math while maintaining perception capabilities. We show standard error bars here, and the exact values are provided in Appendix A.2. Benchmarks. We evaluate model performance on a comprehensive suite of vision-language benchmarks, covering both visual math… view at source ↗

**Figure 4.** Figure 4: Case Study between Staged and Merged Training Models. The staged training model generates concise reasoning with correct perception [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Staged Training Reduces the Response Length for Visual Reasoning. For the Qwen3-VL-8B model, we plot the average response length on the validation set over training steps, comparing the staged and merged training strategies. merged training across both InternVL architectures, with overall gains of +0.95% and +3.77%, respectively, confirming that the benefit of decoupling perception and reasoning generaliz… view at source ↗

**Figure 6.** Figure 6: Example of synthesized visual perception data. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for generating visual perception question-answer pairs. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for assessing visual perception errors in VLM’s reasoning. Prompt for model training messages = [ {"role": ”system", "content": """You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}. i.e. <thinking> reasoning here </thinking> \boxed{final … view at source ↗

**Figure 9.** Figure 9: System prompt used in our experiments. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that VLM performance on visual tasks is limited primarily by perception rather than reasoning, and that post-training can be decomposed into three sequential stages—visual perception, visual reasoning, and textual reasoning—using specialized data. It argues that perception requires targeted RL-based optimization and must be solidified first as a scaffold before visual reasoning, yielding 1.5% higher reasoning accuracy and 20.8% shorter traces versus merged training; staged training is orthogonal to difficulty-based curricula with additive gains, and the resulting models set new results among open-weight VLMs (+5.2% WeMath, +3.7% RealWorldQA).

Significance. If the results hold, the work identifies perception as a foundational, separately optimizable component in VLMs and introduces capability-based staging as a new curriculum axis that combines additively with existing difficulty-based approaches. The multi-VLM experiments and the link between stronger perception and shorter reasoning traces are concrete strengths that could guide more efficient post-training pipelines.

major comments (2)

[§4] §4 (experimental results): the staged-vs-merged comparison reports +5.2% and +3.7% gains but provides no ablation that reverses training order (reasoning stage before perception stage) while holding total data exposure and compute fixed; this directly tests whether the specific perception-first sequence, rather than any phased specialization, is responsible for the scaffold effect claimed in the abstract.
[§4.1] §4.1 and Table 2: the accuracy improvements are given without reported standard deviations, exact train/validation splits, or explicit controls that total training tokens/compute are matched between staged and merged conditions; without these, the central claim that decoupling improves performance over merged training remains vulnerable to selection or optimization artifacts.

minor comments (3)

Add a short related-work paragraph contrasting the proposed capability staging with prior difficulty-based or multi-stage curricula in VLMs.
Figure 3: label the y-axis units for reasoning-trace length explicitly and include error bars on the 20.8% reduction claim.
[§3.3] §3.3: clarify whether the RL reward for the perception stage uses the same format as the later reasoning stages or a distinct formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§4] §4 (experimental results): the staged-vs-merged comparison reports +5.2% and +3.7% gains but provides no ablation that reverses training order (reasoning stage before perception stage) while holding total data exposure and compute fixed; this directly tests whether the specific perception-first sequence, rather than any phased specialization, is responsible for the scaffold effect claimed in the abstract.

Authors: The referee raises a valid point regarding the need for a reverse-order ablation to confirm the importance of the perception-first sequence. Our hypothesis is grounded in the idea that perception provides a necessary foundation for effective reasoning, as supported by the observed reduction in reasoning trace length. To directly address this, we will include an additional experiment in the revised manuscript where we reverse the order of stages (textual and visual reasoning before perception) while keeping the total data exposure and compute fixed. We will report the results and compare them to our original staged approach. revision: yes
Referee: [§4.1] §4.1 and Table 2: the accuracy improvements are given without reported standard deviations, exact train/validation splits, or explicit controls that total training tokens/compute are matched between staged and merged conditions; without these, the central claim that decoupling improves performance over merged training remains vulnerable to selection or optimization artifacts.

Authors: We appreciate this feedback on improving the experimental rigor. In the revised version, we will add standard deviations for the reported accuracies based on multiple training runs with different seeds. We will also specify the exact train and validation splits used in our experiments and provide a detailed breakdown of the total training tokens and computational resources for both the staged and merged training setups to confirm they are equivalent. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains on held-out benchmarks with independent ablations

full rationale

The paper reports measured improvements from staged perception-then-reasoning training versus merged training, with gains quantified on separate evaluation benchmarks (WeMath +5.2%, RealWorldQA +3.7%, plus 1.5% reasoning accuracy and 20.8% shorter traces). These outcomes are obtained by direct comparison of training regimes on held-out data rather than by any equation or definition that reduces the reported benefit to a parameter fitted from the evaluation set itself. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the core ordering claim; the scaffold premise is presented as an experimental observation supported by the staged-vs-merged ablation. The derivation therefore remains self-contained against external benchmarks and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that perception and reasoning capabilities can be isolated into distinct training stages with specialized data; no new physical entities or forces are introduced. The main free parameter is the choice of which data counts as 'specialized perception data' versus reasoning data.

free parameters (1)

perception-stage data selection rule
The boundary between perception-only data and reasoning data is chosen to enable the staged curriculum; this choice directly affects the reported gains.

axioms (1)

domain assumption Visual perception can be optimized independently of reasoning using targeted data and RL
Invoked when the authors state that perception requires targeted optimization and is more effectively learned via RL than caption-based SFT.

pith-pipeline@v0.9.0 · 5811 in / 1426 out tokens · 28233 ms · 2026-05-20T05:08:47.288590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 9 internal anchors

[1]

Nore- geo: Non-reasoning geometry benchmark.arXiv preprint arXiv:2601.10254,

Abdullaeva, I., Vasiliuk, A., Goncharova, E., Rahmatul- laev, T., Ivan, Z., Kurkin, M., and Kuznetsov, A. Nore- geo: Non-reasoning geometry benchmark.arXiv preprint arXiv:2601.10254,

work page arXiv
[2]

URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf. Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., L...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

URL https: //arxiv.org/abs/2311.12793. Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37: 27056–27087, 2024b. Deng, H., Ma, D. Z. X. Z. R., Cao, Y . G. Y ., and Kang, Y . Curr-r...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Video-R1: Reinforcing Video Reasoning in MLLMs

Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y ., Peng, T., Wu, J., Zhang, X., Wang, B., and Yue, X. Video-r1: Reinforcing video reasoning in mllms, 2025a. URL https://arxiv.org/abs/2503.21776. Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y ., Zheng, D., Sun, P., Zhang, Y ., Sun, H., et al. Onethinker: All-in-one reasoning model for image and vide...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2312.11370 , year=

Gao, J., Pi, R., Zhang, J., Ye, J., Zhong, W., Wang, Y ., Hong, L., Han, J., Xu, H., Li, Z., et al. G-llava: Solving geo- metric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370,

work page arXiv
[6]

CogVLM2: Visual Language Models for Image and Video Understanding

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z. Hong, W., Wang, W., Ding, M., Yu, W., Lv, Q., Wang, Y ., Cheng, Y ., Huang, S., Ji, J., Xue, Z., et al. Cogvlm2: Vi- sual language models for image and video understanding. arXiv preprint arXiv:2408.16500,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z
[7]

Do vision- language models really understand visual language? arXiv preprint arXiv:2410.00193,

Hou, Y ., Giledereli, B., Tu, Y ., and Sachan, M. Do vision- language models really understand visual language? arXiv preprint arXiv:2410.00193,

work page arXiv
[8]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2508.02669 , year=

Huang, X., Wu, J., Liu, H., Tang, X., and Zhou, Y . Medvl- thinker: Simple baselines for multimodal medical reason- ing.arXiv preprint arXiv:2508.02669,

work page arXiv
[10]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

URL https://arxiv.org/abs/ 1902.09506. Jeddi, A., Karaimer, H. C., Nguyen, H., Wang, Z., Zhao, K., Rajabi, J., Zhang, R., Goyal, R., Taati, B., and Grzeszczuk, R. Puzzle curriculum grpo for vision-centric reasoning.arXiv preprint arXiv:2512.14944,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[11]

Kamoi, R., Zhang, Y ., Das, S. S. S., Zhang, R. H., and Zhang, R. Visonlyqa: Large vision language models still struggle with visual perception of geometric information. arXiv preprint arXiv:2412.00947,

work page arXiv
[12]

Mmr1: Enhancing multimodal reasoning with variance- aware sampling and open resources.arXiv preprint arXiv:2509.21268,

Leng, S., Wang, J., Li, J., Zhang, H., Hu, Z., Zhang, B., Jiang, Y ., Zhang, H., Li, X., Bing, L., et al. Mmr1: Enhancing multimodal reasoning with variance- aware sampling and open resources.arXiv preprint arXiv:2509.21268,

work page arXiv
[13]

arXiv preprint arXiv:2403.00231 , year=

Li, L., Wang, Y ., Xu, R., Wang, P., Feng, X., Kong, L., and Liu, Q. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231,

work page arXiv
[14]

org/abs/2511.17731

URL https://arxiv. org/abs/2511.17731. Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision- language models.arXiv preprint arXiv:2305.10355,

work page arXiv
[15]

Lindstr¨om, A. D. and Abraham, S. S. Clevr-math: A dataset for compositional language, visual and mathematical rea- soning.arXiv preprint arXiv:2208.05358,

work page arXiv
[16]

More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

Liu, C., Xu, Z., Wei, Q., Wu, J., Zou, J., Wang, X. E., Zhou, Y ., and Liu, S. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523,

work page arXiv
[17]

Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning, 2024a. URL https:// arxiv.org/abs/2310.03744. Liu, Y ., Liu, J., Shi, X., Cheng, Q., Huang, Y ., and Lu, W. Let’s learn step by step: Enhancing in-context learn- ing ability with curriculum learning.arXiv preprint arXiv:2402.10738, 2024b. Lu, P., Bansal, H., Xia, ...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

org/abs/1906.00067

URL https://arxiv. org/abs/1906.00067. 11 From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models Masry, A., Long, D. X., Tan, J. Q., Joty, S., and Hoque, E. Chartqa: A benchmark for question answering about charts with visual and logical reasoning,

work page arXiv 1906
[19]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

URL https://arxiv.org/abs/2203.10244. Mathew, M., Bagal, V ., Tito, R. P., Karatzas, D., Valveny, E., and Jawahar, C. V . Infographicvqa, 2021a. URL https://arxiv.org/abs/2104.12756. Mathew, M., Karatzas, D., and Jawahar, C. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pp...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Peng, Y ., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., and Yang, X. Lmm-r1: Em- powering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025a. Peng, Y ., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., and Yang, X. Lmm- r1: Empowering 3b...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

A-okvqa: A benchmark for visual question answering using world knowledge, 2022a

Schwenk, D., Khandelwal, A., Clark, C., Marino, K., and Mottaghi, R. A-okvqa: A benchmark for visual question answering using world knowledge, 2022a. URL https: //arxiv.org/abs/2206.01718. Schwenk, D., Khandelwal, A., Clark, C., Marino, K., and Mottaghi, R. A-okvqa: A benchmark for visual ques- tion answering using world knowledge. InEuropean conference o...

work page arXiv
[22]

Descriptive caption enhancement with visual specialists for multimodal perception.arXiv preprint arXiv:2412.14233,

Sun, Y ., Hao, J., Zhu, K., Liu, J.-J., Zhao, Y ., Li, X., Zhang, G., Li, Z., and Wang, J. Descriptive caption enhancement with visual specialists for multimodal perception.arXiv preprint arXiv:2412.14233,

work page arXiv
[23]

org/abs/2501.06186

URL https://arxiv. org/abs/2501.06186. Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., and Li, H. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neu- ral Information Processing Systems, 37:95095–95169, 2024a. Wang, W., Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Zhu, J., Zhu, X., Lu, L., Qiao, Y ., e...

work page arXiv
[24]

URL https://arxiv.org/abs/2411. 10440. 12 From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models Xu, G., Jin, P., Wu, Z., Li, H., Song, Y ., Sun, L., and Yuan, L. Llava-cot: Let vision language models reason step- by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. ...

work page 2087
[25]

Wethink: Toward general-purpose vision- language reasoning via reinforcement learning.arXiv preprint arXiv:2506.07905, 2025a

Yang, J., Ma, F., Wang, Z., Yin, D., Rong, K., Rao, F., and Zhang, R. Wethink: Toward general-purpose vision- language reasoning via reinforcement learning.arXiv preprint arXiv:2506.07905, 2025a. Yang, Y ., He, X., Pan, H., Jiang, X., Deng, Y ., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., Zhang, B., and Chen, W. R1- onevision: Advancing generalized multi...

work page arXiv
[26]

Gthinker: Towards general multimodal reasoning via cue-guided rethinking

Zhan, Y ., Wu, Z., Zhu, Y ., Xue, R., Luo, R., Chen, Z., Zhang, C., Li, Y ., He, Z., Yang, Z., et al. Gthinker: Towards general multimodal reasoning via cue-guided rethinking. arXiv preprint arXiv:2506.01078, 2025a. Zhan, Y ., Zhu, Y ., Zheng, S., Zhao, H., Yang, F., Tang, M., and Wang, J. Vision-r1: Evolving human-free alignment in large vision-language ...

work page arXiv 2025
[27]

Math- verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pp

Zhang, R., Jiang, D., Zhang, Y ., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K.-W., Qiao, Y ., et al. Math- verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pp. 169–186. Springer, 2024a. Zhang, R., Zhang, B., Li, Y ., Zhang, H., Sun, Z., Gan, Z., Yang, Y ., Pang, R., and Ya...

work page arXiv
[28]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

Zou, C., Guo, X., Yang, R., Zhang, J., Hu, B., and Zhang, H. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InInternational Conference on Learning Repre- sentations, volume 2025, pp. 48337–48383,

work page 2025
[29]

Table 6.Key hyperparameters used in our Stage-3 training

to ensure a controlled comparison and reproducibility. Table 6.Key hyperparameters used in our Stage-3 training. Hyperparameter Value Max prompt length 2048 Actor dtype bf16 Actor optimizer adamw bf16 Actor micro-bsz (update) 16 Actor micro-bsz (experience) 64 Offload params / optim False / False Rollout GPU mem util. 0.7 Tensor parallel size 1 Reward typ...

work page 2048
[30]

Across both Qwen2.5-VL-7B and Qwen3-VL-8B, incorporating perception data consistently yields larger gains on visual math benchmarks compared to reasoning-only training. For instance, on Qwen3-VL-8B, perception+reasoning improves MVerse (VI) from 42.26% to 43.78% and MVista from 73.80% to 75.90%, while achieving comparable performance on A-OKVQA and POPE. ...

work page arXiv 2025

[1] [1]

Nore- geo: Non-reasoning geometry benchmark.arXiv preprint arXiv:2601.10254,

Abdullaeva, I., Vasiliuk, A., Goncharova, E., Rahmatul- laev, T., Ivan, Z., Kurkin, M., and Kuznetsov, A. Nore- geo: Non-reasoning geometry benchmark.arXiv preprint arXiv:2601.10254,

work page arXiv

[2] [2]

URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf. Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., L...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

URL https: //arxiv.org/abs/2311.12793. Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37: 27056–27087, 2024b. Deng, H., Ma, D. Z. X. Z. R., Cao, Y . G. Y ., and Kang, Y . Curr-r...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Video-R1: Reinforcing Video Reasoning in MLLMs

Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y ., Peng, T., Wu, J., Zhang, X., Wang, B., and Yue, X. Video-r1: Reinforcing video reasoning in mllms, 2025a. URL https://arxiv.org/abs/2503.21776. Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y ., Zheng, D., Sun, P., Zhang, Y ., Sun, H., et al. Onethinker: All-in-one reasoning model for image and vide...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2312.11370 , year=

Gao, J., Pi, R., Zhang, J., Ye, J., Zhong, W., Wang, Y ., Hong, L., Han, J., Xu, H., Li, Z., et al. G-llava: Solving geo- metric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370,

work page arXiv

[6] [6]

CogVLM2: Visual Language Models for Image and Video Understanding

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z. Hong, W., Wang, W., Ding, M., Yu, W., Lv, Q., Wang, Y ., Cheng, Y ., Huang, S., Ji, J., Xue, Z., et al. Cogvlm2: Vi- sual language models for image and video understanding. arXiv preprint arXiv:2408.16500,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z

[7] [7]

Do vision- language models really understand visual language? arXiv preprint arXiv:2410.00193,

Hou, Y ., Giledereli, B., Tu, Y ., and Sachan, M. Do vision- language models really understand visual language? arXiv preprint arXiv:2410.00193,

work page arXiv

[8] [8]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2508.02669 , year=

Huang, X., Wu, J., Liu, H., Tang, X., and Zhou, Y . Medvl- thinker: Simple baselines for multimodal medical reason- ing.arXiv preprint arXiv:2508.02669,

work page arXiv

[10] [10]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

URL https://arxiv.org/abs/ 1902.09506. Jeddi, A., Karaimer, H. C., Nguyen, H., Wang, Z., Zhao, K., Rajabi, J., Zhang, R., Goyal, R., Taati, B., and Grzeszczuk, R. Puzzle curriculum grpo for vision-centric reasoning.arXiv preprint arXiv:2512.14944,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[11] [11]

Kamoi, R., Zhang, Y ., Das, S. S. S., Zhang, R. H., and Zhang, R. Visonlyqa: Large vision language models still struggle with visual perception of geometric information. arXiv preprint arXiv:2412.00947,

work page arXiv

[12] [12]

Mmr1: Enhancing multimodal reasoning with variance- aware sampling and open resources.arXiv preprint arXiv:2509.21268,

Leng, S., Wang, J., Li, J., Zhang, H., Hu, Z., Zhang, B., Jiang, Y ., Zhang, H., Li, X., Bing, L., et al. Mmr1: Enhancing multimodal reasoning with variance- aware sampling and open resources.arXiv preprint arXiv:2509.21268,

work page arXiv

[13] [13]

arXiv preprint arXiv:2403.00231 , year=

Li, L., Wang, Y ., Xu, R., Wang, P., Feng, X., Kong, L., and Liu, Q. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231,

work page arXiv

[14] [14]

org/abs/2511.17731

URL https://arxiv. org/abs/2511.17731. Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision- language models.arXiv preprint arXiv:2305.10355,

work page arXiv

[15] [15]

Lindstr¨om, A. D. and Abraham, S. S. Clevr-math: A dataset for compositional language, visual and mathematical rea- soning.arXiv preprint arXiv:2208.05358,

work page arXiv

[16] [16]

More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

Liu, C., Xu, Z., Wei, Q., Wu, J., Zou, J., Wang, X. E., Zhou, Y ., and Liu, S. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523,

work page arXiv

[17] [17]

Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning, 2024a. URL https:// arxiv.org/abs/2310.03744. Liu, Y ., Liu, J., Shi, X., Cheng, Q., Huang, Y ., and Lu, W. Let’s learn step by step: Enhancing in-context learn- ing ability with curriculum learning.arXiv preprint arXiv:2402.10738, 2024b. Lu, P., Bansal, H., Xia, ...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

org/abs/1906.00067

URL https://arxiv. org/abs/1906.00067. 11 From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models Masry, A., Long, D. X., Tan, J. Q., Joty, S., and Hoque, E. Chartqa: A benchmark for question answering about charts with visual and logical reasoning,

work page arXiv 1906

[19] [19]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

URL https://arxiv.org/abs/2203.10244. Mathew, M., Bagal, V ., Tito, R. P., Karatzas, D., Valveny, E., and Jawahar, C. V . Infographicvqa, 2021a. URL https://arxiv.org/abs/2104.12756. Mathew, M., Karatzas, D., and Jawahar, C. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pp...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Peng, Y ., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., and Yang, X. Lmm-r1: Em- powering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025a. Peng, Y ., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., and Yang, X. Lmm- r1: Empowering 3b...

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

A-okvqa: A benchmark for visual question answering using world knowledge, 2022a

Schwenk, D., Khandelwal, A., Clark, C., Marino, K., and Mottaghi, R. A-okvqa: A benchmark for visual question answering using world knowledge, 2022a. URL https: //arxiv.org/abs/2206.01718. Schwenk, D., Khandelwal, A., Clark, C., Marino, K., and Mottaghi, R. A-okvqa: A benchmark for visual ques- tion answering using world knowledge. InEuropean conference o...

work page arXiv

[22] [22]

Descriptive caption enhancement with visual specialists for multimodal perception.arXiv preprint arXiv:2412.14233,

Sun, Y ., Hao, J., Zhu, K., Liu, J.-J., Zhao, Y ., Li, X., Zhang, G., Li, Z., and Wang, J. Descriptive caption enhancement with visual specialists for multimodal perception.arXiv preprint arXiv:2412.14233,

work page arXiv

[23] [23]

org/abs/2501.06186

URL https://arxiv. org/abs/2501.06186. Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., and Li, H. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neu- ral Information Processing Systems, 37:95095–95169, 2024a. Wang, W., Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Zhu, J., Zhu, X., Lu, L., Qiao, Y ., e...

work page arXiv

[24] [24]

URL https://arxiv.org/abs/2411. 10440. 12 From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models Xu, G., Jin, P., Wu, Z., Li, H., Song, Y ., Sun, L., and Yuan, L. Llava-cot: Let vision language models reason step- by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. ...

work page 2087

[25] [25]

Wethink: Toward general-purpose vision- language reasoning via reinforcement learning.arXiv preprint arXiv:2506.07905, 2025a

Yang, J., Ma, F., Wang, Z., Yin, D., Rong, K., Rao, F., and Zhang, R. Wethink: Toward general-purpose vision- language reasoning via reinforcement learning.arXiv preprint arXiv:2506.07905, 2025a. Yang, Y ., He, X., Pan, H., Jiang, X., Deng, Y ., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., Zhang, B., and Chen, W. R1- onevision: Advancing generalized multi...

work page arXiv

[26] [26]

Gthinker: Towards general multimodal reasoning via cue-guided rethinking

Zhan, Y ., Wu, Z., Zhu, Y ., Xue, R., Luo, R., Chen, Z., Zhang, C., Li, Y ., He, Z., Yang, Z., et al. Gthinker: Towards general multimodal reasoning via cue-guided rethinking. arXiv preprint arXiv:2506.01078, 2025a. Zhan, Y ., Zhu, Y ., Zheng, S., Zhao, H., Yang, F., Tang, M., and Wang, J. Vision-r1: Evolving human-free alignment in large vision-language ...

work page arXiv 2025

[27] [27]

Math- verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pp

Zhang, R., Jiang, D., Zhang, Y ., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K.-W., Qiao, Y ., et al. Math- verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pp. 169–186. Springer, 2024a. Zhang, R., Zhang, B., Li, Y ., Zhang, H., Sun, Z., Gan, Z., Yang, Y ., Pang, R., and Ya...

work page arXiv

[28] [28]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

Zou, C., Guo, X., Yang, R., Zhang, J., Hu, B., and Zhang, H. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InInternational Conference on Learning Repre- sentations, volume 2025, pp. 48337–48383,

work page 2025

[29] [29]

Table 6.Key hyperparameters used in our Stage-3 training

to ensure a controlled comparison and reproducibility. Table 6.Key hyperparameters used in our Stage-3 training. Hyperparameter Value Max prompt length 2048 Actor dtype bf16 Actor optimizer adamw bf16 Actor micro-bsz (update) 16 Actor micro-bsz (experience) 64 Offload params / optim False / False Rollout GPU mem util. 0.7 Tensor parallel size 1 Reward typ...

work page 2048

[30] [30]

Across both Qwen2.5-VL-7B and Qwen3-VL-8B, incorporating perception data consistently yields larger gains on visual math benchmarks compared to reasoning-only training. For instance, on Qwen3-VL-8B, perception+reasoning improves MVerse (VI) from 42.26% to 43.78% and MVista from 73.80% to 75.90%, while achieving comparable performance on A-OKVQA and POPE. ...

work page arXiv 2025