From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
Pith reviewed 2026-05-20 05:08 UTC · model grok-4.3
The pith
Decoupling visual perception from reasoning in post-training allows vision-language models to achieve higher accuracy with shorter reasoning traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By breaking post-training into visual perception, visual reasoning, and textual reasoning stages with specialized data, perception is shown to need targeted optimization and to serve as a scaffold that should precede reasoning refinement. Staged training consistently outperforms merged training, delivering 1.5 percent higher reasoning accuracy alongside 20.8 percent shorter reasoning traces and leading results on visual math and perception tasks among open-weight models.
What carries the argument
A three-stage curriculum that first optimizes visual perception using specialized data and reinforcement learning before addressing visual and textual reasoning.
If this is right
- Staged models reach higher reasoning accuracy than merged-training counterparts.
- Reasoning traces become 20.8 percent shorter due to stronger initial perception.
- Performance gains appear on visual math benchmarks such as WeMath and perception tasks like RealWorldQA.
- Combining capability staging with difficulty-based curricula produces additive benefits.
Where Pith is reading between the lines
- This approach could apply to other multimodal systems where early skill separation prevents interference between capabilities.
- Testing the method on additional vision-language benchmarks beyond math could reveal broader applicability.
- Shorter reasoning traces might reduce computational costs during model inference in practical applications.
Load-bearing premise
The training data for perception can be cleanly separated from reasoning data in a way that avoids harmful distribution shifts during the overall process.
What would settle it
A controlled experiment showing that joint training on combined perception and reasoning data produces equivalent or superior accuracy and efficiency on the same visual tasks would disprove the advantage of staging.
Figures
read the original abstract
Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that VLM performance on visual tasks is limited primarily by perception rather than reasoning, and that post-training can be decomposed into three sequential stages—visual perception, visual reasoning, and textual reasoning—using specialized data. It argues that perception requires targeted RL-based optimization and must be solidified first as a scaffold before visual reasoning, yielding 1.5% higher reasoning accuracy and 20.8% shorter traces versus merged training; staged training is orthogonal to difficulty-based curricula with additive gains, and the resulting models set new results among open-weight VLMs (+5.2% WeMath, +3.7% RealWorldQA).
Significance. If the results hold, the work identifies perception as a foundational, separately optimizable component in VLMs and introduces capability-based staging as a new curriculum axis that combines additively with existing difficulty-based approaches. The multi-VLM experiments and the link between stronger perception and shorter reasoning traces are concrete strengths that could guide more efficient post-training pipelines.
major comments (2)
- [§4] §4 (experimental results): the staged-vs-merged comparison reports +5.2% and +3.7% gains but provides no ablation that reverses training order (reasoning stage before perception stage) while holding total data exposure and compute fixed; this directly tests whether the specific perception-first sequence, rather than any phased specialization, is responsible for the scaffold effect claimed in the abstract.
- [§4.1] §4.1 and Table 2: the accuracy improvements are given without reported standard deviations, exact train/validation splits, or explicit controls that total training tokens/compute are matched between staged and merged conditions; without these, the central claim that decoupling improves performance over merged training remains vulnerable to selection or optimization artifacts.
minor comments (3)
- Add a short related-work paragraph contrasting the proposed capability staging with prior difficulty-based or multi-stage curricula in VLMs.
- Figure 3: label the y-axis units for reasoning-trace length explicitly and include error bars on the 20.8% reduction claim.
- [§3.3] §3.3: clarify whether the RL reward for the perception stage uses the same format as the later reasoning stages or a distinct formulation.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable suggestions. We address the major comments below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (experimental results): the staged-vs-merged comparison reports +5.2% and +3.7% gains but provides no ablation that reverses training order (reasoning stage before perception stage) while holding total data exposure and compute fixed; this directly tests whether the specific perception-first sequence, rather than any phased specialization, is responsible for the scaffold effect claimed in the abstract.
Authors: The referee raises a valid point regarding the need for a reverse-order ablation to confirm the importance of the perception-first sequence. Our hypothesis is grounded in the idea that perception provides a necessary foundation for effective reasoning, as supported by the observed reduction in reasoning trace length. To directly address this, we will include an additional experiment in the revised manuscript where we reverse the order of stages (textual and visual reasoning before perception) while keeping the total data exposure and compute fixed. We will report the results and compare them to our original staged approach. revision: yes
-
Referee: [§4.1] §4.1 and Table 2: the accuracy improvements are given without reported standard deviations, exact train/validation splits, or explicit controls that total training tokens/compute are matched between staged and merged conditions; without these, the central claim that decoupling improves performance over merged training remains vulnerable to selection or optimization artifacts.
Authors: We appreciate this feedback on improving the experimental rigor. In the revised version, we will add standard deviations for the reported accuracies based on multiple training runs with different seeds. We will also specify the exact train and validation splits used in our experiments and provide a detailed breakdown of the total training tokens and computational resources for both the staged and merged training setups to confirm they are equivalent. revision: yes
Circularity Check
No circularity; empirical gains on held-out benchmarks with independent ablations
full rationale
The paper reports measured improvements from staged perception-then-reasoning training versus merged training, with gains quantified on separate evaluation benchmarks (WeMath +5.2%, RealWorldQA +3.7%, plus 1.5% reasoning accuracy and 20.8% shorter traces). These outcomes are obtained by direct comparison of training regimes on held-out data rather than by any equation or definition that reduces the reported benefit to a parameter fitted from the evaluation set itself. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the core ordering claim; the scaffold premise is presented as an experimental observation supported by the staged-vs-merged ablation. The derivation therefore remains self-contained against external benchmarks and does not collapse to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- perception-stage data selection rule
axioms (1)
- domain assumption Visual perception can be optimized independently of reasoning using targeted data and RL
Reference graph
Works this paper leans on
-
[1]
Nore- geo: Non-reasoning geometry benchmark.arXiv preprint arXiv:2601.10254,
Abdullaeva, I., Vasiliuk, A., Goncharova, E., Rahmatul- laev, T., Ivan, Z., Kurkin, M., and Kuznetsov, A. Nore- geo: Non-reasoning geometry benchmark.arXiv preprint arXiv:2601.10254,
-
[2]
URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf. Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., L...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
URL https: //arxiv.org/abs/2311.12793. Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems, 37: 27056–27087, 2024b. Deng, H., Ma, D. Z. X. Z. R., Cao, Y . G. Y ., and Kang, Y . Curr-r...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Video-R1: Reinforcing Video Reasoning in MLLMs
Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y ., Peng, T., Wu, J., Zhang, X., Wang, B., and Yue, X. Video-r1: Reinforcing video reasoning in mllms, 2025a. URL https://arxiv.org/abs/2503.21776. Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y ., Zheng, D., Sun, P., Zhang, Y ., Sun, H., et al. Onethinker: All-in-one reasoning model for image and vide...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
arXiv preprint arXiv:2312.11370 , year=
Gao, J., Pi, R., Zhang, J., Ye, J., Zhong, W., Wang, Y ., Hong, L., Han, J., Xu, H., Li, Z., et al. G-llava: Solving geo- metric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370,
-
[6]
CogVLM2: Visual Language Models for Image and Video Understanding
ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z. Hong, W., Wang, W., Ding, M., Yu, W., Lv, Q., Wang, Y ., Cheng, Y ., Huang, S., Ji, J., Xue, Z., et al. Cogvlm2: Vi- sual language models for image and video understanding. arXiv preprint arXiv:2408.16500,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z
-
[7]
Do vision- language models really understand visual language? arXiv preprint arXiv:2410.00193,
Hou, Y ., Giledereli, B., Tu, Y ., and Sachan, M. Do vision- language models really understand visual language? arXiv preprint arXiv:2410.00193,
-
[8]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2508.02669 , year=
Huang, X., Wu, J., Liu, H., Tang, X., and Zhou, Y . Medvl- thinker: Simple baselines for multimodal medical reason- ing.arXiv preprint arXiv:2508.02669,
-
[10]
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
URL https://arxiv.org/abs/ 1902.09506. Jeddi, A., Karaimer, H. C., Nguyen, H., Wang, Z., Zhao, K., Rajabi, J., Zhang, R., Goyal, R., Taati, B., and Grzeszczuk, R. Puzzle curriculum grpo for vision-centric reasoning.arXiv preprint arXiv:2512.14944,
work page internal anchor Pith review Pith/arXiv arXiv 1902
- [11]
-
[12]
Leng, S., Wang, J., Li, J., Zhang, H., Hu, Z., Zhang, B., Jiang, Y ., Zhang, H., Li, X., Bing, L., et al. Mmr1: Enhancing multimodal reasoning with variance- aware sampling and open resources.arXiv preprint arXiv:2509.21268,
-
[13]
arXiv preprint arXiv:2403.00231 , year=
Li, L., Wang, Y ., Xu, R., Wang, P., Feng, X., Kong, L., and Liu, Q. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231,
-
[14]
URL https://arxiv. org/abs/2511.17731. Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision- language models.arXiv preprint arXiv:2305.10355,
- [15]
-
[16]
Liu, C., Xu, Z., Wei, Q., Wu, J., Zou, J., Wang, X. E., Zhou, Y ., and Liu, S. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523,
-
[17]
Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning, 2024a. URL https:// arxiv.org/abs/2310.03744. Liu, Y ., Liu, J., Shi, X., Cheng, Q., Huang, Y ., and Lu, W. Let’s learn step by step: Enhancing in-context learn- ing ability with curriculum learning.arXiv preprint arXiv:2402.10738, 2024b. Lu, P., Bansal, H., Xia, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
URL https://arxiv. org/abs/1906.00067. 11 From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models Masry, A., Long, D. X., Tan, J. Q., Joty, S., and Hoque, E. Chartqa: A benchmark for question answering about charts with visual and logical reasoning,
-
[19]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
URL https://arxiv.org/abs/2203.10244. Mathew, M., Bagal, V ., Tito, R. P., Karatzas, D., Valveny, E., and Jawahar, C. V . Infographicvqa, 2021a. URL https://arxiv.org/abs/2104.12756. Mathew, M., Karatzas, D., and Jawahar, C. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pp...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Peng, Y ., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., and Yang, X. Lmm-r1: Em- powering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025a. Peng, Y ., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., and Yang, X. Lmm- r1: Empowering 3b...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
A-okvqa: A benchmark for visual question answering using world knowledge, 2022a
Schwenk, D., Khandelwal, A., Clark, C., Marino, K., and Mottaghi, R. A-okvqa: A benchmark for visual question answering using world knowledge, 2022a. URL https: //arxiv.org/abs/2206.01718. Schwenk, D., Khandelwal, A., Clark, C., Marino, K., and Mottaghi, R. A-okvqa: A benchmark for visual ques- tion answering using world knowledge. InEuropean conference o...
-
[22]
Sun, Y ., Hao, J., Zhu, K., Liu, J.-J., Zhao, Y ., Li, X., Zhang, G., Li, Z., and Wang, J. Descriptive caption enhancement with visual specialists for multimodal perception.arXiv preprint arXiv:2412.14233,
-
[23]
URL https://arxiv. org/abs/2501.06186. Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., and Li, H. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neu- ral Information Processing Systems, 37:95095–95169, 2024a. Wang, W., Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Zhu, J., Zhu, X., Lu, L., Qiao, Y ., e...
-
[24]
URL https://arxiv.org/abs/2411. 10440. 12 From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models Xu, G., Jin, P., Wu, Z., Li, H., Song, Y ., Sun, L., and Yuan, L. Llava-cot: Let vision language models reason step- by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. ...
work page 2087
-
[25]
Yang, J., Ma, F., Wang, Z., Yin, D., Rong, K., Rao, F., and Zhang, R. Wethink: Toward general-purpose vision- language reasoning via reinforcement learning.arXiv preprint arXiv:2506.07905, 2025a. Yang, Y ., He, X., Pan, H., Jiang, X., Deng, Y ., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., Zhang, B., and Chen, W. R1- onevision: Advancing generalized multi...
-
[26]
Gthinker: Towards general multimodal reasoning via cue-guided rethinking
Zhan, Y ., Wu, Z., Zhu, Y ., Xue, R., Luo, R., Chen, Z., Zhang, C., Li, Y ., He, Z., Yang, Z., et al. Gthinker: Towards general multimodal reasoning via cue-guided rethinking. arXiv preprint arXiv:2506.01078, 2025a. Zhan, Y ., Zhu, Y ., Zheng, S., Zhao, H., Yang, F., Tang, M., and Wang, J. Vision-r1: Evolving human-free alignment in large vision-language ...
-
[27]
Zhang, R., Jiang, D., Zhang, Y ., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K.-W., Qiao, Y ., et al. Math- verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pp. 169–186. Springer, 2024a. Zhang, R., Zhang, B., Li, Y ., Zhang, H., Sun, Z., Gan, Z., Yang, Y ., Pang, R., and Ya...
-
[28]
Zou, C., Guo, X., Yang, R., Zhang, J., Hu, B., and Zhang, H. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InInternational Conference on Learning Repre- sentations, volume 2025, pp. 48337–48383,
work page 2025
-
[29]
Table 6.Key hyperparameters used in our Stage-3 training
to ensure a controlled comparison and reproducibility. Table 6.Key hyperparameters used in our Stage-3 training. Hyperparameter Value Max prompt length 2048 Actor dtype bf16 Actor optimizer adamw bf16 Actor micro-bsz (update) 16 Actor micro-bsz (experience) 64 Offload params / optim False / False Rollout GPU mem util. 0.7 Tensor parallel size 1 Reward typ...
work page 2048
-
[30]
Across both Qwen2.5-VL-7B and Qwen3-VL-8B, incorporating perception data consistently yields larger gains on visual math benchmarks compared to reasoning-only training. For instance, on Qwen3-VL-8B, perception+reasoning improves MVerse (VI) from 42.26% to 43.78% and MVista from 73.80% to 75.90%, while achieving comparable performance on A-OKVQA and POPE. ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.