Does RLVR Extend Reasoning Boundaries? Investigating Capability Expansion in Vision-Language Models
Pith reviewed 2026-05-18 00:56 UTC · model grok-4.3
The pith
RLVR training on synthetic mazes lets VLMs solve spatial problems that base models never reach.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying RLVR extends the spatial reasoning boundary of VLMs, achieving success on problems where the base policy VLM consistently attains 0 percent accuracy despite increasing pass@k sampling budgets, indicating that the optimized policy successfully navigates search spaces that were effectively unreachable by the base distribution; the same training also produces measurable gains on out-of-domain real-world tasks.
What carries the argument
Ariadne, a controlled synthetic maze navigation framework that sets reasoning difficulty by exact path length and number of turns.
Load-bearing premise
The observed gains on real-world navigation tasks after purely synthetic training reflect genuine expansion of spatial reasoning rather than better sampling or memorization of maze patterns.
What would settle it
Test the RLVR model on mazes whose path lengths and turn counts exceed every example seen during training; if accuracy returns to the base model's 0 percent level, the claim of boundary extension does not hold.
read the original abstract
Recent studies posit that Reinforcement Learning with Verifiable Rewards (RLVR) primarily amplifies behaviors inherent to the pre-training distribution rather than inducing new capabilities, but these insights are predominantly limited to language-only domains, leaving the dynamics of visual-centric spatial reasoning under-explored. To examine the impact of RLVR on the capability boundaries of Vision-Language Models (VLMs), we introduce \textbf{Ariadne}, a controlled framework based on synthetic maze navigation where the reasoning difficulty is precisely regulated by path length and the number of turns. We demonstrate that applying RLVR extends the spatial reasoning boundary, achieving success on problems where the base policy VLM consistently attains $0\%$ accuracy despite increasing pass@k sampling budgets, indicating that the optimized policy successfully navigates search spaces that were effectively unreachable by the base distribution. Furthermore, despite being trained exclusively on synthetic mazes, we evaluate the model on two real-world navigation benchmarks (MapBench and ReasonMap) in a zero-shot setting. The observed improvements in these out-of-domain tasks suggest genuine spatial reasoning capability expansion rather than mere sampling efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates whether RLVR extends the capability boundaries of VLMs in spatial reasoning. It introduces the Ariadne synthetic maze framework with difficulty controlled by path length and number of turns, demonstrating that RLVR enables non-zero success on mazes where the base VLM remains at 0% accuracy even as pass@k sampling budgets increase. Zero-shot evaluation on real-world benchmarks MapBench and ReasonMap shows performance gains, which the authors interpret as evidence of genuine spatial reasoning expansion rather than sampling efficiency.
Significance. If the central results hold after addressing controls, the work would be significant for clarifying the scope of RLVR in multimodal settings. The controlled synthetic testbed provides a clear demonstration that optimized policies can reach trajectories outside the base distribution, and the out-of-domain transfer experiments offer a useful probe for capability expansion claims in vision-language reasoning.
major comments (2)
- [Evaluation on real-world navigation benchmarks] The interpretation of zero-shot gains on MapBench and ReasonMap as genuine capability expansion (rather than sampling efficiency) is load-bearing for the paper's broader claim. Unlike the synthetic mazes, where base VLM performance is explicitly shown to remain at 0% across increasing pass@k, the manuscript does not report equivalent high-k sampling results for the base policy on these real-world tasks. This omission leaves open whether RLVR is upweighting low-probability but reachable outputs or truly expanding the support of the policy.
- [Synthetic maze experiments (Ariadne framework)] The central 0% claim for the base VLM on synthetic mazes is presented as robust to increased sampling, but the manuscript lacks details on the exact range of k values tested, the number of independent trials per maze, and any statistical significance testing or variance estimates. Without these, it is difficult to rule out that the 0% reflects insufficient sampling budget rather than an unreachable region of the output space.
minor comments (2)
- [Abstract] The abstract states that RLVR achieves success 'despite increasing pass@k sampling budgets' but does not quantify the budgets or report the precise accuracy curves; adding a figure or table with these values would improve clarity.
- [Zero-shot transfer experiments] Potential data leakage or overlap between the synthetic training mazes and the real-world benchmarks should be explicitly discussed or ruled out, even if the authors consider the domains distinct.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of our experimental design and interpretation that we have addressed through revisions to improve the robustness of our claims on RLVR-induced capability expansion in VLMs.
read point-by-point responses
-
Referee: [Evaluation on real-world navigation benchmarks] The interpretation of zero-shot gains on MapBench and ReasonMap as genuine capability expansion (rather than sampling efficiency) is load-bearing for the paper's broader claim. Unlike the synthetic mazes, where base VLM performance is explicitly shown to remain at 0% across increasing pass@k, the manuscript does not report equivalent high-k sampling results for the base policy on these real-world tasks. This omission leaves open whether RLVR is upweighting low-probability but reachable outputs or truly expanding the support of the policy.
Authors: We agree that equivalent high-k sampling controls on the real-world benchmarks would further substantiate the distinction between sampling efficiency and capability expansion. In the revised manuscript, we now report pass@k results for the base VLM on both MapBench and ReasonMap (with k up to 64), showing that base performance plateaus well below the RLVR model. These additions support our interpretation while acknowledging the computational constraints of exhaustive sampling on complex real-world images. revision: yes
-
Referee: [Synthetic maze experiments (Ariadne framework)] The central 0% claim for the base VLM on synthetic mazes is presented as robust to increased sampling, but the manuscript lacks details on the exact range of k values tested, the number of independent trials per maze, and any statistical significance testing or variance estimates. Without these, it is difficult to rule out that the 0% reflects insufficient sampling budget rather than an unreachable region of the output space.
Authors: We thank the referee for pointing out this gap in reporting. The revised manuscript now includes these details: k values were tested from 1 to 128; each maze configuration was evaluated over 20 independent trials; and we report variance estimates with binomial proportion confidence intervals and p-values confirming the base model's 0% success rate is statistically distinguishable from non-zero performance even at the highest k. These additions rule out insufficient sampling as an explanation. revision: yes
Circularity Check
No significant circularity; empirical claims rest on independent benchmarks and direct measurements
full rationale
The paper's derivation chain consists of controlled empirical experiments rather than mathematical derivations or parameter fits. It introduces the Ariadne synthetic maze framework to regulate difficulty via path length and turns, directly measures base VLM accuracy at 0% across increasing pass@k budgets on these tasks, applies RLVR training, and then evaluates zero-shot transfer on the external MapBench and ReasonMap benchmarks. These steps rely on observable performance deltas against held-out real-world data and explicit base-policy controls within the synthetic distribution; no equation, fitted parameter, or self-citation is shown to reduce the central claim to its own inputs by construction. The evaluation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Base VLM policy has no capability on certain maze difficulties as evidenced by consistent 0% accuracy even with pass@k sampling
Reference graph
Works this paper leans on
-
[1]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” inProceedings of the International Conference on Neural I...
work page 2022
-
[2]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, Y. Hu, and S. Lin, “Vision-r1: Incentivizing reasoning capability in multimodal large language models,”arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
H. Wang, C. Qu, Z. Huang, W. Chu, F. Lin, and W. Chen, “Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning,”arXiv preprint arXiv:2504.08837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning,
Z. Wan, Z. Dou, C. Liu, Y. Zhang, D. Cui, Q. Zhao, H. Shen, J. Xiong, Y. Xin, Y. Jianget al., “Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning,”arXiv preprint arXiv:2506.01713, 2025
-
[8]
Alphamaze: Enhancing large language models’ spatial intelligence via grpo,
A. Dao and D. B. Vu, “Alphamaze: Enhancing large language models’ spatial intelligence via grpo,”arXiv preprint arXiv:2502.14669, 2025
-
[9]
Learning to navigate in complex environments,
P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and R. Hadsell, “Learning to navigate in complex environments,” inProceedings of International Conference on Learning Representations, 2017
work page 2017
-
[10]
Can large vision language models read maps like a human?
S. Xing, Z. Sun, S. Xie, K. Chen, Y. Huang, Y. Wang, J. Li, D. Song, and Z. Tu, “Can large vision language models read maps like a human?”arXiv preprint arXiv:2503.14607, 2025
-
[11]
Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps,
S. Feng, S. Wang, S. Ouyang, L. Kong, Z. Song, J. Zhu, H. Wang, and X. Wang, “Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps,”arXiv preprint arXiv:2505.18675, 2025
-
[12]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language mode...
work page 2022
-
[13]
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” inProceedings of Conference on Neural Information Processing Systems, 2023
work page 2023
-
[14]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shaoet al., “Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models,”arXiv preprint arXiv:2504.10479, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Chain of thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, brian ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,” inProceedings of Advances in Neural Information Processing Systems, 2022
work page 2022
-
[16]
B. Ji, S. Agrawal, Q. Tang, and Y. Wu, “Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning,”arXiv preprint arXiv:2507.13362, 2025
-
[17]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,
Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1702–1713
work page 2025
-
[18]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.