arxiv: 2511.00710 · v4 · submitted 2025-11-01 · 💻 cs.AI

Does RLVR Extend Reasoning Boundaries? Investigating Capability Expansion in Vision-Language Models

Minghe Shen , Zhuo Zhi , Chonghan Liu , Shuo Xing , Zhengzhong Tu , Che Liu This is my paper

Pith reviewed 2026-05-18 00:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords RLVRvision-language modelsspatial reasoningmaze navigationcapability expansionsynthetic datazero-shot transfer

0 comments

The pith

RLVR training on synthetic mazes lets VLMs solve spatial problems that base models never reach.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior studies on language models indicate that RLVR mostly amplifies behaviors already present in the training distribution. This paper tests the same question for vision-language models by introducing a tightly controlled synthetic maze task where difficulty scales directly with path length and turn count. The optimized models succeed on navigation problems that the base VLM fails at 0 percent accuracy even after large increases in sampling budget. Training occurs only on these artificial mazes, yet the resulting policy shows clear gains on two separate real-world navigation benchmarks in a zero-shot setting.

Core claim

Applying RLVR extends the spatial reasoning boundary of VLMs, achieving success on problems where the base policy VLM consistently attains 0 percent accuracy despite increasing pass@k sampling budgets, indicating that the optimized policy successfully navigates search spaces that were effectively unreachable by the base distribution; the same training also produces measurable gains on out-of-domain real-world tasks.

What carries the argument

Ariadne, a controlled synthetic maze navigation framework that sets reasoning difficulty by exact path length and number of turns.

Load-bearing premise

The observed gains on real-world navigation tasks after purely synthetic training reflect genuine expansion of spatial reasoning rather than better sampling or memorization of maze patterns.

What would settle it

Test the RLVR model on mazes whose path lengths and turn counts exceed every example seen during training; if accuracy returns to the base model's 0 percent level, the claim of boundary extension does not hold.

read the original abstract

Recent studies posit that Reinforcement Learning with Verifiable Rewards (RLVR) primarily amplifies behaviors inherent to the pre-training distribution rather than inducing new capabilities, but these insights are predominantly limited to language-only domains, leaving the dynamics of visual-centric spatial reasoning under-explored. To examine the impact of RLVR on the capability boundaries of Vision-Language Models (VLMs), we introduce \textbf{Ariadne}, a controlled framework based on synthetic maze navigation where the reasoning difficulty is precisely regulated by path length and the number of turns. We demonstrate that applying RLVR extends the spatial reasoning boundary, achieving success on problems where the base policy VLM consistently attains $0\%$ accuracy despite increasing pass@k sampling budgets, indicating that the optimized policy successfully navigates search spaces that were effectively unreachable by the base distribution. Furthermore, despite being trained exclusively on synthetic mazes, we evaluate the model on two real-world navigation benchmarks (MapBench and ReasonMap) in a zero-shot setting. The observed improvements in these out-of-domain tasks suggest genuine spatial reasoning capability expansion rather than mere sampling efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates whether RLVR extends the capability boundaries of VLMs in spatial reasoning. It introduces the Ariadne synthetic maze framework with difficulty controlled by path length and number of turns, demonstrating that RLVR enables non-zero success on mazes where the base VLM remains at 0% accuracy even as pass@k sampling budgets increase. Zero-shot evaluation on real-world benchmarks MapBench and ReasonMap shows performance gains, which the authors interpret as evidence of genuine spatial reasoning expansion rather than sampling efficiency.

Significance. If the central results hold after addressing controls, the work would be significant for clarifying the scope of RLVR in multimodal settings. The controlled synthetic testbed provides a clear demonstration that optimized policies can reach trajectories outside the base distribution, and the out-of-domain transfer experiments offer a useful probe for capability expansion claims in vision-language reasoning.

major comments (2)

[Evaluation on real-world navigation benchmarks] The interpretation of zero-shot gains on MapBench and ReasonMap as genuine capability expansion (rather than sampling efficiency) is load-bearing for the paper's broader claim. Unlike the synthetic mazes, where base VLM performance is explicitly shown to remain at 0% across increasing pass@k, the manuscript does not report equivalent high-k sampling results for the base policy on these real-world tasks. This omission leaves open whether RLVR is upweighting low-probability but reachable outputs or truly expanding the support of the policy.
[Synthetic maze experiments (Ariadne framework)] The central 0% claim for the base VLM on synthetic mazes is presented as robust to increased sampling, but the manuscript lacks details on the exact range of k values tested, the number of independent trials per maze, and any statistical significance testing or variance estimates. Without these, it is difficult to rule out that the 0% reflects insufficient sampling budget rather than an unreachable region of the output space.

minor comments (2)

[Abstract] The abstract states that RLVR achieves success 'despite increasing pass@k sampling budgets' but does not quantify the budgets or report the precise accuracy curves; adding a figure or table with these values would improve clarity.
[Zero-shot transfer experiments] Potential data leakage or overlap between the synthetic training mazes and the real-world benchmarks should be explicitly discussed or ruled out, even if the authors consider the domains distinct.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of our experimental design and interpretation that we have addressed through revisions to improve the robustness of our claims on RLVR-induced capability expansion in VLMs.

read point-by-point responses

Referee: [Evaluation on real-world navigation benchmarks] The interpretation of zero-shot gains on MapBench and ReasonMap as genuine capability expansion (rather than sampling efficiency) is load-bearing for the paper's broader claim. Unlike the synthetic mazes, where base VLM performance is explicitly shown to remain at 0% across increasing pass@k, the manuscript does not report equivalent high-k sampling results for the base policy on these real-world tasks. This omission leaves open whether RLVR is upweighting low-probability but reachable outputs or truly expanding the support of the policy.

Authors: We agree that equivalent high-k sampling controls on the real-world benchmarks would further substantiate the distinction between sampling efficiency and capability expansion. In the revised manuscript, we now report pass@k results for the base VLM on both MapBench and ReasonMap (with k up to 64), showing that base performance plateaus well below the RLVR model. These additions support our interpretation while acknowledging the computational constraints of exhaustive sampling on complex real-world images. revision: yes
Referee: [Synthetic maze experiments (Ariadne framework)] The central 0% claim for the base VLM on synthetic mazes is presented as robust to increased sampling, but the manuscript lacks details on the exact range of k values tested, the number of independent trials per maze, and any statistical significance testing or variance estimates. Without these, it is difficult to rule out that the 0% reflects insufficient sampling budget rather than an unreachable region of the output space.

Authors: We thank the referee for pointing out this gap in reporting. The revised manuscript now includes these details: k values were tested from 1 to 128; each maze configuration was evaluated over 20 independent trials; and we report variance estimates with binomial proportion confidence intervals and p-values confirming the base model's 0% success rate is statistically distinguishable from non-zero performance even at the highest k. These additions rule out insufficient sampling as an explanation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent benchmarks and direct measurements

full rationale

The paper's derivation chain consists of controlled empirical experiments rather than mathematical derivations or parameter fits. It introduces the Ariadne synthetic maze framework to regulate difficulty via path length and turns, directly measures base VLM accuracy at 0% across increasing pass@k budgets on these tasks, applies RLVR training, and then evaluates zero-shot transfer on the external MapBench and ReasonMap benchmarks. These steps rely on observable performance deltas against held-out real-world data and explicit base-policy controls within the synthetic distribution; no equation, fitted parameter, or self-citation is shown to reduce the central claim to its own inputs by construction. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that 0% accuracy under increased sampling reflects true unreachability by the base distribution and that zero-shot gains on real benchmarks reflect capability expansion rather than other factors.

axioms (1)

domain assumption Base VLM policy has no capability on certain maze difficulties as evidenced by consistent 0% accuracy even with pass@k sampling
This premise is invoked to support the claim of boundary extension and is central to distinguishing capability expansion from sampling improvements.

pith-pipeline@v0.9.0 · 5730 in / 1206 out tokens · 58405 ms · 2026-05-18T00:56:29.827303+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 7 internal anchors

[1]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” inProceedings of the International Conference on Neural I...

work page 2022
[2]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, Y. Hu, and S. Lin, “Vision-r1: Incentivizing reasoning capability in multimodal large language models,”arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

H. Wang, C. Qu, Z. Huang, W. Chu, F. Lin, and W. Chen, “Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning,”arXiv preprint arXiv:2504.08837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning,

Z. Wan, Z. Dou, C. Liu, Y. Zhang, D. Cui, Q. Zhao, H. Shen, J. Xiong, Y. Xin, Y. Jianget al., “Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning,”arXiv preprint arXiv:2506.01713, 2025

work page arXiv 2025
[8]

Alphamaze: Enhancing large language models’ spatial intelligence via grpo,

A. Dao and D. B. Vu, “Alphamaze: Enhancing large language models’ spatial intelligence via grpo,”arXiv preprint arXiv:2502.14669, 2025

work page arXiv 2025
[9]

Learning to navigate in complex environments,

P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and R. Hadsell, “Learning to navigate in complex environments,” inProceedings of International Conference on Learning Representations, 2017

work page 2017
[10]

Can large vision language models read maps like a human?

S. Xing, Z. Sun, S. Xie, K. Chen, Y. Huang, Y. Wang, J. Li, D. Song, and Z. Tu, “Can large vision language models read maps like a human?”arXiv preprint arXiv:2503.14607, 2025

work page arXiv 2025
[11]

Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps,

S. Feng, S. Wang, S. Ouyang, L. Kong, Z. Song, J. Zhu, H. Wang, and X. Wang, “Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps,”arXiv preprint arXiv:2505.18675, 2025

work page arXiv 2025
[12]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language mode...

work page 2022
[13]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” inProceedings of Conference on Neural Information Processing Systems, 2023

work page 2023
[14]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shaoet al., “Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models,”arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Chain of thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, brian ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,” inProceedings of Advances in Neural Information Processing Systems, 2022

work page 2022
[16]

Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning,

B. Ji, S. Agrawal, Q. Tang, and Y. Wu, “Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning,”arXiv preprint arXiv:2507.13362, 2025

work page arXiv 2025
[17]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1702–1713

work page 2025
[18]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025