RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Chenyang Zhao; Hamish Ivison; Hannaneh Hajishirzi; Hao Peng; Jacqueline He; Lifan Yuan; Natasha Jaques; Pang Wei Koh; Runlong Zhou; Shuyue Stella Li

arxiv: 2511.07317 · v2 · pith:MBWDFESSnew · submitted 2025-11-10 · 💻 cs.CL · cs.LG

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Zhiyuan Zeng , Hamish Ivison , Yiping Wang , Lifan Yuan , Shuyue Stella Li , Zhuorui Ye , Siting Li , Jacqueline He

show 9 more authors

Runlong Zhou Tong Chen Chenyang Zhao Yulia Tsvetkov Simon Shaolei Du Natasha Jaques Hao Peng Pang Wei Koh Hannaneh Hajishirzi

This is my paper

classification 💻 cs.CL cs.LG

keywords environmentsverifiablerlvetrainingenvironmentlearningreasoningrlve-gym

0 comments

read the original abstract

We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes
cs.CV 2026-05 accept novelty 6.0

ShapeCodeBench introduces a renewable benchmark for perception-to-program reconstruction of synthetic shapes, with evaluations showing low exact-match performance from current models and heuristics.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
cs.LG 2026-05 unverdicted novelty 6.0

S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
cs.AI 2026-01 unverdicted novelty 6.0

SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
Gym-V: A Unified Vision Environment System for Agentic Vision Research
cs.CV 2026-03 unverdicted novelty 5.0

Gym-V supplies 179 visual environments showing that observation scaffolding like captions and rules matters more for training success than the choice of RL algorithm.