The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards

Yu Huang , Zixin Wen , Yuejie Chi , Yuting Wei , Aarti Singh , Yingbin Liang , Yuxin Chen

Authors on Pith no claims yet

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords trainingcurriculumdynamicslearningreasoningrewardsspectrumanalysis

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RLVR for transformers on compositional reasoning tasks. Our theory shows that mixed-difficulty training naturally follows an implicit curriculum: without any explicit schedule, easier problems become learnable first and shape the frontier for harder ones, creating a learning progression from easy to hard during optimization. The effectiveness of this curriculum is governed by the smoothness of the difficulty spectrum. When the spectrum is smooth, training dynamics enters a well-behaved relay regime, in which persistent gradient signals on easier problems make slightly harder ones tractable and keep training at the edge of competence. When the spectrum contains abrupt discontinuities, training undergoes grokking-type phase transitions with prolonged plateaus before progress recurs. As a technical contribution, our analysis develops and adapts techniques from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Perturbations to Extrapolate Your LLM
stat.ML 2026-05 unverdicted novelty 6.0

A learnable continuous perturbation framework for LLM token prefixes via latent vector transformations, optimized through unbiased estimating equations, yields gains in out-of-domain performance.
Perturbation is All You Need for Extrapolating Language Models
stat.ML 2026-05 unverdicted novelty 6.0

Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
cs.LG 2026-04 unverdicted novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...