LearnAlign: Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment

· 2025 · cs.LG · arXiv 2506.11480

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we present a novel gradient-alignment-based method, named LearnAlign, which intelligently selects the learnable and representative training reasoning data for RLVR post-training. To overcome the well-known response-length bias in gradient norms, we introduce the data learnability based on the success rate, which indicates the learning potential of each data point. Experiments across five reasoning benchmarks show that our method significantly reduces training data requirements while achieving minor performance degradation or even improving performance compared to full-data training. Specifically, it reduces data requirements by up to 1,000 data points with better performance (77.5%) than that on the full dataset on the GSM8K benchmark (77.0%). Furthermore, its efficiency is demonstrated on both mathematical and code benchmarks by using much less data from the DAPO-MATH-17K dataset.

representative citing papers

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

cs.AI · 2026-01-08 · unverdicted · novelty 6.0

SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.

citing papers explorer

Showing 1 of 1 citing paper.

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning cs.AI · 2026-01-08 · unverdicted · none · ref 19 · internal anchor
SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.

LearnAlign: Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment

fields

years

verdicts

representative citing papers

citing papers explorer