One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Rethinking reflection in pre- training.arXiv preprint arXiv:2504.04022
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
fields
cs.LG 3years
2025 3representative citing papers
DARS adaptively increases rollouts on hard problems in RLVR to improve Pass@K, and when paired with batch scaling for breadth, achieves gains in both Pass@K and Pass@1 by treating depth and breadth as complementary exploration dimensions.
citing papers explorer
-
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
-
Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
DARS adaptively increases rollouts on hard problems in RLVR to improve Pass@K, and when paired with batch scaling for breadth, achieves gains in both Pass@K and Pass@1 by treating depth and breadth as complementary exploration dimensions.
- Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards