Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
arXiv preprint arXiv:2402.12146 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3representative citing papers
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
citing papers explorer
-
Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
-
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.