EasyRL trains LLMs data-efficiently by warming up on easy labeled samples then using divide-and-conquer pseudo-labeling and progressive self-training to handle harder unlabeled data, outperforming baselines with only 10% of the labeled data.
For RL training, we adopt the GRPO algorithm with a maximum sequence length of 4096
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning
EasyRL trains LLMs data-efficiently by warming up on easy labeled samples then using divide-and-conquer pseudo-labeling and progressive self-training to handle harder unlabeled data, outperforming baselines with only 10% of the labeled data.