For RL training, we adopt the GRPO algorithm with a maximum sequence length of 4096

All experiments are conducted with the vLLM framework on 2 x NVIDIA H200 GPUs (140GB) · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

EasyRL trains LLMs data-efficiently by warming up on easy labeled samples then using divide-and-conquer pseudo-labeling and progressive self-training to handle harder unlabeled data, outperforming baselines with only 10% of the labeled data.

citing papers explorer

Showing 1 of 1 citing paper.

Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning cs.LG · 2026-04-19 · unverdicted · none · ref 1
EasyRL trains LLMs data-efficiently by warming up on easy labeled samples then using divide-and-conquer pseudo-labeling and progressive self-training to handle harder unlabeled data, outperforming baselines with only 10% of the labeled data.

For RL training, we adopt the GRPO algorithm with a maximum sequence length of 4096

fields

years

verdicts

representative citing papers

citing papers explorer