SGPO extracts strategies from strong-model responses, builds autonomous and guided trajectories, and applies token-level forward-KL distillation with adaptive weighting to outperform SFT and RL baselines by 2.2 points on math benchmarks.
Ceva’s theorem
8 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.