SGPO extracts strategies from strong-model responses, builds autonomous and guided trajectories, and applies token-level forward-KL distillation with adaptive weighting to outperform SFT and RL baselines by 2.2 points on math benchmarks.
Ceva’s theorem
8 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
citing papers explorer
-
Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
SGPO extracts strategies from strong-model responses, builds autonomous and guided trajectories, and applies token-level forward-KL distillation with adaptive weighting to outperform SFT and RL baselines by 2.2 points on math benchmarks.
-
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning
Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.
-
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
-
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
-
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
- Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment