Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.
Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.
citing papers explorer
-
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.
-
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
-
Reinforcement Learning for LLM Post-Training: A Survey
A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.