Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
On the direction of rlvr updates for llm reasoning: Identification and exploitation
8 Pith papers cite this work. Polarity classification is still indexing.
years
2026 8verdicts
UNVERDICTED 8representative citing papers
APPO refines branching and credit assignment in agentic RL via a Branching Score and procedure-level scaling, improving baselines by nearly 4 points on 13 benchmarks.
EAPO uses policy entropy ratio to adaptively weight positive samples in RLVR for open-ended QA, claiming better diversity and stability than fixed-weight baselines on medical datasets.
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.
MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling to selectively unlearn RLVR-induced reasoning, achieving significant forgetting on MATH while preserving GSM8K and retain MATH unlike full-parameter updates.
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.
citing papers explorer
-
Not only where, But when: Temporal Scheduling for RLVR
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
-
APPO: Agentic Procedural Policy Optimization
APPO refines branching and credit assignment in agentic RL via a Branching Score and procedure-level scaling, improving baselines by nearly 4 points on 13 benchmarks.
-
EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA
EAPO uses policy entropy ratio to adaptively weight positive samples in RLVR for open-ended QA, claiming better diversity and stability than fixed-weight baselines on medical datasets.
-
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
-
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.
-
Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning
MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling to selectively unlearn RLVR-induced reasoning, achieving significant forgetting on MATH while preserving GSM8K and retain MATH unlike full-parameter updates.
-
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
-
One-Way Policy Optimization for Self-Evolving LLMs
OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.