hub Canonical reference

The invisible leash: Why rlvr may or may not escape its origin

The invisible leash: Why rlvr may or may not escape its origin · 2025 · arXiv 2507.14843

Canonical reference. 80% of citing Pith papers cite this work as background.

22 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 baseline 1

representative citing papers

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

cs.LG · 2026-05-27 · conditional · novelty 7.0

Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.

Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Introduces a hierarchical latent selection model showing SFT supplies raw module materials in compound traces while RL decomposes them to identify atomic modules and enable recombination for new reasoning configurations.

On Advantage Estimates for Max@K Policy Gradients

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

REFT improves Pass@1/8/64 in RLVR by uniform first-token sampling from top-N candidates across 0.5B-7B models and multiple difficulty levels.

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

cs.AI · 2026-05-27 · conditional · novelty 6.0

Coarser compressed CoT needs more SFT data, scales differently with repetition, and RL later breaks apart the compressed steps learned in SFT.

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.

Generalization in LLM Problem Solving: The Case of the Shortest Path

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

cs.LG · 2025-09-17 · unverdicted · novelty 6.0

Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.

On the Generalization Gap in Self-Evolving Language Model Reasoning

cs.CL · 2026-05-31 · unverdicted · novelty 5.0

Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

cs.AI · 2026-05-09 · unverdicted · novelty 5.0

IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

cs.CL · 2026-04-09 · accept · novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Polychromic Objectives for Reinforcement Learning

cs.LG · 2025-09-29 · unverdicted · novelty 5.0

Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.

Policy Improvement Reinforcement Learning

cs.LG · 2026-04-01

citing papers explorer

Showing 22 of 22 citing papers.

On the Geometry of On-Policy Distillation cs.LG · 2026-06-05 · unverdicted · none · ref 18
OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.
OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation cs.LG · 2026-06-04 · unverdicted · none · ref 102
OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.
Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL cs.LG · 2026-05-27 · conditional · none · ref 44
Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.
CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning cs.LG · 2026-05-23 · unverdicted · none · ref 30
CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 35 · 2 links
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR cs.LG · 2026-05-08 · unverdicted · none · ref 14
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 70
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning cs.LG · 2026-06-16 · unverdicted · none · ref 27
Introduces a hierarchical latent selection model showing SFT supplies raw module materials in compound traces while RL decomposes them to identify atomic modules and enable recombination for new reasoning configurations.
On Advantage Estimates for Max@K Policy Gradients cs.LG · 2026-06-04 · unverdicted · none · ref 60
Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.
Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR cs.AI · 2026-05-27 · unverdicted · none · ref 34
REFT improves Pass@1/8/64 in RLVR by uniform first-token sampling from top-N candidates across 0.5B-7B models and multiple difficulty levels.
Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training cs.AI · 2026-05-27 · conditional · none · ref 4
Coarser compressed CoT needs more SFT data, scales differently with repetition, and RL later breaks apart the compressed steps learned in SFT.
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR cs.AI · 2026-05-15 · unverdicted · none · ref 23
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures cs.LG · 2026-05-12 · unverdicted · none · ref 50
MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.
Generalization in LLM Problem Solving: The Case of the Shortest Path cs.AI · 2026-04-16 · unverdicted · none · ref 51
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision cs.LG · 2025-09-17 · unverdicted · none · ref 27
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
On the Generalization Gap in Self-Evolving Language Model Reasoning cs.CL · 2026-05-31 · unverdicted · none · ref 38
Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors cs.AI · 2026-05-09 · unverdicted · none · ref 39
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning cs.CL · 2026-04-11 · unverdicted · none · ref 18
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs cs.CL · 2026-04-11 · unverdicted · none · ref 18
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning cs.CL · 2026-04-09 · accept · none · ref 68
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
Polychromic Objectives for Reinforcement Learning cs.LG · 2025-09-29 · unverdicted · none · ref 43
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
Policy Improvement Reinforcement Learning cs.LG · 2026-04-01 · unreviewed · ref 47

The invisible leash: Why rlvr may or may not escape its origin

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer