hub Canonical reference

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu · 2025 · cs.LG · arXiv 2504.14945

Canonical reference. 88% of citing Pith papers cite this work as background.

34 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 34 citing papers arXiv PDF

abstract

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 1

citation-polarity summary

background 7 baseline 1

representative citing papers

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.

Near-Future Policy Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

cs.AI · 2026-02-02 · unverdicted · novelty 7.0

GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

cs.LG · 2025-07-02 · unverdicted · novelty 7.0

Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.

Harnessing LLM Agents with Skill Programs

cs.AI · 2026-05-18 · conditional · novelty 6.0

HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.

VSPO: Vector-Steered Policy Optimization for Behavioral Control

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.

STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

Teacher-Guided Policy Optimization for LLM Distillation

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

cs.CL · 2026-04-25 · unverdicted · novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

cs.LG · 2026-04-14 · unverdicted · novelty 6.0 · 2 refs

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

cs.DC · 2026-04-10 · unverdicted · novelty 6.0

TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.

Can LLMs Learn to Reason Robustly under Noisy Supervision?

cs.LG · 2026-04-05 · conditional · novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

cs.LG · 2026-03-11 · unverdicted · novelty 6.0

HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.

No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

cs.AI · 2026-01-11 · unverdicted · novelty 6.0

ECHO jointly optimizes policy and critic via co-evolution, cascaded rollouts, and saturation-aware shaping to deliver non-stale feedback and higher success in open-world LLM agent RL.

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.

citing papers explorer

Showing 34 of 34 citing papers.

Multi-Rollout On-Policy Distillation via Peer Successes and Failures cs.LG · 2026-05-12 · unverdicted · none · ref 55 · internal anchor
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 68 · internal anchor
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
Near-Future Policy Optimization cs.LG · 2026-04-22 · unverdicted · none · ref 25 · internal anchor
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration cs.LG · 2026-04-13 · unverdicted · none · ref 37 · internal anchor
NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 21 · internal anchor
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models cs.AI · 2026-02-02 · unverdicted · none · ref 38 · internal anchor
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling cs.LG · 2025-07-02 · unverdicted · none · ref 43 · internal anchor
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 74 · internal anchor
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 38 · internal anchor
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
Harnessing LLM Agents with Skill Programs cs.AI · 2026-05-18 · conditional · none · ref 38 · internal anchor
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
VSPO: Vector-Steered Policy Optimization for Behavioral Control cs.LG · 2026-05-15 · unverdicted · none · ref 27 · internal anchor
VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.
STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning cs.LG · 2026-05-13 · unverdicted · none · ref 41 · internal anchor
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
Teacher-Guided Policy Optimization for LLM Distillation cs.LG · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 33 · internal anchor
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance cs.CL · 2026-04-25 · unverdicted · none · ref 26 · internal anchor
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models cs.CL · 2026-04-18 · unverdicted · none · ref 42 · internal anchor
SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space cs.LG · 2026-04-15 · unverdicted · none · ref 65 · internal anchor
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 36 · 2 links · internal anchor
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training cs.DC · 2026-04-10 · unverdicted · none · ref 43 · internal anchor
TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.
Can LLMs Learn to Reason Robustly under Noisy Supervision? cs.LG · 2026-04-05 · conditional · none · ref 28 · internal anchor
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings cs.LG · 2026-03-11 · unverdicted · none · ref 16 · internal anchor
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning cs.AI · 2026-01-11 · unverdicted · none · ref 21 · internal anchor
ECHO jointly optimizes policy and critic via co-evolution, cascaded rollouts, and saturation-aware shaping to deliver non-stale feedback and higher success in open-world LLM agent RL.
SPHINX: A Synthetic Environment for Visual Perception and Reasoning cs.CV · 2025-11-25 · unverdicted · none · ref 63 · internal anchor
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling cs.CL · 2025-08-12 · unverdicted · none · ref 50 · internal anchor
InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL benchmark.
EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning cs.LG · 2025-08-11 · unverdicted · none · ref 17 · internal anchor
EvoCoT uses self-generated and verified CoT trajectories in a two-stage curriculum to let LLMs learn from initially unsolved hard problems in RLVR settings.
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization cs.AI · 2025-07-31 · unverdicted · none · ref 19 · internal anchor
RL-PLUS is a hybrid RL approach for LLMs that combines internal exploitation with external data via importance sampling and exploration advantages to prevent capability boundary collapse and achieve gains on math and OOD reasoning benchmarks.
FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
FBOS-RL is a feedback-driven bi-objective RL framework that combines Feedback-Guided Exploration Enhancement with Exploitation-oriented Policy Alignment and Exploration-oriented Capability Cultivation to raise training speed and final performance over GRPO under fixed rollout budgets.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning cs.AI · 2026-04-20 · unverdicted · none · ref 15 · internal anchor
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning cs.LG · 2025-12-12 · unverdicted · none · ref 43 · internal anchor
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration cs.LG · 2025-08-19 · unverdicted · none · ref 16 · internal anchor
DARS adaptively increases rollouts on hard problems in RLVR to improve Pass@K, and when paired with batch scaling for breadth, achieves gains in both Pass@K and Pass@1 by treating depth and breadth as complementary exploration dimensions.
Sample-efficient LLM Optimization with Reset Replay cs.LG · 2025-08-08 · unverdicted · none · ref 17 · internal anchor
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 40 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
A Survey of On-Policy Distillation for Large Language Models cs.LG · 2026-04-01 · unverdicted · none · ref 7 · 2 links · internal anchor
A survey that formalizes on-policy distillation as f-divergence minimization over student-sampled trajectories and organizes the literature along three design axes while linking it to KL-constrained RL.

Learning to Reason under Off-Policy Guidance

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer