hub Canonical reference

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu · 2025 · cs.AI · arXiv 2504.05118

Canonical reference. 76% of citing Pith papers cite this work as background.

64 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 64 citing papers arXiv PDF

abstract

We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 method 3 dataset 1

citation-polarity summary

background 13 use method 3 use dataset 1

representative citing papers

Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.

AIS: Adaptive Importance Sampling for Quantized RL

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

cs.LG · 2026-04-12 · unverdicted · novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

cs.IR · 2026-04-04 · unverdicted · novelty 7.0

SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.

Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

cs.LG · 2026-02-03 · unverdicted · novelty 7.0

Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

cs.AI · 2026-02-02 · unverdicted · novelty 7.0

GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

cs.LG · 2026-01-26 · unverdicted · novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.

MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation

cs.LG · 2025-11-11 · unverdicted · novelty 7.0

MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.

The Art of Scaling Reinforcement Learning Compute for LLMs

cs.LG · 2025-10-15 · unverdicted · novelty 7.0

A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

cs.LG · 2025-04-29 · accept · novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

What are Key Factors for Updates in RL for LLM Reasoning?

cs.CL · 2026-06-21 · unverdicted · novelty 6.0

Theoretical analysis of RLVR update dynamics leads to ACPO, an adaptive clipping method that outperforms DAPO and CISPO on reasoning benchmarks with 3B and 7B models.

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

Introduces FFR task, F2RVLM and FFRS models, and MLDR dataset for retrieving coherent multi-modal dialogue fragments, reporting superior performance on single-dialogue and corpus benchmarks.

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Hidden-Align adds an auxiliary loss to align hidden states of correct reasoning paths at the pre-answer token in RLVR, improving pass@1 by 3.8-6.2 points over DAPO on eight math benchmarks for Qwen3 models of 1.7B-14B scale.

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

RL-trained lightweight controller using answer statistics improves trade-offs among correctness, latency, and total samples in adaptive sampling for LLM test-time scaling.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.

Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

AIPO: Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.

citing papers explorer

Showing 50 of 56 citing papers after filters.

Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking cs.CL · 2026-07-01 · unverdicted · none · ref 91 · internal anchor
DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs cs.LG · 2026-05-15 · unverdicted · none · ref 43 · internal anchor
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.
AIS: Adaptive Importance Sampling for Quantized RL stat.ML · 2026-05-13 · unverdicted · none · ref 24 · internal anchor
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits cs.LG · 2026-05-09 · unverdicted · none · ref 28 · internal anchor
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 47 · internal anchor
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent cs.LG · 2026-05-04 · unverdicted · none · ref 52 · internal anchor
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning cs.LG · 2026-04-12 · unverdicted · none · ref 17 · internal anchor
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation cs.IR · 2026-04-04 · unverdicted · none · ref 44 · internal anchor
SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.
Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing cs.LG · 2026-02-03 · unverdicted · none · ref 27 · internal anchor
Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models cs.AI · 2026-02-02 · unverdicted · none · ref 43 · internal anchor
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models cs.LG · 2026-01-26 · unverdicted · none · ref 23 · internal anchor
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.
MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation cs.LG · 2025-11-11 · unverdicted · none · ref 31 · internal anchor
MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.
The Art of Scaling Reinforcement Learning Compute for LLMs cs.LG · 2025-10-15 · unverdicted · none · ref 25 · internal anchor
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.
What are Key Factors for Updates in RL for LLM Reasoning? cs.CL · 2026-06-21 · unverdicted · none · ref 7 · internal anchor
Theoretical analysis of RLVR update dynamics leads to ACPO, an adaptive clipping method that outperforms DAPO and CISPO on reasoning benchmarks with 3B and 7B models.
Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues cs.CL · 2026-06-03 · unverdicted · none · ref 58 · internal anchor
Introduces FFR task, F2RVLM and FFRS models, and MLDR dataset for retrieving coherent multi-modal dialogue fragments, reporting superior performance on single-dialogue and corpus benchmarks.
Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning cs.LG · 2026-06-02 · unverdicted · none · ref 34 · internal anchor
Hidden-Align adds an auxiliary loss to align hidden states of correct reasoning paths at the pre-answer token in RLVR, improving pass@1 by 3.8-6.2 points over DAPO on eight math benchmarks for Qwen3 models of 1.7B-14B scale.
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling cs.CL · 2026-06-02 · unverdicted · none · ref 32 · internal anchor
RL-trained lightweight controller using answer statistics improves trade-offs among correctness, latency, and total samples in adaptive sampling for LLM test-time scaling.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 14 · internal anchor
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control cs.LG · 2026-05-12 · unverdicted · none · ref 6 · 2 links · internal anchor
Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 20 · internal anchor
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 72 · internal anchor
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 23 · internal anchor
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
AIPO: Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 75 · 2 links · internal anchor
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control cs.LG · 2026-05-08 · unverdicted · none · ref 42 · internal anchor
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 107 · internal anchor
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
Gradient Extrapolation-Based Policy Optimization cs.LG · 2026-05-07 · unverdicted · none · ref 42 · internal anchor
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning cs.AI · 2026-05-02 · unverdicted · none · ref 18 · internal anchor
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling cs.CL · 2026-04-29 · unverdicted · none · ref 29 · internal anchor
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization cs.AI · 2026-04-22 · unverdicted · none · ref 38 · internal anchor
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning cs.LG · 2026-04-22 · unverdicted · none · ref 22 · internal anchor
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment cs.LG · 2026-04-20 · unverdicted · none · ref 61 · internal anchor
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 30 · 2 links · internal anchor
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
rePIRL: Learn PRM with Inverse RL for LLM Reasoning cs.LG · 2026-02-08 · unverdicted · none · ref 40 · internal anchor
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training cs.LG · 2026-01-12 · unverdicted · none · ref 17 · internal anchor
SFT and RL cannot be decoupled in LLM post-training because each step increases the loss or lowers the reward of the prior step under KL and PL analyses.
From Ranking to Reasoning: Explainable Web API Recommendation via Semantic Reasoning cs.SE · 2025-11-08 · unverdicted · none · ref 3 · internal anchor
WAR-R1 combines special start/stop tokens in an LLM with supervised fine-tuning and GRPO reinforcement learning to deliver adaptive, explainable Web API recommendations that improve accuracy by up to 10.89% on ProgrammableWeb data.
When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL cs.CL · 2025-10-07 · unverdicted · none · ref 15 · internal anchor
The paper identifies that importance sampling ratios in outcome-supervised RL misallocate credit by creating unbalanced token updates, and introduces ASPO to correct the asymmetry for positive-advantage tokens.
Multiplayer Nash Preference Optimization cs.AI · 2025-09-27 · unverdicted · none · ref 37 · internal anchor
MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling cs.CL · 2025-08-12 · unverdicted · none · ref 57 · internal anchor
InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL benchmark.
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR cs.CL · 2025-07-21 · unverdicted · none · ref 52 · internal anchor
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning cs.CL · 2025-06-06 · unverdicted · none · ref 6 · internal anchor
Writing-RL applies adaptive curriculum RL with pairwise rewards and dynamic scheduling to enhance long-form writing in 7B LLMs over SFT baselines and shows generalization to long-input reasoning tasks.
Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning cs.LG · 2026-06-20 · unverdicted · none · ref 256 · internal anchor
Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
VIMPO: Value-Implicit Policy Optimization for LLMs cs.LG · 2026-06-18 · unverdicted · none · ref 14 · internal anchor
VIMPO derives a policy-implied value function from optimality conditions for critic-free RL in LLMs and shows gains over GRPO on math benchmarks.
When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff cs.LG · 2026-06-07 · unverdicted · none · ref 92 · internal anchor
Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning cs.LG · 2026-05-31 · unverdicted · none · ref 61 · internal anchor
POPO uses recency-based prioritized group replay and decoupled off-policy optimization to avoid zero-variance ineffective samples in RLVR, accelerating LLM reasoning finetuning with fewer rollouts.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 205 · internal anchor
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts cs.LG · 2026-05-30 · unverdicted · none · ref 6 · internal anchor
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs cs.AI · 2026-05-27 · unverdicted · none · ref 48 · internal anchor
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments cs.AI · 2026-05-26 · unverdicted · none · ref 49 · internal anchor
NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals cs.LG · 2026-05-21 · unverdicted · none · ref 26 · internal anchor
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR cs.DC · 2026-05-20 · unverdicted · none · ref 44 · internal anchor
PlexRL multiplexes unified LLM services across RLVR jobs at the cluster level to exploit anti-correlated idle times and reduce GPU-hour costs by up to 37.58% with minimal per-job overhead.

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer