hub Mixed citations

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen · 2025 · cs.CL · arXiv 2501.03262

Mixed citation behavior. Most common role is background (57%).

66 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 66 citing papers arXiv PDF

abstract

Reinforcement Learning from Human Feedback~(RLHF) plays a crucial role in aligning Large Language Models~(LLMs). The dominant algorithm, Proximal Policy Optimization~(PPO), employs a critic network to estimate advantages, which introduces significant computational and memory overhead. To address this, a family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these methods typically rely on \textit{prompt-level (local)} advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and, as we show, is a theoretically biased estimator. To solve these challenges, we introduce REINFORCE++, a critic-free framework centered on \textbf{Global Advantage Normalization}. By normalizing advantages across the entire global batch rather than small, prompt-specific groups, our method provides a more stable and theoretically sound, \textit{effectively unbiased} estimate (whose bias vanishes as batch size increases). We introduce two variants: REINFORCE++, a highly efficient and general algorithm ($k \ge 1$) for general-domain RLHF, and REINFORCE++ /w baseline, a robust group-sampling variant ($k > 1$) for complex reasoning tasks. Our empirical evaluation demonstrates that each variant shows superior stability and performance in its respective domain, outperforming existing methods and even PPO in complex agentic settings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 3 method 3

citation-polarity summary

background 8 baseline 3 use method 3

representative citing papers

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

cs.LG · 2026-05-14 · conditional · novelty 8.0

DualKV is a new FlashAttention variant that shares prompt KV across multiple rollouts in RL training, delivering 1.63-3.82x speedups on 8B-30B models while remaining mathematically identical to standard attention.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.

AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition

cs.HC · 2026-05-07 · unverdicted · novelty 7.0

AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emotion recognition benchmarks.

Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

cs.LG · 2026-05-07 · conditional · novelty 7.0

A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.

Autogenesis: A Self-Evolving Agent Protocol

cs.AI · 2026-04-16 · unverdicted · novelty 7.0 · 2 refs

Autogenesis Protocol defines structured resource management and closed-loop self-evolution for multi-agent LLM systems, with the resulting AGS showing gains over baselines on long-horizon benchmarks.

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

Retrieval Augmented Conversational Recommendation with Reinforcement Learning

cs.IR · 2026-04-06 · unverdicted · novelty 7.0

RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

cs.AI · 2026-02-02 · unverdicted · novelty 7.0

GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.

Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

cs.AI · 2026-01-08 · conditional · novelty 7.0

Miner uses intrinsic policy uncertainty with token-level focal credit assignment and adaptive advantage calibration as a self-supervised reward to enable efficient RL training on positive homogeneous prompts, yielding up to 4.58 Pass@1 gains over GRPO on Qwen3 models.

The Art of Scaling Reinforcement Learning Compute for LLMs

cs.LG · 2025-10-15 · unverdicted · novelty 7.0

A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

cs.LG · 2025-08-28 · unverdicted · novelty 7.0

TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

cs.LG · 2025-04-29 · accept · novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

CLORE: Content-Level Optimization for Reasoning Efficiency

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.

AIPO: Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.

Bridging Textual Profiles and Latent User Embeddings for Personalization

cs.IR · 2026-05-07 · unverdicted · novelty 6.0

BLUE aligns LLM-generated textual user profiles with embedding-based recommendation objectives via reinforcement learning and next-item text supervision, yielding better zero-shot performance and cross-domain transfer than baselines.

citing papers explorer

Showing 50 of 66 citing papers.

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts cs.LG · 2026-05-14 · conditional · none · ref 7 · internal anchor
DualKV is a new FlashAttention variant that shares prompt KV across multiple rollouts in RL training, delivering 1.63-3.82x speedups on 8B-30B models while remaining mathematically identical to standard attention.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR cs.LG · 2026-05-19 · conditional · none · ref 12 · internal anchor
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL cs.LG · 2026-05-17 · unverdicted · none · ref 12 · internal anchor
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play cs.AI · 2026-05-08 · unverdicted · none · ref 16 · internal anchor
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition cs.HC · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emotion recognition benchmarks.
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL cs.LG · 2026-05-07 · conditional · none · ref 20 · internal anchor
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training cs.LG · 2026-04-21 · unverdicted · none · ref 12 · internal anchor
EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation cs.CL · 2026-04-21 · unverdicted · none · ref 35 · internal anchor
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning cs.CL · 2026-04-18 · unverdicted · none · ref 8 · internal anchor
Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.
Autogenesis: A Self-Evolving Agent Protocol cs.AI · 2026-04-16 · unverdicted · none · ref 7 · 2 links · internal anchor
Autogenesis Protocol defines structured resource management and closed-loop self-evolution for multi-agent LLM systems, with the resulting AGS showing gains over baselines on long-horizon benchmarks.
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration cs.LG · 2026-04-13 · unverdicted · none · ref 12 · internal anchor
NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
Retrieval Augmented Conversational Recommendation with Reinforcement Learning cs.IR · 2026-04-06 · unverdicted · none · ref 18 · internal anchor
RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models cs.AI · 2026-02-02 · unverdicted · none · ref 16 · internal anchor
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models cs.AI · 2026-01-08 · conditional · none · ref 4 · internal anchor
Miner uses intrinsic policy uncertainty with token-level focal credit assignment and adaptive advantage calibration as a self-supervised reward to enable efficient RL training on positive homogeneous prompts, yielding up to 4.58 Pass@1 gains over GRPO on Qwen3 models.
The Art of Scaling Reinforcement Learning Compute for LLMs cs.LG · 2025-10-15 · unverdicted · none · ref 7 · internal anchor
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning cs.LG · 2025-08-28 · unverdicted · none · ref 16 · internal anchor
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example cs.LG · 2025-04-29 · accept · none · ref 15 · internal anchor
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
CLORE: Content-Level Optimization for Reasoning Efficiency cs.AI · 2026-05-21 · unverdicted · none · ref 19 · internal anchor
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning cs.CV · 2026-05-21 · unverdicted · none · ref 15 · internal anchor
CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play cs.AI · 2026-05-16 · unverdicted · none · ref 6 · internal anchor
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning cs.LG · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
AIPO: Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 25 · 2 links · internal anchor
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
Bridging Textual Profiles and Latent User Embeddings for Personalization cs.IR · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
BLUE aligns LLM-generated textual user profiles with embedding-based recommendation objectives via reinforcement learning and next-item text supervision, yielding better zero-shot performance and cross-domain transfer than baselines.
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 21 · internal anchor
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 9 · 2 links · internal anchor
Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space cs.LG · 2026-04-15 · unverdicted · none · ref 21 · internal anchor
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
From Relevance to Authority: Authority-aware Generative Retrieval in Web Search Engines cs.IR · 2026-04-15 · unverdicted · none · ref 5 · internal anchor
AuthGR is the first generative retriever to explicitly incorporate document authority alongside relevance using multimodal scoring and progressive training, yielding efficiency gains and real-world engagement improvements.
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold cs.AI · 2026-04-15 · unverdicted · none · ref 18 · 2 links · internal anchor
ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and financial tabular benchmarks with new faithfulness metrics.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 22 · 2 links · internal anchor
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning cs.IR · 2026-04-09 · unverdicted · none · ref 21 · internal anchor
ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 42 · internal anchor
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Target Policy Optimization cs.LG · 2026-04-07 · unverdicted · none · ref 6 · internal anchor
TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning cs.CL · 2026-04-07 · unverdicted · none · ref 2 · internal anchor
AgentGL is an RL-driven LLM agent framework for agentic graph learning that uses graph-native tools and curriculum training to outperform GraphLLM and GraphRAG baselines by up to 17.5% on node classification and 28.4% on link prediction across text-attributed graph benchmarks.
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue cs.CL · 2026-03-06 · unverdicted · none · ref 47 · internal anchor
MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization for emotional support dialogues.
rePIRL: Learn PRM with Inverse RL for LLM Reasoning cs.LG · 2026-02-08 · unverdicted · none · ref 11 · internal anchor
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization cs.CL · 2026-01-08 · unverdicted · none · ref 10 · internal anchor
GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning cs.AI · 2026-01-08 · unverdicted · none · ref 16 · internal anchor
SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling cs.LG · 2025-10-28 · unverdicted · none · ref 7 · internal anchor
PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.
Structured In-context Environment Scaling for Large Language Model Reasoning cs.CL · 2025-09-27 · conditional · none · ref 10 · internal anchor
SIE framework automatically constructs scalable, verifiable reasoning environments from structured data, improving in-domain performance and enabling generalization to out-of-domain math and logic tasks.
Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework cs.CV · 2025-09-27 · unverdicted · none · ref 9 · internal anchor
DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.
SPaCe: Unlocking Sample-Efficient Large Language Models Training With Self-Pace Curriculum Learning cs.LG · 2025-08-07 · unverdicted · none · ref 12 · internal anchor
SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.
Agentic Reinforced Policy Optimization cs.LG · 2025-07-26 · unverdicted · none · ref 1 · internal anchor
ARPO adds entropy-based adaptive rollouts and stepwise advantage attribution to RL for LLM agents, outperforming prior trajectory-level methods on 13 benchmarks with half the tool budget.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent cs.CL · 2025-07-03 · unverdicted · none · ref 57 · internal anchor
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents cs.CL · 2025-06-18 · unverdicted · none · ref 22 · internal anchor
MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and shopping tasks.
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks cs.AI · 2025-04-07 · unverdicted · none · ref 8 · internal anchor
VAPO achieves 60.4 on AIME 2024 with Qwen 32B, outperforming prior methods by over 10 points through targeted fixes for value bias, sequence length variation, and sparse rewards.
DAPO: An Open-Source LLM Reinforcement Learning System at Scale cs.LG · 2025-03-18 · conditional · none · ref 15 · internal anchor
DAPO introduces decoupled clipping and dynamic sampling for LLM RL, achieving 50 on AIME 2024 with Qwen2.5-32B while fully open-sourcing code, data, and the verl-based training system.
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs cs.CL · 2025-03-03 · unverdicted · none · ref 1 · internal anchor
Language models that naturally exhibit verification, backtracking, subgoal setting, and backward chaining improve substantially during RL on verifiable tasks, and these behaviors can be instilled via priming with reasoning-focused examples or filtered pretraining to enable self-improvement.
LamPO: A Lambda Style Policy Optimization for Reasoning Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
LamPO introduces a pairwise decomposed advantage with confidence-aware weighting to replace scalar group advantages in group-relative policy optimization for reasoning models.
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation cs.LG · 2026-05-20 · unverdicted · none · ref 17 · internal anchor
The paper shows that advantage collapse in GRPO causes training stagnation on math reasoning benchmarks and proposes AVSPO, which uses real-time monitoring to inject virtual reward samples and reduces collapse while improving accuracy by 4-6 points.

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer