pith. sign in

hub Mixed citations

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Mixed citation behavior. Most common role is background (53%).

78 Pith papers citing it
Background 53% of classified citations
abstract

Reinforcement Learning from Human Feedback~(RLHF) plays a crucial role in aligning Large Language Models~(LLMs). The dominant algorithm, Proximal Policy Optimization~(PPO), employs a critic network to estimate advantages, which introduces significant computational and memory overhead. To address this, a family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these methods typically rely on \textit{prompt-level (local)} advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and, as we show, is a theoretically biased estimator. To solve these challenges, we introduce REINFORCE++, a critic-free framework centered on \textbf{Global Advantage Normalization}. By normalizing advantages across the entire global batch rather than small, prompt-specific groups, our method provides a more stable and theoretically sound, \textit{effectively unbiased} estimate (whose bias vanishes as batch size increases). We introduce two variants: REINFORCE++, a highly efficient and general algorithm ($k \ge 1$) for general-domain RLHF, and REINFORCE++ /w baseline, a robust group-sampling variant ($k > 1$) for complex reasoning tasks. Our empirical evaluation demonstrates that each variant shows superior stability and performance in its respective domain, outperforming existing methods and even PPO in complex agentic settings.

hub tools

citation-role summary

background 8 method 4 baseline 3

citation-polarity summary

clear filters

representative citing papers

RAISE: RAG Design as an Architecture Search Problem

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

RAISE is a standardized benchmark for RAG hyperparameter optimization that evaluates 13 search algorithms across seven datasets and finds performance is highly task-dependent.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.

Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

cs.AI · 2026-01-08 · conditional · novelty 7.0

Miner uses intrinsic policy uncertainty with token-level focal credit assignment and adaptive advantage calibration as a self-supervised reward to enable efficient RL training on positive homogeneous prompts, yielding up to 4.58 Pass@1 gains over GRPO on Qwen3 models.

The Art of Scaling Reinforcement Learning Compute for LLMs

cs.LG · 2025-10-15 · unverdicted · novelty 7.0

A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.

CLORE: Content-Level Optimization for Reasoning Efficiency

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.

citing papers explorer

Showing 50 of 70 citing papers after filters.