{"work":{"id":"557f9e99-cb00-4dd2-92fd-67ddcddbb35d","openalex_id":null,"doi":null,"arxiv_id":"2501.03262","raw_key":null,"title":"REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization","authors":null,"authors_text":"Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen","year":2025,"venue":"cs.CL","abstract":"Reinforcement Learning from Human Feedback~(RLHF) plays a crucial role in aligning Large Language Models~(LLMs). The dominant algorithm, Proximal Policy Optimization~(PPO), employs a critic network to estimate advantages, which introduces significant computational and memory overhead. To address this, a family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these methods typically rely on \\textit{prompt-level (local)} advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and, as we show, is a theoretically biased estimator. To solve these challenges, we introduce REINFORCE++, a critic-free framework centered on \\textbf{Global Advantage Normalization}. By normalizing advantages across the entire global batch rather than small, prompt-specific groups, our method provides a more stable and theoretically sound, \\textit{effectively unbiased} estimate (whose bias vanishes as batch size increases). We introduce two variants: REINFORCE++, a highly efficient and general algorithm ($k \\ge 1$) for general-domain RLHF, and REINFORCE++ /w baseline, a robust group-sampling variant ($k > 1$) for complex reasoning tasks. Our empirical evaluation demonstrates that each variant shows superior stability and performance in its respective domain, outperforming existing methods and even PPO in complex agentic settings.","external_url":"https://arxiv.org/abs/2501.03262","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-22T23:35:13.450285+00:00","pith_arxiv_id":"2501.03262","created_at":"2026-05-09T06:40:40.711207+00:00","updated_at":"2026-05-22T23:35:13.450285+00:00","title_quality_ok":true,"display_title":"REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization","render_title":"REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization"},"hub":{"state":{"work_id":"557f9e99-cb00-4dd2-92fd-67ddcddbb35d","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":66,"external_cited_by_count":null,"distinct_field_count":7,"first_pith_cited_at":"2025-03-03T08:46:22+00:00","last_pith_cited_at":"2026-05-21T09:16:27+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-26T04:26:13.078470+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":8},{"context_role":"baseline","n":3},{"context_role":"method","n":3}],"polarity_counts":[{"context_polarity":"background","n":8},{"context_polarity":"baseline","n":3},{"context_polarity":"use_method","n":3}],"runs":{},"summary":{},"graph":{},"authors":[]}}