hub

(2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, Yong Wang · 2025 · arXiv 2504.02546

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 1 method 1

citation-polarity summary

background 2 baseline 1 use method 1

representative citing papers

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

Holder Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

cs.AI · 2025-09-29 · unverdicted · novelty 6.0

DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

cs.CL · 2025-07-21 · unverdicted · novelty 6.0

Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.

OneThinker: All-in-one Reasoning Model for Image and Video

cs.CV · 2025-12-02 · unverdicted · novelty 5.0

OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

cs.LG · 2025-10-11 · unverdicted · novelty 5.0

Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

cs.RO · 2025-09-02 · unverdicted · novelty 5.0

AutoDrive-R² adds four-step CoT reasoning with self-reflection to VLA models via SFT on nuScenesR²-6K and GRPO RL under spatial, dynamic, and smoothness rewards, reporting SOTA results on nuScenes and Waymo.

Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning

cs.CL · 2025-05-20 · unverdicted · novelty 5.0

Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

cs.AI · 2026-05-11 · unverdicted · novelty 4.0

EXPO improves GRPO for LLM mathematical reasoning via accuracy-conditioned KL scaling and Gaussian curriculum sampling, delivering gains such as 13.34 points on AIME 2025 pass@32.

Reinforcement Learning from Human Feedback: A Statistical Perspective

stat.ML · 2026-04-02 · accept · novelty 2.0

A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.

citing papers explorer

Showing 15 of 15 citing papers.

Group-in-Group Policy Optimization for LLM Agent Training cs.LG · 2025-05-16 · unverdicted · none · ref 72
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
Holder Policy Optimisation cs.LG · 2026-05-12 · unverdicted · none · ref 23 · 2 links
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum cs.LG · 2026-05-12 · unverdicted · none · ref 2
FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 3 · 2 links
Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search cs.AI · 2025-09-29 · unverdicted · none · ref 4
DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR cs.CL · 2025-07-21 · unverdicted · none · ref 6
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning cs.CL · 2026-04-11 · unverdicted · none · ref 51
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs cs.CL · 2026-04-11 · unverdicted · none · ref 51
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks cs.CV · 2026-04-09 · unverdicted · none · ref 6
OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
OneThinker: All-in-one Reasoning Model for Image and Video cs.CV · 2025-12-02 · unverdicted · none · ref 19
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective cs.LG · 2025-10-11 · unverdicted · none · ref 3
Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.
AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving cs.RO · 2025-09-02 · unverdicted · none · ref 9
AutoDrive-R² adds four-step CoT reasoning with self-reflection to VLA models via SFT on nuScenesR²-6K and GRPO RL under spatial, dynamic, and smoothness rewards, reporting SOTA results on nuScenes and Waymo.
Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning cs.CL · 2025-05-20 · unverdicted · none · ref 15
Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.
expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling cs.AI · 2026-05-11 · unverdicted · none · ref 2
EXPO improves GRPO for LLM mathematical reasoning via accuracy-conditioned KL scaling and Gaussian curriculum sampling, delivering gains such as 13.34 points on AIME 2025 pass@32.
Reinforcement Learning from Human Feedback: A Statistical Perspective stat.ML · 2026-04-02 · accept · none · ref 17
A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.

(2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer