pith. sign in

arxiv: 2601.09085 · v2 · pith:ILHP74PGnew · submitted 2026-01-14 · 💻 cs.LG · cs.AI· cs.CL· cs.IR

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

classification 💻 cs.LG cs.AIcs.CLcs.IR
keywords trainingmmr-grpoacrossbenchmarkscompletionsgrpomarginalmathematical
0
0 comments X
read the original abstract

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. Our code is released at: https://github.com/WeiKangda/MMR-GRPO.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    EDAS modulates advantage signals in RLVR to penalize repeated errors more and rare errors less, yielding consistent gains on math benchmarks when added to existing methods.

  2. Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

    cs.LG 2026-05 unverdicted novelty 5.0

    Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.