MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

Kangda Wei; Ruihong Huang

arxiv: 2601.09085 · v2 · pith:ILHP74PGnew · submitted 2026-01-14 · 💻 cs.LG · cs.AI· cs.CL· cs.IR

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

Kangda Wei , Ruihong Huang This is my paper

classification 💻 cs.LG cs.AIcs.CLcs.IR

keywords trainingmmr-grpoacrossbenchmarkscompletionsgrpomarginalmathematical

0 comments

read the original abstract

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. Our code is released at: https://github.com/WeiKangda/MMR-GRPO.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Leveraging Error Diversity in Group Rollouts for Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

EDAS modulates advantage signals in RLVR to penalize repeated errors more and rare errors less, yielding consistent gains on math benchmarks when added to existing methods.
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
cs.LG 2026-05 unverdicted novelty 5.0

Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.