Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

Kangda Wei, Ruihong Huang · 2026 · arXiv 2601.09085

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

EDAS modulates advantage signals in RLVR to penalize repeated errors more and rare errors less, yielding consistent gains on math benchmarks when added to existing methods.

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.

citing papers explorer

Showing 2 of 2 citing papers.

Leveraging Error Diversity in Group Rollouts for Reinforcement Learning cs.LG · 2026-05-17 · unverdicted · none · ref 28
EDAS modulates advantage signals in RLVR to penalize repeated errors more and rare errors less, yielding consistent gains on math benchmarks when added to existing methods.
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization cs.LG · 2026-05-20 · unverdicted · none · ref 60
Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.

Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

fields

years

verdicts

representative citing papers

citing papers explorer