Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.
Contrastive pair
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works
Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.