pith. sign in

arxiv: 2606.29238 · v1 · pith:YEARNEYOnew · submitted 2026-06-28 · 💻 cs.LG

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse

Pith reviewed 2026-06-30 08:38 UTC · model grok-4.3

classification 💻 cs.LG
keywords GRPOpolicy gradientcredit assignmentgradient sparsityrank collapsereinforcement learningPPO
0
0 comments X

The pith

Under output-only rewards GRPO assigns identical advantages to every token in a rollout, collapsing the gradient matrix to effective rank approximately 2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Group Relative Policy Optimization replaces PPO's learned critic with a baseline equal to the mean reward across grouped rollouts. Deriving GRPO from the policy gradient theorem reveals that an end-of-sequence reward forces every token inside one rollout to receive the exact same advantage value. This reduces token-level credit assignment to a single scalar per sequence. The authors prove the construction creates an intrinsic rank-2 structure in the gradient matrix because of the zero-sum constraint on the advantages. SVD analysis on Nemotron-4B trained on GSM8K confirms the effective rank stays near 2 for group sizes 2, 4, and 8 while sparsity grows during training.

Core claim

GRPO's group-mean baseline produces identical advantages for all tokens whenever reward is supplied only at rollout completion; the resulting zero-sum constraint on advantages forces the policy gradient matrix to possess an intrinsic rank-2 structure, which the authors prove induces increasing gradient sparsity over training and which SVD measurements on a 4B model confirm remains near rank 2 independent of group size R.

What carries the argument

The group-mean advantage estimator that subtracts the average reward of R rollouts from each rollout's reward, assigning the identical scalar to every token within that rollout.

If this is right

  • Gradient sparsity intensifies as training proceeds.
  • Effective rank of the gradient matrix stays near 2 for any group size in {2, 4, 8}.
  • The group-mean baseline is optimal only under conditions derived from the zero-sum advantage constraint.
  • Multi-step reasoning performance is limited by the inability to assign differentiated credit to individual tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Any RL method that relies solely on end-of-sequence rewards may encounter similar rank collapse unless auxiliary per-token signals are added.
  • Alternative baseline estimators or learned critics could restore higher gradient rank by breaking the identical-advantage property.
  • The rank-2 structure may generalize to other group-based advantage estimators that enforce zero-sum constraints within each group.

Load-bearing premise

Rewards arrive only after each complete rollout finishes and supply no per-token or intermediate signals that could differentiate contributions inside a sequence.

What would settle it

Compute the singular-value spectrum of the GRPO gradient matrix on a task that supplies per-token reward labels and check whether the effective rank rises well above 2.

Figures

Figures reproduced from arXiv: 2606.29238 by Amritansh Mishra, Berkcan Kapusuzoglu, Supriyo Chakraborty.

Figure 1
Figure 1. Figure 1: Effective rank of the GRPO gradient matrix across training steps for R ∈ {2, 4, 8}. Despite the maximum possible rank being higher, effective rank remains ≈ 2 throughout training [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top-1 singular value fraction (σ 2 1/ P i σ 2 i ) across train￾ing. R = 4 shows the highest concentration, while R = 8 diffuses energy across more components. This non-monotonic pattern is explained by the interplay between advantage concentration and trajectory score diversity. 7.2. Results Effective rank ≈ 2 for all R. Across all group sizes, the effective rank of the gradient matrix remains approximatel… view at source ↗
Figure 3
Figure 3. Figure 3: Training accuracy on GSM8K for different group sizes. Despite rank-2 gradient structure for all R, larger groups still im￾prove learning speed by providing better estimates of the advantage direction ψ¯+ − ψ¯− [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Group Relative Policy Optimization (GRPO) eliminates the learned critic in PPO by using the mean reward of grouped rollouts as a baseline. We provide a rigorous derivation of GRPO from first principles of the policy gradient theorem, revealing a fundamental credit assignment failure: under output-only reward, every token in a rollout receives identical advantage, collapsing token-level credit to a single scalar. We prove this induces gradient sparsity that intensifies over training, and demonstrate empirically via SVD analysis of GRPO gradients on Nemotron-4B/GSM8K that the gradient matrix has effective rank $\approx$ 2 regardless of group size $R \in \{2, 4, 8\}$. We formalize this as an intrinsic rank-2 structure arising from the zero-sum constraint on advantages and derive conditions under which GRPO's baseline is optimal. Our results characterize when GRPO's simplicity is theoretically justified and identify the credit assignment bottleneck as the key limitation for multi-step reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript derives GRPO from the policy gradient theorem under the output-only reward regime, showing that identical per-rollout advantages collapse token-level credit assignment to a scalar. It proves this produces gradient sparsity that intensifies during training and an intrinsic rank-2 structure in the gradient matrix arising from the zero-sum constraint on grouped advantages. The claims are supported by SVD analysis of GRPO gradients on Nemotron-4B/GSM8K demonstrating effective rank approximately 2 independent of group size R in {2,4,8}, together with conditions under which the GRPO baseline is optimal.

Significance. If the derivation and empirical signature hold, the work supplies a precise theoretical account of GRPO's credit-assignment bottleneck in multi-step reasoning, clarifying when its critic-free simplicity is justified. The first-principles derivation from the standard policy-gradient theorem, the algebraic rank-2 result, and the concrete SVD experiment constitute clear strengths that advance understanding of relative-baseline methods.

minor comments (2)
  1. The experimental section would benefit from an explicit statement of how effective rank is computed (e.g., the precise singular-value threshold or cumulative-energy cutoff) to allow direct reproduction of the rank-2 finding.
  2. Notation for the advantage estimator and the grouped baseline could be introduced earlier with a short table contrasting it to the standard advantage in PPO.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the positive recommendation to accept. The summary accurately captures the core contributions regarding the derivation of GRPO, the credit-assignment collapse under output-only rewards, the induced gradient sparsity, and the intrinsic rank-2 structure.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation begins from the standard policy gradient theorem applied under the explicitly stated output-only reward regime. The credit-assignment collapse and rank-2 gradient structure follow directly from the algebraic fact that grouped advantages sum to zero; neither step reduces to a fitted parameter inside the paper nor to a self-citation chain. The SVD analysis on Nemotron-4B/GSM8K is presented as empirical corroboration of the algebraic prediction rather than a self-referential result. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the policy gradient theorem (standard) and the modeling choice that rewards arrive only at sequence end. No free parameters are fitted inside the derivation; no new physical or mathematical entities are introduced.

axioms (1)
  • standard math Policy gradient theorem
    Invoked as the starting point for deriving the GRPO update rule and its consequences.

pith-pipeline@v0.9.1-grok · 5712 in / 1412 out tokens · 33801 ms · 2026-06-30T08:38:52.537816+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 8 canonical work pages · 7 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  3. [3]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Y and Guo, Daya , journal=

  4. [4]

    arXiv preprint arXiv:2501.12948 , year=

  5. [5]

    Advances in Neural Information Processing Systems , pages=

    Policy gradient methods for reinforcement learning with function approximation , author=. Advances in Neural Information Processing Systems , pages=

  6. [6]

    Machine Learning , volume=

    Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine Learning , volume=

  7. [7]

    Uncertainty in Artificial Intelligence , pages=

    The optimal reward baseline for gradient-based reinforcement learning , author=. Uncertainty in Artificial Intelligence , pages=

  8. [8]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    High-dimensional continuous control using generalized advantage estimation , author=. arXiv preprint arXiv:1506.02438 , year=

  9. [9]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Ahmadian, Arash and Cremer, Chris and Gall. Back to basics: Revisiting. arXiv preprint arXiv:2402.14740 , year=

  10. [10]

    Advances in Neural Information Processing Systems , year=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , year=

  11. [11]

    Let's Verify Step by Step

    Let's verify step by step , author=. arXiv preprint arXiv:2305.20050 , year=

  12. [12]

    Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

  13. [13]

    Measuring the Intrinsic Dimension of Objective Landscapes

    Measuring the intrinsic dimension of objective landscapes , author=. arXiv preprint arXiv:1804.08838 , year=

  14. [14]

    arXiv preprint arXiv:2402.16819 , year=

    Nemotron-4 15B Technical Report , author=. arXiv preprint arXiv:2402.16819 , year=

  15. [15]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=