Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

Fathinah Asma Izzati; Mohamed El Amine Seddik; Omar El Mansouri; Salem Lahlou

arxiv: 2510.18924 · v3 · pith:3L3U62UNnew · submitted 2025-10-21 · 💻 cs.LG · cs.AI

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

Omar El Mansouri , Fathinah Asma Izzati , Mohamed El Amine Seddik , Salem Lahlou This is my paper

Pith reviewed 2026-05-21 20:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords GRPORLHFnoise correctionBernoulli noiseunbiased gradientsreward modelpolicy optimizationlabel noise

0 comments

The pith

Modeling reward noise as Bernoulli flips and correcting for them after probability estimation produces provably unbiased gradients in group relative policy optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard group-based policy optimization methods already reduce the impact of individual reward errors, but adding an explicit correction step after estimating how often rewards flip makes the gradient estimates unbiased. This correction is applied to the learning signal derived from noisy human or verifiable rewards, which are common when aligning large language models. A sympathetic reader would care because RLHF and RLVR training breaks down under inconsistent rewards, and an unbiased version could produce more reliable models without changing the underlying optimization loop. The approach draws from label-noise techniques in supervised learning and tests the idea on math and code benchmarks, reporting accuracy lifts when reward models are imperfect.

Core claim

By treating reward corruption as a Bernoulli noise process whose flip probabilities can be estimated from data, the noise-corrected GRPO and Dr.GRPO variants remove the bias that would otherwise appear in the policy gradient, yielding unbiased estimates while preserving the robustness that group-relative comparisons already provide against per-sample noise.

What carries the argument

Noise correction applied after estimating reward flip probabilities, which debiases the advantage or reward signal used inside the GRPO update rule.

If this is right

Group comparisons already dampen individual reward flips, and the explicit correction step strengthens this effect to produce unbiased gradients.
The same correction can be added on top of existing reward models without retraining them, delivering accuracy gains on math and code tasks under realistic noise levels.
Theoretical analysis shows the method works for any group-based relative policy optimization that uses noisy scalar rewards.
The framework directly transfers label-noise correction ideas from supervised learning into the RLHF setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Bernoulli correction might extend to other policy optimization algorithms that rely on relative advantages, such as variants of PPO that use group sampling.
If flip probabilities can be estimated online rather than from a fixed dataset, the method could adapt during long training runs where noise statistics drift.
One testable extension is to apply the correction only to the most uncertain rewards instead of every sample, potentially reducing variance introduced by the correction itself.

Load-bearing premise

Reward corruption behaves like independent Bernoulli flips whose probabilities can be estimated accurately enough from the observed data to subtract the bias without adding new error.

What would settle it

Run the corrected versus uncorrected GRPO on a controlled task where the true flip probability is known and measure whether the gradient bias (estimated via a held-out clean reward) drops to zero only in the corrected version.

read the original abstract

Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a noise-robust Group Relative Policy Optimization (GRPO) framework and its variant Dr.GRPO. It models reward corruption as Bernoulli flips, estimates flip probabilities from observed data, applies a correction to the learning signal, and claims this yields provably unbiased gradient estimates. Theoretical analysis asserts that group-based methods inherently mitigate individual noise and that the correction amplifies robustness. Empirically, the method reports gains of up to 6.7 percentage points on math tasks and 1.5 on code tasks under noisy reward-model conditions.

Significance. If the unbiasedness result holds after substituting the estimated flip probability into the correction formula, the work would meaningfully connect label-noise correction techniques from supervised learning to group-relative RLHF/RLVR. It would supply both a practical algorithm for noisy real-world deployment and theoretical insight into why group methods are already somewhat robust, with concrete performance lifts on math and code benchmarks.

major comments (2)

[Theoretical analysis] Theoretical analysis section: the claim of 'provably unbiased gradient estimates' is central, yet no derivation is supplied showing that the Bernoulli-inversion correction remains unbiased once the true flip probability p is replaced by a data-dependent estimator ˆp computed from the same group-relative rewards. The expectation of the corrected gradient may acquire a nonzero bias term from Cov(ˆp, reward signal) that the inversion formula does not cancel.
[Method] Method section: the description of 'applying noise correction after estimating reward flip probabilities' does not specify the estimator, its sample size, or any independence assumption between the estimation step and the subsequent policy-gradient computation. Without this, the 'provably unbiased' guarantee cannot be verified and the circularity concern (estimation from the identical noisy signals) remains unaddressed.

minor comments (1)

[Abstract] Abstract: the statement that 'group methods already mitigate noise' is asserted without quantifying how this mitigation interacts with the subsequent plug-in estimation of flip probabilities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and will revise the manuscript to strengthen the theoretical derivation and methodological clarity while preserving the core contributions.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis section: the claim of 'provably unbiased gradient estimates' is central, yet no derivation is supplied showing that the Bernoulli-inversion correction remains unbiased once the true flip probability p is replaced by a data-dependent estimator ˆp computed from the same group-relative rewards. The expectation of the corrected gradient may acquire a nonzero bias term from Cov(ˆp, reward signal) that the inversion formula does not cancel.

Authors: We acknowledge that the current theoretical analysis primarily establishes unbiasedness for the noise-correction operator when the true flip probability p is known. The manuscript does not supply an explicit derivation for the plug-in estimator ˆp that accounts for potential covariance between the estimate and the reward signal. Under the group-relative formulation, the estimator ˆp is formed from aggregated statistics across multiple responses per prompt, which reduces dependence on any single reward observation; however, a rigorous expansion of the expectation that shows the bias term vanishes (or is bounded) under standard concentration assumptions is indeed missing. We will add this derivation in the revised theoretical section, including a lemma that bounds the additional bias term as a function of group size and sample concentration. revision: yes
Referee: [Method] Method section: the description of 'applying noise correction after estimating reward flip probabilities' does not specify the estimator, its sample size, or any independence assumption between the estimation step and the subsequent policy-gradient computation. Without this, the 'provably unbiased' guarantee cannot be verified and the circularity concern (estimation from the identical noisy signals) remains unaddressed.

Authors: We agree that the method description is insufficiently precise. The flip-probability estimator is the empirical fraction of sign flips between pairwise group-relative rewards within each prompt's response group (group size matching the GRPO baseline, typically 4–8). Estimation and gradient computation share the same noisy rewards, so strict independence does not hold; instead, the group-relative normalization already averages individual noise before correction is applied. We will revise the method section to explicitly define the estimator, state the group size used, and clarify that the unbiasedness claim is asymptotic in group size under the Bernoulli model, with a short discussion of the finite-sample bias introduced by shared data. revision: yes

Circularity Check

1 steps flagged

Unbiased gradient claim depends on plug-in of data-estimated flip probabilities

specific steps

fitted input called prediction [Abstract]
"Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates."

The paper fits/estimates the Bernoulli flip probabilities from the noisy reward data used in the GRPO objective, then presents the corrected gradient as provably unbiased. The unbiasedness property holds by construction only for the known true p; the plug-in estimator ˆp makes the actual implemented gradient's expectation depend on the fit, so the 'provable' label reduces to the estimation procedure itself rather than an external guarantee.

full rationale

The paper's core derivation asserts that modeling reward noise as Bernoulli flips, estimating the flip probabilities from observed rewards, and applying a correction yields provably unbiased gradients. However, the theoretical unbiasedness typically holds only when the true flip probability p is known; replacing it with a data-dependent estimator ˆp computed from the same group-relative reward signals introduces potential covariance bias that the inversion formula does not automatically cancel. This reduces the 'provably unbiased' result to a fitted adjustment rather than an independent derivation from first principles, matching the fitted-input-called-prediction pattern. No self-citation load-bearing or ansatz smuggling is evident from the provided text, but the central claim's guarantee is not self-contained against the estimation step.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on modeling reward noise as independent Bernoulli flips whose probabilities can be estimated separately from the policy gradient; this introduces one domain assumption and one fitted quantity with no independent evidence supplied in the abstract.

free parameters (1)

reward flip probability
Estimated from data to perform the noise correction; directly used to debias the gradient signal.

axioms (1)

domain assumption Reward corruption follows a Bernoulli noise process
Invoked to justify the correction formula that produces unbiased gradients.

invented entities (1)

noise-corrected GRPO / Dr.GRPO no independent evidence
purpose: Debias learning signal under noisy rewards
New algorithmic framework introduced in the paper.

pith-pipeline@v0.9.0 · 5747 in / 1366 out tokens · 34139 ms · 2026-05-21T20:24:21.117510+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We explicitly model reward corruption as Bernoulli noise... apply noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Natarajan correction: ˆr_i ← (˜r − ρ+) / (1 − ρ+ − ρ−)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.