Breaking the Self-Confirming Loop: Diagnosing and Mitigating Systemic Reward Bias in Self-Rewarding RL

Boyuan Pan; Chuyi Tan; Jiayi Shi; Ji Zhang; Kan Li; Peiwen Yuan; Shaoxiong Feng; Xinglin Wang; Yao Hu; Yiwei Li

arxiv: 2510.08977 · v2 · pith:3JQDU2X5new · submitted 2025-10-10 · 💻 cs.LG · cs.CL

Breaking the Self-Confirming Loop: Diagnosing and Mitigating Systemic Reward Bias in Self-Rewarding RL

Chuyi Tan , Peiwen Yuan , Xinglin Wang , Yiwei Li , Shaoxiong Feng , Yueqi Zhang , Jiayi Shi , Ji Zhang

show 3 more authors

Boyuan Pan Yao Hu Kan Li

This is my paper

classification 💻 cs.LG cs.CL

keywords biascouplinglearningover-rewardreinforcementrewardrewardsself-rewarding

0 comments

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) efficiently scales the reasoning ability of large language models (LLMs) but is bottlenecked by scarce labeled data. Reinforcement learning with intrinsic rewards (RLIR) offers a scalable alternative via self-rewarding, yet often suffers from instability and inferior performance. We trace this gap to a systemic bias in confidence-coupled self-rewarding: the model tends to over-reward high-confidence mistakes, forming a self-confirming loop. We quantify this feedback-loop bias with three metrics: reward noise magnitude (rho_noise), policy-reward coupling (rho_selfbias), and over-/under-reward skew (rho_symbias). Our analyses show a compounding effect where strong coupling amplifies confidence-conditioned errors and drives a drift toward over-reward, leading to instability and a lower performance ceiling. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models with adaptive reward interpolation and disagreement-aware rollout selection to reduce coupling and suppress over-reward drift. Extensive experiments show that RLER improves by 6.2% over the best RLIR baseline and is within 3.6% of RLVR, while exhibiting stable scaling on unlabeled samples.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
cs.LG 2026-03 unverdicted novelty 7.0

SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.