RLVR for LLMs tolerates up to 15% verifier noise with validation accuracy within 2 points of clean baselines across three model families and two task domains.
Rate or fate? rlv r: Reinforcement learning with verifiable noisy rewards
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
citing papers explorer
-
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
RLVR for LLMs tolerates up to 15% verifier noise with validation accuracy within 2 points of clean baselines across three model families and two task domains.
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.