Rate or fate? rlv r: Reinforcement learning with verifiable noisy rewards

Ali Rad, Khashayar Filom, Darioush Keivan, Peyman Mohajerin Esfahani, Ehsan Kamalinejad · 2026 · arXiv 2601.04411

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

An Imperfect Verifier is Good Enough: Learning with Noisy Rewards

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

RLVR for LLMs tolerates up to 15% verifier noise with validation accuracy within 2 points of clean baselines across three model families and two task domains.

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

citing papers explorer

Showing 2 of 2 citing papers.

An Imperfect Verifier is Good Enough: Learning with Noisy Rewards cs.LG · 2026-04-09 · unverdicted · none · ref 6
RLVR for LLMs tolerates up to 15% verifier noise with validation accuracy within 2 points of clean baselines across three model families and two task domains.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR cs.LG · 2026-04-06 · unverdicted · none · ref 5
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

Rate or fate? rlv r: Reinforcement learning with verifiable noisy rewards

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer