Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Tinyv: Reducing false negatives in verification improves rl for llm reasoning
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
representative citing papers
Derives backward and forward corrections for asymmetric verifier noise that improve RLVR performance on math reasoning tasks.
citing papers explorer
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
-
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Derives backward and forward corrections for asymmetric verifier noise that improve RLVR performance on math reasoning tasks.
- High-Dimensional Statistics: Reflections on Progress and Open Problems