Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
Self-rewarding correction for mathematical reasoning
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.
PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.
ACE introduces a solver-adversary loop where an LLM generates both candidate programs and adversarial tests, using execution outcomes for preference optimization to achieve 3-7% pass@1 gains on code benchmarks without ground-truth code.
citing papers explorer
-
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
-
rePIRL: Learn PRM with Inverse RL for LLM Reasoning
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
-
LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?
LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.
-
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.
-
ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization
ACE introduces a solver-adversary loop where an LLM generates both candidate programs and adversarial tests, using execution outcomes for preference optimization to achieve 3-7% pass@1 gains on code benchmarks without ground-truth code.
- Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning