Self-rewarding correction for mathematical reasoning

Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang · 2025 · arXiv 2502.19613

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Can LLMs Learn to Reason Robustly under Noisy Supervision?

cs.LG · 2026-04-05 · conditional · novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

cs.CL · 2025-10-09 · unverdicted · novelty 6.0

LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

cs.LG · 2025-09-03 · unverdicted · novelty 6.0

PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.

ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization

cs.SE · 2026-04-17 · unverdicted · novelty 5.0 · 2 refs

ACE introduces a solver-adversary loop where an LLM generates both candidate programs and adversarial tests, using execution outcomes for preference optimization to achieve 3-7% pass@1 gains on code benchmarks without ground-truth code.

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

cs.LG · 2026-04-19

citing papers explorer

Showing 6 of 6 citing papers.

Can LLMs Learn to Reason Robustly under Noisy Supervision? cs.LG · 2026-04-05 · conditional · none · ref 27
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
rePIRL: Learn PRM with Inverse RL for LLM Reasoning cs.LG · 2026-02-08 · unverdicted · none · ref 32
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
LightReasoner: Can Small Language Models Teach Large Language Models Reasoning? cs.CL · 2025-10-09 · unverdicted · none · ref 18
LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training cs.LG · 2025-09-03 · unverdicted · none · ref 30
PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.
ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization cs.SE · 2026-04-17 · unverdicted · none · ref 15 · 2 links
ACE introduces a solver-adversary loop where an LLM generates both candidate programs and adversarial tests, using execution outcomes for preference optimization to achieve 3-7% pass@1 gains on code benchmarks without ground-truth code.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning cs.LG · 2026-04-19 · unreviewed · ref 24

Self-rewarding correction for mathematical reasoning

fields

years

verdicts

representative citing papers

citing papers explorer