Recognition: no theorem link
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Pith reviewed 2026-05-13 16:56 UTC · model grok-4.3
The pith
Online Label Refinement corrects noisy labels in RLVR by tracking rollout pass rate slopes and consistency to enable self-correction during reasoning model training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OLR progressively corrects potentially noisy labels with majority-voted answers when a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates hold, enabling gradual self-correction as the policy improves and delivering average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations across noise ratios from 0.1 to 0.9.
What carries the argument
Online Label Refinement (OLR), a progressive correction step that replaces labels using majority-voted answers only when rollout pass rate shows a positive slope and historical consistency is stable.
If this is right
- OLR improves robustness under both inactive and active noisy-label settings across all tested noise ratios.
- The method produces consistent gains on six in-distribution mathematical reasoning benchmarks including AIME, AMC, MATH-500, Minerva, and Olympiad.
- Gains extend to three out-of-distribution tasks: ARC-c, GPQA-diamond, and MMLU-pro.
- Early Correctness Coherence allows corrections to begin safely before noisy samples lag in later training stages.
Where Pith is reading between the lines
- The rollout-based correction rule may transfer to other RLVR variants or to non-math reasoning domains where label noise arises from scarce experts.
- Combining OLR with existing noise-robust RL techniques could further reduce the need for perfect supervision in large-scale reasoning training.
- The two-condition check on slope and consistency offers a testable template for label cleaning in any rollout-driven training loop.
Load-bearing premise
A positive slope in the majority answer's rollout pass rate combined with stable historical consistency reliably indicates the correct label and can be used for safe correction without introducing new errors.
What would settle it
An experiment that applies OLR to a controlled dataset where majority-voted answers are known to be wrong and measures whether final model accuracy falls below the no-refinement baseline.
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in RLVR for reasoning LLMs, noisy labels can be either inactive (reducing efficiency) or active (reinforced by the policy), and that an observed Early Correctness Coherence phenomenon motivates Online Label Refinement (OLR). OLR progressively corrects labels to the majority-voted answer when the rollout pass rate of that answer shows a positive slope and historical consistency is stable. Across noise ratios 0.1–0.9, OLR yields average gains of 3.6–3.9% on six in-distribution math benchmarks and 3.3–4.6% on three OOD tasks under both inactive and active noise.
Significance. If the OLR triggers reliably select correct labels rather than consistently generated incorrect ones, the work would be significant: it supplies the first systematic treatment of noisy supervision in RLVR, demonstrates concrete robustness gains on both ID and OOD reasoning benchmarks, and offers a practical, rollout-driven correction mechanism that exploits training dynamics without requiring external clean data.
major comments (2)
- [§3] §3 (OLR definition): the two correction triggers (positive slope in majority-answer rollout pass rate + stable historical consistency) are presented as sufficient to identify the correct label. For active noise at ratios 0.7–0.9 this is load-bearing; once the policy begins emitting the noisy answer at high frequency, both triggers can become positive for the incorrect label, causing OLR to reinforce rather than correct the error. The reported gains at these ratios therefore require explicit verification that corrections are not simply locking in the dominant (wrong) distribution.
- [§4–5] Experiments (§4–5, Tables 1–3): average gains are reported without statistical significance tests, without the exact numerical thresholds used for slope and consistency, and without ablation on whether those thresholds were selected after seeing test performance. Because the central robustness claim rests on these choices, the absence of these details makes it impossible to judge whether the improvements are reproducible or sensitive to post-hoc tuning.
minor comments (2)
- [Abstract] Abstract and §4: report the number of independent runs and standard deviations or confidence intervals alongside the average gains; current presentation of “consistent gains” is difficult to interpret without variance information.
- [§3.1] §3.1: define “rollout pass rate” and “historical consistency” with explicit formulas or pseudocode so that the OLR update rule can be re-implemented without ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work analyzing noisy supervision in RLVR and proposing Online Label Refinement. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the manuscript's rigor and reproducibility without altering its core claims.
read point-by-point responses
-
Referee: [§3] §3 (OLR definition): the two correction triggers (positive slope in majority-answer rollout pass rate + stable historical consistency) are presented as sufficient to identify the correct label. For active noise at ratios 0.7–0.9 this is load-bearing; once the policy begins emitting the noisy answer at high frequency, both triggers can become positive for the incorrect label, causing OLR to reinforce rather than correct the error. The reported gains at these ratios therefore require explicit verification that corrections are not simply locking in the dominant (wrong) distribution.
Authors: We appreciate the referee highlighting this potential failure mode for high-ratio active noise. The Early Correctness Coherence phenomenon we identify shows that correct labels achieve rising rollout success earlier than noisy ones, so the slope and consistency triggers activate preferentially for the correct majority answer before policy overfitting occurs. To provide the requested explicit verification, we have added a new analysis subsection in §3 (with supporting figures) that reports the fraction of OLR corrections aligning with ground-truth labels across all noise ratios, including 0.7–0.9 active noise; this shows that the large majority of refinements are to correct labels rather than reinforcing errors. revision: yes
-
Referee: [§4–5] Experiments (§4–5, Tables 1–3): average gains are reported without statistical significance tests, without the exact numerical thresholds used for slope and consistency, and without ablation on whether those thresholds were selected after seeing test performance. Because the central robustness claim rests on these choices, the absence of these details makes it impossible to judge whether the improvements are reproducible or sensitive to post-hoc tuning.
Authors: We agree that these omissions limit reproducibility assessment. We have revised §3 and the experimental sections to state the exact thresholds (slope threshold of 0.02 and consistency threshold of 0.75 over a 3-update window) and moved their full definition to Appendix B. We now include paired statistical significance tests (bootstrap resampling over 5 seeds) confirming p < 0.05 for the reported average gains on both ID and OOD benchmarks. We have also added an ablation study on threshold sensitivity performed on a held-out validation split (distinct from test sets), showing stable performance within small perturbations and confirming that thresholds were not tuned on test data. revision: yes
Circularity Check
No significant circularity; OLR is an empirically motivated heuristic
full rationale
The paper identifies an Early Correctness Coherence phenomenon from training dynamics and defines OLR correction triggers directly from observable rollout statistics (positive slope in majority-answer pass rate plus historical consistency). These triggers are not fitted to or defined in terms of the final benchmark gains; the claimed robustness improvements are measured on held-out evaluation sets after applying the rule. No equations, self-citations, or uniqueness theorems reduce the reported 3.6–4.6 % gains to the method's own inputs by construction. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- slope and consistency thresholds for label update
axioms (1)
- domain assumption Majority-voted rollout answers become increasingly reliable as policy improves
Reference graph
Works this paper leans on
-
[1]
Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134,
-
[2]
Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning
Hao Chen, Ran Tao, Yue Fan, Yidong Wang, Jindong Wang, Bernt Schiele, Xing Xie, Bhiksha Raj, and Marios Savvides. Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning. arXiv preprint arXiv:2301.10921,
-
[3]
Mitigating memorization of noisy labels via regularization between representations
Hao Cheng, Zhaowei Zhu, Xing Sun, and Yang Liu. Mitigating memorization of noisy labels via regularization between representations. arXiv preprint arXiv:2110.09022,
-
[4]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Towards understanding deep learning from noisy labels with small-loss criterion
Xian-Jin Gui, Wei Wang, and Zhang-Hao Tian. Towards understanding deep learning from noisy labels with small-loss criterion. arXiv preprint arXiv:2106.09291,
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
https://arxiv.org/abs/2503.24290. Jinchi Huang, Lie Qu, Rongfei Jia, and Binqiang Zhao. O2u-net: A simple noisy label detection approach for deep neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3326–3334,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Junnan Li, Richard Socher, and Steven CH Hoi
Hugging Face repository, 13:9. Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394,
-
[13]
Noisy label processing for classification: A survey
Mengting Li and Chuang Zhu. Noisy label processing for classification: A survey. arXiv preprint arXiv:2404.04159,
-
[14]
Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets. Confidence is all you need: Few-shot rl fine-tuning of language models. arXiv preprint arXiv:2506.06395, 2025a. Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu. Selective-supervised contrastive learning with noisy labels. In Proceedings of the IEEE/CVF conference on c...
-
[15]
Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025
Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886, 2025b. Yifan Li, Hu Han, Shiguang Shan, and Xilin Chen. Disc: Learning from noisy labels via dynamic instance-specific selection and correction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24070–24079,
-
[16]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Youssef Mroueh. Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,
-
[18]
Self: Learning to filter noisy labels with self-ensembling
Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, and Thomas Brox. Self: Learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842,
-
[19]
Making deep neural networks robust to label noise: A loss correction approach
12 Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952,
work page 1944
-
[20]
Regularizing Neural Networks by Penalizing Confident Output Distributions
Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548,
-
[21]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
https: //arxiv.org/abs/2402.03300. Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In International conference on machine learning, pages 5739–5748. PMLR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Dropout: a simple way to prevent neural networks from overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958,
work page 1929
-
[23]
Training Convolutional Networks with Noisy Labels
Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080,
work page Pith review arXiv 2080
-
[24]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024a. Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge
Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv preprint arXiv:2407.19594,
-
[27]
Self-rewarding correction for mathematical reasoning
Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang. Self-rewarding correction for mathematical reasoning. arXiv preprint arXiv:2502.19613,
-
[28]
arXiv preprint arXiv:2504.14945 , year =
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025a. Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, et al. Verifybench: Benchmarking reference-based reward systems ...
-
[29]
Deep learning from noisy image labels with quality embedding
Jiangchao Yao, Jiajie Wang, Ivor W Tsang, Ya Zhang, Jun Sun, Chengqi Zhang, and Rui Zhang. Deep learning from noisy image labels with quality embedding. IEEE Transactions on Image Processing, 28(4):1909–1922,
work page 1909
-
[30]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Self-Rewarding Language Models
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 3,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Exgrpo: Learning to reason from experience
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F Wong, and Yu Cheng. Exgrpo: Learning to reason from experience. arXiv preprint arXiv:2510.02245,
-
[33]
mixup: Beyond Empirical Risk Minimization
Chang-Bin Zhang, Peng-Tao Jiang, Qibin Hou, Yunchao Wei, Qi Han, Zhen Li, and Ming-Ming Cheng. Delving deep into label smoothing. IEEE Transactions on Image Processing, 30:5984–5996, 2021a. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization. arXiv preprint arXiv:2504.05812, 2025a. Yikai Zhang, Songzhu Zheng, Pengxiang Wu, Mayank Goswami, and Chao Chen. Learning with feature-dependent label noise: A progressive approach. arXiv pre...
-
[35]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Self-consistency of the internal reward models improves self-rewarding language models
Xin Zhou, Yiwen Guo, Ruotian Ma, Tao Gui, Qi Zhang, and Xuanjing Huang. Self-consistency of the internal reward models improves self-rewarding language models. arXiv preprint arXiv:2502.08922,
-
[37]
TTRL: Test-Time Reinforcement Learning
Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084,
-
[38]
(25) Step 3: Martingale decomposition and high-probability bound
Solving for ρKL c gives γ(1 − ρKL c )Gc − ρKL c Gn − β∆ref = 0 ⇒ ρKL c = γGc − β∆ref Gn + γGc . (25) Step 3: Martingale decomposition and high-probability bound. Define the martingale difference as in Lemma A.3: Mt(xn) = ∆Lt(xn) − E[∆Lt(xn) | F t−1]. Its magnitude is still bounded by ηC p log(1/δ)/K due to Lemma A.2. Thus, the log-ratio evolves as Lt(xn) ...
work page 2025
-
[39]
This clearly demonstrates the versatility of OLR
across both ID and OOD benchmarks. This clearly demonstrates the versatility of OLR. Detailed analysis on Early Correctness Coherence. To verify the Early Correctness Coherence phe- nomenon, we conduct a detailed analysis of the model’s predictions on clean and noisy samples under both noise scenarios (see Figure 5). The results substantiate the two key a...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.