arxiv: 2604.03993 · v1 · submitted 2026-04-05 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Shenzhi Yang , Guangcheng Zhu , Bowen Song , Sharon Li , Haobo Wang , Xing Zheng , Yingfan Ma , Zhongqi Chen

show 2 more authors

Weiqiang Wang Gang Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords noisy labelsreinforcement learningRLVRreasoning modelslabel refinementmathematical reasoningrobust training

0 comments

The pith

Online Label Refinement corrects noisy labels in RLVR by tracking rollout pass rate slopes and consistency to enable self-correction during reasoning model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes how noisy labels affect reinforcement learning with verifiable rewards when training LLMs to reason. It separates inactive noise that wastes data from active noise that gets reinforced and shifts the model wrong. Experiments reveal an Early Correctness Coherence where clean and noisy samples improve together early on before diverging. Motivated by this, the authors introduce Online Label Refinement that replaces suspect labels with majority answers only when the rollout success rate is rising and the answer has stayed consistent over updates. Across noise levels from 10% to 90%, this produces steady accuracy lifts on math benchmarks and out-of-distribution tasks.

Core claim

OLR progressively corrects potentially noisy labels with majority-voted answers when a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates hold, enabling gradual self-correction as the policy improves and delivering average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations across noise ratios from 0.1 to 0.9.

What carries the argument

Online Label Refinement (OLR), a progressive correction step that replaces labels using majority-voted answers only when rollout pass rate shows a positive slope and historical consistency is stable.

If this is right

OLR improves robustness under both inactive and active noisy-label settings across all tested noise ratios.
The method produces consistent gains on six in-distribution mathematical reasoning benchmarks including AIME, AMC, MATH-500, Minerva, and Olympiad.
Gains extend to three out-of-distribution tasks: ARC-c, GPQA-diamond, and MMLU-pro.
Early Correctness Coherence allows corrections to begin safely before noisy samples lag in later training stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rollout-based correction rule may transfer to other RLVR variants or to non-math reasoning domains where label noise arises from scarce experts.
Combining OLR with existing noise-robust RL techniques could further reduce the need for perfect supervision in large-scale reasoning training.
The two-condition check on slope and consistency offers a testable template for label cleaning in any rollout-driven training loop.

Load-bearing premise

A positive slope in the majority answer's rollout pass rate combined with stable historical consistency reliably indicates the correct label and can be used for safe correction without introducing new errors.

What would settle it

An experiment that applies OLR to a controlled dataset where majority-voted answers are known to be wrong and measures whether final model accuracy falls below the no-refinement baseline.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper splits noisy labels in RLVR into inactive and active types then offers a rollout-based correction rule that delivers modest gains across many benchmarks, but the rule's safety when noise is active rests on an assumption that needs tighter checks.

read the letter

The main new piece is the inactive/active noise split specific to rollout-based RLVR, plus the Early Correctness Coherence observation that both clean and noisy samples improve early on. From that they build Online Label Refinement, which flips a label to the majority answer when the majority's rollout pass rate shows a positive slope and historical consistency stays stable. This is a practical heuristic rather than a derived guarantee, and it is not just a rehash of standard supervised noise fixes. The experiments cover six in-distribution math benchmarks and three out-of-distribution tasks at noise ratios from 0.1 to 0.9, reporting average lifts of 3.6-3.9% in-distribution and 3.3-4.6% out-of-distribution. That breadth makes the practical claim easy to evaluate. The soft spot is the central assumption that the two triggers pick the correct label rather than locking in a consistently generated wrong one. For active noise, once the policy starts producing the noisy answer at high rate, both slope and consistency can look good for the error, so the reported gains at 0.7-0.9 noise could partly reflect reinforcement instead of genuine robustness. The paper gives little detail on exact thresholds, how often corrections are verified as right, or statistical tests, which leaves the robustness claim harder to trust. This work is aimed at groups scaling reasoning models on imperfect real-world data. It deserves peer review because the experiments are broad enough to be informative and the idea is straightforward to test further, even if the mechanism needs more validation on when the corrections actually help versus hurt.

Referee Report

2 major / 2 minor

Summary. The paper claims that in RLVR for reasoning LLMs, noisy labels can be either inactive (reducing efficiency) or active (reinforced by the policy), and that an observed Early Correctness Coherence phenomenon motivates Online Label Refinement (OLR). OLR progressively corrects labels to the majority-voted answer when the rollout pass rate of that answer shows a positive slope and historical consistency is stable. Across noise ratios 0.1–0.9, OLR yields average gains of 3.6–3.9% on six in-distribution math benchmarks and 3.3–4.6% on three OOD tasks under both inactive and active noise.

Significance. If the OLR triggers reliably select correct labels rather than consistently generated incorrect ones, the work would be significant: it supplies the first systematic treatment of noisy supervision in RLVR, demonstrates concrete robustness gains on both ID and OOD reasoning benchmarks, and offers a practical, rollout-driven correction mechanism that exploits training dynamics without requiring external clean data.

major comments (2)

[§3] §3 (OLR definition): the two correction triggers (positive slope in majority-answer rollout pass rate + stable historical consistency) are presented as sufficient to identify the correct label. For active noise at ratios 0.7–0.9 this is load-bearing; once the policy begins emitting the noisy answer at high frequency, both triggers can become positive for the incorrect label, causing OLR to reinforce rather than correct the error. The reported gains at these ratios therefore require explicit verification that corrections are not simply locking in the dominant (wrong) distribution.
[§4–5] Experiments (§4–5, Tables 1–3): average gains are reported without statistical significance tests, without the exact numerical thresholds used for slope and consistency, and without ablation on whether those thresholds were selected after seeing test performance. Because the central robustness claim rests on these choices, the absence of these details makes it impossible to judge whether the improvements are reproducible or sensitive to post-hoc tuning.

minor comments (2)

[Abstract] Abstract and §4: report the number of independent runs and standard deviations or confidence intervals alongside the average gains; current presentation of “consistent gains” is difficult to interpret without variance information.
[§3.1] §3.1: define “rollout pass rate” and “historical consistency” with explicit formulas or pseudocode so that the OLR update rule can be re-implemented without ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work analyzing noisy supervision in RLVR and proposing Online Label Refinement. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the manuscript's rigor and reproducibility without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (OLR definition): the two correction triggers (positive slope in majority-answer rollout pass rate + stable historical consistency) are presented as sufficient to identify the correct label. For active noise at ratios 0.7–0.9 this is load-bearing; once the policy begins emitting the noisy answer at high frequency, both triggers can become positive for the incorrect label, causing OLR to reinforce rather than correct the error. The reported gains at these ratios therefore require explicit verification that corrections are not simply locking in the dominant (wrong) distribution.

Authors: We appreciate the referee highlighting this potential failure mode for high-ratio active noise. The Early Correctness Coherence phenomenon we identify shows that correct labels achieve rising rollout success earlier than noisy ones, so the slope and consistency triggers activate preferentially for the correct majority answer before policy overfitting occurs. To provide the requested explicit verification, we have added a new analysis subsection in §3 (with supporting figures) that reports the fraction of OLR corrections aligning with ground-truth labels across all noise ratios, including 0.7–0.9 active noise; this shows that the large majority of refinements are to correct labels rather than reinforcing errors. revision: yes
Referee: [§4–5] Experiments (§4–5, Tables 1–3): average gains are reported without statistical significance tests, without the exact numerical thresholds used for slope and consistency, and without ablation on whether those thresholds were selected after seeing test performance. Because the central robustness claim rests on these choices, the absence of these details makes it impossible to judge whether the improvements are reproducible or sensitive to post-hoc tuning.

Authors: We agree that these omissions limit reproducibility assessment. We have revised §3 and the experimental sections to state the exact thresholds (slope threshold of 0.02 and consistency threshold of 0.75 over a 3-update window) and moved their full definition to Appendix B. We now include paired statistical significance tests (bootstrap resampling over 5 seeds) confirming p < 0.05 for the reported average gains on both ID and OOD benchmarks. We have also added an ablation study on threshold sensitivity performed on a held-out validation split (distinct from test sets), showing stable performance within small perturbations and confirming that thresholds were not tuned on test data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; OLR is an empirically motivated heuristic

full rationale

The paper identifies an Early Correctness Coherence phenomenon from training dynamics and defines OLR correction triggers directly from observable rollout statistics (positive slope in majority-answer pass rate plus historical consistency). These triggers are not fitted to or defined in terms of the final benchmark gains; the claimed robustness improvements are measured on held-out evaluation sets after applying the rule. No equations, self-citations, or uniqueness theorems reduce the reported 3.6–4.6 % gains to the method's own inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that rollout pass-rate trends can serve as a reliable proxy for label correctness once the policy begins to improve, plus standard RLVR assumptions about verifiable rewards.

free parameters (1)

slope and consistency thresholds for label update
The two conditions that trigger correction are defined with implicit cutoffs that must be chosen or tuned.

axioms (1)

domain assumption Majority-voted rollout answers become increasingly reliable as policy improves
Invoked to justify progressive self-correction in OLR.

pith-pipeline@v0.9.0 · 5661 in / 1288 out tokens · 37522 ms · 2026-05-13T16:56:01.845530+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 15 internal anchors

[1]

The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134,

work page arXiv
[2]

Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning

Hao Chen, Ran Tao, Yue Fan, Yidong Wang, Jindong Wang, Bernt Schiele, Xing Xie, Bhiksha Raj, and Marios Savvides. Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning. arXiv preprint arXiv:2301.10921,

work page arXiv
[3]

Mitigating memorization of noisy labels via regularization between representations

Hao Cheng, Zhaowei Zhu, Xing Sun, and Yang Liu. Mitigating memorization of noisy labels via regularization between representations. arXiv preprint arXiv:2110.09022,

work page arXiv
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Towards understanding deep learning from noisy labels with small-loss criterion

Xian-Jin Gui, Wei Wang, and Zhang-Hao Tian. Towards understanding deep learning from noisy labels with small-loss criterion. arXiv preprint arXiv:2106.09291,

work page arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

https://arxiv.org/abs/2503.24290. Jinchi Huang, Lie Qu, Rongfei Jia, and Binqiang Zhao. O2u-net: A simple noisy label detection approach for deep neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3326–3334,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Junnan Li, Richard Socher, and Steven CH Hoi

Hugging Face repository, 13:9. Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394,

work page arXiv 2002
[13]

Noisy label processing for classification: A survey

Mengting Li and Chuang Zhu. Noisy label processing for classification: A survey. arXiv preprint arXiv:2404.04159,

work page arXiv
[14]

Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025

Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets. Confidence is all you need: Few-shot rl fine-tuning of language models. arXiv preprint arXiv:2506.06395, 2025a. Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu. Selective-supervised contrastive learning with noisy labels. In Proceedings of the IEEE/CVF conference on c...

work page arXiv
[15]

Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886, 2025b. Yifan Li, Hu Han, Shiguang Shan, and Xilin Chen. Disc: Learning from noisy labels via dynamic instance-specific selection and correction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24070–24079,

work page arXiv
[16]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

Youssef Mroueh. Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

work page arXiv
[18]

Self: Learning to filter noisy labels with self-ensembling

Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, and Thomas Brox. Self: Learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842,

work page arXiv 1910
[19]

Making deep neural networks robust to label noise: A loss correction approach

12 Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952,

work page 1944
[20]

Regularizing Neural Networks by Penalizing Confident Output Distributions

Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548,

work page Pith review arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

https: //arxiv.org/abs/2402.03300. Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In International conference on machine learning, pages 5739–5748. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958,

work page 1929
[23]

Training Convolutional Networks with Noisy Labels

Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080,

work page Pith review arXiv 2080
[24]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024a. Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, ...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge

Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv preprint arXiv:2407.19594,

work page arXiv
[27]

Self-rewarding correction for mathematical reasoning

Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang. Self-rewarding correction for mathematical reasoning. arXiv preprint arXiv:2502.19613,

work page arXiv
[28]

arXiv preprint arXiv:2504.14945 , year =

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025a. Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, et al. Verifybench: Benchmarking reference-based reward systems ...

work page arXiv
[29]

Deep learning from noisy image labels with quality embedding

Jiangchao Yao, Jiajie Wang, Ivor W Tsang, Ya Zhang, Jun Sun, Chengqi Zhang, and Rui Zhang. Deep learning from noisy image labels with quality embedding. IEEE Transactions on Image Processing, 28(4):1909–1922,

work page 1909
[30]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 3,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Exgrpo: Learning to reason from experience

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F Wong, and Yu Cheng. Exgrpo: Learning to reason from experience. arXiv preprint arXiv:2510.02245,

work page arXiv
[33]

mixup: Beyond Empirical Risk Minimization

Chang-Bin Zhang, Peng-Tao Jiang, Qibin Hou, Yunchao Wei, Qi Han, Zhen Li, and Ming-Ming Cheng. Delving deep into label smoothing. IEEE Transactions on Image Processing, 30:5984–5996, 2021a. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization. arXiv preprint arXiv:2504.05812, 2025a. Yikai Zhang, Songzhu Zheng, Pengxiang Wu, Mayank Goswami, and Chao Chen. Learning with feature-dependent label noise: A progressive approach. arXiv pre...

work page arXiv
[35]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Self-consistency of the internal reward models improves self-rewarding language models

Xin Zhou, Yiwen Guo, Ruotian Ma, Tao Gui, Qi Zhang, and Xuanjing Huang. Self-consistency of the internal reward models improves self-rewarding language models. arXiv preprint arXiv:2502.08922,

work page arXiv
[37]

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084,

work page Pith review arXiv
[38]

(25) Step 3: Martingale decomposition and high-probability bound

Solving for ρKL c gives γ(1 − ρKL c )Gc − ρKL c Gn − β∆ref = 0 ⇒ ρKL c = γGc − β∆ref Gn + γGc . (25) Step 3: Martingale decomposition and high-probability bound. Define the martingale difference as in Lemma A.3: Mt(xn) = ∆Lt(xn) − E[∆Lt(xn) | F t−1]. Its magnitude is still bounded by ηC p log(1/δ)/K due to Lemma A.2. Thus, the log-ratio evolves as Lt(xn) ...

work page 2025
[39]

This clearly demonstrates the versatility of OLR

across both ID and OOD benchmarks. This clearly demonstrates the versatility of OLR. Detailed analysis on Early Correctness Coherence. To verify the Early Correctness Coherence phe- nomenon, we conduct a detailed analysis of the model’s predictions on clean and noisy samples under both noise scenarios (see Figure 5). The results substantiate the two key a...

work page 2015