pith. sign in

arxiv: 2510.00915 · v4 · pith:GFZXOLDPnew · submitted 2025-10-01 · 💻 cs.LG · cs.AI

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Pith reviewed 2026-05-25 07:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningverifiable rewardsnoisy rewardsimperfect verifierspolicy gradientmath reasoningstochastic channel
0
0 comments X

The pith

Two corrections from a stochastic reward channel model reduce the impact of imperfect verifiers on RLVR for math reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models verifier errors as a memoryless stochastic channel with false-positive rate ρ0 and false-negative rate ρ1, then derives two lightweight fixes for binarized rewards in reinforcement learning. A backward correction produces an unbiased surrogate reward that yields an unbiased policy-gradient estimator in expectation. A forward correction reweights score-function terms so the expected update matches the clean gradient direction and needs only the false-negative rate. Both are implemented as hooks in a group relative policy optimization pipeline and improve results on math reasoning under synthetic and real verifier noise, with the forward version remaining more stable at higher noise levels. An appeals mechanism using a lightweight LLM verifier estimates the false-negative rate online and yields further gains.

Core claim

From the abstraction of verifier unreliability as a stochastic reward channel with asymmetric noise rates ρ0 and ρ1, two corrections follow: the backward correction yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, while the forward correction reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the false-negative rate. Both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. An appeals mechanism with a lightweight LLM verifier estimates the false-negative rate online and further improves.

What carries the argument

Stochastic reward channel with false-positive rate ρ0 and false-negative rate ρ1, from which backward unbiased estimation and forward score-function reweighting are derived.

If this is right

  • Both corrections can be added as lightweight hooks inside existing group relative policy optimization pipelines.
  • Performance on math reasoning tasks improves under both synthetic and real verifier noise.
  • The forward correction maintains stability when noise rates are increased.
  • Online estimation of the false-negative rate via an appeals mechanism with a lightweight verifier yields additional gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same channel model and corrections could be applied to other domains that use automated binary verifiers, such as code generation or theorem proving.
  • If noise rates vary with the policy's outputs, the memoryless assumption would no longer hold and the corrections would need adaptive rate tracking.
  • A combined backward-forward correction might be derived for cases where both rates are known, potentially offering further robustness.

Load-bearing premise

Verifier errors can be captured by a memoryless stochastic channel whose rates are known or can be estimated online without depending on the current policy.

What would settle it

Run the corrected RLVR pipeline on a verifier whose error rates are deliberately made to depend on the policy's current outputs and check whether the reported performance gains over the uncorrected baseline disappear.

Figures

Figures reproduced from arXiv: 2510.00915 by Feng Liu, Gang Niu, Masashi Sugiyama, Tongliang Liu, Wei Wang, Xin-Qiang Cai.

Figure 1
Figure 1. Figure 1: Verifier-noise flow in RLVR. An AI agent produces candidate solutions that are scored [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic-Noise Results (pass@1) with 16 samples and 5 random seeds on the four backbones. Base: baseline without RL; Oracle: Training with clean rewards; Noise: Train￾ing with noisy verifier rewards; Noise BC: Training with noise under backward correction; Noise FC: Training with noise under forward correction. 4. Experiments We evaluate our approach under both synthetic and real-world verifier noise. We … view at source ↗
Figure 3
Figure 3. Figure 3: Synthetic-Noise Results (pass@8) with 16 samples and 5 random seeds on the four backbones Llama-3.2-3B-Instruct, and Qwen2.5-Math-7B. Base: baseline without RL; Oracle: Training with clean rewards; Noise: Training with noisy verifier rewards; Noise BC: Training with noise under backward correction; Noise FC: Training with noise under forward correction. advantage-construction stage within VERL. Evaluation … view at source ↗
Figure 4
Figure 4. Figure 4: Robustness results. (a) Backward correction (BC) with ˆρ [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $\rho_0$ and $\rho_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper models imperfect verifiers in RLVR as a memoryless stochastic reward channel with fixed false-positive rate ρ₀ and false-negative rate ρ₁. From this model it derives (i) a backward correction producing an unbiased surrogate reward (and thus unbiased policy-gradient estimator) and (ii) a forward correction that reweights the score-function estimator so its expectation aligns with the clean gradient (requiring only ρ₁). Both are implemented as lightweight modifications to a GRPO pipeline; experiments on math-reasoning tasks report that both corrections improve performance under synthetic and real verifier noise, with the forward variant more stable under heavier noise. An appeals mechanism using a lightweight LLM verifier is introduced to estimate ρ₁ online.

Significance. If the derivations and empirical gains hold under the stated noise model, the work supplies practical, low-overhead corrections that can be dropped into existing RLVR pipelines without altering the core optimizer. The online FN-rate estimator via appeals is a concrete engineering contribution that addresses a practical deployment issue. The approach is directly relevant to scaling automated-verifier RL for reasoning tasks.

major comments (2)
  1. [§3] §3 (theoretical derivations): both the backward unbiased-surrogate claim and the forward reweighting claim are derived under the explicit assumption that the noise channel is memoryless and that ρ₀, ρ₁ are constants independent of the current policy π and of the sampled answer a. The skeptic note correctly identifies that if verifier error rates depend on properties of a (length, syntactic complexity, token distribution) that themselves shift under policy updates, then E[noisy reward | a, π] ≠ E[noisy reward | y] and the claimed unbiasedness or directional alignment no longer holds. This independence assumption is load-bearing for the central theoretical contribution; the manuscript should either prove robustness to mild dependence or provide a concrete diagnostic test.
  2. [Experiments] Experiments section (synthetic and real-noise results): the reported improvements are shown only under fixed, policy-independent noise channels (synthetic) or under a single real verifier whose error statistics are treated as constant. No ablation or diagnostic is presented that varies noise rates with answer properties that change during training. Without such a check, it is unclear whether the observed gains survive when the independence assumption is violated, which directly affects the practical significance of the forward/backward corrections.
minor comments (2)
  1. [§3] Notation for the two rates is introduced as ρ₀ (FP) and ρ₁ (FN) in the abstract but should be restated with a short table or equation block at the start of §3 for readers who skip the abstract.
  2. [Appeals mechanism] The appeals mechanism is described only at a high level; a short pseudocode block or explicit update rule for the online ρ₁ estimator would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the central role of the independence assumption in our noise model. We address each major comment below and commit to revisions that strengthen the presentation of the assumptions and provide additional validation.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical derivations): both the backward unbiased-surrogate claim and the forward reweighting claim are derived under the explicit assumption that the noise channel is memoryless and that ρ₀, ρ₁ are constants independent of the current policy π and of the sampled answer a. The skeptic note correctly identifies that if verifier error rates depend on properties of a (length, syntactic complexity, token distribution) that themselves shift under policy updates, then E[noisy reward | a, π] ≠ E[noisy reward | y] and the claimed unbiasedness or directional alignment no longer holds. This independence assumption is load-bearing for the central theoretical contribution; the manuscript should either prove robustness to mild dependence or provide a concrete diagnostic test.

    Authors: We agree that the memoryless channel with policy- and answer-independent rates is a load-bearing assumption required for the exact unbiasedness of the backward correction and the directional alignment of the forward correction. The derivations in §3 are stated under this model. While a general proof of robustness to arbitrary dependence is outside the scope of the present work, we will revise the manuscript to (i) explicitly restate the assumption and discuss its practical relevance for math-reasoning verifiers (where error is driven primarily by semantic mismatch rather than policy-induced distributional shifts) and (ii) introduce a concrete diagnostic that bins answers by length and syntactic features, estimates empirical ρ₁ within each bin across training epochs, and flags statistically significant policy dependence. If dependence is observed, the appeals-based estimator can be extended to condition on these features. revision: yes

  2. Referee: [Experiments] Experiments section (synthetic and real-noise results): the reported improvements are shown only under fixed, policy-independent noise channels (synthetic) or under a single real verifier whose error statistics are treated as constant. No ablation or diagnostic is presented that varies noise rates with answer properties that change during training. Without such a check, it is unclear whether the observed gains survive when the independence assumption is violated, which directly affects the practical significance of the forward/backward corrections.

    Authors: We acknowledge that the current experimental suite uses stationary noise rates. In the revision we will add a new ablation in which the false-negative rate is made explicitly dependent on answer length (a property that evolves during training). We will generate synthetic data under this length-dependent noise model, re-run the GRPO pipeline with both corrections, and report whether performance gains relative to the uncorrected baseline persist. We will also apply the binning diagnostic described above to the existing real-verifier experiments and include the results. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations are direct mathematical consequences of the explicitly stated noise-channel model

full rationale

The paper defines a memoryless stochastic reward channel with fixed rates ρ0 (FP) and ρ1 (FN), then algebraically derives the backward correction (unbiased surrogate reward) and forward correction (reweighted score-function estimator) as expectations conditional on the true label y. These steps follow immediately from the channel definition and do not reduce to any fitted quantity on the evaluation data, any self-citation chain, or any renaming of an empirical pattern. The online FN-rate estimator via appeals is presented as a separate practical mechanism under the same independence assumption; it does not feed back into the derivation of the corrections themselves. The central claims therefore remain independent of the results they produce.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on treating verifier errors as independent draws from a fixed two-parameter channel; the noise rates themselves function as free parameters of the model.

free parameters (2)
  • ρ0 (false-positive rate)
    Parameter of the stochastic reward channel; required for the backward correction.
  • ρ1 (false-negative rate)
    Parameter of the stochastic reward channel; required for both corrections and the online estimator.
axioms (2)
  • domain assumption Verifier errors are memoryless and independent of the policy being trained.
    Invoked when the reward channel is defined and when expectations are taken over the noise.
  • standard math The policy-gradient theorem continues to hold when the observed reward is replaced by the corrected surrogate.
    Background assumption needed to claim that the corrected estimator is unbiased or aligned.

pith-pipeline@v0.9.0 · 5751 in / 1363 out tokens · 21818 ms · 2026-05-25T07:37:20.698191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates ρ₀ and ρ₁ … instance-independent class-conditional noise rates (ρ₀, ρ₁) that do not vary with (x, y)

  • IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    the estimator bR = (˜R − ρ₀) / (1 − ρ₀ − ρ₁) is an unbiased estimator … E[Δθ] = c ∇θJ(θ) with c = (1 − ρ₀ − ρ₁)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

    cs.LG 2026-02 unverdicted novelty 7.0

    Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.

  2. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

    cs.AI 2026-05 unverdicted novelty 6.0

    POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.

  3. On Training in Imagination

    cs.LG 2026-05 unverdicted novelty 6.0

    The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-...

  4. Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees...

  5. Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

    cs.LG 2026-04 unverdicted novelty 6.0

    Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

  6. On Training in Imagination

    cs.LG 2026-05 unverdicted novelty 5.0

    The paper derives the optimal dynamics-to-reward sample ratio minimizing return error under power-law scaling and proves that zero-mean reward noise in REINFORCE adds only variance that shrinks with more rollouts.

  7. VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

    cs.LG 2026-02 unverdicted novelty 5.0

    VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.

  8. High-Dimensional Statistics: Reflections on Progress and Open Problems

    math.ST 2026-05 unverdicted novelty 2.0

    A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 7 Pith papers · 4 internal anchors

  1. [1]

    Humans or llms as the judge? a study on judgement bias

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301–8327, 2024

  2. [2]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  3. [3]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

  4. [4]

    Tsang, and Masashi Sugiyama

    Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol` o Cesa-Bianchi, and Roman Garnett (eds.),Advances in Neural Information Processing 15 Systems 31: A...

  5. [5]

    Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedin...

  6. [6]

    Association for Computational Linguistics, 2024

  7. [7]

    Pitfalls of rule- and model-based verifiers–a case study on mathematical reasoning.arXiv preprint arXiv:2505.22203, 2025

    Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, and Junxian He. Pitfalls of rule- and model-based verifiers–a case study on mathematical reasoning.arXiv preprint arXiv:2505.22203, 2025

  8. [8]

    Math-verify: A robust mathematical expression evaluator for llm outputs

    Hugging Face. Math-verify: A robust mathematical expression evaluator for llm outputs. GitHub repository, 2025. URLhttps://github.com/huggingface/Math-Verify

  9. [9]

    Aime 2024 (dataset card)

    HuggingFaceH4. Aime 2024 (dataset card). Hugging Face, 2024. URLhttps:// huggingface.co/datasets/HuggingFaceH4/aime_2024

  10. [10]

    Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels

    Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Jennifer G. Dy and Andreas Krause (eds.),Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofPr...

  11. [11]

    On the admissibility of horvitz-thompson estimator for estimating causal effects under network interference.arXiv preprint arXiv:2312.01234, 2023

    Vishesh Karwa and Edoardo M Airoldi. On the admissibility of horvitz-thompson estimator for estimating causal effects under network interference.arXiv preprint arXiv:2312.01234, 2023

  12. [12]

    Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Bel- grave, K. Cho, and A. ...

  13. [13]

    Junnan Li, Richard Socher, and Steven C. H. Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020

  14. [14]

    Provably end-to- end label-noise learning without anchor points

    Xuefeng Li, Tongliang Liu, Bo Han, Gang Niu, and Masashi Sugiyama. Provably end-to- end label-noise learning without anchor points. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pp. 6403–6413. PMLR, 2021

  15. [15]

    Verifybench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025

    Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, and Wentao Zhang. Verifybench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025

  16. [16]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  17. [17]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2,

  18. [18]

    Amc 2023 (dataset card)

    math-ai. Amc 2023 (dataset card). Hugging Face, 2025. URLhttps://huggingface.co/ datasets/math-ai/amc23

  19. [19]

    Reinforcement learning with verifiable rewards: Grpo’s effective loss, dy- namics, and success amplification.arXiv preprint arXiv:2503.06639, 2025

    Youssef Mroueh. Reinforcement learning with verifiable rewards: Grpo’s effective loss, dy- namics, and success amplification.arXiv preprint arXiv:2503.06639, 2025

  20. [20]

    Dhillon, Pradeep Ravikumar, and Ambuj Tewari

    Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learn- ing with noisy labels. In Christopher J. C. Burges, L´ eon Bottou, Zoubin Ghahramani, 17 and Kilian Q. Weinberger (eds.),Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting h...

  21. [21]

    Aime 2025 (dataset card)

    OpenCompass. Aime 2025 (dataset card). Hugging Face, 2025. URLhttps://huggingface. co/datasets/opencompass/AIME2025

  22. [22]

    Making deep neural networks robust to label noise: A loss correction approach

    Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2233–2241. IEEE Computer Society, 2017

  23. [23]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  24. [24]

    Optimization-based prompt injection attack to llm-as-a-judge

    Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhen- qiang Gong. Optimization-based prompt injection attack to llm-as-a-judge. In Bo Luo, Xiaojing Liao, Jun Xu, Engin Kirda, and David Lie (eds.),Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS 2024, Salt Lake City, UT, USA, Octob...

  25. [25]

    Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2025

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2025

  26. [26]

    Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

    Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022

  27. [27]

    Sutton, David A

    Richard S. Sutton, David A. McAllester, Satinder Singh, and Yishay Mansour. Policy gra- dient methods for reinforcement learning with function approximation. In Sara A. Solla, Todd K. Leen, and Klaus-Robert M¨ uller (eds.),Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pp. 10...

  28. [28]

    Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024

  29. [29]

    Reinforcement learning with perturbed rewards

    Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning with perturbed rewards. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty- Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA...

  30. [30]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought rea- soning in language models. InThe Eleventh International Conference on Learning Repre- sentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

  31. [31]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.),Advances in Neural Information Processing Systems 35: Annual Conference on Neural I...

  32. [32]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

  33. [33]

    Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8(3):229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8(3):229–256, 1992

  34. [34]

    Tinyv: Reducing false negatives in verification improves rl for llm reasoning.arXiv preprint arXiv:2505.14625, 2025

    Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, and Radha Poovendran. Tinyv: Reducing false negatives in verification improves rl for llm reasoning.arXiv preprint arXiv:2505.14625, 2025

  35. [35]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. 19 In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.),Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...

  36. [36]

    One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025

    Yulai Zhao, Haolin Liu, Dian Yu, SY Kung, Haitao Mi, and Dong Yu. One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025

  37. [37]

    Le, and Ed H

    Denny Zhou, Nathanael Sch¨ arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net...

  38. [38]

    The unconditional expectation is zero:E[G t] = 0 [32, 26]

  39. [39]

    idx": 16,

    The clean policy gradient is∇ θJ(θ) =E[R ∗Gt]. From property 1, we haveE[G t] =E[(1 {R∗=1} +1 {R∗=0})Gt] =E[R ∗Gt]+E[1 {R∗=0}Gt] = 0. This implies thatE[1 {R∗=0}Gt] =−E[R ∗Gt] =−∇ θJ(θ). Finally, we substitute this back into our expression for the expected update direction: E[ht] =E[w ˜RGt] =−(1−ρ 0 −ρ 1)·E[1 {R∗=0}Gt] =−(1−ρ 0 −ρ 1)·(−∇ θJ(θ)) = (1−ρ 0 −...