pith. machine review for the scientific record. sign in

arxiv: 2509.21882 · v2 · submitted 2025-09-26 · 💻 cs.LG · cs.AI

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

Pith reviewed 2026-05-18 14:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learning with verifiable rewardsLLM evaluationdata contaminationbenchmark reliabilityRLVRmeasurement confounds
0
0 comments X

The pith

Many reported RLVR gains on math and code tasks shrink or vanish once budgets, prompts, and contamination are controlled.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that headline improvements from reinforcement learning with verifiable rewards often reflect measurement artifacts rather than genuine reasoning advances. Through budget-matched reproductions and partial-prompt probes, the authors show that performance gaps narrow substantially when evaluation budgets are aligned, abstention behaviors are tracked, and contaminated examples are treated as memorization checks instead of reasoning tests. A sympathetic reader would care because overstated gains can hide reliability problems, encourage over-optimism about model capabilities, and waste effort on flawed benchmarks. The work does not claim RLVR is ineffective but shows that current reporting practices frequently conflate policy changes with three specific confounds.

Core claim

Using budget-matched reproductions and partial-prompt contamination probes, several widely cited gaps shrink substantially or disappear once budgets, prompts, and dataset versions are matched, and contaminated sets are treated as memorization probes rather than evidence of reasoning. This does not mean that RLVR is ineffective, but it implies that current measurements often overstate capability gains and obscure reliability costs.

What carries the argument

Budget-matched reproductions combined with partial-prompt contamination probes that isolate policy improvement from budget mismatch, attempt inflation and calibration drift, and data contamination.

If this is right

  • RLVR remains effective and deployable in verifiable domains when measured with the proposed controls.
  • Reasoning gains from RLVR should be treated as provisional without budget-matched saturation curves and contamination screens.
  • Current benchmarks obscure reliability costs such as calibration drift and increased confident errors.
  • A compact minimum standard for RLVR includes variance reporting, abstention tracking, and one judge robustness test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar measurement confounds could affect evaluations of other post-training techniques that rely on the same benchmarks.
  • Applying the same probes to non-verifiable reward settings might reveal whether the patterns are specific to RLVR or more general.
  • Widespread adoption of the proposed standards would likely slow the rate of headline claims while raising the reliability of verified advances.

Load-bearing premise

That the budget-matched reproductions and partial-prompt contamination probes are representative of the headline results in the broader RLVR literature and that the three listed confounds are the dominant sources of overstated gains.

What would settle it

A controlled reproduction that matches budgets, prompts, and dataset versions, excludes or flags contaminated items, and still reports large persistent gains on the original headline benchmarks would falsify the claim.

Figures

Figures reproduced from arXiv: 2509.21882 by Aaron Tu, Amin Saberi, Bing Hu, Fang Wu, Ge Liu, Hanqun Cao, Heli Qi, Huaxiu Yao, Jure Leskovec, Li Erran Li, Nan Liu, Naoto Yokoya, Peng Xia, Qingcheng Zeng, Rui Yang, Shayan Talaei, Weihao Xuan, Wenqi Shi, Xiangru Tang, Xu Huang, Yejin Choi, Yijia Xiao, Yuchen Zhuang.

Figure 1
Figure 1. Figure 1: Paper Roadmap: taxes, evaluation pitfalls, contamination, and the unified protocol. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Monthly RLVR activity vs. AIME performance (time span: May 2024–June 2025). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and other structured tasks. However, we argue that many headline RLVR gains are not yet well validated because reports often conflate policy improvement with three confounds: (i) budget mismatch between RLVR and baseline evaluation, (ii) attempt inflation and calibration drift that convert abstentions into confident answers, and (iii) data contamination in benchmarks. Using budget-matched reproductions and partial-prompt contamination probes, we find that several widely cited gaps shrink substantially or disappear once budgets, prompts, and dataset versions are matched, and contaminated sets are treated as memorization probes rather than evidence of reasoning. This does not mean that RLVR is ineffective, but it implies that current measurements often overstate capability gains and obscure reliability costs. We therefore propose a compact, tax-aware minimum standard for RLVR training and evaluation: budget-matched saturation curves with variance, calibration, and abstention tracking, one judge robustness stress test when LLM judges are used, and an explicit contamination screen. With these controls, RLVR remains effective and deployable in verifiable domains, but reasoning gains should be treated as provisional without them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a position paper arguing that many headline gains from reinforcement learning with verifiable rewards (RLVR) on math, code, and structured tasks are not yet well validated. It identifies three confounds—budget mismatch between RLVR runs and baseline evaluations, attempt inflation and calibration drift that turn abstentions into answers, and data contamination—and reports that budget-matched reproductions plus partial-prompt contamination probes cause several widely cited gaps to shrink substantially or disappear. The authors conclude that current measurements often overstate capability gains and obscure reliability costs, while proposing a compact minimum standard for RLVR: budget-matched saturation curves with variance, calibration and abstention tracking, one judge-robustness stress test, and an explicit contamination screen.

Significance. If the central empirical observations hold and generalize, the paper would usefully flag systematic measurement problems in a fast-moving area of LLM post-training. It gives credit to RLVR as a practical method while insisting that reasoning claims remain provisional without the listed controls. The constructive proposal for a tax-aware minimum standard is a clear strength that could improve reproducibility and reduce overstated claims.

major comments (2)
  1. The central claim that 'many headline RLVR gains are not yet well validated' and that 'current measurements often overstate capability gains' is load-bearing on the representativeness of the budget-matched reproductions and partial-prompt probes. The manuscript must explicitly state the selection criteria for the reproduced papers/tasks and demonstrate that the three confounds are dominant rather than specific to the chosen subset; without this, the observed shrinkage cannot be taken as diagnostic of the broader literature (see the skeptic note on generalization).
  2. The abstract and the section on reproductions report that gaps 'shrink substantially or disappear' once budgets, prompts, and dataset versions are matched. To support this, the manuscript should include the exact number of runs, sample sizes, error bars, and statistical tests for each reproduced gap; the current description leaves the magnitude and reliability of the shrinkage difficult to assess.
minor comments (2)
  1. Clarify the precise definition of 'budget-matched' (token budget, wall-clock time, or number of generations) and how abstention rates are measured in the calibration-drift analysis.
  2. The proposed minimum standard is compact and useful; consider adding a short table that maps each recommended control to the confound it addresses.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and for recognizing the potential value of highlighting measurement issues in RLVR evaluations. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: The central claim that 'many headline RLVR gains are not yet well validated' and that 'current measurements often overstate capability gains' is load-bearing on the representativeness of the budget-matched reproductions and partial-prompt probes. The manuscript must explicitly state the selection criteria for the reproduced papers/tasks and demonstrate that the three confounds are dominant rather than specific to the chosen subset; without this, the observed shrinkage cannot be taken as diagnostic of the broader literature (see the skeptic note on generalization).

    Authors: We agree that the manuscript should explicitly state the selection criteria. The reproduced papers and tasks were selected as prominent, highly cited examples of RLVR applications on math and code benchmarks that reported substantial gains; we will add a clear subsection describing these criteria, including citation thresholds, task domains, and reported effect sizes. We will also revise the text to emphasize that these cases are illustrative rather than exhaustive, and to discuss the limits of generalization more explicitly. However, a comprehensive demonstration that the confounds dominate the entire literature would require a systematic meta-review beyond the scope of this position paper. revision: partial

  2. Referee: The abstract and the section on reproductions report that gaps 'shrink substantially or disappear' once budgets, prompts, and dataset versions are matched. To support this, the manuscript should include the exact number of runs, sample sizes, error bars, and statistical tests for each reproduced gap; the current description leaves the magnitude and reliability of the shrinkage difficult to assess.

    Authors: We will update the reproductions section (and, space permitting, the abstract) to report the precise experimental details: number of independent runs per condition, evaluation sample sizes, error bars (standard deviation across seeds), and results of statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests) comparing the original reported gaps to the budget-matched reproductions. These additions will allow readers to assess the magnitude and reliability of the observed shrinkage directly. revision: yes

standing simulated objections not resolved
  • A full demonstration that the three confounds are dominant across the broader RLVR literature (rather than specific to the selected subset) would require an exhaustive meta-analysis that exceeds the scope of this position paper.

Circularity Check

0 steps flagged

No circularity: position paper relies on external critiques and controlled reproductions

full rationale

The paper advances a position on measurement gaps in RLVR by identifying three confounds (budget mismatch, attempt inflation, data contamination) and supporting the claim that gaps shrink under matched conditions via budget-matched reproductions and partial-prompt probes. No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear; claims rest on described experimental controls and comparisons to external literature rather than reducing to the paper's own inputs by construction. Self-citations, if present, are not load-bearing for the central argument, which remains independently falsifiable through the proposed minimum standards.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The position rests primarily on domain assumptions about proper evaluation practices rather than new mathematical constructs or fitted parameters; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Verifiable rewards in RLVR should be interpreted as evidence of reasoning only after controlling for budget, calibration, and contamination confounds.
    This premise is invoked to distinguish genuine capability gains from measurement artifacts in the central argument.

pith-pipeline@v0.9.0 · 5831 in / 1320 out tokens · 43648 ms · 2026-05-18T14:20:09.771666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Ackerman and Nina Panickssery

    Christopher M. Ackerman and Nina Panickssery. Mitigating many-shot jailbreaking.arXiv preprint arXiv:2504.09604,

  2. [2]

    The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effec- tiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134,

  3. [3]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown et al. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787,

  4. [4]

    Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, and Haoxiang Wang

    Notion Blog. Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, and Haoxiang Wang. Bridging supervised learning and reinforcement learning in math reasoning.arXiv preprint arXiv:2505.18116, 2025a. Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for u...

  5. [5]

    Reinforcement learning for reasoning in small llms: What works and what doesn’t.arXiv preprint arXiv:2503.16219,

    Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t.arXiv preprint arXiv:2503.16219,

  6. [6]

    Thinkless: Llm learns when to think.arXiv preprint arXiv:2505.13379,

    Gongfan Fang, Xinyin Ma, and Xinchao Wang. Thinkless: Llm learns when to think.arXiv preprint arXiv:2505.13379,

  7. [7]

    Scaling reasoning, losing control: Evaluating instruction following in large reasoning models.arXiv preprint arXiv:2505.14810,

    Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, and Yu Cheng. Scaling reasoning, losing control: Evaluating instruction following in large reasoning models.arXiv preprint arXiv:2505.14810,

  8. [8]

    Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cogni- tive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,

  9. [9]

    Daya Guo, Dejian Yang, Haowei Zhang, and Junxiao Song

    URLhttps://arxiv.org/abs/ 2506.15674. Daya Guo, Dejian Yang, Haowei Zhang, and Junxiao Song. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.07570,

  10. [10]

    Andre He, Daniel Fried, and Sean Welleck

    URL https://arxiv.org/abs/2501.07570. Andre He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening.arXiv preprint arXiv:2506.02355, 2025a. Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging...

  11. [11]

    Safety tax: Safety alignment makes your large reasoning models less reasonable.arXiv preprint arXiv:2503.00555,

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. Safety tax: Safety alignment makes your large reasoning models less reasonable.arXiv preprint arXiv:2503.00555,

  12. [12]

    Safechain: Safety of language models with long chain-of-thought reasoning capabilities.arXiv preprint arXiv:2502.12025,

    Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Poovendran. Safechain: Safety of language models with long chain-of-thought reasoning capabilities.arXiv preprint arXiv:2502.12025,

  13. [13]

    Polina Kirichenko, Shauli Ravfogel, Roee Aharoni, and Yonatan Belinkov

    Accessed: 2025-02-26. Polina Kirichenko, Shauli Ravfogel, Roee Aharoni, and Yonatan Belinkov. Abstentionbench: Rea- soning LLMs fail on unanswerable questions.arXiv preprint arXiv:2506.09038,

  14. [14]

    Hashimoto

    Yuxin Leng, Yuchen Jiang, Abhijit Sinha, Ivan Evtimov, Oskar Kr ´asn´y, Chongli Zhang, and Tat- sunori B. Hashimoto. Taming overconfidence in llms: Reward calibration in rlhf.arXiv preprint arXiv:2410.09724,

  15. [15]

    The hallucination dilemma: Factuality-aware reinforcement learning for large reasoning models.arXiv preprint arXiv:2505.24630,

    Junyi Li and Hwee Tou Ng. The hallucination dilemma: Factuality-aware reinforcement learning for large reasoning models.arXiv preprint arXiv:2505.24630,

  16. [16]

    When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423,

    Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, and Anurag Beniwal. When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423,

  17. [17]

    Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen

    Accessed 2025- 09-09. Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning?arXiv preprint arXiv:2412.13147, 2024a. Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning e...

  18. [18]

    Reft: Reasoning with reinforced fine-tuning

    Notion Blog. Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Rea- soning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967, 3,

  19. [19]

    Learning what reinforcement learning can’t: In- terleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527,

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can’t: In- terleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527,

  20. [20]

    Reasoning about uncertainty: Do reasoning models know when they don’t know?arXiv preprint arXiv:2506.18183,

    Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, and Anirudha Majumdar. Reasoning about uncertainty: Do reasoning models know when they don’t know?arXiv preprint arXiv:2506.18183,

  21. [21]

    An empirical study on eliciting and improving r1-like reasoning: A third technical report on slow thinking with llms.arXiv preprint arXiv:2503.04548,

    12 Preprint, Under Review Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning: A third technical report on slow thinking with llms.arXiv preprint arXiv:2503.04548,

  22. [22]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229,

  23. [23]

    Dissecting long reasoning models: An empirical study.arXiv preprint arXiv:2506.04913,

    Yongyu Mu, Jiali Zeng, Bei Li, Xinyan Guan, Fandong Meng, Jie Zhou, Tong Xiao, and Jingbo Zhu. Dissecting long reasoning models: An empirical study.arXiv preprint arXiv:2506.04913,

  24. [24]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand `es, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

  25. [25]

    HKU NLP and collaborators

    URLhttps: //arxiv.org/abs/2506.13923. HKU NLP and collaborators. Polaris-7b-preview — hugging face model card.https:// huggingface.co/HKUNLP/polaris-7b-preview,

  26. [26]

    C Opus and A Lawsen

    Accessed 2025-09-09. C Opus and A Lawsen. The illusion of the illusion of thinking: A comment on shojaee et al.(2025). arXiv preprint arXiv:2506.09250,

  27. [28]

    Generalizing Verifiable Instruction Following

    URLhttps://arxiv.org/abs/2507.02833. Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, and Ningyu Zhang. Knowrl: Exploring knowledgeable reinforcement learning for factuality.arXiv preprint arXiv:2506.19807,

  28. [29]

    Shah et al

    Darsh J. Shah et al. Rethinking reflection in pre-training.arXiv preprint arXiv:2504.04022,

  29. [30]

    Restoring calibration for aligned large language models.arXiv preprint arXiv:2502.13018,

    Xiaoyu Shen, Zenan Liu, Xuhui Cao, Jingjing Cao, Xu Fei, Fang Zheng, Yiqun Weng, and Chen Liang. Restoring calibration for aligned large language models.arXiv preprint arXiv:2502.13018,

  30. [31]

    The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

    Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941,

  31. [32]

    Linxin Song, Taiwei Shi, and Jieyu Zhao

    URLhttps://arxiv.org/abs/2506.15522. Linxin Song, Taiwei Shi, and Jieyu Zhao. The hallucination tax of reinforcement finetuning.arXiv preprint arXiv:2505.13988,

  32. [33]

    Evaluation is all you need: Strategic overclaiming of llm reasoning capabilities through evaluation design.arXiv preprint arXiv:2506.04734, 2025a

    Lin Sun, Weihong Lin, Jinzhu Wu, Yongfu Zhu, Xiaoqi Jian, Guangxiang Zhao, Linglin Zhang, Sai-er Hu, Yuhan Wu, and Xiangzheng Zhang. Evaluation is all you need: Strategic overclaiming of llm reasoning capabilities through evaluation design.arXiv preprint arXiv:2506.04734, 2025a. Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh Hajishirzi, Nouha Dzir...

  33. [34]

    Judgebench: A benchmark for evaluating LLM-as-a-judge.arXiv preprint arXiv:2407.11969, 2024a

    Di Wang, Shiwei Zhou, Haichao Zhan, Ming Zhu, Chongyang Zhang, Zhiyuan Zhang, Ming Sun, Li Xu, and Yisen Wang. Judgebench: A benchmark for evaluating LLM-as-a-judge.arXiv preprint arXiv:2407.11969, 2024a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: Hi...

  34. [35]

    Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

    Yixuan Even Xu, Yash Savani, Fei Fang, and Zico Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818,

  35. [36]

    Ziyi Yang, Yichi Zhang, and Ming et al. Li. Slow thinking with llms 3: Acereason.arXiv preprint arXiv:2502.01820,

  36. [37]

    Are reasoning models more prone to hallucination?arXiv preprint arXiv:2505.23646,

    Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, and Tat-Seng Chua. Are reasoning models more prone to hallucination?arXiv preprint arXiv:2505.23646,

  37. [38]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  38. [40]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    URLhttps://arxiv.org/abs/2504.13837. Qingcheng Zeng, Weihao Xuan, Leyang Cui, and Rob V oigt. Thinking out loud: Do reasoning models know when they’re right?arXiv preprint arXiv:2504.06564,

  39. [41]

    Self-play with variational problem synthesis improves llm reasoning.arXiv preprint arXiv:2508.14029, 2025a

    Yue Zhang, Longhui Wei, Zihui Wu, Guangtao Zeng, Juncheng Li, Weiguo Gong, Ziying Dai, Guodong Long, Daniel Gu, Moses Charikar, Siyuan Qi, Chi Jin, and Zhao Song. Self-play with variational problem synthesis improves llm reasoning.arXiv preprint arXiv:2508.14029, 2025a. 14 Preprint, Under Review Zhexin Zhang, Xian Qi Loye, Victor Shea-Jay Huang, Junxiao Y...

  40. [42]

    The hidden risks of large reasoning models: A safety assessment of r1.arXiv preprint arXiv:2502.12659,

    Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, and Xin Eric Wang. The hidden risks of large reasoning models: A safety assessment of r1.arXiv preprint arXiv:2502.12659,

  41. [43]

    The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347,

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347,

  42. [44]

    it’s just data

    15 Preprint, Under Review A APPENDIX Conventions and metrics.pass@k= probability at least one ofksamples is correct;avg@k= meanpass@1acrosskdraws;maj@k= majority vote overk;ECE= expected calibration error. Arrows (↑/↓) indicate whether higher/lower is better. Usage of Large Language Models.We used an LLM assistant as a productivity tool for light editing ...

  43. [45]

    ACC (X%)

    Table 6: Accuracy of QWEN3 checkpoints on five mathematics benchmarks when each model re- ceives only the firstx%of the question (x= 80,60,40) and must greedily complete the remainder. Columns “ACC (X%)” report the average accuracy at that prefix length.Takeaway:QWEN3 vari- ants achieve high ACC@80 on legacy MATH-500/AMC-23 but collapse on AIME-2025, cons...