pith. machine review for the scientific record. sign in

arxiv: 2605.10158 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

Unsupervised Process Reward Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:14 UTC · model grok-4.3

classification 💻 cs.LG
keywords process reward modelsunsupervised learningLLM reasoningerror detectionreinforcement learningtest-time scalingnext-token probability
0
0 comments X

The pith

Unsupervised process reward models trained only on next-token probabilities identify first reasoning errors without labels or supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training process reward models for language-model reasoning without step-by-step annotations or final-answer verification. It does this by defining a scoring function from an LLM's next-token probabilities that jointly ranks candidate positions for the first error across batches of trajectories. If the method works, fine-grained reward signals become available at the scale of raw model outputs rather than human labeling budgets. A sympathetic reader would care because current supervised PRMs are limited by annotation cost, which restricts their use in improving complex reasoning.

Core claim

A scoring function derived from LLM next-token probabilities can train effective process reward models without any human supervision or ground-truth labels by jointly assessing candidate first-error positions across batches of reasoning trajectories; the resulting uPRM identifies errors more accurately than LLM-as-Judge, performs comparably to supervised PRMs as a verifier, and supplies more robust reward signals during reinforcement learning.

What carries the argument

A batch-wise scoring function based on LLM next-token probabilities that ranks candidate positions for the first erroneous step in reasoning trajectories.

If this is right

  • uPRM identifies first erroneous steps up to 15% more accurately than LLM-as-Judge on ProcessBench.
  • As a verifier for test-time scaling, uPRM matches supervised PRMs and exceeds majority voting by up to 6.9%.
  • When used as an RL reward signal, uPRM produces more stable policy optimization than a supervised PRM that uses ground-truth labels.
  • Reward modeling for complex reasoning becomes feasible at the scale of unlabeled trajectory data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same next-token scoring idea could be adapted to supply unsupervised step-level feedback in non-reasoning domains.
  • Combining uPRM scores with simple majority voting might produce stronger verifiers than either alone.
  • Testing whether uPRM remains effective when the base LLM is replaced by a different model would clarify how general the probability signal is.

Load-bearing premise

LLM next-token probabilities contain enough information to locate the first error in a reasoning trajectory without any labeled data or external verification.

What would settle it

A collection of reasoning trajectories in which the first logical error is invisible to next-token probability drops, so that uPRM accuracy falls to random while human judges still succeed.

Figures

Figures reproduced from arXiv: 2605.10158 by Artyom Gadetsky, Hang Guo, Maria Brbic, Maxim Kodryan, Siba Smarak Panigrahi.

Figure 1
Figure 1. Figure 1: Accuracy of LLMs across different scales on MATH-500, MinervaMath, and Olympiad [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
read the original abstract

Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes unsupervised Process Reward Models (uPRM) that define a scoring function from LLM next-token probabilities to jointly assess candidate first-error positions across batches of reasoning trajectories, enabling training without step annotations or final-answer ground truth. It reports up to 15% absolute accuracy gains over LLM-as-a-Judge on ProcessBench for identifying first erroneous steps, comparable performance to supervised PRMs (outperforming majority voting by up to 6.9%) as a test-time verifier, and more robust policy optimization in RL than supervised PRMs using ground-truth labels.

Significance. If the batch-based scoring function produces reliable pseudo-labels for error localization, this would be a significant advance by removing the costly annotation requirement for process supervision, opening scalable paths for reward modeling in complex reasoning. The multi-scenario empirical evaluation (error detection, test-time scaling, RL) provides broad support, and the absence of free parameters or invented entities in the core construction is a strength.

major comments (3)
  1. [§3.2, Eq. (3)] §3.2, Eq. (3): the joint scoring function over a batch of trajectories is defined to select the position that maximizes the product of next-token probabilities under an implicit single-error-per-trajectory model; no derivation or sensitivity analysis shows that this avoids spurious correlations with length or token rarity, which is load-bearing for the claim that the resulting pseudo-labels enable truly unsupervised error detection.
  2. [§4.1, Table 1] §4.1, Table 1 (ProcessBench results): the 15% absolute gain is reported for first-error identification, but the manuscript provides no ablation on how batches are sampled or on the error-rate distribution within batches; if the method implicitly relies on controlled error placement, the gains may not hold on naturally distributed trajectories and would not support the unsupervised claim.
  3. [§5.3] §5.3 (RL experiments): the claim that uPRM yields more robust policy optimization than a supervised PRM trained on ground-truth labels is surprising and central to the practical value; the section lacks training-curve analysis or error-mode breakdown to explain the robustness advantage, leaving open whether the result stems from the scoring function or from differences in training setup.
minor comments (2)
  1. [§3.1] The notation for the scoring function S(b, k) is introduced without an explicit statement of its dependence on batch size, which could be clarified in §3.1 for readability.
  2. [Figure 2] Figure 2 caption does not specify the number of runs or error bars, making it difficult to assess the statistical significance of the reported improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2, Eq. (3)] the joint scoring function over a batch of trajectories is defined to select the position that maximizes the product of next-token probabilities under an implicit single-error-per-trajectory model; no derivation or sensitivity analysis shows that this avoids spurious correlations with length or token rarity, which is load-bearing for the claim that the resulting pseudo-labels enable truly unsupervised error detection.

    Authors: We agree that a formal derivation and sensitivity analysis of the scoring function would improve rigor. The product-based objective in Eq. (3) follows from the single-error assumption and the observation that correct steps receive higher next-token probability mass; the batch joint maximization is intended to identify the position most consistent with this pattern across trajectories. In the revised manuscript we will add both a short derivation clarifying the objective and a sensitivity analysis (including controlled experiments that vary sequence length, token rarity, and batch composition) to quantify any residual correlations. These additions will be placed in §3.2 and the appendix. revision: yes

  2. Referee: [§4.1, Table 1] the 15% absolute gain is reported for first-error identification, but the manuscript provides no ablation on how batches are sampled or on the error-rate distribution within batches; if the method implicitly relies on controlled error placement, the gains may not hold on naturally distributed trajectories and would not support the unsupervised claim.

    Authors: We acknowledge the value of explicit ablations on batch construction. The reported ProcessBench numbers were obtained by sampling batches from the benchmark’s native distribution of trajectories (which already contains a realistic mix of error rates and positions). Nevertheless, we agree that additional controls are needed to fully support the unsupervised claim. In the revision we will add an ablation subsection that varies (i) the fraction of erroneous trajectories per batch and (ii) the sampling strategy (uniform vs. error-stratified), reporting first-error detection accuracy under each regime. These results will be included in §4.1 and Table 1 will be expanded accordingly. revision: yes

  3. Referee: [§5.3] the claim that uPRM yields more robust policy optimization than a supervised PRM trained on ground-truth labels is surprising and central to the practical value; the section lacks training-curve analysis or error-mode breakdown to explain the robustness advantage, leaving open whether the result stems from the scoring function or from differences in training setup.

    Authors: We thank the referee for highlighting the need for clearer explanatory analysis. The robustness observation is currently supported by final performance metrics and qualitative stability notes. To address the gap we will insert training-curve plots (reward and policy performance vs. RL steps) and an error-mode breakdown (categorizing failure types such as over-penalization of correct steps versus under-detection of errors) for both uPRM and the supervised baseline. These additions will appear in §5.3 and the appendix, allowing readers to assess whether the advantage is attributable to the unsupervised scoring function or to other experimental factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation of uPRM

full rationale

The paper defines a scoring function from LLM next-token probabilities to jointly score candidate first-error positions across batches of trajectories, then uses the resulting pseudo-labels to train uPRM without human step annotations or final-answer ground truth. This is a self-contained construction that does not reduce any claimed result (e.g., 15% accuracy gain on ProcessBench or RL improvements) to its inputs by definition. No equations or steps are shown to be equivalent by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked. Performance claims rest on empirical evaluation against external benchmarks rather than tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities are detailed in the abstract; the method relies on existing LLM probabilities.

pith-pipeline@v0.9.0 · 5541 in / 1224 out tokens · 52957 ms · 2026-05-12T03:14:50.787716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 10 internal anchors

  1. [1]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, et al. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models.arXiv preprint arXiv:2503.09567, 2025

  2. [2]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, et al. From System 1 to System 2: A Survey of Reasoning Large Language Models.arXiv preprint arXiv:2502.17419, 2025

  3. [3]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.Neural Information Processing Systems, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.Neural Information Processing Systems, 2022

  4. [4]

    DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, et al. DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 2025

  5. [5]

    Training Verifiers to Solve Math Word Problems.OpenAI Technical Report, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, et al. Training Verifiers to Solve Math Word Problems.OpenAI Technical Report, 2021

  6. [6]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, et al. Kimi k1.5: Scaling Reinforcement Learning with LLMs.arXiv preprint arXiv:2501.12599, 2025

  7. [7]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, et al. Solv- ing Math Word Problems with Process- and Outcome-based Feedback.arXiv preprint arXiv:2211.14275, 2022

  8. [8]

    Let’s Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, et al. Let’s Verify Step by Step. InInternational Conference on Learning Representations, 2024

  9. [9]

    Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning. InInternational Conference on Learning Representations, 2025

  10. [10]

    Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

    Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, et al. Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning. InNeural Information Processing Systems, 2025

  11. [11]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, et al. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. InAssociation for Computational Linguistics, 2024

  12. [12]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, et al. Improve Math- ematical Reasoning in Language Models by Automated Process Supervision.arXiv preprint arXiv:2406.06592, 2024

  13. [13]

    Free Process Rewards without Process Labels

    Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, et al. Free Process Rewards without Process Labels. InInternational Conference on Machine Learning, 2025

  14. [14]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, et al. Process Reinforcement Through Implicit Rewards.arXiv preprint arXiv:2502.01456, 2025

  15. [15]

    OlympiadBench: A Challenging Benchmark for Pomoting AGI with Olympiad-level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, et al. OlympiadBench: A Challenging Benchmark for Pomoting AGI with Olympiad-level Bilingual Multimodal Scientific Problems. InAssociation for Computational Linguistics, 2024

  16. [16]

    Omni-MATH: A Univer- sal Olympiad Level Mathematic Benchmark for Large Language Models

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, et al. Omni-MATH: A Univer- sal Olympiad Level Mathematic Benchmark for Large Language Models. InInternational Conference on Learning Representations, 2025

  17. [17]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024. 10

  18. [18]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, et al. Qwen2.5-Math Tech- nical Report: Toward Mathematical Expert Model via Self-Improvement.arXiv preprint arXiv:2409.12122, 2024

  19. [19]

    AlphaMath Almost Zero: Process Supervision without Process.Neural Information Processing Systems, 2024

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. AlphaMath Almost Zero: Process Supervision without Process.Neural Information Processing Systems, 2024

  20. [20]

    URSA: Understand- ing and Verifying Chain-of-thought Reasoning in Multimodal Mathematics.arXiv preprint arXiv:2501.04686, 2025

    Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, et al. URSA: Understand- ing and Verifying Chain-of-thought Reasoning in Multimodal Mathematics.arXiv preprint arXiv:2501.04686, 2025

  21. [21]

    VinePPO: Refining Credit Assignment in RL Training of LLMs

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, et al. VinePPO: Refining Credit Assignment in RL Training of LLMs. InInternational Conference on Machine Learning, 2025

  22. [22]

    The Lessons of Developing Process Reward Models in Mathematical Reasoning

    Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, et al. The Lessons of Developing Process Reward Models in Mathematical Reasoning. InAssociation for Computa- tional Linguistics, 2025

  23. [23]

    AutoPSV: Automated Process-Supervised Verifier.Neural Information Processing Systems, 2024

    Jianqiao Lu, Zhiyang Dou, Hongru Wang, Zeyu Cao, Jianbo Dai, et al. AutoPSV: Automated Process-Supervised Verifier.Neural Information Processing Systems, 2024

  24. [24]

    From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment

    Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, and Huawei Shen. From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment. InAssociation for Computational Linguistics, 2025

  25. [25]

    FreePRM: Training Pro- cess Reward Models Without Ground Truth Process Labels.arXiv preprint arXiv:2506.03570, 2025

    Lin Sun, Chuang Liu, Xiaofeng Ma, Tao Yang, Weijia Lu, and Ning Wu. FreePRM: Training Pro- cess Reward Models Without Ground Truth Process Labels.arXiv preprint arXiv:2506.03570, 2025

  26. [26]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wuand, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InNeural Information Processing Systems, 2023

  27. [27]

    Hashimoto

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. InConference on Language Modeling, 2024

  28. [28]

    GPTScore: Evaluate as You Desire

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as You Desire. InConference of the North American Chapter of the Association for Computational Linguistics, 2024

  29. [29]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, et al. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InConference on Empirical Methods in Natural Language Processing, 2023

  30. [30]

    Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025

    Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, et al. Can 1B LLM Surpass 405b LLM? Rethinking Compute-Optimal Test-Time Scaling.arXiv preprint arXiv:2502.06703, 2025

  31. [31]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, et al. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.arXiv preprint arXiv:2407.21787, 2024

  32. [32]

    Scaling Test-Time Compute with Open Models, 2024

    Edward Beeching, Lewis Tunstall, and Sasha Rush. Scaling Test-Time Compute with Open Models, 2024. URL https://huggingface.co/spaces/HuggingFaceH4/ blogpost-scaling-test-time-compute

  33. [33]

    [user’s hint]\n{hint}

    Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, et al. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning.arXiv preprint arXiv:2504.00891, 2025

  34. [34]

    Dynamic and Generalizable Process Reward Modeling

    Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, et al. Dynamic and Generalizable Process Reward Modeling. InAssociation for Computational Linguistics, 2025. 11

  35. [35]

    PRL: Process Reward Learning Improves LLMs’ Reasoning Ability and Broadens the Reasoning Boundary.arXiv preprint arXiv:2601.10201, 2026

    Jiarui Yao, Ruida Wang, and Tong Zhang. PRL: Process Reward Learning Improves LLMs’ Reasoning Ability and Broadens the Reasoning Boundary.arXiv preprint arXiv:2601.10201, 2026

  36. [36]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, et al. Understanding R1-Zero-Like Training: A Critical Perspective. InConference on Language Modeling, 2025

  37. [37]

    Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, et al. Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. InInternational Conference on Learning Representations, 2025

  38. [38]

    Scaling Laws for Reward Model Overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling Laws for Reward Model Overoptimization. InInternational Conference on Machine Learning, 2023

  39. [39]

    RULER: Relative Universal LLM-Elicited Rewards.OpenPipe Blog, 2025

    Kyle Corbitt, Saumya Gandhi, Angky William, Andie Jones, Brad Hilton, et al. RULER: Relative Universal LLM-Elicited Rewards.OpenPipe Blog, 2025

  40. [40]

    Batched Self-Consistency Improves LLM Relevance Assessment and Ranking

    Anton Korikov, Pan Du, Scott Sanner, and Navid Rekabsaz. Batched Self-Consistency Improves LLM Relevance Assessment and Ranking. InConference on Empirical Methods in Natural Language Processing, 2025

  41. [41]

    Large (Vision) Language Models are Unsupervised In-Context Learners

    Artyom Gadetsky, Andrei Atanov, Yulun Jiang, Zhitong Gao, Ghazal Hosseini Mighan, et al. Large (Vision) Language Models are Unsupervised In-Context Learners. InInternational Conference on Learning Representations, 2025

  42. [42]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, et al. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations, 2022

  43. [43]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning.Neural Information Processing Systems, 2022

  44. [44]

    Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.Transactions on Machine Learning Research, 2024

    Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, et al. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.Transactions on Machine Learning Research, 2024

  45. [45]

    Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy.Carnegie Mellon University, 2010

    Brian Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy.Carnegie Mellon University, 2010

  46. [46]

    Actor-Critic Algorithms

    Vijay Konda and John Tsitsiklis. Actor-Critic Algorithms. InNeural Information Processing Systems, 1999

  47. [47]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024

  48. [48]

    ProcessBench: Identifying Process Errors in Mathematical Reasoning

    Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, et al. ProcessBench: Identifying Process Errors in Mathematical Reasoning. InAssociation for Computational Linguistics, 2025

  49. [49]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, et al. Measuring Mathematical Problem Solving With the MATH Dataset. InNeural Information Processing Systems, 2021

  50. [50]

    Solving Quantitative Reasoning Problems with Language Models.Neural Information Process- ing Systems, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, et al. Solving Quantitative Reasoning Problems with Language Models.Neural Information Process- ing Systems, 2022

  51. [51]

    RLHF Workflow: From Reward Modeling to Online RLHF.Transactions on Machine Learning Research, 2024

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, et al. RLHF Workflow: From Reward Modeling to Online RLHF.Transactions on Machine Learning Research, 2024

  52. [52]

    Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Riu Yan, Chaojie Wang, et al. Skywork Open Reasoner 1 Technical Report.arXiv preprint arXiv:2505.22312, 2025. 12

  53. [53]

    Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, et al. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. InAssociation for Computational Linguistics, 2024

  54. [54]

    Lee, et al

    Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, et al. What Makes a Reward Model a Good Teacher? An Optimization Perspective. InNeural Information Processing Systems, 2025

  55. [55]

    FlexAttention: A Programming Model for Generating Fused Attention Variants

    Boyuan Dong, Juechu Feng, Driss Guessous, Yanbo Liang, and Horace He. FlexAttention: A Programming Model for Generating Fused Attention Variants. InConference on Machine Learning and Systems, 2025

  56. [56]

    Williams

    Ronald J. Williams. Simple Statistical Gradient-following Algorithms for Connectionist Rein- forcement Learning.Machine Learning, 1992

  57. [57]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019

  58. [58]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, et al. HybridFlow: A Flexible and Efficient RLHF Framework. InEuropean Conference on Computer Systems, 2025

  59. [59]

    No Need for Explanations: LLMs Can Implicitly Learn from Mistakes In-context.Conference on Empirical Methods in Natural Language Processing, 2025

    Lisa Alazraki, Maximilian Mozes, Jon Ander Campos, Tan Yi-Chern, Marek Rei, et al. No Need for Explanations: LLMs Can Implicitly Learn from Mistakes In-context.Conference on Empirical Methods in Natural Language Processing, 2025

  60. [60]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, et al. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. InInternational Conference on Machine Learning, 2025

  61. [61]

    Rl’s razor: Why online reinforcement learning forgets less, 2025

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s Razor: Why Online Reinforcement Learning Forgets Less.arXiv preprint arXiv:2509.04259, 2025. A Score Correction to Mitigate Degenerate Solutions In our preliminary experiments, we observed that, although, the score usually assigns higher values to configurations of j1, . . . , jN that are close to gro...

  62. [62]

    Figure D3: When training with uPRM-SFT, policy learns to hack the reward by generating the complete solution in a single step

    Therefore, the final answer is4 . Figure D3: When training with uPRM-SFT, policy learns to hack the reward by generating the complete solution in a single step. 20 D.3.2 Further Analysis and More Results Figures D4, D5, and D6 depict panels of metrics for all three RL-training runs (with different random seeds) of Qwen2.5 models. As stated in the main tex...

  63. [63]

    [10]; effectively, it approximates the minimum of per-step PRM-emitted rewards for a given response

    Accumulated PRM reward.Mathematically, this is the PURE return value for the first step in the response, computed according to equation (6) in Cheng et al. [10]; effectively, it approximates the minimum of per-step PRM-emitted rewards for a given response. By definition, verifiable reward is not taken into account when computing this value, therefore we d...

  64. [64]

    Response length.Amount of tokens in the response generated by the model for a given input prompt

  65. [65]

    You do not need to box your final answer if it is a variable or an expression

    KL to reference model.Kullback–Leibler divergence between the current policy and the reference policy computed over response tokens. Reference policy is defined by the model at initialization (zero-shot policy). Analysis.As evidenced from the plots, Qwen2.5-Math models could be successfully trained using uPRM both with and without VR. Training with sPRM, ...