Recognition: no theorem link
Unsupervised Process Reward Models
Pith reviewed 2026-05-12 03:14 UTC · model grok-4.3
The pith
Unsupervised process reward models trained only on next-token probabilities identify first reasoning errors without labels or supervision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A scoring function derived from LLM next-token probabilities can train effective process reward models without any human supervision or ground-truth labels by jointly assessing candidate first-error positions across batches of reasoning trajectories; the resulting uPRM identifies errors more accurately than LLM-as-Judge, performs comparably to supervised PRMs as a verifier, and supplies more robust reward signals during reinforcement learning.
What carries the argument
A batch-wise scoring function based on LLM next-token probabilities that ranks candidate positions for the first erroneous step in reasoning trajectories.
If this is right
- uPRM identifies first erroneous steps up to 15% more accurately than LLM-as-Judge on ProcessBench.
- As a verifier for test-time scaling, uPRM matches supervised PRMs and exceeds majority voting by up to 6.9%.
- When used as an RL reward signal, uPRM produces more stable policy optimization than a supervised PRM that uses ground-truth labels.
- Reward modeling for complex reasoning becomes feasible at the scale of unlabeled trajectory data.
Where Pith is reading between the lines
- The same next-token scoring idea could be adapted to supply unsupervised step-level feedback in non-reasoning domains.
- Combining uPRM scores with simple majority voting might produce stronger verifiers than either alone.
- Testing whether uPRM remains effective when the base LLM is replaced by a different model would clarify how general the probability signal is.
Load-bearing premise
LLM next-token probabilities contain enough information to locate the first error in a reasoning trajectory without any labeled data or external verification.
What would settle it
A collection of reasoning trajectories in which the first logical error is invisible to next-token probability drops, so that uPRM accuracy falls to random while human judges still succeed.
Figures
read the original abstract
Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes unsupervised Process Reward Models (uPRM) that define a scoring function from LLM next-token probabilities to jointly assess candidate first-error positions across batches of reasoning trajectories, enabling training without step annotations or final-answer ground truth. It reports up to 15% absolute accuracy gains over LLM-as-a-Judge on ProcessBench for identifying first erroneous steps, comparable performance to supervised PRMs (outperforming majority voting by up to 6.9%) as a test-time verifier, and more robust policy optimization in RL than supervised PRMs using ground-truth labels.
Significance. If the batch-based scoring function produces reliable pseudo-labels for error localization, this would be a significant advance by removing the costly annotation requirement for process supervision, opening scalable paths for reward modeling in complex reasoning. The multi-scenario empirical evaluation (error detection, test-time scaling, RL) provides broad support, and the absence of free parameters or invented entities in the core construction is a strength.
major comments (3)
- [§3.2, Eq. (3)] §3.2, Eq. (3): the joint scoring function over a batch of trajectories is defined to select the position that maximizes the product of next-token probabilities under an implicit single-error-per-trajectory model; no derivation or sensitivity analysis shows that this avoids spurious correlations with length or token rarity, which is load-bearing for the claim that the resulting pseudo-labels enable truly unsupervised error detection.
- [§4.1, Table 1] §4.1, Table 1 (ProcessBench results): the 15% absolute gain is reported for first-error identification, but the manuscript provides no ablation on how batches are sampled or on the error-rate distribution within batches; if the method implicitly relies on controlled error placement, the gains may not hold on naturally distributed trajectories and would not support the unsupervised claim.
- [§5.3] §5.3 (RL experiments): the claim that uPRM yields more robust policy optimization than a supervised PRM trained on ground-truth labels is surprising and central to the practical value; the section lacks training-curve analysis or error-mode breakdown to explain the robustness advantage, leaving open whether the result stems from the scoring function or from differences in training setup.
minor comments (2)
- [§3.1] The notation for the scoring function S(b, k) is introduced without an explicit statement of its dependence on batch size, which could be clarified in §3.1 for readability.
- [Figure 2] Figure 2 caption does not specify the number of runs or error bars, making it difficult to assess the statistical significance of the reported improvements.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed report. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2, Eq. (3)] the joint scoring function over a batch of trajectories is defined to select the position that maximizes the product of next-token probabilities under an implicit single-error-per-trajectory model; no derivation or sensitivity analysis shows that this avoids spurious correlations with length or token rarity, which is load-bearing for the claim that the resulting pseudo-labels enable truly unsupervised error detection.
Authors: We agree that a formal derivation and sensitivity analysis of the scoring function would improve rigor. The product-based objective in Eq. (3) follows from the single-error assumption and the observation that correct steps receive higher next-token probability mass; the batch joint maximization is intended to identify the position most consistent with this pattern across trajectories. In the revised manuscript we will add both a short derivation clarifying the objective and a sensitivity analysis (including controlled experiments that vary sequence length, token rarity, and batch composition) to quantify any residual correlations. These additions will be placed in §3.2 and the appendix. revision: yes
-
Referee: [§4.1, Table 1] the 15% absolute gain is reported for first-error identification, but the manuscript provides no ablation on how batches are sampled or on the error-rate distribution within batches; if the method implicitly relies on controlled error placement, the gains may not hold on naturally distributed trajectories and would not support the unsupervised claim.
Authors: We acknowledge the value of explicit ablations on batch construction. The reported ProcessBench numbers were obtained by sampling batches from the benchmark’s native distribution of trajectories (which already contains a realistic mix of error rates and positions). Nevertheless, we agree that additional controls are needed to fully support the unsupervised claim. In the revision we will add an ablation subsection that varies (i) the fraction of erroneous trajectories per batch and (ii) the sampling strategy (uniform vs. error-stratified), reporting first-error detection accuracy under each regime. These results will be included in §4.1 and Table 1 will be expanded accordingly. revision: yes
-
Referee: [§5.3] the claim that uPRM yields more robust policy optimization than a supervised PRM trained on ground-truth labels is surprising and central to the practical value; the section lacks training-curve analysis or error-mode breakdown to explain the robustness advantage, leaving open whether the result stems from the scoring function or from differences in training setup.
Authors: We thank the referee for highlighting the need for clearer explanatory analysis. The robustness observation is currently supported by final performance metrics and qualitative stability notes. To address the gap we will insert training-curve plots (reward and policy performance vs. RL steps) and an error-mode breakdown (categorizing failure types such as over-penalization of correct steps versus under-detection of errors) for both uPRM and the supervised baseline. These additions will appear in §5.3 and the appendix, allowing readers to assess whether the advantage is attributable to the unsupervised scoring function or to other experimental factors. revision: yes
Circularity Check
No significant circularity in the derivation of uPRM
full rationale
The paper defines a scoring function from LLM next-token probabilities to jointly score candidate first-error positions across batches of trajectories, then uses the resulting pseudo-labels to train uPRM without human step annotations or final-answer ground truth. This is a self-contained construction that does not reduce any claimed result (e.g., 15% accuracy gain on ProcessBench or RL improvements) to its inputs by definition. No equations or steps are shown to be equivalent by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked. Performance claims rest on empirical evaluation against external benchmarks rather than tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, et al. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models.arXiv preprint arXiv:2503.09567, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, et al. From System 1 to System 2: A Survey of Reasoning Large Language Models.arXiv preprint arXiv:2502.17419, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.Neural Information Processing Systems, 2022
work page 2022
-
[4]
DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 2025
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, et al. DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 2025
work page 2025
-
[5]
Training Verifiers to Solve Math Word Problems.OpenAI Technical Report, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, et al. Training Verifiers to Solve Math Word Problems.OpenAI Technical Report, 2021
work page 2021
-
[6]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, et al. Kimi k1.5: Scaling Reinforcement Learning with LLMs.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, et al. Solv- ing Math Word Problems with Process- and Outcome-based Feedback.arXiv preprint arXiv:2211.14275, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, et al. Let’s Verify Step by Step. InInternational Conference on Learning Representations, 2024
work page 2024
-
[9]
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning. InInternational Conference on Learning Representations, 2025
work page 2025
-
[10]
Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, et al. Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning. InNeural Information Processing Systems, 2025
work page 2025
-
[11]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, et al. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. InAssociation for Computational Linguistics, 2024
work page 2024
-
[12]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, et al. Improve Math- ematical Reasoning in Language Models by Automated Process Supervision.arXiv preprint arXiv:2406.06592, 2024
work page internal anchor Pith review arXiv 2024
-
[13]
Free Process Rewards without Process Labels
Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, et al. Free Process Rewards without Process Labels. InInternational Conference on Machine Learning, 2025
work page 2025
-
[14]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, et al. Process Reinforcement Through Implicit Rewards.arXiv preprint arXiv:2502.01456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, et al. OlympiadBench: A Challenging Benchmark for Pomoting AGI with Olympiad-level Bilingual Multimodal Scientific Problems. InAssociation for Computational Linguistics, 2024
work page 2024
-
[16]
Omni-MATH: A Univer- sal Olympiad Level Mathematic Benchmark for Large Language Models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, et al. Omni-MATH: A Univer- sal Olympiad Level Mathematic Benchmark for Large Language Models. InInternational Conference on Learning Representations, 2025
work page 2025
-
[17]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, et al. Qwen2.5-Math Tech- nical Report: Toward Mathematical Expert Model via Self-Improvement.arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. AlphaMath Almost Zero: Process Supervision without Process.Neural Information Processing Systems, 2024
work page 2024
-
[20]
Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, et al. URSA: Understand- ing and Verifying Chain-of-thought Reasoning in Multimodal Mathematics.arXiv preprint arXiv:2501.04686, 2025
-
[21]
VinePPO: Refining Credit Assignment in RL Training of LLMs
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, et al. VinePPO: Refining Credit Assignment in RL Training of LLMs. InInternational Conference on Machine Learning, 2025
work page 2025
-
[22]
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, et al. The Lessons of Developing Process Reward Models in Mathematical Reasoning. InAssociation for Computa- tional Linguistics, 2025
work page 2025
-
[23]
AutoPSV: Automated Process-Supervised Verifier.Neural Information Processing Systems, 2024
Jianqiao Lu, Zhiyang Dou, Hongru Wang, Zeyu Cao, Jianbo Dai, et al. AutoPSV: Automated Process-Supervised Verifier.Neural Information Processing Systems, 2024
work page 2024
-
[24]
From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment
Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, and Huawei Shen. From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment. InAssociation for Computational Linguistics, 2025
work page 2025
-
[25]
Lin Sun, Chuang Liu, Xiaofeng Ma, Tao Yang, Weijia Lu, and Ning Wu. FreePRM: Training Pro- cess Reward Models Without Ground Truth Process Labels.arXiv preprint arXiv:2506.03570, 2025
-
[26]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wuand, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InNeural Information Processing Systems, 2023
work page 2023
- [27]
-
[28]
GPTScore: Evaluate as You Desire
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as You Desire. InConference of the North American Chapter of the Association for Computational Linguistics, 2024
work page 2024
-
[29]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, et al. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InConference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[30]
Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, et al. Can 1B LLM Surpass 405b LLM? Rethinking Compute-Optimal Test-Time Scaling.arXiv preprint arXiv:2502.06703, 2025
-
[31]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, et al. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.arXiv preprint arXiv:2407.21787, 2024
work page internal anchor Pith review arXiv 2024
-
[32]
Scaling Test-Time Compute with Open Models, 2024
Edward Beeching, Lewis Tunstall, and Sasha Rush. Scaling Test-Time Compute with Open Models, 2024. URL https://huggingface.co/spaces/HuggingFaceH4/ blogpost-scaling-test-time-compute
work page 2024
-
[33]
Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, et al. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning.arXiv preprint arXiv:2504.00891, 2025
-
[34]
Dynamic and Generalizable Process Reward Modeling
Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, et al. Dynamic and Generalizable Process Reward Modeling. InAssociation for Computational Linguistics, 2025. 11
work page 2025
-
[35]
Jiarui Yao, Ruida Wang, and Tong Zhang. PRL: Process Reward Learning Improves LLMs’ Reasoning Ability and Broadens the Reasoning Boundary.arXiv preprint arXiv:2601.10201, 2026
-
[36]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, et al. Understanding R1-Zero-Like Training: A Critical Perspective. InConference on Language Modeling, 2025
work page 2025
-
[37]
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, et al. Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. InInternational Conference on Learning Representations, 2025
work page 2025
-
[38]
Scaling Laws for Reward Model Overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling Laws for Reward Model Overoptimization. InInternational Conference on Machine Learning, 2023
work page 2023
-
[39]
RULER: Relative Universal LLM-Elicited Rewards.OpenPipe Blog, 2025
Kyle Corbitt, Saumya Gandhi, Angky William, Andie Jones, Brad Hilton, et al. RULER: Relative Universal LLM-Elicited Rewards.OpenPipe Blog, 2025
work page 2025
-
[40]
Batched Self-Consistency Improves LLM Relevance Assessment and Ranking
Anton Korikov, Pan Du, Scott Sanner, and Navid Rekabsaz. Batched Self-Consistency Improves LLM Relevance Assessment and Ranking. InConference on Empirical Methods in Natural Language Processing, 2025
work page 2025
-
[41]
Large (Vision) Language Models are Unsupervised In-Context Learners
Artyom Gadetsky, Andrei Atanov, Yulun Jiang, Zhitong Gao, Ghazal Hosseini Mighan, et al. Large (Vision) Language Models are Unsupervised In-Context Learners. InInternational Conference on Learning Representations, 2025
work page 2025
-
[42]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, et al. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations, 2022
work page 2022
-
[43]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning.Neural Information Processing Systems, 2022
work page 2022
-
[44]
Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, et al. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.Transactions on Machine Learning Research, 2024
work page 2024
-
[45]
Brian Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy.Carnegie Mellon University, 2010
work page 2010
-
[46]
Vijay Konda and John Tsitsiklis. Actor-Critic Algorithms. InNeural Information Processing Systems, 1999
work page 1999
-
[47]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, et al. ProcessBench: Identifying Process Errors in Mathematical Reasoning. InAssociation for Computational Linguistics, 2025
work page 2025
-
[49]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, et al. Measuring Mathematical Problem Solving With the MATH Dataset. InNeural Information Processing Systems, 2021
work page 2021
-
[50]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, et al. Solving Quantitative Reasoning Problems with Language Models.Neural Information Process- ing Systems, 2022
work page 2022
-
[51]
RLHF Workflow: From Reward Modeling to Online RLHF.Transactions on Machine Learning Research, 2024
Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, et al. RLHF Workflow: From Reward Modeling to Online RLHF.Transactions on Machine Learning Research, 2024
work page 2024
-
[52]
Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,
Jujie He, Jiacai Liu, Chris Yuhao Liu, Riu Yan, Chaojie Wang, et al. Skywork Open Reasoner 1 Technical Report.arXiv preprint arXiv:2505.22312, 2025. 12
-
[53]
Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, et al. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. InAssociation for Computational Linguistics, 2024
work page 2024
-
[54]
Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, et al. What Makes a Reward Model a Good Teacher? An Optimization Perspective. InNeural Information Processing Systems, 2025
work page 2025
-
[55]
FlexAttention: A Programming Model for Generating Fused Attention Variants
Boyuan Dong, Juechu Feng, Driss Guessous, Yanbo Liang, and Horace He. FlexAttention: A Programming Model for Generating Fused Attention Variants. InConference on Machine Learning and Systems, 2025
work page 2025
- [56]
-
[57]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019
work page 2019
-
[58]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, et al. HybridFlow: A Flexible and Efficient RLHF Framework. InEuropean Conference on Computer Systems, 2025
work page 2025
-
[59]
Lisa Alazraki, Maximilian Mozes, Jon Ander Campos, Tan Yi-Chern, Marek Rei, et al. No Need for Explanations: LLMs Can Implicitly Learn from Mistakes In-context.Conference on Empirical Methods in Natural Language Processing, 2025
work page 2025
-
[60]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, et al. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. InInternational Conference on Machine Learning, 2025
work page 2025
-
[61]
Rl’s razor: Why online reinforcement learning forgets less, 2025
Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s Razor: Why Online Reinforcement Learning Forgets Less.arXiv preprint arXiv:2509.04259, 2025. A Score Correction to Mitigate Degenerate Solutions In our preliminary experiments, we observed that, although, the score usually assigns higher values to configurations of j1, . . . , jN that are close to gro...
-
[62]
Therefore, the final answer is4 . Figure D3: When training with uPRM-SFT, policy learns to hack the reward by generating the complete solution in a single step. 20 D.3.2 Further Analysis and More Results Figures D4, D5, and D6 depict panels of metrics for all three RL-training runs (with different random seeds) of Qwen2.5 models. As stated in the main tex...
-
[63]
[10]; effectively, it approximates the minimum of per-step PRM-emitted rewards for a given response
Accumulated PRM reward.Mathematically, this is the PURE return value for the first step in the response, computed according to equation (6) in Cheng et al. [10]; effectively, it approximates the minimum of per-step PRM-emitted rewards for a given response. By definition, verifiable reward is not taken into account when computing this value, therefore we d...
-
[64]
Response length.Amount of tokens in the response generated by the model for a given input prompt
-
[65]
You do not need to box your final answer if it is a variable or an expression
KL to reference model.Kullback–Leibler divergence between the current policy and the reference policy computed over response tokens. Reference policy is defined by the model at initialization (zero-shot policy). Analysis.As evidenced from the plots, Qwen2.5-Math models could be successfully trained using uPRM both with and without VR. Training with sPRM, ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.