arxiv: 2605.10158 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

Unsupervised Process Reward Models

Artyom Gadetsky , Maxim Kodryan , Siba Smarak Panigrahi , Hang Guo , Maria Brbic

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords process reward modelsunsupervised learningLLM reasoningerror detectionreinforcement learningtest-time scalingnext-token probability

0 comments

The pith

Unsupervised process reward models trained only on next-token probabilities identify first reasoning errors without labels or supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training process reward models for language-model reasoning without step-by-step annotations or final-answer verification. It does this by defining a scoring function from an LLM's next-token probabilities that jointly ranks candidate positions for the first error across batches of trajectories. If the method works, fine-grained reward signals become available at the scale of raw model outputs rather than human labeling budgets. A sympathetic reader would care because current supervised PRMs are limited by annotation cost, which restricts their use in improving complex reasoning.

Core claim

A scoring function derived from LLM next-token probabilities can train effective process reward models without any human supervision or ground-truth labels by jointly assessing candidate first-error positions across batches of reasoning trajectories; the resulting uPRM identifies errors more accurately than LLM-as-Judge, performs comparably to supervised PRMs as a verifier, and supplies more robust reward signals during reinforcement learning.

What carries the argument

A batch-wise scoring function based on LLM next-token probabilities that ranks candidate positions for the first erroneous step in reasoning trajectories.

If this is right

uPRM identifies first erroneous steps up to 15% more accurately than LLM-as-Judge on ProcessBench.
As a verifier for test-time scaling, uPRM matches supervised PRMs and exceeds majority voting by up to 6.9%.
When used as an RL reward signal, uPRM produces more stable policy optimization than a supervised PRM that uses ground-truth labels.
Reward modeling for complex reasoning becomes feasible at the scale of unlabeled trajectory data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same next-token scoring idea could be adapted to supply unsupervised step-level feedback in non-reasoning domains.
Combining uPRM scores with simple majority voting might produce stronger verifiers than either alone.
Testing whether uPRM remains effective when the base LLM is replaced by a different model would clarify how general the probability signal is.

Load-bearing premise

LLM next-token probabilities contain enough information to locate the first error in a reasoning trajectory without any labeled data or external verification.

What would settle it

A collection of reasoning trajectories in which the first logical error is invisible to next-token probability drops, so that uPRM accuracy falls to random while human judges still succeed.

Figures

Figures reproduced from arXiv: 2605.10158 by Artyom Gadetsky, Hang Guo, Maria Brbic, Maxim Kodryan, Siba Smarak Panigrahi.

read the original abstract

Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The unsupervised PRM via batch next-token scoring is a fresh angle on removing labels, but the scoring function's reliability is the part that needs real checking.

read the letter

The one thing to take away is that this paper trains a process reward model with no step annotations and no final-answer ground truth, by turning an LLM's next-token probabilities into a batch-level scorer for the position of the first error in reasoning traces. They then use those scores as pseudo-labels to train the uPRM. On ProcessBench it beats LLM-as-a-judge by up to 15 points at locating the first bad step, matches supervised PRMs for test-time verification while beating majority vote by 6.9 points, and produces a steadier RL training signal than a supervised PRM that had access to ground-truth labels. That unsupervised route is the actual novelty here, and the three empirical settings give it something concrete to stand on. The results are reported cleanly enough that the gains look worth reproducing if the method holds. The soft spot sits in the scoring function itself. Next-token probabilities are local and fluency-driven, so the joint batch comparison could easily latch onto length, token frequency, or other surface patterns instead of logical errors. The stress-test note is right to flag that this works only if the batches have error distributions that the scorer can exploit without extra assumptions. Without ablations that change batch composition or compare against simple non-semantic baselines, it's hard to know how much of the 15-point lift is genuine error detection versus artifact. The RL result where it beats a ground-truth supervised model is also surprising and would benefit from more controls. This is for groups working on scalable supervision for LLM reasoning, especially people who want to cut annotation costs for process-level signals. Readers focused on test-time scaling or RL for reasoning will see direct use cases. The paper deserves a serious referee because the idea is testable, the claims are falsifiable, and the reported improvements are large enough to matter if they survive scrutiny on the scorer. I would send it to review but ask specifically for diagnostics on what the batch scorer is actually measuring.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes unsupervised Process Reward Models (uPRM) that define a scoring function from LLM next-token probabilities to jointly assess candidate first-error positions across batches of reasoning trajectories, enabling training without step annotations or final-answer ground truth. It reports up to 15% absolute accuracy gains over LLM-as-a-Judge on ProcessBench for identifying first erroneous steps, comparable performance to supervised PRMs (outperforming majority voting by up to 6.9%) as a test-time verifier, and more robust policy optimization in RL than supervised PRMs using ground-truth labels.

Significance. If the batch-based scoring function produces reliable pseudo-labels for error localization, this would be a significant advance by removing the costly annotation requirement for process supervision, opening scalable paths for reward modeling in complex reasoning. The multi-scenario empirical evaluation (error detection, test-time scaling, RL) provides broad support, and the absence of free parameters or invented entities in the core construction is a strength.

major comments (3)

[§3.2, Eq. (3)] §3.2, Eq. (3): the joint scoring function over a batch of trajectories is defined to select the position that maximizes the product of next-token probabilities under an implicit single-error-per-trajectory model; no derivation or sensitivity analysis shows that this avoids spurious correlations with length or token rarity, which is load-bearing for the claim that the resulting pseudo-labels enable truly unsupervised error detection.
[§4.1, Table 1] §4.1, Table 1 (ProcessBench results): the 15% absolute gain is reported for first-error identification, but the manuscript provides no ablation on how batches are sampled or on the error-rate distribution within batches; if the method implicitly relies on controlled error placement, the gains may not hold on naturally distributed trajectories and would not support the unsupervised claim.
[§5.3] §5.3 (RL experiments): the claim that uPRM yields more robust policy optimization than a supervised PRM trained on ground-truth labels is surprising and central to the practical value; the section lacks training-curve analysis or error-mode breakdown to explain the robustness advantage, leaving open whether the result stems from the scoring function or from differences in training setup.

minor comments (2)

[§3.1] The notation for the scoring function S(b, k) is introduced without an explicit statement of its dependence on batch size, which could be clarified in §3.1 for readability.
[Figure 2] Figure 2 caption does not specify the number of runs or error bars, making it difficult to assess the statistical significance of the reported improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2, Eq. (3)] the joint scoring function over a batch of trajectories is defined to select the position that maximizes the product of next-token probabilities under an implicit single-error-per-trajectory model; no derivation or sensitivity analysis shows that this avoids spurious correlations with length or token rarity, which is load-bearing for the claim that the resulting pseudo-labels enable truly unsupervised error detection.

Authors: We agree that a formal derivation and sensitivity analysis of the scoring function would improve rigor. The product-based objective in Eq. (3) follows from the single-error assumption and the observation that correct steps receive higher next-token probability mass; the batch joint maximization is intended to identify the position most consistent with this pattern across trajectories. In the revised manuscript we will add both a short derivation clarifying the objective and a sensitivity analysis (including controlled experiments that vary sequence length, token rarity, and batch composition) to quantify any residual correlations. These additions will be placed in §3.2 and the appendix. revision: yes
Referee: [§4.1, Table 1] the 15% absolute gain is reported for first-error identification, but the manuscript provides no ablation on how batches are sampled or on the error-rate distribution within batches; if the method implicitly relies on controlled error placement, the gains may not hold on naturally distributed trajectories and would not support the unsupervised claim.

Authors: We acknowledge the value of explicit ablations on batch construction. The reported ProcessBench numbers were obtained by sampling batches from the benchmark’s native distribution of trajectories (which already contains a realistic mix of error rates and positions). Nevertheless, we agree that additional controls are needed to fully support the unsupervised claim. In the revision we will add an ablation subsection that varies (i) the fraction of erroneous trajectories per batch and (ii) the sampling strategy (uniform vs. error-stratified), reporting first-error detection accuracy under each regime. These results will be included in §4.1 and Table 1 will be expanded accordingly. revision: yes
Referee: [§5.3] the claim that uPRM yields more robust policy optimization than a supervised PRM trained on ground-truth labels is surprising and central to the practical value; the section lacks training-curve analysis or error-mode breakdown to explain the robustness advantage, leaving open whether the result stems from the scoring function or from differences in training setup.

Authors: We thank the referee for highlighting the need for clearer explanatory analysis. The robustness observation is currently supported by final performance metrics and qualitative stability notes. To address the gap we will insert training-curve plots (reward and policy performance vs. RL steps) and an error-mode breakdown (categorizing failure types such as over-penalization of correct steps versus under-detection of errors) for both uPRM and the supervised baseline. These additions will appear in §5.3 and the appendix, allowing readers to assess whether the advantage is attributable to the unsupervised scoring function or to other experimental factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation of uPRM

full rationale

The paper defines a scoring function from LLM next-token probabilities to jointly score candidate first-error positions across batches of trajectories, then uses the resulting pseudo-labels to train uPRM without human step annotations or final-answer ground truth. This is a self-contained construction that does not reduce any claimed result (e.g., 15% accuracy gain on ProcessBench or RL improvements) to its inputs by definition. No equations or steps are shown to be equivalent by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked. Performance claims rest on empirical evaluation against external benchmarks rather than tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities are detailed in the abstract; the method relies on existing LLM probabilities.

pith-pipeline@v0.9.0 · 5541 in / 1224 out tokens · 52957 ms · 2026-05-12T03:14:50.787716+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 10 internal anchors

[1]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, et al. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review arXiv 2025
[2]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, et al. From System 1 to System 2: A Survey of Reasoning Large Language Models.arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review arXiv 2025
[3]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.Neural Information Processing Systems, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.Neural Information Processing Systems, 2022

work page 2022
[4]

DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, et al. DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 2025

work page 2025
[5]

Training Verifiers to Solve Math Word Problems.OpenAI Technical Report, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, et al. Training Verifiers to Solve Math Word Problems.OpenAI Technical Report, 2021

work page 2021
[6]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, et al. Kimi k1.5: Scaling Reinforcement Learning with LLMs.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, et al. Solv- ing Math Word Problems with Process- and Outcome-based Feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Let’s Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, et al. Let’s Verify Step by Step. InInternational Conference on Learning Representations, 2024

work page 2024
[9]

Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning. InInternational Conference on Learning Representations, 2025

work page 2025
[10]

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, et al. Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning. InNeural Information Processing Systems, 2025

work page 2025
[11]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, et al. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. InAssociation for Computational Linguistics, 2024

work page 2024
[12]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, et al. Improve Math- ematical Reasoning in Language Models by Automated Process Supervision.arXiv preprint arXiv:2406.06592, 2024

work page internal anchor Pith review arXiv 2024
[13]

Free Process Rewards without Process Labels

Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, et al. Free Process Rewards without Process Labels. InInternational Conference on Machine Learning, 2025

work page 2025
[14]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, et al. Process Reinforcement Through Implicit Rewards.arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

OlympiadBench: A Challenging Benchmark for Pomoting AGI with Olympiad-level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, et al. OlympiadBench: A Challenging Benchmark for Pomoting AGI with Olympiad-level Bilingual Multimodal Scientific Problems. InAssociation for Computational Linguistics, 2024

work page 2024
[16]

Omni-MATH: A Univer- sal Olympiad Level Mathematic Benchmark for Large Language Models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, et al. Omni-MATH: A Univer- sal Olympiad Level Mathematic Benchmark for Large Language Models. InInternational Conference on Learning Representations, 2025

work page 2025
[17]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, et al. Qwen2.5-Math Tech- nical Report: Toward Mathematical Expert Model via Self-Improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

AlphaMath Almost Zero: Process Supervision without Process.Neural Information Processing Systems, 2024

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. AlphaMath Almost Zero: Process Supervision without Process.Neural Information Processing Systems, 2024

work page 2024
[20]

URSA: Understand- ing and Verifying Chain-of-thought Reasoning in Multimodal Mathematics.arXiv preprint arXiv:2501.04686, 2025

Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, et al. URSA: Understand- ing and Verifying Chain-of-thought Reasoning in Multimodal Mathematics.arXiv preprint arXiv:2501.04686, 2025

work page arXiv 2025
[21]

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, et al. VinePPO: Refining Credit Assignment in RL Training of LLMs. InInternational Conference on Machine Learning, 2025

work page 2025
[22]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, et al. The Lessons of Developing Process Reward Models in Mathematical Reasoning. InAssociation for Computa- tional Linguistics, 2025

work page 2025
[23]

AutoPSV: Automated Process-Supervised Verifier.Neural Information Processing Systems, 2024

Jianqiao Lu, Zhiyang Dou, Hongru Wang, Zeyu Cao, Jianbo Dai, et al. AutoPSV: Automated Process-Supervised Verifier.Neural Information Processing Systems, 2024

work page 2024
[24]

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment

Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, and Huawei Shen. From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment. InAssociation for Computational Linguistics, 2025

work page 2025
[25]

FreePRM: Training Pro- cess Reward Models Without Ground Truth Process Labels.arXiv preprint arXiv:2506.03570, 2025

Lin Sun, Chuang Liu, Xiaofeng Ma, Tao Yang, Weijia Lu, and Ning Wu. FreePRM: Training Pro- cess Reward Models Without Ground Truth Process Labels.arXiv preprint arXiv:2506.03570, 2025

work page arXiv 2025
[26]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wuand, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InNeural Information Processing Systems, 2023

work page 2023
[27]

Hashimoto

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. InConference on Language Modeling, 2024

work page 2024
[28]

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as You Desire. InConference of the North American Chapter of the Association for Computational Linguistics, 2024

work page 2024
[29]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, et al. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InConference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[30]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, et al. Can 1B LLM Surpass 405b LLM? Rethinking Compute-Optimal Test-Time Scaling.arXiv preprint arXiv:2502.06703, 2025

work page arXiv 2025
[31]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, et al. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review arXiv 2024
[32]

Scaling Test-Time Compute with Open Models, 2024

Edward Beeching, Lewis Tunstall, and Sasha Rush. Scaling Test-Time Compute with Open Models, 2024. URL https://huggingface.co/spaces/HuggingFaceH4/ blogpost-scaling-test-time-compute

work page 2024
[33]

[user’s hint]\n{hint}

Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, et al. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning.arXiv preprint arXiv:2504.00891, 2025

work page arXiv 2025
[34]

Dynamic and Generalizable Process Reward Modeling

Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, et al. Dynamic and Generalizable Process Reward Modeling. InAssociation for Computational Linguistics, 2025. 11

work page 2025
[35]

PRL: Process Reward Learning Improves LLMs’ Reasoning Ability and Broadens the Reasoning Boundary.arXiv preprint arXiv:2601.10201, 2026

Jiarui Yao, Ruida Wang, and Tong Zhang. PRL: Process Reward Learning Improves LLMs’ Reasoning Ability and Broadens the Reasoning Boundary.arXiv preprint arXiv:2601.10201, 2026

work page arXiv 2026
[36]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, et al. Understanding R1-Zero-Like Training: A Critical Perspective. InConference on Language Modeling, 2025

work page 2025
[37]

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, et al. Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. InInternational Conference on Learning Representations, 2025

work page 2025
[38]

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling Laws for Reward Model Overoptimization. InInternational Conference on Machine Learning, 2023

work page 2023
[39]

RULER: Relative Universal LLM-Elicited Rewards.OpenPipe Blog, 2025

Kyle Corbitt, Saumya Gandhi, Angky William, Andie Jones, Brad Hilton, et al. RULER: Relative Universal LLM-Elicited Rewards.OpenPipe Blog, 2025

work page 2025
[40]

Batched Self-Consistency Improves LLM Relevance Assessment and Ranking

Anton Korikov, Pan Du, Scott Sanner, and Navid Rekabsaz. Batched Self-Consistency Improves LLM Relevance Assessment and Ranking. InConference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[41]

Large (Vision) Language Models are Unsupervised In-Context Learners

Artyom Gadetsky, Andrei Atanov, Yulun Jiang, Zhitong Gao, Ghazal Hosseini Mighan, et al. Large (Vision) Language Models are Unsupervised In-Context Learners. InInternational Conference on Learning Representations, 2025

work page 2025
[42]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, et al. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations, 2022

work page 2022
[43]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning.Neural Information Processing Systems, 2022

work page 2022
[44]

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.Transactions on Machine Learning Research, 2024

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, et al. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.Transactions on Machine Learning Research, 2024

work page 2024
[45]

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy.Carnegie Mellon University, 2010

Brian Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy.Carnegie Mellon University, 2010

work page 2010
[46]

Actor-Critic Algorithms

Vijay Konda and John Tsitsiklis. Actor-Critic Algorithms. InNeural Information Processing Systems, 1999

work page 1999
[47]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

ProcessBench: Identifying Process Errors in Mathematical Reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, et al. ProcessBench: Identifying Process Errors in Mathematical Reasoning. InAssociation for Computational Linguistics, 2025

work page 2025
[49]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, et al. Measuring Mathematical Problem Solving With the MATH Dataset. InNeural Information Processing Systems, 2021

work page 2021
[50]

Solving Quantitative Reasoning Problems with Language Models.Neural Information Process- ing Systems, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, et al. Solving Quantitative Reasoning Problems with Language Models.Neural Information Process- ing Systems, 2022

work page 2022
[51]

RLHF Workflow: From Reward Modeling to Online RLHF.Transactions on Machine Learning Research, 2024

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, et al. RLHF Workflow: From Reward Modeling to Online RLHF.Transactions on Machine Learning Research, 2024

work page 2024
[52]

Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

Jujie He, Jiacai Liu, Chris Yuhao Liu, Riu Yan, Chaojie Wang, et al. Skywork Open Reasoner 1 Technical Report.arXiv preprint arXiv:2505.22312, 2025. 12

work page arXiv 2025
[53]

Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, et al. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. InAssociation for Computational Linguistics, 2024

work page 2024
[54]

Lee, et al

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, et al. What Makes a Reward Model a Good Teacher? An Optimization Perspective. InNeural Information Processing Systems, 2025

work page 2025
[55]

FlexAttention: A Programming Model for Generating Fused Attention Variants

Boyuan Dong, Juechu Feng, Driss Guessous, Yanbo Liang, and Horace He. FlexAttention: A Programming Model for Generating Fused Attention Variants. InConference on Machine Learning and Systems, 2025

work page 2025
[56]

Williams

Ronald J. Williams. Simple Statistical Gradient-following Algorithms for Connectionist Rein- forcement Learning.Machine Learning, 1992

work page 1992
[57]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[58]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, et al. HybridFlow: A Flexible and Efficient RLHF Framework. InEuropean Conference on Computer Systems, 2025

work page 2025
[59]

No Need for Explanations: LLMs Can Implicitly Learn from Mistakes In-context.Conference on Empirical Methods in Natural Language Processing, 2025

Lisa Alazraki, Maximilian Mozes, Jon Ander Campos, Tan Yi-Chern, Marek Rei, et al. No Need for Explanations: LLMs Can Implicitly Learn from Mistakes In-context.Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[60]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, et al. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. InInternational Conference on Machine Learning, 2025

work page 2025
[61]

Rl’s razor: Why online reinforcement learning forgets less, 2025

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s Razor: Why Online Reinforcement Learning Forgets Less.arXiv preprint arXiv:2509.04259, 2025. A Score Correction to Mitigate Degenerate Solutions In our preliminary experiments, we observed that, although, the score usually assigns higher values to configurations of j1, . . . , jN that are close to gro...

work page arXiv 2025
[62]

Figure D3: When training with uPRM-SFT, policy learns to hack the reward by generating the complete solution in a single step

Therefore, the final answer is4 . Figure D3: When training with uPRM-SFT, policy learns to hack the reward by generating the complete solution in a single step. 20 D.3.2 Further Analysis and More Results Figures D4, D5, and D6 depict panels of metrics for all three RL-training runs (with different random seeds) of Qwen2.5 models. As stated in the main tex...

work page
[63]

[10]; effectively, it approximates the minimum of per-step PRM-emitted rewards for a given response

Accumulated PRM reward.Mathematically, this is the PURE return value for the first step in the response, computed according to equation (6) in Cheng et al. [10]; effectively, it approximates the minimum of per-step PRM-emitted rewards for a given response. By definition, verifiable reward is not taken into account when computing this value, therefore we d...

work page
[64]

Response length.Amount of tokens in the response generated by the model for a given input prompt

work page
[65]

You do not need to box your final answer if it is a variable or an expression

KL to reference model.Kullback–Leibler divergence between the current policy and the reference policy computed over response tokens. Reference policy is defined by the model at initialization (zero-shot policy). Analysis.As evidenced from the plots, Qwen2.5-Math models could be successfully trained using uPRM both with and without VR. Training with sPRM, ...

work page