arxiv: 2501.07301 · v2 · submitted 2025-01-13 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 1 theorem link

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang , Chujie Zheng , Yangzhen Wu , Beichen Zhang , Runji Lin , Bowen Yu , Dayiheng Liu , Jingren Zhou

show 1 more author

Junyang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords process reward modelsmathematical reasoningBest-of-N evaluationconsensus filteringLLM-as-a-judgeprocess supervisiondata synthesisstep-wise error identification

0 comments

The pith

Consensus filtering across annotation methods yields stronger process reward models for mathematical reasoning by correcting biases in standard evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Monte Carlo estimation for creating training data for process reward models produces weaker results than LLM-as-a-judge or human labeling because completion models frequently misjudge whether an individual reasoning step is correct. It further documents systematic biases in conventional Best-of-N testing, where models receive high scores for reaching the right final answer even when the intermediate steps contain errors, which misaligns the evaluation with the actual goal of process verification. To fix these problems the authors introduce a consensus filtering procedure that merges signals from multiple annotation sources and recommend evaluating both full responses and individual steps. This combination delivers measurable gains in model performance and requires less labeled data. They also release an improved open-source process reward model that exceeds prior public alternatives.

Core claim

Monte Carlo estimation for process reward model data synthesis yields inferior performance and generalization because completion models produce inaccurate step-level verifications, while conventional Best-of-N evaluations suffer from three biases: unreliable policy models generate correct answers via flawed processes, PRMs tolerate those responses and produce inflated scores, and optimized PRMs shift toward outcome assessment as shown by minimum scores concentrating on final steps; consensus filtering that integrates Monte Carlo estimates with LLM-as-a-judge annotations, paired with a dual response-level and step-level evaluation framework, mitigates these issues and produces superior models

What carries the argument

The consensus filtering mechanism that combines Monte Carlo estimation signals with LLM-as-a-judge annotations to select reliable training examples for process reward models

If this is right

Process reward models reach higher Best-of-N scores while correctly identifying process errors rather than only final answers.
Step-wise error identification accuracy increases under the combined evaluation framework.
Effective process reward models can be trained with substantially fewer annotations due to improved data efficiency.
The released open-source model provides a stronger baseline that outperforms earlier public process reward models.
Future development of process supervision should adopt dual-level metrics that track both responses and individual steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Consensus-style filtering could reduce evaluation biases when applying process supervision to reasoning domains other than mathematics.
Lower reliance on any single annotation method may make process reward models more practical to scale as model sizes grow.
The results indicate that stronger policy models capable of generating diverse reasoning paths would make Best-of-N testing more reliable for future process reward work.
The released model supplies a concrete starting point for checking whether the same biases appear when newer large language models serve as the underlying policy.

Load-bearing premise

The identified biases in Best-of-N evaluation and the performance gains from consensus filtering generalize beyond the specific models, datasets, and tasks used in the experiments.

What would settle it

A controlled experiment in which a process reward model trained only on Monte Carlo data achieves equal or higher accuracy than the consensus-filtered model when both are tested on a fresh set of human-verified step-by-step error labels.

read the original abstract

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags three concrete biases in BoN evaluation for PRMs and shows a consensus filter helps on their data, but the gains look tied to the tested models and splits.

read the letter

The key thing here is that the authors ran a bunch of experiments showing Monte Carlo data synthesis for PRMs falls short of LLM-as-judge or human labels, and they pin down three specific problems with standard Best-of-N scoring: flawed-process but correct-answer responses, tolerance that inflates scores, and PRMs latching onto final steps instead of the process. They then test a consensus filter that mixes MC with the judge and report better performance plus data efficiency on both BoN and step-error detection, plus they ship a new open PRM that beats prior open ones on their benchmarks.

Referee Report

3 major / 1 minor

Summary. The paper claims that Monte Carlo (MC) estimation for synthesizing training data for Process Reward Models (PRMs) in mathematical reasoning is inferior to LLM-as-a-judge and human annotation methods. It identifies three biases in conventional Best-of-N (BoN) evaluation—misalignment from flawed-process/correct-answer responses, inflated scores due to PRM tolerance, and score concentration on final steps—and proposes a consensus filtering mechanism that integrates MC with LLM-as-a-judge. This yields improved performance and data efficiency on BoN and step-wise error identification tasks, culminating in the release of a new state-of-the-art open-source PRM.

Significance. If the empirical findings hold, the work supplies concrete practical guidelines for PRM data annotation and evaluation, demonstrates a reproducible way to mitigate common biases in process supervision, and delivers a stronger open-source PRM that can be directly adopted by the community for mathematical reasoning pipelines.

major comments (3)

[Abstract and Experiments] The manuscript provides no dataset sizes, number of training/evaluation examples, statistical significance tests, or error bars for any of the reported BoN or step-wise performance gains (abstract and experimental sections). Without these, the central claim that consensus filtering 'significantly improve[s] both model performance and data efficiency' cannot be verified.
[Method / Consensus Filtering] The exact operational definition of the consensus filtering mechanism—voting rules, decision thresholds, weighting between MC completion scores and LLM-as-judge verdicts, and handling of disagreements—is not specified with sufficient precision to allow reproduction or to confirm that the reported superiority is not an artifact of the particular implementation choices.
[Evaluation and Discussion] No cross-model ablations or out-of-distribution evaluations are presented to test whether the identified BoN biases (flawed-process/correct-answer misalignment and final-step score concentration) and the advantage of consensus filtering generalize beyond the specific policy models, math datasets, and evaluation splits used.

minor comments (1)

[Abstract] The abstract states that the new PRM 'outperforms existing open-source alternatives' but does not name the baselines or report the exact margins; adding these quantitative details would strengthen the summary.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify important areas for improvement in clarity, reproducibility, and experimental rigor. We address each major comment point-by-point below and will revise the manuscript to incorporate the necessary changes.

read point-by-point responses

Referee: [Abstract and Experiments] The manuscript provides no dataset sizes, number of training/evaluation examples, statistical significance tests, or error bars for any of the reported BoN or step-wise performance gains (abstract and experimental sections). Without these, the central claim that consensus filtering 'significantly improve[s] both model performance and data efficiency' cannot be verified.

Authors: We agree that these details are essential for verifying the claims and apologize for their omission in the initial submission. In the revised manuscript, we will add explicit reporting of all dataset sizes (including exact numbers of training examples for MC-based, LLM-as-a-judge, human-annotated, and consensus-filtered PRMs, as well as evaluation set sizes for BoN and step-wise error identification tasks). We will also include error bars from multiple runs with different random seeds and conduct/report statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) on the performance and data-efficiency gains to substantiate the improvements. revision: yes
Referee: [Method / Consensus Filtering] The exact operational definition of the consensus filtering mechanism—voting rules, decision thresholds, weighting between MC completion scores and LLM-as-judge verdicts, and handling of disagreements—is not specified with sufficient precision to allow reproduction or to confirm that the reported superiority is not an artifact of the particular implementation choices.

Authors: We concur that the current description lacks the precision needed for full reproducibility. The revised version will include a detailed, formal specification of the consensus filtering procedure: explicit voting rules (e.g., requiring agreement or a combined score threshold), the precise numerical decision thresholds used, the weighting formula between MC completion scores and LLM-as-a-judge verdicts, and the exact protocol for handling disagreements (such as defaulting to the LLM verdict, discarding cases, or applying a tie-breaker). This will allow independent replication and confirm that the gains are not implementation-specific. revision: yes
Referee: [Evaluation and Discussion] No cross-model ablations or out-of-distribution evaluations are presented to test whether the identified BoN biases (flawed-process/correct-answer misalignment and final-step score concentration) and the advantage of consensus filtering generalize beyond the specific policy models, math datasets, and evaluation splits used.

Authors: We recognize the importance of testing generalization. Our experiments were deliberately scoped to the policy models and datasets from prior PRM literature to enable direct, apples-to-apples comparisons with existing methods. In the revision, we will add at least one cross-model ablation using an alternative policy model on the same datasets and expand the discussion to address the scope limitations. Comprehensive out-of-distribution evaluations across additional math domains would require substantial extra compute and are flagged as valuable future work; however, the mechanistic analysis of the biases (misalignment, score inflation, and final-step concentration) is grounded in observable patterns that we expect to hold more broadly. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential reductions

full rationale

The paper is an empirical investigation of Process Reward Models. It reports experimental comparisons between Monte Carlo estimation, LLM-as-a-judge, and human annotation for data synthesis; identifies biases in Best-of-N evaluation through observation; and proposes a consensus filtering mechanism based on those findings. No equations, theoretical derivations, or parameter-fitting steps are described that reduce to inputs by construction. Claims rest on distinct experimental results across methods and tasks rather than self-definition, fitted predictions, or load-bearing self-citations. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine learning paper. No mathematical axioms, free parameters, or invented entities are introduced; all claims rest on experimental comparisons between annotation and evaluation methods.

pith-pipeline@v0.9.0 · 5625 in / 1110 out tokens · 68586 ms · 2026-05-16T13:39:01.668243+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation
cs.CL 2026-01 unverdicted novelty 7.0

ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.
Scalable Token-Level Hallucination Detection in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
cs.CL 2026-04 unverdicted novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
PARM: Pipeline-Adapted Reward Model
cs.AI 2026-04 unverdicted novelty 6.0

PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
Improving Medical VQA through Trajectory-Aware Process Supervision
cs.LG 2026-04 conditional novelty 6.0

A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
cs.AI 2026-03 unverdicted novelty 6.0

An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
cs.RO 2025-05 conditional novelty 6.0

VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
cs.LG 2026-04 unverdicted novelty 5.0

CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
Learning to Draw ASCII Improves Spatial Reasoning in Language Models
cs.AI 2026-04 unverdicted novelty 5.0

Training LLMs on text-to-ASCII spatial layout construction improves text-only spatial reasoning and transfers to external benchmarks.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 19 Pith papers · 8 internal anchors

[1]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al

URL https://arxiv.org/abs/2405.03553. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page arXiv
[2]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

org/abs/2406.14024

URL https://arxiv. org/abs/2406.14024. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008,

work page arXiv
[4]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math ...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V . Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A....

work page 2022
[6]

MARIO: math reasoning with code interpreter output - A reproducible pipeline

Minpeng Liao, Chengxi Li, Wei Luo, Jing Wu, and Kai Fan. MARIO: math reasoning with code interpreter output - A reproducible pipeline. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 905–924. Association for Compu...

work page 2024
[7]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

URL https://arxiv.org/abs/2406.06592. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https: //qwenlm.github.io/blog/qwq-32b-preview/. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei

URL https://huggingface.co/Skywork. Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

work page 2024
[11]

Solving math word problems with process- and outcome-based feedback

URL https://arxiv.org/abs/2211.14275. Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, and Bo An. Q*: Improving multi-step reasoning for llms with deliberative planning, 2024a. URL https://arxiv.org/abs/2406. 14283. 12 Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-sheph...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.510 2024
[12]

Wei Xiong, Hanning Zhang, Nan Jiang, and Tong Zhang

URL https://arxiv.org/abs/2404.05692. Wei Xiong, Hanning Zhang, Nan Jiang, and Tong Zhang. An implementation of generative prm. https://github.com/RLHFlow/RLHF-Reward-Modeling,

work page arXiv
[13]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv prep...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Generative verifiers: Reward modeling as next-token prediction, 2025a

URL https://arxiv.org/abs/ 2408.15240. Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559,

work page arXiv
[15]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

13 A PRM Guided Search We further integrate PRM with greedy search by generating N candidate steps at each step, evaluating these candidates using PRM scoring, and selecting the highest-scoring step for subsequent expansion. For the policy model, we employed Qwen2.5-7B-Instruct which has greater diversity in generation to sample 8 candidates at each step,...

work page 2023
[17]

However, its performance is inferior to maj@8, suggesting challenges in employing a 7B PRM for the supervision of 72B policy model- generated responses

Qwen2.5- Math- 7B-PRM outperforms other PRMs of equivalent model scale. However, its performance is inferior to maj@8, suggesting challenges in employing a 7B PRM for the supervision of 72B policy model- generated responses. Besides, Qwen2.5-Math-PRM-72B surpasses maj@8 in prm@8 and is comparable with Qwen2.5-Math-RM-72B in orm@8. B.2 The BoN Evaluation w...

work page 2023
[18]

14 Setting GSM8K MATH Minerva Math GaoKao 2023 En Olympiad Bench College Math MMLU STEM Avg. pass@8 97.3 93.2 56.6 83.6 62.4 54.1 95.3 77.5 maj@8 96.0 88.6 47.8 73.8 50.1 50.2 84.9 70.2 1.5B Skywork-PRM-1.5B 96.5 88.1 45.2 74.3 48.4 49.7 79.7 68.8 7B+ Math-Shepherd-PRM-7B 96.5 86.8 45.6 71.9 49.2 49.5 77.5 68.1 RLHFlow-PRM-Mistral-8B 96.6 87.5 46.3 73.5 4...

work page 2023
[19]

Incorrect

and OlympiadBench (He et al., 2024). The results are presented in the Table 10 and it can be found that our PRMs maintain superior performance compared to other PRMs, especially on MATH500. Setting MATH500 AIME24 AMC23 Minerva Math GaoKao 2023 En Olympiad Bench Avg. pass@64 96.0 50.0 95.0 56.6 86.8 73.5 76.3 maj@64 84.2 16.7 77.5 34.6 73.8 51.1 56.3 1.5B ...

work page 2024