pith. machine review for the scientific record. sign in

arxiv: 2501.07301 · v2 · submitted 2025-01-13 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 1 theorem link

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords process reward modelsmathematical reasoningBest-of-N evaluationconsensus filteringLLM-as-a-judgeprocess supervisiondata synthesisstep-wise error identification
0
0 comments X

The pith

Consensus filtering across annotation methods yields stronger process reward models for mathematical reasoning by correcting biases in standard evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Monte Carlo estimation for creating training data for process reward models produces weaker results than LLM-as-a-judge or human labeling because completion models frequently misjudge whether an individual reasoning step is correct. It further documents systematic biases in conventional Best-of-N testing, where models receive high scores for reaching the right final answer even when the intermediate steps contain errors, which misaligns the evaluation with the actual goal of process verification. To fix these problems the authors introduce a consensus filtering procedure that merges signals from multiple annotation sources and recommend evaluating both full responses and individual steps. This combination delivers measurable gains in model performance and requires less labeled data. They also release an improved open-source process reward model that exceeds prior public alternatives.

Core claim

Monte Carlo estimation for process reward model data synthesis yields inferior performance and generalization because completion models produce inaccurate step-level verifications, while conventional Best-of-N evaluations suffer from three biases: unreliable policy models generate correct answers via flawed processes, PRMs tolerate those responses and produce inflated scores, and optimized PRMs shift toward outcome assessment as shown by minimum scores concentrating on final steps; consensus filtering that integrates Monte Carlo estimates with LLM-as-a-judge annotations, paired with a dual response-level and step-level evaluation framework, mitigates these issues and produces superior models

What carries the argument

The consensus filtering mechanism that combines Monte Carlo estimation signals with LLM-as-a-judge annotations to select reliable training examples for process reward models

If this is right

  • Process reward models reach higher Best-of-N scores while correctly identifying process errors rather than only final answers.
  • Step-wise error identification accuracy increases under the combined evaluation framework.
  • Effective process reward models can be trained with substantially fewer annotations due to improved data efficiency.
  • The released open-source model provides a stronger baseline that outperforms earlier public process reward models.
  • Future development of process supervision should adopt dual-level metrics that track both responses and individual steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Consensus-style filtering could reduce evaluation biases when applying process supervision to reasoning domains other than mathematics.
  • Lower reliance on any single annotation method may make process reward models more practical to scale as model sizes grow.
  • The results indicate that stronger policy models capable of generating diverse reasoning paths would make Best-of-N testing more reliable for future process reward work.
  • The released model supplies a concrete starting point for checking whether the same biases appear when newer large language models serve as the underlying policy.

Load-bearing premise

The identified biases in Best-of-N evaluation and the performance gains from consensus filtering generalize beyond the specific models, datasets, and tasks used in the experiments.

What would settle it

A controlled experiment in which a process reward model trained only on Monte Carlo data achieves equal or higher accuracy than the consensus-filtered model when both are tested on a fresh set of human-verified step-by-step error labels.

read the original abstract

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that Monte Carlo (MC) estimation for synthesizing training data for Process Reward Models (PRMs) in mathematical reasoning is inferior to LLM-as-a-judge and human annotation methods. It identifies three biases in conventional Best-of-N (BoN) evaluation—misalignment from flawed-process/correct-answer responses, inflated scores due to PRM tolerance, and score concentration on final steps—and proposes a consensus filtering mechanism that integrates MC with LLM-as-a-judge. This yields improved performance and data efficiency on BoN and step-wise error identification tasks, culminating in the release of a new state-of-the-art open-source PRM.

Significance. If the empirical findings hold, the work supplies concrete practical guidelines for PRM data annotation and evaluation, demonstrates a reproducible way to mitigate common biases in process supervision, and delivers a stronger open-source PRM that can be directly adopted by the community for mathematical reasoning pipelines.

major comments (3)
  1. [Abstract and Experiments] The manuscript provides no dataset sizes, number of training/evaluation examples, statistical significance tests, or error bars for any of the reported BoN or step-wise performance gains (abstract and experimental sections). Without these, the central claim that consensus filtering 'significantly improve[s] both model performance and data efficiency' cannot be verified.
  2. [Method / Consensus Filtering] The exact operational definition of the consensus filtering mechanism—voting rules, decision thresholds, weighting between MC completion scores and LLM-as-judge verdicts, and handling of disagreements—is not specified with sufficient precision to allow reproduction or to confirm that the reported superiority is not an artifact of the particular implementation choices.
  3. [Evaluation and Discussion] No cross-model ablations or out-of-distribution evaluations are presented to test whether the identified BoN biases (flawed-process/correct-answer misalignment and final-step score concentration) and the advantage of consensus filtering generalize beyond the specific policy models, math datasets, and evaluation splits used.
minor comments (1)
  1. [Abstract] The abstract states that the new PRM 'outperforms existing open-source alternatives' but does not name the baselines or report the exact margins; adding these quantitative details would strengthen the summary.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify important areas for improvement in clarity, reproducibility, and experimental rigor. We address each major comment point-by-point below and will revise the manuscript to incorporate the necessary changes.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The manuscript provides no dataset sizes, number of training/evaluation examples, statistical significance tests, or error bars for any of the reported BoN or step-wise performance gains (abstract and experimental sections). Without these, the central claim that consensus filtering 'significantly improve[s] both model performance and data efficiency' cannot be verified.

    Authors: We agree that these details are essential for verifying the claims and apologize for their omission in the initial submission. In the revised manuscript, we will add explicit reporting of all dataset sizes (including exact numbers of training examples for MC-based, LLM-as-a-judge, human-annotated, and consensus-filtered PRMs, as well as evaluation set sizes for BoN and step-wise error identification tasks). We will also include error bars from multiple runs with different random seeds and conduct/report statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) on the performance and data-efficiency gains to substantiate the improvements. revision: yes

  2. Referee: [Method / Consensus Filtering] The exact operational definition of the consensus filtering mechanism—voting rules, decision thresholds, weighting between MC completion scores and LLM-as-judge verdicts, and handling of disagreements—is not specified with sufficient precision to allow reproduction or to confirm that the reported superiority is not an artifact of the particular implementation choices.

    Authors: We concur that the current description lacks the precision needed for full reproducibility. The revised version will include a detailed, formal specification of the consensus filtering procedure: explicit voting rules (e.g., requiring agreement or a combined score threshold), the precise numerical decision thresholds used, the weighting formula between MC completion scores and LLM-as-a-judge verdicts, and the exact protocol for handling disagreements (such as defaulting to the LLM verdict, discarding cases, or applying a tie-breaker). This will allow independent replication and confirm that the gains are not implementation-specific. revision: yes

  3. Referee: [Evaluation and Discussion] No cross-model ablations or out-of-distribution evaluations are presented to test whether the identified BoN biases (flawed-process/correct-answer misalignment and final-step score concentration) and the advantage of consensus filtering generalize beyond the specific policy models, math datasets, and evaluation splits used.

    Authors: We recognize the importance of testing generalization. Our experiments were deliberately scoped to the policy models and datasets from prior PRM literature to enable direct, apples-to-apples comparisons with existing methods. In the revision, we will add at least one cross-model ablation using an alternative policy model on the same datasets and expand the discussion to address the scope limitations. Comprehensive out-of-distribution evaluations across additional math domains would require substantial extra compute and are flagged as valuable future work; however, the mechanistic analysis of the biases (misalignment, score inflation, and final-step concentration) is grounded in observable patterns that we expect to hold more broadly. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential reductions

full rationale

The paper is an empirical investigation of Process Reward Models. It reports experimental comparisons between Monte Carlo estimation, LLM-as-a-judge, and human annotation for data synthesis; identifies biases in Best-of-N evaluation through observation; and proposes a consensus filtering mechanism based on those findings. No equations, theoretical derivations, or parameter-fitting steps are described that reduce to inputs by construction. Claims rest on distinct experimental results across methods and tasks rather than self-definition, fitted predictions, or load-bearing self-citations. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine learning paper. No mathematical axioms, free parameters, or invented entities are introduced; all claims rest on experimental comparisons between annotation and evaluation methods.

pith-pipeline@v0.9.0 · 5625 in / 1110 out tokens · 68586 ms · 2026-05-16T13:39:01.668243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Multi-Rollout On-Policy Distillation via Peer Successes and Failures

    cs.LG 2026-05 unverdicted novelty 7.0

    MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.

  2. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  3. IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...

  4. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    cs.LG 2026-01 unverdicted novelty 7.0

    A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...

  5. ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

    cs.CL 2026-01 unverdicted novelty 7.0

    ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.

  6. Scalable Token-Level Hallucination Detection in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...

  7. GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.

  8. Process Supervision of Confidence Margin for Calibrated LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

  9. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

    cs.CL 2026-04 unverdicted novelty 6.0

    Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...

  10. PARM: Pipeline-Adapted Reward Model

    cs.AI 2026-04 unverdicted novelty 6.0

    PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.

  11. Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.

  12. Improving Medical VQA through Trajectory-Aware Process Supervision

    cs.LG 2026-04 conditional novelty 6.0

    A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.

  13. Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

    cs.AI 2026-03 unverdicted novelty 6.0

    An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.

  14. VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    cs.RO 2025-05 conditional novelty 6.0

    VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.

  15. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  16. Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...

  17. Learning to Draw ASCII Improves Spatial Reasoning in Language Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Training LLMs on text-to-ASCII spatial layout construction improves text-only spatial reasoning and transfers to external benchmarks.

  18. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

  19. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

    cs.CL 2025-08

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 19 Pith papers · 8 internal anchors

  1. [1]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al

    URL https://arxiv.org/abs/2405.03553. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  2. [2]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  3. [3]

    org/abs/2406.14024

    URL https://arxiv. org/abs/2406.14024. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008,

  4. [4]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math ...

  5. [5]

    Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V . Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A....

  6. [6]

    MARIO: math reasoning with code interpreter output - A reproducible pipeline

    Minpeng Liao, Chengxi Li, Wei Luo, Jing Wu, and Kai Fan. MARIO: math reasoning with code interpreter output - A reproducible pipeline. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 905–924. Association for Compu...

  7. [7]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050,

  8. [8]

    URL https://arxiv.org/abs/2406.06592. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  9. [9]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URL https: //qwenlm.github.io/blog/qwq-32b-preview/. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  10. [10]

    Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei

    URL https://huggingface.co/Skywork. Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

  11. [11]

    Solving math word problems with process- and outcome-based feedback

    URL https://arxiv.org/abs/2211.14275. Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, and Bo An. Q*: Improving multi-step reasoning for llms with deliberative planning, 2024a. URL https://arxiv.org/abs/2406. 14283. 12 Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-sheph...

  12. [12]

    Wei Xiong, Hanning Zhang, Nan Jiang, and Tong Zhang

    URL https://arxiv.org/abs/2404.05692. Wei Xiong, Hanning Zhang, Nan Jiang, and Tong Zhang. An implementation of generative prm. https://github.com/RLHFlow/RLHF-Reward-Modeling,

  13. [13]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv prep...

  14. [14]

    Generative verifiers: Reward modeling as next-token prediction, 2025a

    URL https://arxiv.org/abs/ 2408.15240. Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559,

  15. [15]

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931,

  16. [16]

    13 A PRM Guided Search We further integrate PRM with greedy search by generating N candidate steps at each step, evaluating these candidates using PRM scoring, and selecting the highest-scoring step for subsequent expansion. For the policy model, we employed Qwen2.5-7B-Instruct which has greater diversity in generation to sample 8 candidates at each step,...

  17. [17]

    However, its performance is inferior to maj@8, suggesting challenges in employing a 7B PRM for the supervision of 72B policy model- generated responses

    Qwen2.5- Math- 7B-PRM outperforms other PRMs of equivalent model scale. However, its performance is inferior to maj@8, suggesting challenges in employing a 7B PRM for the supervision of 72B policy model- generated responses. Besides, Qwen2.5-Math-PRM-72B surpasses maj@8 in prm@8 and is comparable with Qwen2.5-Math-RM-72B in orm@8. B.2 The BoN Evaluation w...

  18. [18]

    14 Setting GSM8K MATH Minerva Math GaoKao 2023 En Olympiad Bench College Math MMLU STEM Avg. pass@8 97.3 93.2 56.6 83.6 62.4 54.1 95.3 77.5 maj@8 96.0 88.6 47.8 73.8 50.1 50.2 84.9 70.2 1.5B Skywork-PRM-1.5B 96.5 88.1 45.2 74.3 48.4 49.7 79.7 68.8 7B+ Math-Shepherd-PRM-7B 96.5 86.8 45.6 71.9 49.2 49.5 77.5 68.1 RLHFlow-PRM-Mistral-8B 96.6 87.5 46.3 73.5 4...

  19. [19]

    Incorrect

    and OlympiadBench (He et al., 2024). The results are presented in the Table 10 and it can be found that our PRMs maintain superior performance compared to other PRMs, especially on MATH500. Setting MATH500 AIME24 AMC23 Minerva Math GaoKao 2023 En Olympiad Bench Avg. pass@64 96.0 50.0 95.0 56.6 86.8 73.5 76.3 maj@64 84.2 16.7 77.5 34.6 73.8 51.1 56.3 1.5B ...