pith. sign in

arxiv: 2410.08146 · v1 · pith:EWU7G3DXnew · submitted 2024-10-10 · 💻 cs.LG · cs.CL

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Pith reviewed 2026-05-21 01:37 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords process reward modelsLLM reasoningreinforcement learningtest-time searchprocess advantage verifiersoutcome reward models
0
0 comments X

The pith

Process rewards that measure progress under a distinct prover policy outperform outcome rewards for improving LLM reasoning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that good process rewards should quantify how much a reasoning step changes the likelihood of eventually reaching a correct answer. This change, or progress, is measured under a prover policy that is deliberately kept separate from the base policy being improved. Training process advantage verifiers on these automatically generated progress signals produces dense rewards that support better exploration. The resulting verifiers deliver higher accuracy and lower compute cost than outcome reward models when used for test-time search, and they also improve sample efficiency and final accuracy when used as rewards in online reinforcement learning. The characterization further shows that even weaker prover policies can still strengthen a stronger base policy.

Core claim

The process reward for a step should measure progress as the change in the likelihood of producing a correct response before and after the step, with this progress evaluated under a prover policy distinct from the base policy. Optimizing process rewards from such provers improves exploration during test-time search and online RL. Training process advantage verifiers to predict progress under these provers yields more than 8 percent higher accuracy and 1.5-5 times greater compute efficiency versus outcome reward models in search, plus 5-6 times better sample efficiency and over 6 percent accuracy gain in online RL.

What carries the argument

Progress defined as the change in future success likelihood before and after a step, measured under a prover policy separate from the base policy and used to train process advantage verifiers.

If this is right

  • Test-time search guided by the trained verifiers is more than 8 percent more accurate and 1.5-5 times more compute-efficient than search guided by outcome reward models.
  • Online reinforcement learning with dense rewards from the verifiers achieves 5-6 times better sample efficiency and more than 6 percent higher accuracy than reinforcement learning with outcome rewards.
  • Even weak prover policies can substantially improve stronger base policies when progress is measured under the distinct-prover regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach reduces the need for dense human step-by-step labels by relying on automated progress signals from separate policies.
  • Similar progress definitions could be tested in sequential domains beyond language models, such as planning or program synthesis.
  • Combining signals from several provers of varying strength might further stabilize the training targets.

Load-bearing premise

The progress signal measured under a prover policy distinct from the base policy remains a reliable training target even when the prover is weaker than the base policy and likelihood estimates come from similar models.

What would settle it

Training process advantage verifiers using progress signals computed from the base policy itself rather than a distinct prover, then checking whether the reported accuracy and efficiency gains over outcome reward models disappear in both search and RL settings.

read the original abstract

A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is $>8\%$ more accurate, and $1.5-5\times$ more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with $5-6\times$ gain in sample efficiency, and $>6\%$ gain in accuracy, over ORMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes measuring process rewards as the change in likelihood of eventual correct completion (progress) under a prover policy distinct from the base policy. It claims this yields process advantage verifiers (PAVs) that improve test-time search and online RL over outcome reward models (ORMs), with reported gains of >8% accuracy, 1.5-5x compute efficiency in search, and 5-6x sample efficiency plus >6% accuracy in RL; it further claims that even weak provers can improve stronger base policies.

Significance. If the central claims hold, the work provides a scalable route to dense, automated process rewards without human step-level labels and shows that separating the prover policy enables useful signals even when the prover is weaker than the base. The efficiency gains and the weak-prover result would be notable contributions to reward design for LLM reasoning.

major comments (2)
  1. [Abstract] Abstract: the empirical claims (>8% accuracy, 1.5-5x efficiency, 5-6x sample efficiency) are presented without error bars, standard deviations across runs, or statistical tests, so the magnitude and reliability of the reported gains over ORMs cannot be assessed.
  2. [Theoretical characterization and experimental sections] Theoretical characterization and experimental sections: the manuscript provides no ablation that isolates the effect of using a prover policy distinct from the base policy (or of using a weaker prover), which is load-bearing for the claim that such separation yields a reliable, non-circular progress signal when both policies belong to the same model family.
minor comments (1)
  1. [§3] The precise definition of the progress signal (difference of likelihoods) and the procedure for estimating those likelihoods should be stated with an explicit equation and a description of the data used for fitting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments point by point below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the empirical claims (>8% accuracy, 1.5-5x efficiency, 5-6x sample efficiency) are presented without error bars, standard deviations across runs, or statistical tests, so the magnitude and reliability of the reported gains over ORMs cannot be assessed.

    Authors: We agree that the abstract would benefit from greater statistical detail to allow readers to assess the reliability of the reported gains. The underlying experiments were run with multiple random seeds, but error bars and tests were not included in the abstract for space reasons. In the revised manuscript we will expand the abstract and results sections to report standard deviations across runs and include statistical significance tests comparing PAVs against ORMs. revision: yes

  2. Referee: [Theoretical characterization and experimental sections] Theoretical characterization and experimental sections: the manuscript provides no ablation that isolates the effect of using a prover policy distinct from the base policy (or of using a weaker prover), which is load-bearing for the claim that such separation yields a reliable, non-circular progress signal when both policies belong to the same model family.

    Authors: The theoretical characterization derives the precise conditions on the prover policy that guarantee a non-circular progress signal, including the case of a weaker prover from the same model family. The empirical results are consistent with this analysis. We acknowledge, however, that an explicit ablation directly comparing same-policy versus distinct-policy progress signals is absent. We will add this ablation to the experimental section in the revision to isolate the contribution of policy separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with distinct prover and empirical validation

full rationale

The paper's core derivation defines process reward as change in likelihood of correct completion under a prover policy explicitly distinct from the base policy, then theoretically characterizes good provers and trains PAVs to predict that signal. This does not reduce to its inputs by construction: the prover is required to be distinct, the characterization is presented as independent theoretical work, and all reported gains (accuracy, efficiency) are measured against ORMs on held-out tasks rather than being forced by fitting the same likelihoods. No self-citation chain, fitted-input-as-prediction, or self-definitional reduction is exhibited in the provided abstract or claims. The setup is externally falsifiable via the reported search and RL experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that likelihood differences under a separate prover constitute a useful dense reward signal without introducing new fitted parameters beyond standard model training.

axioms (1)
  • domain assumption A prover policy distinct from the base policy can be used to compute reliable step-level progress signals.
    Invoked in the key insight paragraph of the abstract.

pith-pipeline@v0.9.0 · 5875 in / 1198 out tokens · 19268 ms · 2026-05-21T01:37:04.343010+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence law_of_existence echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

    cs.CL 2026-04 unverdicted novelty 8.0

    MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....

  2. CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

    cs.LG 2026-05 conditional novelty 7.0

    CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.

  3. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  4. Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo

    cs.LG 2026-04 unverdicted novelty 7.0

    Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.

  5. SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

    cs.AI 2026-01 unverdicted novelty 7.0

    SCRIBE introduces skill-conditioned rewards with intermediate behavioral evaluation to reduce noise in training tool-augmented agents, raising AIME25 accuracy from 43.3% to 63.3% on a Qwen3-4B model.

  6. ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

    cs.AI 2025-10 unverdicted novelty 7.0

    ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.

  7. The Art of Scaling Reinforcement Learning Compute for LLMs

    cs.LG 2025-10 unverdicted novelty 7.0

    A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale prediction...

  8. Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

    cs.CL 2026-05 unverdicted novelty 6.0

    SCA framework applies Information Bottleneck to assign step-level confidence in black-box LLM reasoning traces, flagging errors and boosting self-correction success by up to 13.5% on math and QA tasks.

  9. STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

  10. When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

    stat.ML 2026-05 unverdicted novelty 6.0

    A wrapper for black-box generate-verify AI pipelines that uses a conservative hard-negative reference pool and e-processes to control the probability of releasing on infeasible tasks while permitting release on feasible ones.

  11. Controllable and Verifiable Process Data Synthesis for Process Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.

  12. Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding

    cs.AI 2026-05 unverdicted novelty 6.0

    CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.

  13. GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.

  14. Efficient Process Reward Modeling via Contrastive Mutual Information

    cs.CL 2026-04 unverdicted novelty 6.0

    CPMI labels reasoning-step rewards by measuring how much each step boosts mutual information with the correct answer relative to hard negatives, cutting labeling cost by 84% and tokens by 98% while improving accuracy.

  15. TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

    cs.DC 2026-04 unverdicted novelty 6.0

    TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.

  16. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

    cs.CL 2025-08 unverdicted novelty 6.0

    Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.

  17. The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

    cs.LG 2025-05 unverdicted novelty 6.0

    Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.

  18. Process Reinforcement through Implicit Rewards

    cs.LG 2025-02 conditional novelty 6.0

    PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...

  19. A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

    cs.CV 2026-05 unverdicted novelty 5.0

    A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.

  20. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    cs.AI 2025-01 unverdicted novelty 3.0

    The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 20 Pith papers · 14 internal anchors

  1. [1]

    15 Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954,

  2. [2]

    J. D. Chang, K. Brantley, R. Ramamurthy, D. Misra, and W. Sun. Learning to generate better than your llm. arXiv preprint arXiv:2306.11816,

  3. [3]

    Chang, A

    K.-W. Chang, A. Krishnamurthy, A. Agarwal, H. Daumé III, and J. Langford. Learning to search better than your teacher. InInternational Conference on Machine Learning, pages 2058–2066. PMLR,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021a. K.Cobbe, V.Kosaraju, M.Bavarian, M.Chen, H.Jun, L.Kaiser, M.Plappert, J.Tworek, J.Hilton, R.Nakano, et al. Training verifiers to solv...

  5. [5]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

  6. [6]

    Raileanu

    A.Havrilla,Y.Du,S.C.Raparthy,C.Nalmpantis,J.Dwivedi-Yu,M.Zhuravinskyi,E.Hambro,S.Sukhbaatar, and R. Raileanu. Teaching large language models to reason with reinforcement learning.arXiv preprint arXiv:2403.04642,

  7. [7]

    G. Hinton. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  8. [8]

    Hosseini, X

    A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, and R. Agarwal. V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457,

  9. [9]

    Hwang, D

    H. Hwang, D. Kim, S. Kim, S. Ye, and M. Seo. Self-explore to avoid the pit: Improving the reasoning capabilities of language models with fine-grained rewards.arXiv preprint arXiv:2404.10346,

  10. [10]

    Advances in neural information processing systems, 2001a. S. M. Kakade. A natural policy gradient.Advances in neural information processing systems, 14, 2001b. A. Kazemnejad, M. Aghajohari, E. Portelance, A. Sordoni, S. Reddy, A. Courville, and N. L. Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint a...

  11. [11]

    Let's Verify Step by Step

    16 Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

  12. [12]

    L. Luo, Y. Liu, R. Liu, S. Phatale, H. Lara, Y. Li, L. Shu, Y. Zhu, L. Meng, J. Sun, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592,

  13. [13]

    Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and H. Yang. Let’s reward step by step: Step-level reward model as the navigators for reasoning.arXiv preprint arXiv:2310.10080,

  14. [14]

    WebGPT: Browser-assisted question-answering with human feedback

    R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

  15. [15]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,

  16. [16]

    Reinforcement and Imitation Learning via Interactive No-Regret Learning

    S. Ross and J. A. Bagnell. Reinforcement and imitation learning via interactive no-regret learning.arXiv preprint arXiv:1406.5979,

  17. [17]

    A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation.arXiv preprint arXiv:1511.06295,

  18. [18]

    Setlur, S

    A. Setlur, S. Garg, X. Geng, N. Garg, V. Smith, and A. Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.arXiv preprint arXiv:2406.14532,

  19. [19]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  20. [20]

    D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P

    A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023a. I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Prog- prompt: Generat...

  21. [21]

    Solving math word problems with process- and outcome-based feedback

    J.Uesato, N.Kushman, R.Kumar, F.Song, N.Siegel, L.Wang, A.Creswell, G.Irving, andI.Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,

  22. [22]

    Y. Wu, Z. Sun, S. Li, S. Welleck, and Y. Yang. An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

  23. [23]

    F. Yu, A. Gao, and B. Wang. Outcome-supervised verifiers for planning in mathematical reasoning.arXiv preprint arXiv:2311.09724,

  24. [24]

    Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825,

  25. [25]

    Generative verifiers: Reward modeling as next-token prediction

    L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal. Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240,

  26. [26]

    First, we look at works that train verifiers to provide outcome level feedback (Cobbe et al., 2021b; Hosseini et al., 2024; Singh et al., 2023b; Zelikman et al.,

  27. [27]

    Here, the trained ORMs are mainly used for test-time search (best-of-𝑁)

    on the correctness of the full response (ORM). Here, the trained ORMs are mainly used for test-time search (best-of-𝑁). Next, we look at works that alleviate issues with sparse feedback in ORMs, and instead train process reward models (PRMs), that can perform credit assignment. PRMs are trained either through human annotations (Lightman et al., 2023; Uesa...

  28. [28]

    commonly used to improve the test-time performance using best-of-𝑁, where we generate multiple candidate solutions from the base policy (LLM), rank them using the ORM, and pick the best one. ORMs are trained to assess correctness of a solution either using binary classification (Cobbe et al., 2021a; Yu et al., 2023), preference optimization using DPO (Hos...

  29. [29]

    first pit

    or automated LLM-generated data to estimate value functions 𝑄𝜋 (Luo et al., 2024; Wang et al., 2024). Our work also focus on automated data collection for PRMs but empirically argues for using the advantage function𝐴𝜇 as step-level rewards along with 𝑄𝜋, with a conceptual explanation in Section 3.1. Several prior works have explored step-level search algo...

  30. [30]

    policy in our didactic setup. rewards. In all three works, the gains observed by using PRMs that predict step-level correctness (similar to Lightman et al. (2023)) is quite small, compared to simply using trained ORMs, or the ground-truth outcome supervisionRex. In fact, Havrilla et al. (2024) states that the only algorithm that does well is a form of exp...

  31. [31]

    The RL runs are initialized with a supervised finetuned policy

    We train for 10,000 iterations in both cases, with a batch size of 64, and a constant learning rate of1𝑒 − 3 for the Adam optimizer. The RL runs are initialized with a supervised finetuned policy. For this we take a randomly initialized network, based on the MADE architecture (Germain et al., 2015), with 3 layers, and 128 hidden units in each. Then we tra...

  32. [32]

    The finetuning is done for 5000 iterations, with a batchsize of 32, and a maximum learning rate of5𝑒 − 6 for 2B, 9B and 5𝑒 − 7 for the 27B models

    dataset. The finetuning is done for 5000 iterations, with a batchsize of 32, and a maximum learning rate of5𝑒 − 6 for 2B, 9B and 5𝑒 − 7 for the 27B models. We trained the policies using the Adam optimizer, with a linear warm up and cosine decay learning rate schedule. The linear warm up is done for the first 500 iterations. For the base policies, we choos...

  33. [33]

    scoring token

    We use a linear warm up (till 2000 steps), followed by a cosine decay learning rate schedule to train the models. Since a pretrained LLM would output a matrix of logits (vocabulary size× sequence length) we fix a token as the “scoring token” to be the end of the sequence / prefix that needs to be scored. The logits of this scoring token are then used to d...

  34. [34]

    (2024); Setlur et al

    64k 128k 256k 512k 1024k Training Data Size 0.3 0.4 0.5Accuracy Scaling Laws: First Pit First pit Random Figure 11 | First pit strategy from Luo et al. (2024); Setlur et al. (2024): We compare the beam search performance (with beam size

  35. [35]

    Additional: Experiments on RL Training with PAVs Training details

    E. Additional: Experiments on RL Training with PAVs Training details. As discussed in Section 5, the initialization for RL training is the RFT (rejection finetuned) checkpoint for the corresponding base policies. More specifically, we consider two base policies Gemma 2B SFT, and Gemma 9B SFT, where the RL training is initialized with the policy obtained b...

  36. [36]

    F.1. Natural Policy Gradient The natural policy gradient (NPG) algorithm (Kakade, 2001a) defines a Fisher information matrix (induced by the policy), and performs gradient updates in the geometry induced by the following matrix: 𝐹𝜌(𝜋) = 𝔼𝒔∼𝑑𝜋 𝜌 𝔼𝑎∼𝜋(· |𝒔) h ∇𝜋 log 𝜋(𝑎 | 𝒔) ∇𝜋 log 𝜋(𝑎 | 𝒔) ⊤i (11) Typically, the NPG update does gradient updates on the obje...