The finetuning is done for 5000 iterations, with a batchsize of 32, and a maximum learning rate of5𝑒 − 6 for 2B, 9B and 5𝑒 − 7 for the 27B models

dataset · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

cs.LG · 2024-10-10 · unverdicted · novelty 6.0

Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.

citing papers explorer

Showing 1 of 1 citing paper.

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning cs.LG · 2024-10-10 · unverdicted · none · ref 32
Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.

The finetuning is done for 5000 iterations, with a batchsize of 32, and a maximum learning rate of5𝑒 − 6 for 2B, 9B and 5𝑒 − 7 for the 27B models

fields

years

verdicts

representative citing papers

citing papers explorer