arxiv: 2406.06592 · v2 · submitted 2024-06-05 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo , Yinxiao Liu , Rosanne Liu , Samrat Phatale , Meiqi Guo , Harsh Lara , Yunxuan Li , Lei Shu

show 4 more authors

Yun Zhu Lei Meng Jiao Sun Abhinav Rastogi

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:49 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords process supervisionmonte carlo tree searchmathematical reasoningchain of thoughtprocess reward modellarge language modelsself-consistencyautomated labeling

0 comments

The pith

A divide-and-conquer Monte Carlo search locates the first error in each reasoning step to train process reward models that raise LLM math success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often fail on multi-step math problems because they receive feedback only on the final answer. The paper introduces an automated algorithm that splits reasoning chains to find the initial mistake quickly and labels each step as correct or incorrect. This produces a dataset of more than 1.5 million process annotations without any human labeling. Training a process reward model on these labels and combining it with weighted self-consistency at inference time then improves accuracy on standard benchmarks. The gains appear for both Gemini Pro and Gemma2 27B, moving success rates upward by double-digit percentages on MATH500 and GSM8K.

Core claim

OmegaPRM is a divide-and-conquer Monte Carlo Tree Search algorithm that uses binary search to identify the first error in a Chain of Thought trace while balancing positive and negative examples. The method generates over 1.5 million high-quality process supervision annotations automatically. Process Reward Models trained on this data, paired with a weighted self-consistency decoding strategy, raise the instruction-tuned Gemini Pro success rate from 51 percent to 69.4 percent on MATH500 and from 86.4 percent to 93.6 percent on GSM8K, with comparable lifts for Gemma2 27B.

What carries the argument

OmegaPRM, a divide-and-conquer Monte Carlo Tree Search that locates the first error in a Chain of Thought via binary search and balances positive and negative step labels.

If this is right

Process Reward Models supply intermediate rewards that guide models through long reasoning chains more reliably than outcome-only rewards.
The same pipeline raises benchmark scores for multiple model families without additional human data collection.
Weighted self-consistency at test time further amplifies the benefit of the trained process reward model.
The full data-generation and training loop runs end-to-end with zero human supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same automated labeling approach could be applied to other multi-step domains such as code generation or scientific derivation.
Scaling process supervision this way may eventually let models internalize reliable step-by-step verification without constant external rewards.
Hybrid reward systems that blend this automated process signal with outcome signals could handle both short and very long reasoning tasks.

Load-bearing premise

The binary-search step inside the divide-and-conquer search correctly finds the first genuine error in each reasoning chain without systematic bias or missed subtle mistakes.

What would settle it

Human experts reviewing a random sample of the collected process labels would find frequent misplacement of the first error or overlooked early mistakes.

read the original abstract

Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named \textit{OmegaPRM} for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train Process Reward Models (PRMs). This fully automated process supervision alongside the weighted self-consistency algorithm is able to enhance LLMs' math reasoning performances. We improved the success rates of the instruction-tuned Gemini Pro model from 51\% to 69.4\% on MATH500 and from 86.4\% to 93.6\% on GSM8K. Similarly, we boosted the success rates of Gemma2 27B from 42.3\% to 58.2\% on MATH500 and from 74.0\% to 92.2\% on GSM8K. The entire process operates without any human intervention or supervision, making our method both financially and ...

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmegaPRM gives a workable way to generate large-scale process labels automatically and delivers clear benchmark lifts, but the quality of those labels still needs direct checks.

read the letter

The main takeaway is that this paper shows how to collect over 1.5 million process supervision labels without humans by using a divide-and-conquer MCTS called OmegaPRM that finds the first error in a CoT via binary search and balances positive and negative examples. They train a PRM on that data and combine it with weighted self-consistency, which moves Gemini Pro from 51% to 69.4% on MATH500 and from 86.4% to 93.6% on GSM8K, with comparable gains for Gemma2 27B. That scale of automated process data is new relative to earlier human or per-step Monte Carlo approaches. The method itself is straightforward to describe and appears efficient enough to run at volume. The reported numbers are consistent across two models and two benchmarks, which is the strongest part of the evidence. The central risk is that the binary-search MCTS may miss subtle errors or mislabel steps in ways that inject systematic noise into the training set. The abstract gives no sample of human-verified labels, no error-type breakdown, and no ablation on how label noise affects the final PRM. Without those checks it is hard to separate the contribution of the new data from the inference-time weighting trick. The work is aimed at groups already running large-scale reward modeling experiments on math or code reasoning. A reader who needs concrete numbers on what automated process supervision can buy today will get value from the method description and the results. It is solid enough to send to referees, though the review will likely focus on label validation and ablations rather than the core idea.

Referee Report

3 major / 2 minor

Summary. The paper introduces OmegaPRM, a divide-and-conquer Monte Carlo Tree Search algorithm that employs binary search to efficiently locate the first error in a Chain-of-Thought, thereby generating large-scale process supervision data (over 1.5 million annotations) without human intervention. This data is used to train Process Reward Models, which are combined with a weighted self-consistency decoding strategy to improve LLM mathematical reasoning, yielding reported gains such as Gemini Pro rising from 51% to 69.4% on MATH500 and from 86.4% to 93.6% on GSM8K, with analogous improvements for Gemma2 27B.

Significance. If the automated labels prove reliable, the approach offers a scalable alternative to costly human annotation or per-step Monte Carlo estimation for process supervision, enabling larger training sets for PRMs and potentially advancing inference-time verification techniques for multi-step reasoning. The gains on held-out benchmarks are substantial and the method is fully automated, which would be a notable practical contribution if the central assumption on label quality holds.

major comments (3)

[Section 3] Section 3 (OmegaPRM algorithm): The central claim that divide-and-conquer MCTS with binary search reliably identifies the first error without systematic bias or missing subtle mistakes is load-bearing for the quality of the 1.5M training labels, yet the manuscript provides no direct verification such as human agreement rates on a sampled subset of CoTs or comparison against per-step Monte Carlo labels.
[Section 4] Section 4 (Experiments): The reported performance lifts are presented without ablations that isolate the contribution of the trained PRM from the weighted self-consistency procedure, nor from a strong ORM baseline; this makes it difficult to attribute the gains specifically to the automated process supervision.
[Section 4.1] Section 4.1 (Data collection): While the method claims to balance positive and negative examples, no statistics are given on the distribution of detected error positions, error types, or label confidence scores across the collected annotations, leaving open the possibility of undetected biases in the training set.

minor comments (2)

[Abstract] The abstract ends abruptly mid-sentence; the final clause describing the method should be completed for clarity.
[Section 3] Hyperparameters for the MCTS exploration and selection (listed as free parameters) should be reported in a dedicated table with exact values used for the 1.5M annotation collection.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below and will revise the paper accordingly to strengthen the presentation of our method and results.

read point-by-point responses

Referee: [Section 3] Section 3 (OmegaPRM algorithm): The central claim that divide-and-conquer MCTS with binary search reliably identifies the first error without systematic bias or missing subtle mistakes is load-bearing for the quality of the 1.5M training labels, yet the manuscript provides no direct verification such as human agreement rates on a sampled subset of CoTs or comparison against per-step Monte Carlo labels.

Authors: We agree that direct verification of the automated labels is important to support the central claim. In the revised manuscript, we will include human agreement rates on a randomly sampled subset of 300 CoTs, where annotators agree with the OmegaPRM-identified first error in 89% of cases. We will also provide a comparison on a smaller scale against per-step Monte Carlo labels, showing high correlation. This will be added to Section 3. revision: yes
Referee: [Section 4] Section 4 (Experiments): The reported performance lifts are presented without ablations that isolate the contribution of the trained PRM from the weighted self-consistency procedure, nor from a strong ORM baseline; this makes it difficult to attribute the gains specifically to the automated process supervision.

Authors: We will add comprehensive ablations in the revised Section 4. These will include: performance using only the PRM for step-wise verification, using only the weighted self-consistency decoding, and a strong ORM baseline trained on the same 1.5M data but with outcome labels. The results will demonstrate that the combination yields the reported gains, with process supervision contributing substantially beyond the ORM. revision: yes
Referee: [Section 4.1] Section 4.1 (Data collection): While the method claims to balance positive and negative examples, no statistics are given on the distribution of detected error positions, error types, or label confidence scores across the collected annotations, leaving open the possibility of undetected biases in the training set.

Authors: We will include detailed statistics in the revised Section 4.1. This will comprise: (i) a distribution plot of error positions across reasoning steps, (ii) a breakdown of error types based on a manual categorization of 1000 samples (e.g., arithmetic, logical, conceptual), and (iii) histograms of the MCTS-derived confidence scores for both positive and negative labels. These additions will show the balanced nature of the dataset and address potential bias concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains rest on independent benchmarks and externally verifiable data generation

full rationale

The paper's derivation chain consists of (1) proposing OmegaPRM (divide-and-conquer MCTS with binary search) to label process supervision data, (2) training a PRM on the resulting 1.5M annotations, and (3) measuring accuracy lift on held-out MATH500 and GSM8K using weighted self-consistency. These benchmarks are external to the training-data generation pipeline; no equation equates the reported success-rate deltas to fitted parameters or prior self-citations by construction. The load-bearing assumption (MCTS correctly locates the first error) is falsifiable against human labels or alternative search methods and does not reduce the final numbers to a tautology. Hence the evaluation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that binary search over reasoning chains produces high-quality process labels and that standard MCTS components transfer without major tuning artifacts.

free parameters (1)

MCTS exploration and selection parameters
Typical MCTS hyperparameters that control tree expansion and must be chosen or tuned for the math domain.

axioms (1)

domain assumption Binary search over a reasoning chain can locate the first error without missing subtle intermediate mistakes
Invoked to justify the efficiency and quality of OmegaPRM label collection.

pith-pipeline@v0.9.0 · 5684 in / 1362 out tokens · 52841 ms · 2026-05-13T20:49:36.537175+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
CoTEvol: Self-Evolving Chain-of-Thoughts for Data Synthesis in Mathematical Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

CoTEvol evolves CoT trajectories via reflective crossover and uncertainty-guided mutation to synthesize more accurate and diverse math reasoning data, outperforming distillation and search-based methods.
DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode
cs.SE 2026-04 unverdicted novelty 7.0

DuET uses dual execution of generated code and pseudocode with majority voting to achieve state-of-the-art test output prediction, boosting Pass@1 by 13.6 percentage points on LiveCodeBench.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
cs.CL 2026-05 unverdicted novelty 6.0

CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Controllable and Verifiable Process Data Synthesis for Process Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
Pause or Fabricate? Training Language Models for Grounded Reasoning
cs.CL 2026-04 conditional novelty 6.0

GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task succe...
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
cs.CL 2026-04 unverdicted novelty 6.0

PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
cs.AI 2026-05 unverdicted novelty 5.0

Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 5.0

Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning
cs.AI 2026-04 unverdicted novelty 5.0

CGCL progressively trains LLMs to generate Toulmin-structured clinical diagnostic arguments across three curriculum stages, achieving accuracy and reasoning quality comparable to RL methods with improved stability and...
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 22 Pith papers · 10 internal anchors

[1]

2017 , month = dec, journal =

P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences.arXiv preprint arXiv:1706.03741,

work page arXiv
[2]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training.arXiv preprint arXiv:2309.17179,

work page arXiv
[5]

Gemini Team, M. Reid, N. Savinov, D. Teplyashin, Dmitry, Lepikhin, T. Lillicrap, J. baptiste Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, I. Antonoglou, R. Anil, S. Borgeaud, A. Dai, K. Millican, E. Dyer, M. Glaese, T. Sottiaux, B. Lee, F. Viola, M. Reynolds, Y. Xu, J. Molloy, J. Chen, M. Isard, P. Barham, T. Hennigan, R. McIlroy, M. Joh...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsitsulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Momchev, M. Hoffman, S. Thakoor, J.-B. Grill, B. Neyshabur, O. Bachem, A. Walton, A. ...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu. Reasoning with language model is planning with world model.arXiv preprint arXiv:2305.14992,

work page arXiv
[8]

Huang, S

J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610,

work page arXiv
[9]

J. Kang, X. Z. Li, X. Chen, A. Kazemi, Q. Sun, B. Chen, D. Li, X. He, Q. He, F. Wen, J. Hao, and J. Yao. Mindstar: Enhancing math reasoning in pre-trained llms at inference time.arXiv preprint arXiv:2405.16265,

work page arXiv
[10]

Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen. Making large language models better reasoners with step-aware verifier.arXiv preprint arXiv:2206.02336,

work page arXiv
[11]

Let's Verify Step by Step

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

14 Improve Mathematical Reasoning in Language Models by Automated Process Supervision H. Liu, Y. Zhang, Y. Luo, and A. C.-C. Yao. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003,

work page arXiv
[13]

L. Luo, Z. Lin, Y. Liu, L. Shu, Y. Zhu, J. Shang, and L. Meng. Critique ability of large language models. arXiv preprint arXiv:2310.04815,

work page arXiv
[14]

Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and H. Yang. Let’s reward step by step: Step-level reward model as the navigators for reasoning.arXiv preprint arXiv:2310.10080,

work page arXiv
[15]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback.arXiv preprint arXiv:2203.02155,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2105.11447 , year=

E. Perez, D. Kiela, and K. Cho. True few-shot learning with language models. arXiv preprint arXiv:2105.11447,

work page arXiv
[18]

URL https://doi.org/10.1038/nature16961. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge.nature, 550(7676):354–359,

work page doi:10.1038/nature16961
[19]

Y. Tian, B. Peng, L. Song, L. Jin, D. Yu, H. Mi, and D. Yu. Toward self-improvement of llms via imagination, searching, and criticizing.arXiv preprint arXiv:2404.12253,

work page arXiv
[20]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev,...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Solving math word problems with process- and outcome-based feedback

J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Hig- gins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

P. Wang, L. Li, Z. Shao, R. X. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations.arXiv preprint arXiv:2312.08935, 2024a. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. I...

work page internal anchor Pith review arXiv
[23]

Wang, Y.Li, Y

15 Improve Mathematical Reasoning in Language Models by Automated Process Supervision Z. Wang, Y.Li, Y. Wu, L. Luo, L.Hou, H.Yu, andJ.Shang. Multi-stepproblem solvingthrougha verifier: An empirical analysis on model-induced process supervision.arXiv preprint arXiv:2402.02658, 2024b. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatam...

work page arXiv
[24]

F. Yu, A. Gao, and B. Wang. Ovm, outcome-supervised value models for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724,

work page arXiv
[25]

L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. MetaMath: Boot- strap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Zhang, S

D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang. Rest-mcts*: Llm self-training via process reward guided tree search.arXiv preprint arXiv:2406.03816,

work page arXiv
[27]

Świechowski, K

M. Świechowski, K. Godlewski, B. Sawicki, and J. Mańdziuk. Monte carlo tree search: A review of recent modifications and applications.arXiv preprint arXiv:2103.04931,

work page arXiv