pith. machine review for the scientific record. sign in

arxiv: 2312.08935 · v3 · submitted 2023-12-14 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords process reward modelautomatic supervisionmath reasoningstep-by-step PPOLLM verificationGSM8KMATHreinforcement learning
0
0 comments X

The pith

Math-Shepherd trains reward models on auto-generated step labels to verify and reinforce LLM math solutions without human annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Math-Shepherd, a process reward model that scores each individual step in a math solution path. It trains this model on data built automatically by comparing complete correct and incorrect solution traces, removing the need for human step labels. The trained model supports verification by reranking multiple LLM outputs according to cumulative step scores and reinforcement by providing per-step signals during PPO training. Experiments demonstrate accuracy gains on GSM8K and MATH for models including Mistral-7B.

Core claim

Math-Shepherd is a process-oriented reward model trained with automatically constructed process-wise supervision data that labels individual reasoning steps as correct or incorrect. When applied to verification through reranking of LLM outputs or to step-by-step PPO reinforcement, it produces measurable accuracy improvements such as raising Mistral-7B from 77.9 percent to 84.1 percent on GSM8K and from 28.6 percent to 33.0 percent on MATH, with further gains to 89.1 percent and 43.5 percent when verification is added.

What carries the argument

Math-Shepherd, a process reward model that assigns a scalar score to each reasoning step using automatically generated supervision signals.

Load-bearing premise

Automatically constructed process-wise supervision data accurately labels correct versus incorrect reasoning steps without systematic bias or noise from the generation process itself.

What would settle it

A large-scale human annotation study on held-out solution steps that finds the automatic labels disagree with expert judgments on a substantial fraction of steps would falsify the central claim.

read the original abstract

In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Math-Shepherd, a process reward model for step-level supervision in mathematical reasoning. It is trained solely on automatically constructed process-wise labels derived from sampling multiple solution trajectories per problem and back-propagating final-answer correctness, without human annotations. The model is applied in two settings: verification via reranking of LLM outputs and reinforcement learning via step-by-step PPO. Reported results include accuracy gains for Mistral-7B from 77.9% to 84.1% on GSM8K and 28.6% to 33.0% on MATH via PPO, with further lifts to 89.1% and 43.5% when combined with verification.

Significance. If the automatic labeling procedure reliably captures step correctness, the work would be significant for scaling process supervision in LLMs by removing the annotation bottleneck. The concrete benchmark lifts on GSM8K and MATH, achieved with open-source models and reproducible PPO training, would demonstrate practical value for both verification and RL pipelines in mathematical reasoning.

major comments (2)
  1. [§3] §3 (process-wise supervision construction): the label assignment procedure samples trajectories and assigns step rewards solely from final-answer match to ground truth. This is vulnerable to systematic noise (correct early steps followed by later errors receive negative labels; incorrect steps compensated later receive positive labels). No quantitative validation of label accuracy against human step-level annotations is reported, which directly undermines the central claim that gains arise from genuine process supervision rather than improved outcome filtering.
  2. [Experimental results] Experimental setup and results sections: no analysis or controls are described for potential data leakage between the automatically generated training trajectories and the GSM8K/MATH test sets, nor for error rates in the auto-labeling pipeline. These omissions are load-bearing because the reported PPO improvements (e.g., Mistral-7B GSM8K lift) cannot be confidently attributed to step-level credit assignment without such checks.
minor comments (2)
  1. [Abstract] Abstract and §4: results are detailed only for Mistral-7B while the text refers to 'a series of open-source LLMs'; listing the full set of evaluated models and their individual gains would improve completeness.
  2. [§3] Notation in §3: the precise definition of step reward (binary vs. continuous) and how it is aggregated across sampled trajectories should be stated more explicitly to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major comment below and commit to revisions that strengthen the presentation of our automatic labeling approach and experimental controls.

read point-by-point responses
  1. Referee: [§3] §3 (process-wise supervision construction): the label assignment procedure samples trajectories and assigns step rewards solely from final-answer match to ground truth. This is vulnerable to systematic noise (correct early steps followed by later errors receive negative labels; incorrect steps compensated later receive positive labels). No quantitative validation of label accuracy against human step-level annotations is reported, which directly undermines the central claim that gains arise from genuine process supervision rather than improved outcome filtering.

    Authors: We acknowledge that propagating final-answer correctness to individual steps can introduce label noise, as early correct steps in failing trajectories receive negative labels and later compensating errors in successful trajectories receive positive labels. This is an inherent trade-off of our fully automatic method that avoids human annotations. We maintain that the resulting process reward model still provides net-positive step-level signals on average, as shown by consistent gains in both the verification (reranking) and RL (PPO) settings. In revision we will add an explicit limitations subsection discussing this noise source and will include a small-scale human validation study on a random sample of 200 steps to report estimated label accuracy. revision: partial

  2. Referee: [Experimental results] Experimental setup and results sections: no analysis or controls are described for potential data leakage between the automatically generated training trajectories and the GSM8K/MATH test sets, nor for error rates in the auto-labeling pipeline. These omissions are load-bearing because the reported PPO improvements (e.g., Mistral-7B GSM8K lift) cannot be confidently attributed to step-level credit assignment without such checks.

    Authors: All training trajectories are generated exclusively from the official training splits of GSM8K and MATH; the test sets are held out entirely. We will add an explicit statement and table confirming this separation in the revised experimental setup. For auto-labeling error rates we will add an analysis that measures label consistency across multiple independent trajectory samples per problem and reports the fraction of steps whose label flips when a different successful or unsuccessful trajectory is chosen. These additions will allow readers to assess the reliability of the step-level credit assignment. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains on external benchmarks

full rationale

The paper's central claims rest on measured accuracy improvements for Mistral-7B and other LLMs on the fixed external benchmarks GSM8K and MATH. The automatic construction of process-wise labels (via sampling trajectories and final-answer matching to ground truth) is an input to training the reward model; the subsequent PPO and verification steps are evaluated against those same independent benchmarks rather than against quantities defined from the fitted model itself. No equation or derivation reduces a claimed prediction to a fitted parameter by construction, and no load-bearing self-citation chain is required for the reported results. The method therefore remains self-contained against external falsifiability.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that automatically generated step labels are sufficiently accurate to train a useful reward model; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Automatically generated process-wise supervision data accurately distinguishes correct from incorrect reasoning steps
    The entire training pipeline rests on this premise to replace human annotations.

pith-pipeline@v0.9.0 · 5570 in / 1150 out tokens · 41096 ms · 2026-05-14T22:29:38.954816+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

    cs.CL 2026-04 unverdicted novelty 8.0

    MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....

  2. POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

    cs.SE 2026-05 unverdicted novelty 7.0

    POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

  3. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  4. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  5. Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

    cs.CL 2024-10 conditional novelty 7.0

    Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.

  6. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  7. CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

    cs.CV 2026-05 unverdicted novelty 6.0

    CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.

  8. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

    cs.AI 2026-05 unverdicted novelty 6.0

    CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...

  9. Process Supervision of Confidence Margin for Calibrated LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

  10. Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.

  11. SeLaR: Selective Latent Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.

  12. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  13. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  14. Process Reinforcement through Implicit Rewards

    cs.LG 2025-02 conditional novelty 6.0

    PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...

  15. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    cs.LG 2024-07 unverdicted novelty 6.0

    Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

  16. Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    cs.CL 2024-06 conditional novelty 6.0

    OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.

  17. ReMedi: Reasoner for Medical Clinical Prediction

    cs.CL 2026-05 unverdicted novelty 5.0

    ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.

  18. Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

    cs.CL 2026-04 unverdicted novelty 5.0

    Groupwise Ranking Reward reduces reasoning-answer inconsistency in multimodal models and raises reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.

  19. Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.

  20. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

  21. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  22. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

    cs.CL 2025-08

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 22 Pith papers · 17 internal anchors

  1. [1]

    Red Teaming Language Models with Language Models

    Perez, Ethan and Huang, Saffron and Song, Francis and Cai, Trevor and Ring, Roman and Aslanides, John and Glaese, Amelia and McAleese, Nat and Irving, Geoffrey. Red Teaming Language Models with Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.225

  2. [10]

    International Conference on Machine Learning , pages=

    Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  3. [11]

    Proceedings of the 29th Symposium on Operating Systems Principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

  4. [13]

    Chi and Quoc V

    Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =

  5. [14]

    Le and Ed H

    Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc V. Le and Ed H. Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  6. [29]

    arXiv preprint arXiv:2306.17492 , year=

    Preference ranking optimization for human alignment , author=. arXiv preprint arXiv:2306.17492 , year=

  7. [31]

    GitHub repository , howpublished =

    DeepSeek , title =. GitHub repository , howpublished =. 2023 , publisher =

  8. [32]

    nature , volume=

    Mastering the game of Go with deep neural networks and tree search , author=. nature , volume=. 2016 , publisher=

  9. [38]

    Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

    Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

  10. [43]

    Artificial Intelligence Review , volume=

    Monte Carlo tree search: A review of recent modifications and applications , author=. Artificial Intelligence Review , volume=. 2023 , publisher=

  11. [44]

    Alphazero-like tree-search can guide large lan- guage model decoding and training.arXiv preprint arXiv:2309.17179, 2023

    Alphazero-like tree-search can guide large language model decoding and training , author=. arXiv preprint arXiv:2309.17179 , year=

  12. [45]

    European conference on machine learning , pages=

    Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=

  13. [46]

    International conference on computers and games , pages=

    Efficient selectivity and backup operators in Monte-Carlo tree search , author=. International conference on computers and games , pages=. 2006 , organization=

  14. [49]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  15. [51]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  16. [52]

    Jiang, Jia Deng, Stella Biderman, and Sean Welleck

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023

  17. [53]

    When do program-of-thoughts work for reasoning? arXiv preprint arXiv:2308.15452, 2023

    Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thoughts work for reasoning? arXiv preprint arXiv:2308.15452, 2023

  18. [54]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  19. [55]

    Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond

    Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, and Baobao Chang. Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond. arXiv preprint arXiv:2310.02071, 2023

  20. [56]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  21. [57]

    Efficient selectivity and backup operators in monte-carlo tree search

    R \'e mi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp.\ 72--83. Springer, 2006

  22. [58]

    Deepseek llm: Let there be answers

    DeepSeek. Deepseek llm: Let there be answers. https://github.com/deepseek-ai/DeepSeek-LLM, 2023

  23. [59]

    Complexity-based prompting for multi-step reasoning

    Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720, 2022

  24. [60]

    Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023

  25. [61]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020

  26. [62]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  27. [63]

    Towards reasoning in large language models: A survey

    Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 1049--1065, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.findings-acl.67. URL htt...

  28. [64]

    Large Language Models Cannot Self-Correct Reasoning Yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023

  29. [65]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  30. [66]

    Challenges and applications of large language models

    Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023

  31. [67]

    Bandit based monte-carlo planning

    Levente Kocsis and Csaba Szepesv \'a ri. Bandit based monte-carlo planning. In European conference on machine learning, pp.\ 282--293. Springer, 2006

  32. [68]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.\ 611--626, 2023

  33. [69]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

  34. [70]

    M3it: A large-scale dataset towards multi- modal multilingual instruction tuning

    Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023 a

  35. [71]

    Making language models better reasoners with step-aware verifier

    Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, Toronto, C...

  36. [72]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023

  37. [73]

    Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023

  38. [74]

    Let's reward step by step: Step-level reward model as the navigators for reasoning

    Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. Let's reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080, 2023

  39. [75]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774

  40. [76]

    Let's reinforce step by step

    Sarah Pan, Vladislav Lialin, Sherin Muckatira, and Anna Rumshisky. Let's reinforce step by step. arXiv preprint arXiv:2311.05821, 2023

  41. [77]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.\ 1--22, 2023

  42. [78]

    Mastering the game of go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529 0 (7587): 0 484--489, 2016

  43. [79]

    Restgpt: Connecting large lan- guage models with real -world restful apis

    Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis. corr, abs/2306.06624, 2023. doi: 10.48550. arXiv preprint arXiv.2306.06624

  44. [80]

    Monte carlo tree search: A review of recent modifications and applications

    Maciej \'S wiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma \'n dziuk. Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56 0 (3): 0 2497--2562, 2023

  45. [81]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  46. [82]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022

  47. [83]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023 a

  48. [84]

    Making large language models better reasoners with alignment

    Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144, 2023 b

  49. [85]

    Large language models are not fair evaluators

    Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023 c

  50. [86]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 d . URL https://openreview.net/pdf...

  51. [87]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

  52. [88]

    arXiv preprint arXiv:2306.01693 , year=

    Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023

  53. [89]

    Lossless speedup of autoregressive translation with generalized aggressive decoding

    Heming Xia, Tao Ge, Furu Wei, and Zhifang Sui. Lossless speedup of autoregressive translation with generalized aggressive decoding. arXiv preprint arXiv:2203.16487, 2022

  54. [90]

    Ovm, outcome-supervised value models for planning in mathematical reasoning

    Fei Yu, Anningzhe Gao, and Benyou Wang. Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023 a

  55. [91]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023 b

  56. [92]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023

  57. [93]

    arXiv preprint arXiv:2309.05653 , year=

    Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023

  58. [94]

    Cumulative reasoning with large language models

    Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi-Chih Yao. Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371, 2023

  59. [95]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

  60. [96]

    Solving math word problems via cooperative reasoning induced language models

    Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. Solving math word problems via cooperative reasoning induced language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...