Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Pith reviewed 2026-05-18 23:53 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{4CBGIT3D}
Prints a linked pith:4CBGIT3D badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Step-wise preference optimization on individual reasoning steps improves long-chain mathematical accuracy in LLMs more effectively than whole-answer DPO.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Step-DPO reframes preference optimization so that each individual reasoning step becomes the unit of comparison rather than the full final answer. The authors construct a dataset of 10K step-wise preference pairs and show that training on self-generated pairs yields better results than out-of-distribution data. When applied to Qwen2-72B-Instruct the resulting model reaches 70.8 percent on the MATH test set and 94.0 percent on GSM8K, exceeding GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.
What carries the argument
Step-wise preference pairs that contrast a correct reasoning step with an incorrect one at the identical position in the chain, allowing Direct Preference Optimization to operate at process granularity instead of outcome granularity.
If this is right
- Models learn to detect and avoid specific errors inside long reasoning chains rather than only judging final answers.
- Only 10K step-wise pairs and under 500 training steps suffice for a nearly 3 percent accuracy increase on MATH for models exceeding 70B parameters.
- Self-generated data outperforms human-written or GPT-4-generated data for this style of preference optimization.
- Open models can reach or exceed the math performance of several closed-source frontier models.
Where Pith is reading between the lines
- The same step-level signal could be applied to other sequential tasks such as code generation or multi-step scientific reasoning where error localization matters.
- Automated ways to generate or verify step labels might remove the remaining human effort in the pipeline and allow further scaling.
- Process-level preference data may reduce the total volume of feedback needed for alignment compared with outcome-only methods.
Load-bearing premise
The pipeline that creates the step-wise preference pairs must label correct and incorrect steps accurately and without introducing systematic errors or shifts in data distribution.
What would settle it
Training a model with the Step-DPO pairs produces no accuracy gain or a loss relative to standard DPO or the untuned base model on the MATH test set.
read the original abstract
Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at https://github.com/dvlab-research/Step-DPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Step-DPO, an extension of Direct Preference Optimization that operates on individual reasoning steps rather than complete answers, to improve long-chain mathematical reasoning in LLMs. It presents a custom pipeline for constructing 10K step-wise preference pairs (emphasizing self-generated data over GPT-4 or human data), reports that fewer than 500 training steps on these pairs yield nearly 3% accuracy gains on MATH for >70B models, and claims that Step-DPO applied to Qwen2-72B-Instruct reaches 70.8% on MATH and 94.0% on GSM8K, surpassing GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.
Significance. If the step-level labels are reliable, the work demonstrates a data-efficient route to process supervision within the DPO framework for complex reasoning, with the self-generated data observation providing a useful practical insight. Public release of code, data, and models is a clear strength that aids reproducibility and follow-up work.
major comments (2)
- [Section 3] Data construction pipeline (Section 3): The manuscript describes generating step-wise preference pairs by locating the first erroneous step but provides no quantitative validation of labeling accuracy, such as human agreement rates on a held-out sample, error analysis of mislabeled pairs, or checks for systematic biases (e.g., overlooking subtle arithmetic mistakes). This validation is load-bearing for the central claim that the 10K pairs produce genuine process-level supervision rather than spurious signals.
- [Section 4] Experiments and ablations (Section 4): While headline results on MATH and GSM8K are reported, the paper supplies limited controls to isolate the effect of step-wise versus answer-wise DPO or to rule out confounding factors such as the specific distribution of self-generated data versus the baseline training distribution. Additional ablations (e.g., random step labeling or answer-level DPO on the same 10K pairs) would strengthen the attribution of gains to the step-wise formulation.
minor comments (2)
- [Section 2] Notation for the step-wise preference loss could be clarified with an explicit equation contrasting it to standard DPO (Eq. 1 in the paper).
- [Figure 2] Figure 2 or the data pipeline diagram would benefit from an example of a correctly versus incorrectly labeled step pair to illustrate the labeling rule.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our Step-DPO paper. The suggestions regarding validation of the data construction pipeline and the need for additional controls in the experiments are helpful for strengthening the manuscript. We address each major comment below and have revised the paper accordingly to incorporate quantitative validation and further ablations.
read point-by-point responses
-
Referee: [Section 3] Data construction pipeline (Section 3): The manuscript describes generating step-wise preference pairs by locating the first erroneous step but provides no quantitative validation of labeling accuracy, such as human agreement rates on a held-out sample, error analysis of mislabeled pairs, or checks for systematic biases (e.g., overlooking subtle arithmetic mistakes). This validation is load-bearing for the central claim that the 10K pairs produce genuine process-level supervision rather than spurious signals.
Authors: We agree that quantitative validation of the labeling accuracy is important to support the claim of reliable process-level supervision. In the revised manuscript, we have added a dedicated subsection in Section 3 describing a human evaluation study performed on a held-out sample of the preference pairs. This includes inter-annotator agreement rates, an error analysis of mislabeled cases, and explicit checks for systematic biases such as the potential overlooking of subtle arithmetic mistakes. The pipeline description has also been expanded to explain the multi-stage verification steps used to mitigate such biases. These additions provide direct evidence that the 10K pairs deliver genuine process supervision. revision: yes
-
Referee: [Section 4] Experiments and ablations (Section 4): While headline results on MATH and GSM8K are reported, the paper supplies limited controls to isolate the effect of step-wise versus answer-wise DPO or to rule out confounding factors such as the specific distribution of self-generated data versus the baseline training distribution. Additional ablations (e.g., random step labeling or answer-level DPO on the same 10K pairs) would strengthen the attribution of gains to the step-wise formulation.
Authors: We acknowledge that stronger controls would better isolate the contribution of the step-wise formulation and rule out potential confounders from the data distribution. In the revised Section 4, we have added ablations that apply answer-level DPO to the exact same 10K preference pairs for direct comparison, as well as a random step labeling baseline. These experiments help demonstrate that the observed gains are attributable to accurate step-wise supervision rather than the self-generated data distribution alone. We have also clarified the distinctions between the training distributions in the discussion of results. revision: yes
Circularity Check
No significant circularity: Step-DPO extends DPO empirically to step pairs with held-out benchmark gains
full rationale
The paper introduces Step-DPO as an application of the existing DPO objective to newly constructed step-level preference pairs generated via a custom pipeline. All reported performance numbers (e.g., 70.8% on MATH, 94.0% on GSM8K for Qwen2-72B-Instruct) are measured on standard held-out test sets that are independent of the training objective and data construction. No derivation step, equation, or claim reduces by construction to a fitted parameter, self-definition, or self-citation chain; the central results remain externally falsifiable through benchmark evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of training steps =
<500
Forward citations
Cited by 20 Pith papers
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
-
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
-
On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.
-
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.
-
Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning
Future Policy Approximation (FPA) improves offline RL for LLM mathematical reasoning by extrapolating future policies in logit space to proactively reweight gradients, yielding consistent gains over DPO, RPO, KTO and ...
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
-
YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning
YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?
Larger differences in generator capability between chosen and rejected reasoning traces improve out-of-domain performance, while filtering pairs by sample-level quality deltas enables more data-efficient training.
-
Hard Negative Sample-Augmented DPO Post-Training for Small Language Models
A six-dimensional MathVerifier supplies hard negatives and per-sample weights that improve DPO performance on math reasoning for a 1.5B Qwen2.5 model over standard SFT and unweighted DPO.
-
SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance
SHE is a new RL framework using stepwise hybrid examination rewards to improve reasoning quality and accuracy in large-scale e-commerce query-product relevance prediction.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
Sample-efficient LLM Optimization with Reset Replay
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoni...
-
From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
A DPO framework augmented with curriculum learning and two new loss parameters generates veracity explanations for Hindi news using LLMs and PLMs.
-
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
-
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation
Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Llemma: An Open Language Model For Mathematics
Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv:2310.10631,
work page internal anchor Pith review arXiv
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Alphamath almost zero: process supervision without process
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process. arXiv:2405.03553,
-
[5]
Training Verifiers to Solve Math Word Problems
10 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv:2309.17452,
-
[7]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
ORPO: Monolithic Preference Optimization without Reference Model
Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. arXiv:2403.07691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Common 7b language models already possess strong math capabilities
Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. Common 7b language models already possess strong math capabilities. arXiv:2403.04706,
-
[10]
Mario: Math reasoning with code interpreter output–a reproducible pipeline
Minpeng Liao, Wei Luo, Chengxi Li, Jing Wu, and Kai Fan. Mario: Math reasoning with code interpreter output–a reproducible pipeline. arXiv:2401.08190,
-
[11]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv:2305.20050,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Rho-1: Not all tokens are what you need
Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, et al. Rho-1: Not all tokens are what you need. arXiv:2404.07965,
-
[13]
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems. arXiv:1705.04146,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:2401.09003 , year=
Haoxiong Liu and Andrew Chi-Chih Yao. Augmenting math word problems via iterative question composing. arXiv:2401.09003,
-
[15]
Improving large language model fine-tuning for solving math problems
Yixin Liu, Avi Singh, C Daniel Freeman, John D Co-Reyes, and Peter J Liu. Improving large language model fine-tuning for solving math problems. arXiv:2310.10047,
-
[16]
Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms. arXiv:2402.16352,
-
[17]
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv:2308.09583,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah
URL https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime . Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of slms in grade school math. arXiv:2402.14830,
-
[19]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J´er´emy Rapin, et al. Code llama: Open foundation models for code. arXiv:2308.12950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
arXiv preprint arXiv:2403.02884 , year=
Zhengyang Tang, Xingxing Zhang, Benyou Wan, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv:2403.02884,
-
[23]
Can llms learn from previous mistakes? investigating llms’ errors to boost for reasoning
Yongqi Tong, Dawei Li, Sizhe Wang, Yujia Wang, Fei Teng, and Jingbo Shang. Can llms learn from previous mistakes? investigating llms’ errors to boost for reasoning. arXiv:2403.20046,
-
[24]
arXiv preprint arXiv:2402.10176 , year=
Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. arXiv:2402.10176,
-
[25]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Zephyr: Direct Distillation of LM Alignment
Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Cl´ementine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv:2310.16944,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
arXiv preprint arXiv:2310.03731 (2023)
Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv:2310.03731, 2023a. Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. Math-shepherd: Verify and re...
-
[28]
Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data
Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data. arXiv:2405.14333,
-
[29]
Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline
Yifan Xu, Xiao Liu, Xinghan Liu, Zhenyu Hou, Yueyan Li, Xiaohan Zhang, Zihan Wang, Aohan Zeng, Zhengxiao Du, Wenyi Zhao, et al. Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline. arXiv:2404.02893,
-
[30]
arXiv preprint arXiv:2402.06332 , year=
12 Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language models toward verifiable reasoning. arXiv:2402.06332,
-
[31]
Answering questions by meta-reasoning over multiple chains of thought
Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answering questions by meta-reasoning over multiple chains of thought. arXiv:2304.13007,
-
[32]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv:2309.12284,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv:2308.01825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv:2309.05653,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Mammoth2: Scaling instructions from the web
Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. arXiv:2405.03548,
-
[36]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Sch¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv:2205.10625,
work page internal anchor Pith review Pith/arXiv arXiv
- [37]
-
[38]
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv:2406.11931,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.