arxiv: 2308.09583 · v3 · submitted 2023-08-18 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo , Qingfeng Sun , Can Xu , Pu Zhao , Jianguang Lou , Chongyang Tao , Xiubo Geng , Qingwei Lin

show 3 more authors

Shifeng Chen Yansong Tang Dongmei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords mathematical reasoninglarge language modelsreinforcement learninginstruction evolutionchain of thoughtGSM8KMATH benchmark

0 comments

The pith

WizardMath applies reinforced evol-instruct feedback to boost LLMs' math chain-of-thought reasoning without external tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WizardMath as a way to strengthen mathematical reasoning in large language models by using Reinforcement Learning from Evol-Instruct Feedback, or RLEIF. This process evolves instructions and applies process supervision directly in the math domain. A sympathetic reader would care because the resulting models, especially the 70B version, reach or exceed the performance of closed models such as GPT-3.5-Turbo, Claude 2, Gemini Pro, and early GPT-4 on GSM8K and MATH benchmarks. The work also points to instruction evolution and process supervision as central to these gains and shows strong results even with the smaller Mistral 7B base.

Core claim

By applying Reinforcement Learning from Evol-Instruct Feedback (RLEIF) to the math domain, WizardMath enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, yielding WizardMath-Mistral 7B that surpasses top-tier open-source LLMs and WizardMath 70B that outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version on GSM8K and MATH.

What carries the argument

Reinforcement Learning from Evol-Instruct Feedback (RLEIF), which evolves math instructions iteratively and reinforces correct reasoning steps through feedback.

Load-bearing premise

The reported performance gains stem primarily from the RLEIF procedure rather than from differences in the base model, data mixture, or evaluation protocol.

What would settle it

A controlled experiment that fine-tunes the identical base models on the same evolved instructions but omits the reinforcement learning feedback loop and checks whether the large accuracy lifts on GSM8K and MATH still appear.

read the original abstract

Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance. For more details refer to https://github.com/nlpxucan/WizardLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WizardMath shows open models can hit strong math numbers with evolved data plus RL process feedback, but the RL step still needs isolation from the data changes.

read the letter

The main thing here is that WizardMath layers reinforcement learning from process feedback on top of Evol-Instruct data to lift math reasoning in open LLMs. The 7B Mistral version beats other open models on GSM8k and MATH with better data efficiency, and the 70B version claims to surpass GPT-3.5-Turbo, Claude 2, Gemini Pro, and an early GPT-4 on those benchmarks without external tools. They also flag the importance of instruction evolution and process supervision in a short exploration section. The GitHub release helps with checking the pipeline. This is a straightforward extension of prior work on data generation and RL for reasoning, and the benchmark margins are the concrete advance. The experimental gaps are the real issue. No direct comparison appears between the full RLEIF model and an SFT-only run on the same evolved dataset, so it is hard to tell whether the RL step drives the gains or if the data quality alone would have done most of the work. The abstract also omits error bars, prompt formatting details, and data filtering rules, which leaves the headline results resting on unreviewed setup choices. This paper is for groups working on practical math reasoning improvements in open models, especially those who want to try similar pipelines for tutoring or assistant tools. Readers who care about clean attribution of gains will want the missing ablations before treating the method as settled. It is worth sending to peer review because the performance claims are sharp enough that referees can test the training details and run controls themselves.

Referee Report

2 major / 2 minor

Summary. The paper introduces WizardMath, which applies Reinforcement Learning from Evol-Instruct Feedback (RLEIF) to boost chain-of-thought mathematical reasoning in LLMs without external tools. It reports that WizardMath-Mistral-7B substantially outperforms leading open-source models on GSM8k and MATH, while WizardMath-70B surpasses GPT-3.5-Turbo, Claude 2, Gemini Pro, and an early GPT-4 variant; a preliminary analysis emphasizes the roles of instruction evolution and process supervision.

Significance. If the gains are shown to stem specifically from RLEIF rather than data quality or base-model differences, the work would provide a practical recipe for elevating open-source mathematical reasoning to near-proprietary levels using only evolved instructions and process-level RL, with the preliminary ablation-style exploration of evolution and supervision serving as a useful starting point for follow-on research.

major comments (2)

[Experiments] The manuscript provides no direct SFT-only baseline trained on the identical Evol-Instruct dataset before applying the RL stage. Without this comparison, it remains unclear whether the headline gains on GSM8k and MATH are driven by the RLEIF reinforcement step or simply by the quality of the evolved data.
[Abstract and §4] Benchmark results in the abstract and main results section are presented without error bars, details on prompt formatting, data exclusion criteria, or full training curves. These omissions make it difficult to assess the statistical reliability of the claim that WizardMath-70B outperforms the listed closed models.

minor comments (2)

[Conclusion] The GitHub link is referenced but the paper would benefit from an explicit statement of which artifacts (code, data splits, evaluation prompts) are released.
[Method] Notation for the process-supervision reward model could be introduced earlier and used consistently in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Experiments] The manuscript provides no direct SFT-only baseline trained on the identical Evol-Instruct dataset before applying the RL stage. Without this comparison, it remains unclear whether the headline gains on GSM8k and MATH are driven by the RLEIF reinforcement step or simply by the quality of the evolved data.

Authors: We agree that an explicit SFT-only baseline on the identical dataset would help isolate the contribution of the RLEIF stage. In the revised manuscript we will add results from such a baseline trained on the same Evol-Instruct data, allowing direct comparison of performance before and after the reinforcement learning phase. revision: yes
Referee: [Abstract and §4] Benchmark results in the abstract and main results section are presented without error bars, details on prompt formatting, data exclusion criteria, or full training curves. These omissions make it difficult to assess the statistical reliability of the claim that WizardMath-70B outperforms the listed closed models.

Authors: We acknowledge the value of these details for assessing reliability. The revised version will include error bars from repeated evaluations where computationally feasible, explicit prompt formatting descriptions, data exclusion criteria, and full training curves in the appendix to support the reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks without self-referential derivations

full rationale

The paper introduces the RLEIF procedure and reports accuracy numbers on GSM8k and MATH, comparing WizardMath variants against GPT-3.5-Turbo, Claude 2, Gemini Pro and early GPT-4. No equations appear that define a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is imported via self-citation to force the central result. The derivation chain consists of standard RL training steps whose outputs are evaluated on independent test sets; therefore the headline performance numbers are not equivalent to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that process-level feedback signals are reliable and that benchmark scores reflect genuine reasoning gains.

pith-pipeline@v0.9.0 · 5533 in / 1083 out tokens · 20517 ms · 2026-05-17T03:56:42.931558+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories
cs.AI 2026-04 unverdicted novelty 7.0

CRPS synthesizes reasoning paths by contrasting high- and low-quality MCTS trajectories, enabling models trained on 60K examples to match or exceed those trained on 590K standard examples with better out-of-domain gen...
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
cs.AI 2025-12 unverdicted novelty 7.0

CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
cs.CV 2024-03 conditional novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
Distribution Corrected Offline Data Distillation for Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference
cs.CV 2026-05 unverdicted novelty 6.0

CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
cs.CL 2024-12 unverdicted novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
ARMove: Learning to Predict Human Mobility through Agentic Reasoning
cs.MA 2026-04 unverdicted novelty 5.0

ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interp...
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
cs.AI 2025-01 unverdicted novelty 3.0

The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 17 Pith papers · 1 internal anchor

[1]

URL https://api.semanticscholar.org/CorpusID:266818336. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Wint...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023 2020
[2]

#Instruction#

Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https: //aclanthology.org/N19-1421. 14 Published as a conference paper at ICLR 2025 Zhengyang Tang, Xingxing Zhang, Benyou Wan, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024. Rohan Taori, Ishaan Gulrajani, Tiany...

work page doi:10.18653/v1/n19-1421 2025
[3]

Instruction Evolution and SFT In the first step, we apply upward and downward instruction evolution on the GSM8k and MATH datasets, generating evolved instructions for the SFT. On the leftmost side of Figure 1, the three blue arrows, from top to bottom, represent: (a) the adoption of the instruction evolution technique, (b) the generation of evolved instr...

work page
[4]

A” represents the original instruction, while “B,

Reward Model Training The second step involves two reward models: the Instruction Quality Scoring Reward Model (IRM) and the Process-Supervised Reward Model (PRM), depicted in the central section of Figure 1. • IRM: We employ upward and downward evolution on a seed instruction, yielding five instructions (original + evolved). These instructions are ranked...

work page
[5]

As depicted in the far-right section of Figure 1, the process is as follows: (a) The first blue arrow represents instruction scoring by the IRM

Reinforcement Learning with PPO In the final step, we integrate the IRM and PRM within a Proximal Policy Optimization (PPO)-based reinforcement learning framework. As depicted in the far-right section of Figure 1, the process is as follows: (a) The first blue arrow represents instruction scoring by the IRM. (b) The second blue arrow shows PPO initializati...

work page 2025
[6]

• On Llama-2-13B and Llama-2-70B, WizardMath-SFT achieves comparable perfor- mance to Xwin-Math

Performance Comparison: • On Llama-2-7B and Mistral-7B-v0.1, WizardMath-SFT performs marginally below SOTA models (i.e.,Xwin-Math and Skywork-Math) and outperforms existing other excellent models (i.e.,DART-Math). • On Llama-2-13B and Llama-2-70B, WizardMath-SFT achieves comparable perfor- mance to Xwin-Math. • On all various base models, WizardMath-SFT s...

work page
[7]

Meanwhile, WizardMath-SFT demonstrates comparable or superior performance to advanced data synthesis methods, such as DART- Math and MetaMath, across all base models

Comparison with advanced data synthesis methods (i.e., DART-Math, MetaMath) As shown in the following Table 15, DART-Math demonstrates strong performance across various base models and the data synthesis method proposed by DART-Math shows the effectiveness and outstanding performance. Meanwhile, WizardMath-SFT demonstrates comparable or superior performan...

work page 2025
[8]

It also significantly enhances the mathematical reasoning capabilities of our models

The proposed Math Evol Instruct data synthesis method is also as effective and practical as the current state-of-the-art data synthesis methods, such as DART-Math, Skywork-Math and Xwin-Math in the SFT stage. It also significantly enhances the mathematical reasoning capabilities of our models

work page
[9]

The proposed IRM and PRM models substantially improve performance during the reinforcement learning phase. They not only continuously enhance the mathematical reasoning abilities of our 34 Published as a conference paper at ICLR 2025 Table 18: The performance comparison of WizardMath-SFT with DART-Math, Xwin-Math, and Skywork-Math on the Llama2-7B base mo...

work page 2025
[10]

In Table 6, we provide a detailed analysis of the effects of downward evolution

Unlike WizardLM/WizardCoder, which primarily focus on increasing instruction difficulty, we are the first to propose the novel concept of downward evolution, a major distinction in instruction evolution. In Table 6, we provide a detailed analysis of the effects of downward evolution. Specifically, two rounds of downward evolution led to a remarkable impro...

work page
[11]

In reinforcement learning (RL) training, we firstly propose the instruction quality scoring reward model (IRM) combined with the process supervision reward model (PRM) further enhancing WizardMath mathematical reasoning ability. As demonstrated in Table 3, our method achieves a remarkable 5%–8% improvement in GSM8k and MATH performance over the SFT backbo...

work page
[12]

Additionally, the training datasets for SFT, PRM, and IRM are fully synthesized using AI systems

We firstly propose to use AI to annotate the step-level PRM training data. Additionally, the training datasets for SFT, PRM, and IRM are fully synthesized using AI systems. This fully AI-automated data generation pipeline ensures scalability

work page
[13]

It surpasses all existing open-source state-of-the-art models, showcasing the effectiveness and robustness of the RLEIF approach proposed in our study

WizardMath demonstrates outstanding performance across a wide range of model scales, from 100M to 1B and 70B parameters, on the benchmarks such as GSM8k, MATH, and out-of- distribution (OOD) tasks like MWPBench(Tang et al., 2024). It surpasses all existing open-source state-of-the-art models, showcasing the effectiveness and robustness of the RLEIF approa...

work page 2024