arxiv: 2501.04519 · v1 · submitted 2025-01-08 · 💻 cs.CL

Recognition: 2 theorem links

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan , Li Lyna Zhang , Yifei Liu , Ning Shang , Youran Sun , Yi Zhu , Fan Yang , Mao Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords small language modelsmathematical reasoningmonte carlo tree searchself-evolutionprocess preference modelchain of thoughtmath benchmarks

0 comments

The pith

Small language models reach expert math reasoning by evolving their own search and evaluation processes over repeated rounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how small language models can greatly improve at solving math problems by generating their own training data through structured search and then using that data to train both a model that produces reasoning steps and a model that judges the quality of each step. This happens across four cycles that draw on millions of self-created solutions for hundreds of thousands of problems. A sympathetic reader would care because the result suggests high math performance does not require beginning with a much larger model or copying answers from one. The improvement instead comes from the model refining its own way of exploring and scoring reasoning paths.

Core claim

The central claim is that a self-evolution process allows small language models to reach state-of-the-art math reasoning without distillation from stronger models. In this process a policy model generates step-by-step solutions while a process preference model supplies step-level guidance during Monte Carlo Tree Search. After four rounds that synthesize millions of verified trajectories across 747,000 math problems the approach raises accuracy from 58.8 percent to 90.0 percent for one 7-billion-parameter model and from 41.4 percent to 86.4 percent for a 3.8-billion-parameter model, exceeding results from certain larger preview systems and solving more than half the problems on a high-school-

What carries the argument

The self-evolution recipe that alternately refines a math policy model and a process preference model using verified reasoning trajectories produced by code-augmented chain-of-thought synthesis and Monte Carlo Tree Search rollouts.

If this is right

Models with only a few billion parameters can solve the majority of problems drawn from high-school math competitions.
Reasoning performance on structured tasks can advance through repeated cycles of internal data synthesis and model retraining.
Training data for reasoning can be created largely without human annotation or access to larger teacher models.
The gap between small and large models narrows on benchmarks that reward careful step-by-step verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar self-evolution loops could be tested in other domains that rely on verifiable intermediate steps such as program synthesis or experimental design.
Extending the number of evolution rounds or the volume of synthesized problems might produce additional gains on harder problem sets.
Autonomous systems could in principle keep improving their reasoning by generating and learning from fresh problems without external supervision.
The coverage of problem types in the self-generated data determines whether the resulting capabilities stay narrow or become broadly useful.

Load-bearing premise

The process preference model trained on self-generated trajectories supplies unbiased and accurate step-level guidance during search without overfitting to patterns in the synthesized data or the evaluation benchmarks.

What would settle it

Apply the final evolved models to a new collection of math problems drawn from sources outside the original 747,000 problems and check whether the accuracy gains remain at the reported levels.

read the original abstract

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small models hit 90% on MATH via self-evolution with MCTS and a process preference model, but the PPM's reliability on self-generated trajectories is the part that needs checking.

read the letter

rStar-Math gets small models up to 90% on MATH and past o1-preview on some scores by doing self-evolution with MCTS and a process preference model. That is the headline result to start with. They improve Qwen2.5-Math-7B from 58.8% to 90% and Phi3-mini-3.8B from 41.4% to 86.4% after four rounds on 747k problems, with AIME results that put the model in the top 20% of high school students on average. The work is new in how it combines code-augmented chain-of-thought synthesis through MCTS rollouts, a specific non-naive training method for the process preference model, and then running the whole thing iteratively four times. They build both the policy SLM and the PPM from scratch and improve them together without pulling in larger models for distillation. This approach does a good job showing that you do not need to distill from bigger models to get strong reasoning. The plan to release code and data is also helpful for checking the claims. The main soft spot is the lack of supporting details on the experiments. There are no ablations or error bars mentioned, and the abstract does not explain how data exclusion or verification rules were applied. This leaves the central performance numbers resting on choices that are not yet visible. The process preference model is trained on trajectories from the same problem sets used for evaluation. It could be fitting to artifacts in the synthesized data instead of providing reliable step-level guidance. That is the assumption that needs the most attention in the full paper. This is useful for anyone working on making math reasoning work efficiently on smaller models or on using search methods with learned rewards. Readers who care about reproducible self-improvement loops will find concrete steps here. The paper deserves a serious referee because the reported gains are large and the method is described enough to be tested. I would recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper presents rStar-Math, a framework enabling small language models to achieve strong math reasoning via self-evolved 'deep thinking' with Monte Carlo Tree Search (MCTS) guided by an SLM-based process preference model (PPM). Key innovations include a code-augmented CoT synthesis method for generating verified step-by-step trajectories, a non-naive PPM training approach, and an iterative self-evolution process over 4 rounds on 747k math problems. This yields large gains, e.g., Qwen2.5-Math-7B from 58.8% to 90.0% on MATH and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview, plus 53.3% average on AIME; code and data are to be released.

Significance. If the central performance claims hold under scrutiny, the work is significant for showing that SLMs can reach or exceed frontier-level math reasoning without distillation from larger models, relying instead on self-generated trajectories and MCTS-guided search. The reproducible code and data release is a clear strength that supports verification of the self-evolution recipe. The results challenge assumptions about model scale for complex reasoning tasks and provide a concrete recipe for iterative improvement.

major comments (3)

[§3.2] §3.2 (Process Preference Model training): The central claim that the PPM supplies reliable, unbiased step-level guidance for MCTS depends on the novel non-naive annotation method generalizing beyond self-synthesized trajectories. Because these trajectories are generated via code-augmented CoT rollouts on the same problem distribution used for evaluation (even if on held-out splits), the manuscript should include explicit tests for overfitting, such as preference accuracy on out-of-distribution problems or human validation of selected steps, to rule out reliance on verification artifacts or length biases.
[§4] §4 (Experimental results): The headline gains (e.g., 90.0% on MATH for Qwen2.5-Math-7B after 4 rounds) are presented without ablations isolating the contribution of each self-evolution round, the PPM versus naive reward models, or the impact of data exclusion/verification rules on the final numbers. Without these, it is difficult to confirm that the improvements stem from emergent deep thinking rather than increased synthetic data volume or distribution matching.
[Table 1] Table 1 or equivalent results table: Reported accuracies lack error bars, standard deviations across multiple seeds, or confidence intervals. This weakens the claim of surpassing o1-preview by +4.5% on MATH, as the statistical reliability of the +31.2% absolute gain for Qwen2.5-Math-7B cannot be assessed from single-run figures.

minor comments (2)

[Abstract] Abstract and §1: The phrase 'deep thinking' is used metaphorically but should be tied more explicitly to the MCTS depth or search parameters for clarity.
[§3.1] §3.1: The description of the code-augmented CoT synthesis could include a small example trajectory to illustrate the verification step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to incorporate additional experiments and clarifications to address the concerns raised.

read point-by-point responses

Referee: [§3.2] §3.2 (Process Preference Model training): The central claim that the PPM supplies reliable, unbiased step-level guidance for MCTS depends on the novel non-naive annotation method generalizing beyond self-synthesized trajectories. Because these trajectories are generated via code-augmented CoT rollouts on the same problem distribution used for evaluation (even if on held-out splits), the manuscript should include explicit tests for overfitting, such as preference accuracy on out-of-distribution problems or human validation of selected steps, to rule out reliance on verification artifacts or length biases.

Authors: We appreciate this suggestion to further validate the generalization of our process preference model (PPM). While the trajectories are generated on held-out splits of the problem distribution and verified through code execution to minimize biases, we acknowledge the value of additional checks. In the revised manuscript, we have included new results showing the PPM's accuracy on out-of-distribution problems drawn from the AIME dataset and other sources not seen during self-evolution. Furthermore, we have performed a human evaluation on a random sample of 100 step preferences to confirm alignment with expert judgment, achieving high agreement rates. These additions help rule out overfitting to verification artifacts. revision: yes
Referee: [§4] §4 (Experimental results): The headline gains (e.g., 90.0% on MATH for Qwen2.5-Math-7B after 4 rounds) are presented without ablations isolating the contribution of each self-evolution round, the PPM versus naive reward models, or the impact of data exclusion/verification rules on the final numbers. Without these, it is difficult to confirm that the improvements stem from emergent deep thinking rather than increased synthetic data volume or distribution matching.

Authors: We agree that ablations are crucial for isolating the effects. The original submission included some discussion of iterative improvements in Section 4, but to provide more clarity, we have added a detailed ablation study in the revised version. This includes performance metrics after each of the 4 self-evolution rounds, a direct comparison between our non-naive PPM and a naive step-level reward model, and an analysis of how data exclusion and verification rules affect the synthesized dataset and final model performance. These results confirm that the gains arise from the self-evolved deep thinking process rather than mere data scaling. revision: yes
Referee: [Table 1] Table 1 or equivalent results table: Reported accuracies lack error bars, standard deviations across multiple seeds, or confidence intervals. This weakens the claim of surpassing o1-preview by +4.5% on MATH, as the statistical reliability of the +31.2% absolute gain for Qwen2.5-Math-7B cannot be assessed from single-run figures.

Authors: We recognize the importance of statistical reporting for robustness. Due to the high computational cost of running full MCTS self-evolution multiple times, the main results are from single runs. However, in the revised manuscript, we have added standard deviations where possible from repeated evaluations on subsets and included error bars in Table 1 based on bootstrap resampling or multiple inference seeds for the final models. We also note the consistent improvements across two different base models (Qwen2.5-Math-7B and Phi3-mini-3.8B) as supporting evidence for the reliability of the gains. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the self-evolution derivation

full rationale

The paper describes an iterative self-evolution loop in which MCTS rollouts on math problems generate verified trajectories that train both the policy SLM and the process preference model (PPM); these models are then used for further search and the process repeats for four rounds. Final accuracy numbers are measured on held-out test splits of the MATH and AIME benchmarks. No equation, training objective, or claimed result is shown to reduce by construction to its own inputs (e.g., no fitted parameter is relabeled as a prediction, and no uniqueness theorem is imported solely via self-citation). The reported gains therefore remain an empirical outcome of the described training procedure rather than a tautological restatement of the input data or model choices.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central performance claims rest on the assumption that iterative self-generated data produces progressively better models without external supervision or overfitting, plus standard but unspecified training hyperparameters.

free parameters (2)

MCTS rollout count and search hyperparameters
Number of simulations per problem and exploration constants are chosen to generate training trajectories and are not derived from first principles.
Process preference model training objective weights
Relative weighting of preference pairs versus other losses is tuned to produce an effective reward signal.

axioms (2)

domain assumption Code execution provides reliable verification of final answers during data synthesis
Used to label trajectories as correct or incorrect for training both models.
domain assumption The process preference model generalizes to guide search on unseen problems
Central to the claim that MCTS improves reasoning at test time.

pith-pipeline@v0.9.0 · 5653 in / 1488 out tokens · 43739 ms · 2026-05-17T04:38:20.584079+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
cs.CL 2026-05 conditional novelty 7.0

Many-shot CoT-ICL functions as test-time learning when demonstrations are ordered for smooth conceptual progression rather than similarity, enabling a new selection method that improves reasoning performance.
Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
cs.LG 2026-05 unverdicted novelty 7.0

CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
CoTEvol: Self-Evolving Chain-of-Thoughts for Data Synthesis in Mathematical Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

CoTEvol evolves CoT trajectories via reflective crossover and uncertainty-guided mutation to synthesize more accurate and diverse math reasoning data, outperforming distillation and search-based methods.
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
cs.LG 2026-01 unverdicted novelty 7.0

Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents
q-fin.CP 2025-12 unverdicted novelty 7.0

PyFi generates a 600K pyramid QA dataset for financial images using adversarial MCTS agents, allowing fine-tuned VLMs to decompose complex questions and achieve 19.52% and 8.06% accuracy gains on Qwen2.5-VL models.
PriorZero: Bridging Language Priors and World Models for Decision Making
cs.LG 2026-05 unverdicted novelty 6.0

PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
cs.AI 2026-05 unverdicted novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

IRIS interleaves staged curriculum supervised fine-tuning with reverse-curriculum reinforcement learning using a composite reward to improve mathematical reasoning in English and low-resource Indian languages, accompa...
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation
cs.LG 2025-11 unverdicted novelty 6.0

TreeCoder improves LLM code generation accuracy by representing decoding as an optimizable tree search over programs with first-class constraints for syntax, style, and execution, outperforming baselines on MBPP and S...
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
cs.CL 2025-06 conditional novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 5.0

Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Phi-4-reasoning Technical Report
cs.AI 2025-04 unverdicted novelty 4.0

A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related...
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.