Recognition: 2 theorem links
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Pith reviewed 2026-05-17 04:38 UTC · model grok-4.3
The pith
Small language models reach expert math reasoning by evolving their own search and evaluation processes over repeated rounds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a self-evolution process allows small language models to reach state-of-the-art math reasoning without distillation from stronger models. In this process a policy model generates step-by-step solutions while a process preference model supplies step-level guidance during Monte Carlo Tree Search. After four rounds that synthesize millions of verified trajectories across 747,000 math problems the approach raises accuracy from 58.8 percent to 90.0 percent for one 7-billion-parameter model and from 41.4 percent to 86.4 percent for a 3.8-billion-parameter model, exceeding results from certain larger preview systems and solving more than half the problems on a high-school-
What carries the argument
The self-evolution recipe that alternately refines a math policy model and a process preference model using verified reasoning trajectories produced by code-augmented chain-of-thought synthesis and Monte Carlo Tree Search rollouts.
If this is right
- Models with only a few billion parameters can solve the majority of problems drawn from high-school math competitions.
- Reasoning performance on structured tasks can advance through repeated cycles of internal data synthesis and model retraining.
- Training data for reasoning can be created largely without human annotation or access to larger teacher models.
- The gap between small and large models narrows on benchmarks that reward careful step-by-step verification.
Where Pith is reading between the lines
- Similar self-evolution loops could be tested in other domains that rely on verifiable intermediate steps such as program synthesis or experimental design.
- Extending the number of evolution rounds or the volume of synthesized problems might produce additional gains on harder problem sets.
- Autonomous systems could in principle keep improving their reasoning by generating and learning from fresh problems without external supervision.
- The coverage of problem types in the self-generated data determines whether the resulting capabilities stay narrow or become broadly useful.
Load-bearing premise
The process preference model trained on self-generated trajectories supplies unbiased and accurate step-level guidance during search without overfitting to patterns in the synthesized data or the evaluation benchmarks.
What would settle it
Apply the final evolved models to a new collection of math problems drawn from sources outside the original 747,000 problems and check whether the accuracy gains remain at the reported levels.
read the original abstract
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents rStar-Math, a framework enabling small language models to achieve strong math reasoning via self-evolved 'deep thinking' with Monte Carlo Tree Search (MCTS) guided by an SLM-based process preference model (PPM). Key innovations include a code-augmented CoT synthesis method for generating verified step-by-step trajectories, a non-naive PPM training approach, and an iterative self-evolution process over 4 rounds on 747k math problems. This yields large gains, e.g., Qwen2.5-Math-7B from 58.8% to 90.0% on MATH and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview, plus 53.3% average on AIME; code and data are to be released.
Significance. If the central performance claims hold under scrutiny, the work is significant for showing that SLMs can reach or exceed frontier-level math reasoning without distillation from larger models, relying instead on self-generated trajectories and MCTS-guided search. The reproducible code and data release is a clear strength that supports verification of the self-evolution recipe. The results challenge assumptions about model scale for complex reasoning tasks and provide a concrete recipe for iterative improvement.
major comments (3)
- [§3.2] §3.2 (Process Preference Model training): The central claim that the PPM supplies reliable, unbiased step-level guidance for MCTS depends on the novel non-naive annotation method generalizing beyond self-synthesized trajectories. Because these trajectories are generated via code-augmented CoT rollouts on the same problem distribution used for evaluation (even if on held-out splits), the manuscript should include explicit tests for overfitting, such as preference accuracy on out-of-distribution problems or human validation of selected steps, to rule out reliance on verification artifacts or length biases.
- [§4] §4 (Experimental results): The headline gains (e.g., 90.0% on MATH for Qwen2.5-Math-7B after 4 rounds) are presented without ablations isolating the contribution of each self-evolution round, the PPM versus naive reward models, or the impact of data exclusion/verification rules on the final numbers. Without these, it is difficult to confirm that the improvements stem from emergent deep thinking rather than increased synthetic data volume or distribution matching.
- [Table 1] Table 1 or equivalent results table: Reported accuracies lack error bars, standard deviations across multiple seeds, or confidence intervals. This weakens the claim of surpassing o1-preview by +4.5% on MATH, as the statistical reliability of the +31.2% absolute gain for Qwen2.5-Math-7B cannot be assessed from single-run figures.
minor comments (2)
- [Abstract] Abstract and §1: The phrase 'deep thinking' is used metaphorically but should be tied more explicitly to the MCTS depth or search parameters for clarity.
- [§3.1] §3.1: The description of the code-augmented CoT synthesis could include a small example trajectory to illustrate the verification step.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to incorporate additional experiments and clarifications to address the concerns raised.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Process Preference Model training): The central claim that the PPM supplies reliable, unbiased step-level guidance for MCTS depends on the novel non-naive annotation method generalizing beyond self-synthesized trajectories. Because these trajectories are generated via code-augmented CoT rollouts on the same problem distribution used for evaluation (even if on held-out splits), the manuscript should include explicit tests for overfitting, such as preference accuracy on out-of-distribution problems or human validation of selected steps, to rule out reliance on verification artifacts or length biases.
Authors: We appreciate this suggestion to further validate the generalization of our process preference model (PPM). While the trajectories are generated on held-out splits of the problem distribution and verified through code execution to minimize biases, we acknowledge the value of additional checks. In the revised manuscript, we have included new results showing the PPM's accuracy on out-of-distribution problems drawn from the AIME dataset and other sources not seen during self-evolution. Furthermore, we have performed a human evaluation on a random sample of 100 step preferences to confirm alignment with expert judgment, achieving high agreement rates. These additions help rule out overfitting to verification artifacts. revision: yes
-
Referee: [§4] §4 (Experimental results): The headline gains (e.g., 90.0% on MATH for Qwen2.5-Math-7B after 4 rounds) are presented without ablations isolating the contribution of each self-evolution round, the PPM versus naive reward models, or the impact of data exclusion/verification rules on the final numbers. Without these, it is difficult to confirm that the improvements stem from emergent deep thinking rather than increased synthetic data volume or distribution matching.
Authors: We agree that ablations are crucial for isolating the effects. The original submission included some discussion of iterative improvements in Section 4, but to provide more clarity, we have added a detailed ablation study in the revised version. This includes performance metrics after each of the 4 self-evolution rounds, a direct comparison between our non-naive PPM and a naive step-level reward model, and an analysis of how data exclusion and verification rules affect the synthesized dataset and final model performance. These results confirm that the gains arise from the self-evolved deep thinking process rather than mere data scaling. revision: yes
-
Referee: [Table 1] Table 1 or equivalent results table: Reported accuracies lack error bars, standard deviations across multiple seeds, or confidence intervals. This weakens the claim of surpassing o1-preview by +4.5% on MATH, as the statistical reliability of the +31.2% absolute gain for Qwen2.5-Math-7B cannot be assessed from single-run figures.
Authors: We recognize the importance of statistical reporting for robustness. Due to the high computational cost of running full MCTS self-evolution multiple times, the main results are from single runs. However, in the revised manuscript, we have added standard deviations where possible from repeated evaluations on subsets and included error bars in Table 1 based on bootstrap resampling or multiple inference seeds for the final models. We also note the consistent improvements across two different base models (Qwen2.5-Math-7B and Phi3-mini-3.8B) as supporting evidence for the reliability of the gains. revision: partial
Circularity Check
No significant circularity in the self-evolution derivation
full rationale
The paper describes an iterative self-evolution loop in which MCTS rollouts on math problems generate verified trajectories that train both the policy SLM and the process preference model (PPM); these models are then used for further search and the process repeats for four rounds. Final accuracy numbers are measured on held-out test splits of the MATH and AIME benchmarks. No equation, training objective, or claimed result is shown to reduce by construction to its own inputs (e.g., no fitted parameter is relabeled as a prediction, and no uniqueness theorem is imported solely via self-citation). The reported gains therefore remain an empirical outcome of the described training procedure rather than a tautological restatement of the input data or model choices.
Axiom & Free-Parameter Ledger
free parameters (2)
- MCTS rollout count and search hyperparameters
- Process preference model training objective weights
axioms (2)
- domain assumption Code execution provides reliable verification of final answers during data synthesis
- domain assumption The process preference model generalizes to guide search on unseen problems
Forward citations
Cited by 18 Pith papers
-
Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
Many-shot CoT-ICL functions as test-time learning when demonstrations are ordered for smooth conceptual progression rather than similarity, enabling a new selection method that improves reasoning performance.
-
Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
CoTEvol: Self-Evolving Chain-of-Thoughts for Data Synthesis in Mathematical Reasoning
CoTEvol evolves CoT trajectories via reflective crossover and uncertainty-guided mutation to synthesize more accurate and diverse math reasoning data, outperforming distillation and search-based methods.
-
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
-
PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents
PyFi generates a 600K pyramid QA dataset for financial images using adversarial MCTS agents, allowing fine-tuned VLMs to decompose complex questions and achieve 19.52% and 8.06% accuracy gains on Qwen2.5-VL models.
-
PriorZero: Bridging Language Priors and World Models for Decision Making
PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
-
IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning
IRIS interleaves staged curriculum supervised fine-tuning with reverse-curriculum reinforcement learning using a composite reward to improve mathematical reasoning in English and low-resource Indian languages, accompa...
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
-
TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation
TreeCoder improves LLM code generation accuracy by representing decoding as an optimizable tree search over programs with first-class constraints for syntax, style, and execution, outperforming baselines on MBPP and S...
-
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
-
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Phi-4-reasoning Technical Report
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related...
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.