hub Canonical reference

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu · 2025 · cs.CL · arXiv 2501.04519

Canonical reference. 88% of citing Pith papers cite this work as background.

29 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 29 citing papers arXiv PDF

abstract

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8

citation-polarity summary

background 7 unclear 1

representative citing papers

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

cs.CL · 2026-05-13 · conditional · novelty 7.0

Many-shot CoT-ICL functions as test-time learning when demonstrations are ordered for smooth conceptual progression rather than similarity, enabling a new selection method that improves reasoning performance.

Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.

Fine-Tuning Small Reasoning Models for Quantum Field Theory

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

CoTEvol: Self-Evolving Chain-of-Thoughts for Data Synthesis in Mathematical Reasoning

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

CoTEvol evolves CoT trajectories via reflective crossover and uncertainty-guided mutation to synthesize more accurate and diverse math reasoning data, outperforming distillation and search-based methods.

On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency

cs.LG · 2026-01-29 · unverdicted · novelty 7.0

Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.

PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents

q-fin.CP · 2025-12-11 · unverdicted · novelty 7.0

PyFi generates a 600K pyramid QA dataset for financial images using adversarial MCTS agents, allowing fine-tuned VLMs to decompose complex questions and achieve 19.52% and 8.06% accuracy gains on Qwen2.5-VL models.

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.

STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

PriorZero: Bridging Language Priors and World Models for Decision Making

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

IRIS interleaves staged curriculum supervised fine-tuning with reverse-curriculum reinforcement learning using a composite reward to improve mathematical reasoning in English and low-resource Indian languages, accompanied by a new 29k-problem multilingual dataset.

A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obtained from covering numbers.

TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

cs.LG · 2025-11-27 · unverdicted · novelty 6.0

TreeCoder improves LLM code generation accuracy by representing decoding as an optimizable tree search over programs with first-class constraints for syntax, style, and execution, outperforming baselines on MBPP and SQL-Spider.

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

cs.LG · 2025-10-21 · unverdicted · novelty 6.0

A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

cs.AI · 2025-09-29 · unverdicted · novelty 6.0

DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

cs.AI · 2025-09-02 · accept · novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

cs.CL · 2025-06-02 · conditional · novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

cs.AI · 2025-02-27 · unverdicted · novelty 6.0

Meta-Reasoner applies contextual multi-armed bandits to adaptively select LLM reasoning strategies at inference time, reporting 9-12% accuracy gains and 28-35% time reductions on math and science benchmarks.

Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search

cs.CL · 2025-02-02 · unverdicted · novelty 6.0

DITS replaces Q-value guidance in MCTS with influence scores for synthetic data synthesis in multi-agent LLM training, claiming better efficiency and performance on eight datasets.

SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

cs.AI · 2026-04-11 · unverdicted · novelty 5.0

SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.

Sample-efficient LLM Optimization with Reset Replay

cs.LG · 2025-08-08 · unverdicted · novelty 5.0

LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.

TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

cs.CL · 2025-05-21 · unverdicted · novelty 5.0

TemplateRL extracts interpretable templates via MCTS on seed problems and injects them into RL policy optimization to raise high-quality rollout rates, reporting 99% gain over GRPO on AIME and 41% on AMC.

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

cs.LG · 2025-05-16 · unverdicted · novelty 5.0

TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

citing papers explorer

Showing 29 of 29 citing papers.

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn cs.CL · 2026-05-13 · conditional · none · ref 28 · internal anchor
Many-shot CoT-ICL functions as test-time learning when demonstrations are ordered for smooth conceptual progression rather than similarity, enabling a new selection method that improves reasoning performance.
Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators cs.LG · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.
Fine-Tuning Small Reasoning Models for Quantum Field Theory cs.LG · 2026-04-21 · unverdicted · none · ref 49 · internal anchor
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
CoTEvol: Self-Evolving Chain-of-Thoughts for Data Synthesis in Mathematical Reasoning cs.AI · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
CoTEvol evolves CoT trajectories via reflective crossover and uncertainty-guided mutation to synthesize more accurate and diverse math reasoning data, outperforming distillation and search-based methods.
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency cs.LG · 2026-01-29 · unverdicted · none · ref 10 · internal anchor
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents q-fin.CP · 2025-12-11 · unverdicted · none · ref 1 · internal anchor
PyFi generates a 600K pyramid QA dataset for financial images using adversarial MCTS agents, allowing fine-tuned VLMs to decompose complex questions and achieve 19.52% and 8.06% accuracy gains on Qwen2.5-VL models.
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play cs.AI · 2026-05-16 · unverdicted · none · ref 64 · internal anchor
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning cs.LG · 2026-05-13 · unverdicted · none · ref 53 · internal anchor
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
PriorZero: Bridging Language Priors and World Models for Decision Making cs.LG · 2026-05-12 · unverdicted · none · ref 17 · internal anchor
PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning cs.CL · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
IRIS interleaves staged curriculum supervised fine-tuning with reverse-curriculum reinforcement learning using a composite reward to improve mathematical reasoning in English and low-resource Indian languages, accompanied by a new 29k-problem multilingual dataset.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws cs.LG · 2026-04-27 · unverdicted · none · ref 75 · internal anchor
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obtained from covering numbers.
TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation cs.LG · 2025-11-27 · unverdicted · none · ref 21 · internal anchor
TreeCoder improves LLM code generation accuracy by representing decoding as an optimizable tree search over programs with first-class constraints for syntax, style, and execution, outperforming baselines on MBPP and SQL-Spider.
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs cs.LG · 2025-10-21 · unverdicted · none · ref 19 · internal anchor
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search cs.AI · 2025-09-29 · unverdicted · none · ref 7 · internal anchor
DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 202 · internal anchor
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning cs.CL · 2025-06-02 · conditional · none · ref 8 · internal anchor
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models cs.AI · 2025-02-27 · unverdicted · none · ref 2 · internal anchor
Meta-Reasoner applies contextual multi-armed bandits to adaptively select LLM reasoning strategies at inference time, reporting 9-12% accuracy gains and 28-35% time reductions on math and science benchmarks.
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search cs.CL · 2025-02-02 · unverdicted · none · ref 5 · internal anchor
DITS replaces Q-value guidance in MCTS with influence scores for synthetic data synthesis in multi-agent LLM training, claiming better efficiency and performance on eight datasets.
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning cs.AI · 2026-04-11 · unverdicted · none · ref 6 · internal anchor
SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
Sample-efficient LLM Optimization with Reset Replay cs.LG · 2025-08-08 · unverdicted · none · ref 8 · internal anchor
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.
TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning cs.CL · 2025-05-21 · unverdicted · none · ref 1 · internal anchor
TemplateRL extracts interpretable templates via MCTS on seed problems and injects them into RL policy optimization to raise high-quality rollout rates, reporting 99% gain over GRPO on AIME and 41% on AMC.
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning cs.LG · 2025-05-16 · unverdicted · none · ref 14 · internal anchor
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 226 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Improving Language Models with Intentional Analysis cs.CL · 2025-02-07 · unverdicted · none · ref 4 · internal anchor
Intentional Analysis improves language model task performance by explicitly adding intent-aware analysis and reasoning, outperforming Chain-of-Thought and working synergistically with it even on frontier models.
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping cs.LG · 2025-10-29 · unverdicted · none · ref 8 · internal anchor
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 234 · internal anchor
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Phi-4-reasoning Technical Report cs.AI · 2025-04-30 · unverdicted · none · ref 20 · internal anchor
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 143 · internal anchor
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer