Many-shot CoT-ICL functions as test-time learning when demonstrations are ordered for smooth conceptual progression rather than similarity, enabling a new selection method that improves reasoning performance.
hub Canonical reference
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.
hub tools
citation-role summary
citation-polarity summary
roles
background 8representative citing papers
CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
CoTEvol evolves CoT trajectories via reflective crossover and uncertainty-guided mutation to synthesize more accurate and diverse math reasoning data, outperforming distillation and search-based methods.
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
PyFi generates a 600K pyramid QA dataset for financial images using adversarial MCTS agents, allowing fine-tuned VLMs to decompose complex questions and achieve 19.52% and 8.06% accuracy gains on Qwen2.5-VL models.
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
IRIS interleaves staged curriculum supervised fine-tuning with reverse-curriculum reinforcement learning using a composite reward to improve mathematical reasoning in English and low-resource Indian languages, accompanied by a new 29k-problem multilingual dataset.
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obtained from covering numbers.
TreeCoder improves LLM code generation accuracy by representing decoding as an optimizable tree search over programs with first-class constraints for syntax, style, and execution, outperforming baselines on MBPP and SQL-Spider.
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
Meta-Reasoner applies contextual multi-armed bandits to adaptively select LLM reasoning strategies at inference time, reporting 9-12% accuracy gains and 28-35% time reductions on math and science benchmarks.
DITS replaces Q-value guidance in MCTS with influence scores for synthetic data synthesis in multi-agent LLM training, claiming better efficiency and performance on eight datasets.
SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.
TemplateRL extracts interpretable templates via MCTS on seed problems and injects them into RL policy optimization to raise high-quality rollout rates, reporting 99% gain over GRPO on AIME and 41% on AMC.
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
citing papers explorer
-
Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
Many-shot CoT-ICL functions as test-time learning when demonstrations are ordered for smooth conceptual progression rather than similarity, enabling a new selection method that improves reasoning performance.
-
Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
CoTEvol: Self-Evolving Chain-of-Thoughts for Data Synthesis in Mathematical Reasoning
CoTEvol evolves CoT trajectories via reflective crossover and uncertainty-guided mutation to synthesize more accurate and diverse math reasoning data, outperforming distillation and search-based methods.
-
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
-
PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents
PyFi generates a 600K pyramid QA dataset for financial images using adversarial MCTS agents, allowing fine-tuned VLMs to decompose complex questions and achieve 19.52% and 8.06% accuracy gains on Qwen2.5-VL models.
-
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
-
STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
-
PriorZero: Bridging Language Priors and World Models for Decision Making
PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
-
IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning
IRIS interleaves staged curriculum supervised fine-tuning with reverse-curriculum reinforcement learning using a composite reward to improve mathematical reasoning in English and low-resource Indian languages, accompanied by a new 29k-problem multilingual dataset.
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obtained from covering numbers.
-
TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation
TreeCoder improves LLM code generation accuracy by representing decoding as an optimizable tree search over programs with first-class constraints for syntax, style, and execution, outperforming baselines on MBPP and SQL-Spider.
-
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
-
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
-
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
Meta-Reasoner applies contextual multi-armed bandits to adaptively select LLM reasoning strategies at inference time, reporting 9-12% accuracy gains and 28-35% time reductions on math and science benchmarks.
-
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search
DITS replaces Q-value guidance in MCTS with influence scores for synthetic data synthesis in multi-agent LLM training, claiming better efficiency and performance on eight datasets.
-
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
-
Sample-efficient LLM Optimization with Reset Replay
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.
-
TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning
TemplateRL extracts interpretable templates via MCTS on seed problems and injects them into RL policy optimization to raise high-quality rollout rates, reporting 99% gain over GRPO on AIME and 41% on AMC.
-
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Improving Language Models with Intentional Analysis
Intentional Analysis improves language model task performance by explicitly adding intent-aware analysis and reasoning, outperforming Chain-of-Thought and working synergistically with it even on frontier models.
-
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Phi-4-reasoning Technical Report
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.