hub

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun · 2023 · cs.CL · arXiv 2309.05653

35 Pith papers cite this work. Polarity classification is still indexing.

35 Pith papers citing it

open full Pith review browse 35 citing papers arXiv PDF

abstract

We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset. MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and also ensures extensive coverage of diverse fields in math. The hybrid of CoT and PoT not only unleashes the potential of tool use but also allows different thought processes for different math problems. As a result, the MAmmoTH series substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain between 16% and 32%. Remarkably, our MAmmoTH-7B model reaches 33% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 23%, and the MAmmoTH-34B model achieves 44% accuracy on MATH, even surpassing GPT-4's CoT result. Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

Transformers Provably Learn to Internalize Chain-of-Thought

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.

HARP: Efficient Data Selection for Finetuning Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.

AI co-mathematician: Accelerating mathematicians with agentic AI

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

cs.CL · 2026-04-11 · unverdicted · novelty 7.0

Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.

EDUMATH: Generating Standards-aligned Educational Math Word Problems

cs.CL · 2025-10-08 · conditional · novelty 7.0

EDUMATH introduces the first teacher-annotated dataset for standards-aligned math word problem generation and demonstrates that it enables smaller open LLMs to match larger models while producing problems students prefer over human-written ones.

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

cs.CL · 2024-06-12 · unverdicted · novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

cs.CV · 2024-03-21 · conditional · novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.

Geometric Latent Reasoning Induces Shorter Generations in LLMs

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

GLR formulates latent reasoning as geometric path approximation in pretrained embedding space and reports shorter LLM generations on math tasks without an explicit length penalty.

Manifold-Guided Attention Steering

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

cs.CV · 2026-02-04 · unverdicted · novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

SmolVLM: Redefining small and efficient multimodal models

cs.AI · 2025-04-07 · unverdicted · novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

cs.CV · 2025-03-19 · unverdicted · novelty 6.0

MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.

Muon is Scalable for LLM Training

cs.LG · 2025-02-24 · unverdicted · novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

cs.LG · 2024-06-26 · conditional · novelty 6.0

Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

cs.CL · 2024-02-05 · unverdicted · novelty 6.0

DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

cs.AI · 2023-12-14 · conditional · novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

Llemma: An Open Language Model For Mathematics

cs.CL · 2023-10-16 · unverdicted · novelty 6.0

Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

cs.CL · 2023-09-29 · conditional · novelty 6.0

ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

cs.AI · 2026-05-31 · unverdicted · novelty 5.0

DAG-MoE uses a lightweight module to learn DAG-based structural aggregation of selected experts, expanding combination space and enabling intra-layer multi-step reasoning compared to standard weighted-sum MoE.

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

cs.LG · 2026-05-06 · unverdicted · novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

citing papers explorer

Showing 9 of 9 citing papers after filters.

Transformers Provably Learn to Internalize Chain-of-Thought cs.LG · 2026-05-27 · unverdicted · none · ref 54 · internal anchor
L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.
HARP: Efficient Data Selection for Finetuning Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 36 · internal anchor
HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.
RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning cs.LG · 2026-06-05 · unverdicted · none · ref 40 · internal anchor
RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.
Manifold-Guided Attention Steering cs.LG · 2026-05-20 · unverdicted · none · ref 23 · internal anchor
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 55 · internal anchor
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity cs.LG · 2026-05-13 · unverdicted · none · ref 58 · internal anchor
Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning cs.LG · 2026-05-06 · unverdicted · none · ref 31 · internal anchor
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation cs.LG · 2026-04-10 · unverdicted · none · ref 52 · internal anchor
EdgeRazor uses structural mixed-precision quantization, layer-adaptive feature distillation, and entropy-aware KL divergence to achieve 1.88-bit LLMs that outperform prior 2-bit and 3-bit baselines with 4-10x lower training budget.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs cs.LG · 2026-04-08 · unverdicted · none · ref 88 · internal anchor
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer