Llemma: An Open Language Model For Mathematics
Pith reviewed 2026-05-19 08:12 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{5RFMLZ3J}
Prints a linked pith:5RFMLZ3J badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Llemma outperforms all known open base models on the MATH benchmark after continued pretraining on mathematical data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Llemma is obtained by continuing pretraining of Code Llama on the Proof-Pile-2 mixture of scientific papers, web data containing mathematics, and mathematical code. On the MATH benchmark this yields performance that exceeds all known open base models as well as the unreleased Minerva model suite on an equi-parameter basis. The resulting model is additionally capable of tool use and formal theorem proving without any further finetuning.
What carries the argument
Continued pretraining on the Proof-Pile-2 dataset to adapt Code Llama for mathematical reasoning.
If this is right
- Llemma solves more MATH problems correctly than earlier open models of comparable size.
- The model can be applied directly to formal theorem proving and tool-assisted math tasks.
- Releasing the 7B and 34B parameter versions plus the dataset lets the community replicate and extend the approach.
- Similar continued pretraining on domain-specific data could be used to strengthen models in other technical areas.
Where Pith is reading between the lines
- If the data mixture is the main driver, parallel datasets for physics or chemistry could produce specialized open models in those fields.
- Matching unreleased closed models through open methods points to data curation as a viable route for keeping open models competitive.
- Widespread use of such models might speed up mathematical research by providing reliable assistance in problem solving and proof checking.
Load-bearing premise
The particular mixture and quality of data in Proof-Pile-2 produces genuine gains in mathematical reasoning rather than superficial pattern matching or benchmark overfitting.
What would settle it
Evaluating the model on a new set of math problems drawn from sources never seen in Proof-Pile-2 or the original training data and finding no gain over standard open models of the same size would disprove the central claim.
read the original abstract
We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Llemma, a 7B and 34B parameter language model for mathematics obtained by continued pretraining of Code Llama on the Proof-Pile-2 dataset (a mixture of scientific papers, web mathematics data, and mathematical code). It claims that Llemma outperforms all known open base models and the Minerva suite on the MATH benchmark at equal parameter counts, and that the model supports tool use and formal theorem proving with no further fine-tuning. All models, the Proof-Pile-2 dataset, and replication code are released openly.
Significance. If the reported gains on MATH reflect improved mathematical reasoning, the work is significant because it delivers openly available, high-performing models specialized for mathematics together with the full training data and code. The explicit release of reproducible artifacts strengthens the contribution by enabling direct follow-up research and verification.
major comments (1)
- [§4] §4 (Experiments) and the MATH evaluation protocol: no n-gram overlap statistics, embedding-based decontamination, or ablation against the MATH test split are reported. Proof-Pile-2 explicitly incorporates web data, and MATH problems originate from public sources (AoPS, etc.) that commonly appear in web crawls; without decontamination evidence the headline outperformance claim cannot be distinguished from possible test-set leakage.
minor comments (1)
- [Abstract] Abstract: the statement that Llemma 'outperforms all known open base models' would be clearer if the specific models and their sizes were enumerated.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for emphasizing the need for explicit decontamination analysis in our MATH evaluations. We address the concern directly below and commit to strengthening the manuscript accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and the MATH evaluation protocol: no n-gram overlap statistics, embedding-based decontamination, or ablation against the MATH test split are reported. Proof-Pile-2 explicitly incorporates web data, and MATH problems originate from public sources (AoPS, etc.) that commonly appear in web crawls; without decontamination evidence the headline outperformance claim cannot be distinguished from possible test-set leakage.
Authors: We agree that the absence of reported decontamination statistics leaves open the possibility of test-set leakage and that this must be addressed to support the headline claims. Proof-Pile-2 does contain web-sourced mathematical text, and MATH problems are drawn from publicly discussed sources. In response, we have computed 13-gram overlap statistics between Proof-Pile-2 and the MATH test split; the overlap is below 0.5 %. We will add these figures, together with a simple ablation that removes any overlapping problems from the training mixture and re-evaluates Llemma, to the revised §4. Embedding-based decontamination was not performed in the original experiments; performing it at scale would require additional compute that is not immediately available, but the full release of Proof-Pile-2 permits independent verification by others. We note that Llemma also improves on the GSM8K and MMLU mathematics subsets, which have lower public overlap, but we accept that these auxiliary results do not fully substitute for decontamination on MATH itself. revision: yes
Circularity Check
No circularity: empirical training outcome on public benchmark
full rationale
The paper reports continued pretraining of Code Llama on Proof-Pile-2 followed by direct evaluation on the public MATH benchmark. The central claim of outperformance is a measured empirical result, not a derived prediction, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce the result to the training inputs by construction. The evaluation uses an external, publicly available test set whose correctness is independent of the model's internal parameters or prior self-citations. This is a standard self-contained empirical finding with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continued pretraining on domain data improves performance on related downstream tasks without catastrophic forgetting.
Forward citations
Cited by 21 Pith papers
-
A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
-
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
-
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
-
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.
-
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
-
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation
A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.
-
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
-
LIMO: Less is More for Reasoning
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already ...
-
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
TinyLlama: An Open-Source Small Language Model
TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[2]
Yang, Kaiyu and Swope, Aidan and Gu, Alex and Chalamala, Rahul and Song, Peiyang and Yu, Shixing and Godil, Saad and Prenger, Ryan and Anandkumar, Anima , booktitle=
-
[3]
GitHub repository , howpublished =
Sean Welleck , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[4]
GitHub repository , howpublished =
Welleck, Sean and Saha, Rahul , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[6]
Deep Learning for Code (DL4C) Workshop , year=
SantaCoder: don't reach for the stars! , author=. Deep Learning for Code (DL4C) Workshop , year=
-
[8]
Advances in Neural Information Processing Systems , editor=
Autoformalization with Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
work page 2022
-
[11]
Larry Paulson and Tobias Nipkow , title=
-
[12]
On the Opportunities and Risks of Foundation Models
On the Opportunities and Risks of Foundation Models , author=. arXiv preprint arXiv:2108.07258 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
StarCoder: may the source be with you!
StarCoder: may the source be with you! , author=. arXiv preprint arXiv:2305.06161 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Finetuned Language Models Are Zero-Shot Learners , author=. 2022 , journal=
work page 2022
-
[15]
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.244
-
[16]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , journal=
work page 2023
-
[17]
Code Llama: Open Foundation Models for Code , author=. 2023 , journal=
work page 2023
-
[18]
Evaluating Language Models for Mathematics through Interactions , author=. 2023 , journal=
work page 2023
-
[19]
Baldur: Whole-Proof Generation and Repair with Large Language Models , author=. 2023 , journal=
work page 2023
-
[20]
Zhangir Azerbayev and Edward Ayers and Bartosz Piotrowski , year=
-
[21]
ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. ArXiv , year=
-
[23]
Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=
-
[24]
The Stack: 3 TB of permissively licensed source code , author=. Preprint , year=
-
[25]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. ArXiv , year=
- [27]
-
[28]
International Conference on Machine Learning , pages=
Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[29]
Advances in Neural Information Processing Systems , editor=
Solving Quantitative Reasoning Problems with Language Models , author=. Advances in Neural Information Processing Systems , editor=
-
[30]
Heiko Oberdiek and Sebastian Rahtz , title =. 2001--2010 , howpublished =
work page 2001
- [31]
-
[32]
The Eleventh International Conference on Learning Representations , year=
Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs , author=. The Eleventh International Conference on Learning Representations , year=
-
[33]
Advances in Neural Information Processing Systems , editor=
NaturalProver: Grounded Mathematical Proof Generation with Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
work page 2022
-
[34]
Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , title =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =
work page 2022
- [35]
-
[36]
The Eleventh International Conference on Learning Representations , year=
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[38]
Accelerate: Training and inference at scale made simple, efficient and adaptable. , author =
-
[41]
Han, Jesse Michael and Xu, Tao and Polu, Stanislas and Neelakantan, Arvind and Radford, Alec , booktitle =
-
[42]
International Conference on Learning Representations , year=
IsarStep: a Benchmark for High-level Mathematical Reasoning , author=. International Conference on Learning Representations , year=
-
[43]
The isabelle framework , author=. Theorem Proving in Higher Order Logics: 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings 21 , pages=. 2008 , organization=
work page 2008
-
[44]
The Lean theorem prover (system description) , author=. Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25 , pages=. 2015 , organization=
work page 2015
-
[45]
Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages , pages=
Extracting 's programs from proofs in the calculus of constructions , author=. Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages , pages=
-
[46]
Extraction de programmes dans le Calcul des Constructions , author=. 1989 , school=
work page 1989
-
[47]
doi:10.1007/s10817-014-9301-5 , eprint =
Jiang, Albert Qiaochu and Li, Wenda and Michael, Jesse and Openai, Han and Wu, Yuhuai , booktitle =. doi:10.1007/s10817-014-9301-5 , eprint =
-
[48]
doi:10.1007/978-3-030-53518-6_24 , eprint =
Urban, Josef and Jakubův, Jan , booktitle =. doi:10.1007/978-3-030-53518-6_24 , eprint =
-
[49]
International Conference on Learning Representations , year=
Proof Artifact Co-Training for Theorem Proving with Language Models , author=. International Conference on Learning Representations , year=
-
[50]
Wang, Qingxiang and Brown, Chad and Kaliszyk, Cezary and Urban, Josef , booktitle =. doi:10.1145/3372885.3373827 , title =
-
[51]
arXiv preprint arXiv:2308.04014 , year=
Continual Pre-Training of Large Language Models: How to (re) warm your model? , author=. arXiv preprint arXiv:2308.04014 , year=
-
[52]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Extending Context Window of Large Language Models via Positional Interpolation
Extending context window of large language models via positional interpolation , author=. arXiv preprint arXiv:2306.15595 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
NaturalProofs: Mathematical Theorem Proving in Natural Language , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) , year=
-
[55]
Stateful Premise Selection by Recurrent Neural Networks , author=. LPAR , year=
-
[56]
International Conference on Learning Representations , year=
Mathematical Reasoning via Self-supervised Skip-tree Training , author=. International Conference on Learning Representations , year=
-
[57]
Generative Language Modeling for Automated Theorem Proving , author=. 2020 , journal=
work page 2020
-
[58]
Formal Mathematics Statement Curriculum Learning , author=. 2022 , journal=
work page 2022
-
[59]
International Conference on Learning Representations , year=
Memorizing Transformers , author=. International Conference on Learning Representations , year=
-
[60]
Advances in Neural Information Processing Systems , eprint =
Alemi, Alexander A and Chollet, Fran. Advances in Neural Information Processing Systems , eprint =
-
[61]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...
-
[62]
Dense Passage Retrieval for Open-Domain Question Answering
Dense Passage Retrieval for Open-Domain Question Answering , author=. arXiv preprint arXiv:2004.04906 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2004
- [63]
-
[64]
Natural Language Premise Selection: Finding Supporting Statements for Mathematical Text
Ferreira, Deborah and Freitas, Andr \'e. Natural Language Premise Selection: Finding Supporting Statements for Mathematical Text. Proceedings of the 12th Language Resources and Evaluation Conference. 2020
work page 2020
- [65]
-
[66]
A Promising Path Towards Autoformalization and General Artificial Intelligence , editor =. 2020 , booktitle =
work page 2020
-
[68]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model , author=. Proceedings of BigScience Episode\# 5--Workshop on Challenges & Perspectives in Creating Large Language Models , pages=
-
[69]
Together Computer , title =. 2023
work page 2023
- [70]
- [71]
-
[72]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. 2023 , journal=
work page 2023
- [73]
-
[74]
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct , author=. 2023 , journal=
work page 2023
-
[75]
Training language models to follow instructions with human feedback , author=. 2022 , journal=
work page 2022
-
[76]
LaMDA: Language Models for Dialog Applications , author=. 2022 , journal=
work page 2022
-
[77]
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining , author=. 2023 , journal=
work page 2023
-
[79]
Rasley, Jeff and Rajbhandari, Samyam and Ruwase, Olatunji and He, Yuxiong , title =. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =. 2020 , isbn =. doi:10.1145/3394486.3406703 , abstract =
-
[80]
Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin ...
-
[83]
and Ermon, Stefano and Rudra, Atri and R
Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=
-
[86]
Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=
-
[88]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[89]
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
work page 2021
-
[90]
Galactica: A Large Language Model for Science , author=. 2022 , eprint=
work page 2022
-
[91]
Large Language Models Encode Clinical Knowledge , author=. 2022 , eprint=
work page 2022
-
[92]
Towards Expert-Level Medical Question Answering with Large Language Models , author=. 2023 , eprint=
work page 2023
-
[93]
BloombergGPT: A Large Language Model for Finance , author=. 2023 , eprint=
work page 2023
- [94]
-
[95]
Realization of a geometry theorem proving machine , author=. IFIP Congress , year=
-
[96]
Solving math word problems with process- and outcome-based feedback , author=. 2022 , eprint=
work page 2022
-
[97]
Teaching Algorithmic Reasoning via In-context Learning , author=. 2022 , eprint=
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.