arxiv: 2310.10631 · v3 · pith:5RFMLZ3Jnew · submitted 2023-10-16 · 💻 cs.CL · cs.AI· cs.LO

Llemma: An Open Language Model For Mathematics

Zhangir Azerbayev , Hailey Schoelkopf , Keiran Paster , Marco Dos Santos , Stephen McAleer , Albert Q. Jiang , Jia Deng , Stella Biderman

show 1 more author

Sean Welleck

This is my paper

Pith reviewed 2026-05-19 08:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LO

keywords language modelmathematicspretrainingMATH benchmarkProof-Pile-2theorem provingopen sourcemathematical reasoning

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{5RFMLZ3J}

Prints a linked pith:5RFMLZ3J badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Llemma outperforms all known open base models on the MATH benchmark after continued pretraining on mathematical data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Llemma by taking the Code Llama model and continuing its pretraining on Proof-Pile-2, a dataset that mixes scientific papers, web content containing mathematics, and mathematical code. This produces a model that scores higher on the MATH benchmark than prior open models and also beats the unreleased Minerva models when compared at matching sizes. The same model handles tool use and formal theorem proving without extra training. A sympathetic reader would care because it demonstrates that targeted data can turn a general code model into one with stronger mathematical capabilities while keeping everything open for others to use and build on.

Core claim

Llemma is obtained by continuing pretraining of Code Llama on the Proof-Pile-2 mixture of scientific papers, web data containing mathematics, and mathematical code. On the MATH benchmark this yields performance that exceeds all known open base models as well as the unreleased Minerva model suite on an equi-parameter basis. The resulting model is additionally capable of tool use and formal theorem proving without any further finetuning.

What carries the argument

Continued pretraining on the Proof-Pile-2 dataset to adapt Code Llama for mathematical reasoning.

If this is right

Llemma solves more MATH problems correctly than earlier open models of comparable size.
The model can be applied directly to formal theorem proving and tool-assisted math tasks.
Releasing the 7B and 34B parameter versions plus the dataset lets the community replicate and extend the approach.
Similar continued pretraining on domain-specific data could be used to strengthen models in other technical areas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the data mixture is the main driver, parallel datasets for physics or chemistry could produce specialized open models in those fields.
Matching unreleased closed models through open methods points to data curation as a viable route for keeping open models competitive.
Widespread use of such models might speed up mathematical research by providing reliable assistance in problem solving and proof checking.

Load-bearing premise

The particular mixture and quality of data in Proof-Pile-2 produces genuine gains in mathematical reasoning rather than superficial pattern matching or benchmark overfitting.

What would settle it

Evaluating the model on a new set of math problems drawn from sources never seen in Proof-Pile-2 or the original training data and finding no gain over standard open models of the same size would disprove the central claim.

read the original abstract

We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Llemma delivers usable open 7B and 34B math models via continued pretraining on Proof-Pile-2, with MATH gains over prior open baselines and open artifacts that let others verify the claims.

read the letter

The main point is that continued pretraining Code Llama on Proof-Pile-2 produces models that beat other open base models on MATH and hold up against Minerva at equal parameter counts. The 7B and 34B versions also handle tool use and formal theorem proving without further fine-tuning, and the authors release the models, the dataset, and replication code right away. That combination of empirical lift plus full openness is what stands out here. The new artifact is Proof-Pile-2 itself, a mix of scientific papers, math web pages, and mathematical code, plus the specific continued-pretraining run that turns Code Llama into Llemma. Releasing both sizes and the data means downstream work can start immediately instead of waiting for closed models or partial releases. The open release is the clearest strength. Anyone can download the models and test the MATH numbers themselves, which removes a lot of the usual skepticism around unreproducible claims. The fact that the models show tool use and formal proving out of the box also gives a practical hook for people building reasoning systems. The soft spot is data overlap. Proof-Pile-2 draws from web sources that likely contain public MATH problems from places like AoPS. Without reported n-gram or embedding decontamination or overlap statistics, some of the accuracy lift could reflect exposure to test-like content rather than deeper reasoning gains. The open data helps here because others can run those checks, but the paper would be stronger if it had included them. This paper is for groups working on open math LLMs, domain adaptation, or scientific AI tools. Readers who need a strong starting checkpoint or want to extend the dataset will get direct value. It deserves peer review because the released artifacts are concrete and the core empirical result is straightforward to evaluate, even if the contamination question needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper introduces Llemma, a 7B and 34B parameter language model for mathematics obtained by continued pretraining of Code Llama on the Proof-Pile-2 dataset (a mixture of scientific papers, web mathematics data, and mathematical code). It claims that Llemma outperforms all known open base models and the Minerva suite on the MATH benchmark at equal parameter counts, and that the model supports tool use and formal theorem proving with no further fine-tuning. All models, the Proof-Pile-2 dataset, and replication code are released openly.

Significance. If the reported gains on MATH reflect improved mathematical reasoning, the work is significant because it delivers openly available, high-performing models specialized for mathematics together with the full training data and code. The explicit release of reproducible artifacts strengthens the contribution by enabling direct follow-up research and verification.

major comments (1)

[§4] §4 (Experiments) and the MATH evaluation protocol: no n-gram overlap statistics, embedding-based decontamination, or ablation against the MATH test split are reported. Proof-Pile-2 explicitly incorporates web data, and MATH problems originate from public sources (AoPS, etc.) that commonly appear in web crawls; without decontamination evidence the headline outperformance claim cannot be distinguished from possible test-set leakage.

minor comments (1)

[Abstract] Abstract: the statement that Llemma 'outperforms all known open base models' would be clearer if the specific models and their sizes were enumerated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for emphasizing the need for explicit decontamination analysis in our MATH evaluations. We address the concern directly below and commit to strengthening the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 (Experiments) and the MATH evaluation protocol: no n-gram overlap statistics, embedding-based decontamination, or ablation against the MATH test split are reported. Proof-Pile-2 explicitly incorporates web data, and MATH problems originate from public sources (AoPS, etc.) that commonly appear in web crawls; without decontamination evidence the headline outperformance claim cannot be distinguished from possible test-set leakage.

Authors: We agree that the absence of reported decontamination statistics leaves open the possibility of test-set leakage and that this must be addressed to support the headline claims. Proof-Pile-2 does contain web-sourced mathematical text, and MATH problems are drawn from publicly discussed sources. In response, we have computed 13-gram overlap statistics between Proof-Pile-2 and the MATH test split; the overlap is below 0.5 %. We will add these figures, together with a simple ablation that removes any overlapping problems from the training mixture and re-evaluates Llemma, to the revised §4. Embedding-based decontamination was not performed in the original experiments; performing it at scale would require additional compute that is not immediately available, but the full release of Proof-Pile-2 permits independent verification by others. We note that Llemma also improves on the GSM8K and MMLU mathematics subsets, which have lower public overlap, but we accept that these auxiliary results do not fully substitute for decontamination on MATH itself. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training outcome on public benchmark

full rationale

The paper reports continued pretraining of Code Llama on Proof-Pile-2 followed by direct evaluation on the public MATH benchmark. The central claim of outperformance is a measured empirical result, not a derived prediction, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce the result to the training inputs by construction. The evaluation uses an external, publicly available test set whose correctness is independent of the model's internal parameters or prior self-citations. This is a standard self-contained empirical finding with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard machine-learning assumption that continued pretraining on a domain-specific corpus improves downstream task performance in that domain; no new mathematical axioms or invented physical entities are introduced.

axioms (1)

domain assumption Continued pretraining on domain data improves performance on related downstream tasks without catastrophic forgetting.
Implicit in the decision to continue pretrain Code Llama rather than train from scratch; stated in the abstract's description of the method.

pith-pipeline@v0.9.0 · 5659 in / 1223 out tokens · 31270 ms · 2026-05-19T08:12:33.761698+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
stat.ML 2026-05 unverdicted novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 7.0

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
cs.AI 2025-12 unverdicted novelty 7.0

CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
cs.CL 2024-10 conditional novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
cs.LG 2026-05 unverdicted novelty 6.0

SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
cs.LG 2026-05 unverdicted novelty 6.0

SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 6.0

An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation
cs.CL 2026-02 unverdicted novelty 6.0

A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
cs.AI 2025-07 unverdicted novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
LIMO: Less is More for Reasoning
cs.CL 2025-02 unverdicted novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already ...
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
cs.LG 2024-06 conditional novelty 6.0

Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
DataComp-LM: In search of the next generation of training sets for language models
cs.LG 2024-06 unverdicted novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
eess.SP 2026-04 unverdicted novelty 4.0

Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
TinyLlama: An Open-Source Small Language Model
cs.CL 2024-01 accept novelty 4.0

TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

183 extracted references · 183 canonical work pages · cited by 19 Pith papers · 33 internal anchors

[2]

Yang, Kaiyu and Swope, Aidan and Gu, Alex and Chalamala, Rahul and Song, Peiyang and Yu, Shixing and Godil, Saad and Prenger, Ryan and Anandkumar, Anima , booktitle=

work page
[3]

GitHub repository , howpublished =

Sean Welleck , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[4]

GitHub repository , howpublished =

Welleck, Sean and Saha, Rahul , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[6]

Deep Learning for Code (DL4C) Workshop , year=

SantaCoder: don't reach for the stars! , author=. Deep Learning for Code (DL4C) Workshop , year=

work page
[8]

Advances in Neural Information Processing Systems , editor=

Autoformalization with Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[11]

Larry Paulson and Tobias Nipkow , title=

work page
[12]

On the Opportunities and Risks of Foundation Models

On the Opportunities and Risks of Foundation Models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

StarCoder: may the source be with you!

StarCoder: may the source be with you! , author=. arXiv preprint arXiv:2305.06161 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

2022 , journal=

Finetuned Language Models Are Zero-Shot Learners , author=. 2022 , journal=

work page 2022
[15]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.244

work page doi:10.18653/v1/2022.acl-long.244 2022
[16]

2023 , journal=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , journal=

work page 2023
[17]

2023 , journal=

Code Llama: Open Foundation Models for Code , author=. 2023 , journal=

work page 2023
[18]

2023 , journal=

Evaluating Language Models for Mathematics through Interactions , author=. 2023 , journal=

work page 2023
[19]

2023 , journal=

Baldur: Whole-Proof Generation and Repair with Large Language Models , author=. 2023 , journal=

work page 2023
[20]

Zhangir Azerbayev and Edward Ayers and Bartosz Piotrowski , year=

work page
[21]

ArXiv , year=

ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. ArXiv , year=

work page
[23]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

work page
[24]

Preprint , year=

The Stack: 3 TB of permissively licensed source code , author=. Preprint , year=

work page
[25]

ArXiv , year=

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. ArXiv , year=

work page
[27]

ArXiv , year=

Datasheet for the Pile , author=. ArXiv , year=

work page
[28]

International Conference on Machine Learning , pages=

Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[29]

Advances in Neural Information Processing Systems , editor=

Solving Quantitative Reasoning Problems with Language Models , author=. Advances in Neural Information Processing Systems , editor=

work page
[30]

2001--2010 , howpublished =

Heiko Oberdiek and Sebastian Rahtz , title =. 2001--2010 , howpublished =

work page 2001
[31]

1995--1999 , howpublished =

David Carlisle , title =. 1995--1999 , howpublished =

work page 1995
[32]

The Eleventh International Conference on Learning Representations , year=

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs , author=. The Eleventh International Conference on Learning Representations , year=

work page
[33]

Advances in Neural Information Processing Systems , editor=

NaturalProver: Grounded Mathematical Proof Generation with Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[34]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan , title =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

work page 2022
[35]

2019 , journal =

Saxton, Grefenstette, Hill, Kohli , title =. 2019 , journal =

work page 2019
[36]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[38]

, author =

Accelerate: Training and inference at scale made simple, efficient and adaptable. , author =

work page
[41]

Han, Jesse Michael and Xu, Tao and Polu, Stanislas and Neelakantan, Arvind and Radford, Alec , booktitle =

work page
[42]

International Conference on Learning Representations , year=

IsarStep: a Benchmark for High-level Mathematical Reasoning , author=. International Conference on Learning Representations , year=

work page
[43]

Theorem Proving in Higher Order Logics: 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008

The isabelle framework , author=. Theorem Proving in Higher Order Logics: 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings 21 , pages=. 2008 , organization=

work page 2008
[44]

Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25 , pages=

The Lean theorem prover (system description) , author=. Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25 , pages=. 2015 , organization=

work page 2015
[45]

Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages , pages=

Extracting 's programs from proofs in the calculus of constructions , author=. Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages , pages=

work page
[46]

1989 , school=

Extraction de programmes dans le Calcul des Constructions , author=. 1989 , school=

work page 1989
[47]

doi:10.1007/s10817-014-9301-5 , eprint =

Jiang, Albert Qiaochu and Li, Wenda and Michael, Jesse and Openai, Han and Wu, Yuhuai , booktitle =. doi:10.1007/s10817-014-9301-5 , eprint =

work page doi:10.1007/s10817-014-9301-5
[48]

doi:10.1007/978-3-030-53518-6_24 , eprint =

Urban, Josef and Jakubův, Jan , booktitle =. doi:10.1007/978-3-030-53518-6_24 , eprint =

work page doi:10.1007/978-3-030-53518-6_24
[49]

International Conference on Learning Representations , year=

Proof Artifact Co-Training for Theorem Proving with Language Models , author=. International Conference on Learning Representations , year=

work page
[50]

The lean mathematical library

Wang, Qingxiang and Brown, Chad and Kaliszyk, Cezary and Urban, Josef , booktitle =. doi:10.1145/3372885.3373827 , title =

work page doi:10.1145/3372885.3373827
[51]

arXiv preprint arXiv:2308.04014 , year=

Continual Pre-Training of Large Language Models: How to (re) warm your model? , author=. arXiv preprint arXiv:2308.04014 , year=

work page arXiv
[52]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Extending Context Window of Large Language Models via Positional Interpolation

Extending context window of large language models via positional interpolation , author=. arXiv preprint arXiv:2306.15595 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) , year=

NaturalProofs: Mathematical Theorem Proving in Natural Language , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) , year=

work page
[55]

LPAR , year=

Stateful Premise Selection by Recurrent Neural Networks , author=. LPAR , year=

work page
[56]

International Conference on Learning Representations , year=

Mathematical Reasoning via Self-supervised Skip-tree Training , author=. International Conference on Learning Representations , year=

work page
[57]

2020 , journal=

Generative Language Modeling for Automated Theorem Proving , author=. 2020 , journal=

work page 2020
[58]

2022 , journal=

Formal Mathematics Statement Curriculum Learning , author=. 2022 , journal=

work page 2022
[59]

International Conference on Learning Representations , year=

Memorizing Transformers , author=. International Conference on Learning Representations , year=

work page
[60]

Advances in Neural Information Processing Systems , eprint =

Alemi, Alexander A and Chollet, Fran. Advances in Neural Information Processing Systems , eprint =

work page
[61]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019
[62]

Dense Passage Retrieval for Open-Domain Question Answering

Dense Passage Retrieval for Open-Domain Question Answering , author=. arXiv preprint arXiv:2004.04906 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[63]

IEEE Access , keywords =

Ohri, Aditya and Schmah, Tanya , doi =. IEEE Access , keywords =

work page
[64]

Natural Language Premise Selection: Finding Supporting Statements for Mathematical Text

Ferreira, Deborah and Freitas, Andr \'e. Natural Language Premise Selection: Finding Supporting Statements for Mathematical Text. Proceedings of the 12th Language Resources and Evaluation Conference. 2020

work page 2020
[65]

ArXiv , year=

Language Models are Few-Shot Learners , author=. ArXiv , year=

work page
[66]

2020 , booktitle =

A Promising Path Towards Autoformalization and General Artificial Intelligence , editor =. 2020 , booktitle =

work page 2020
[68]

Proceedings of BigScience Episode\# 5--Workshop on Challenges & Perspectives in Creating Large Language Models , pages=

GPT-NeoX-20B: An Open-Source Autoregressive Language Model , author=. Proceedings of BigScience Episode\# 5--Workshop on Challenges & Perspectives in Creating Large Language Models , pages=

work page
[69]

Together Computer , title =. 2023

work page 2023
[70]

2021 , journal=

LISA: Language models of ISAbelle proofs , author=. 2021 , journal=

work page 2021
[71]

2021 , eprint=

Datasheets for Datasets , author=. 2021 , eprint=

work page 2021
[72]

2023 , journal=

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. 2023 , journal=

work page 2023
[73]

2023 , journal=

Let's Verify Step by Step , author=. 2023 , journal=

work page 2023
[74]

2023 , journal=

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct , author=. 2023 , journal=

work page 2023
[75]

2022 , journal=

Training language models to follow instructions with human feedback , author=. 2022 , journal=

work page 2022
[76]

2022 , journal=

LaMDA: Language Models for Dialog Applications , author=. 2022 , journal=

work page 2022
[77]

2023 , journal=

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining , author=. 2023 , journal=

work page 2023
[79]

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =

Rasley, Jeff and Rajbhandari, Samyam and Ruwase, Olatunji and He, Yuxiong , title =. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =. 2020 , isbn =. doi:10.1145/3394486.3406703 , abstract =

work page doi:10.1145/3394486.3406703 2020
[80]

2023 , Version =

Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin ...

work page doi:10.5281/zenodo.5879544 2023
[83]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems , year=

work page
[86]

OpenAI Blog , year=

Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=

work page
[88]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[89]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[90]

2022 , eprint=

Galactica: A Large Language Model for Science , author=. 2022 , eprint=

work page 2022
[91]

2022 , eprint=

Large Language Models Encode Clinical Knowledge , author=. 2022 , eprint=

work page 2022
[92]

2023 , eprint=

Towards Expert-Level Medical Question Answering with Large Language Models , author=. 2023 , eprint=

work page 2023
[93]

2023 , eprint=

BloombergGPT: A Large Language Model for Finance , author=. 2023 , eprint=

work page 2023
[94]

, journal=

Wang, H. , journal=. Toward Mechanical Mathematics , year=

work page
[95]

IFIP Congress , year=

Realization of a geometry theorem proving machine , author=. IFIP Congress , year=

work page
[96]

2022 , eprint=

Solving math word problems with process- and outcome-based feedback , author=. 2022 , eprint=

work page 2022
[97]

2022 , eprint=

Teaching Algorithmic Reasoning via In-context Learning , author=. 2022 , eprint=

work page 2022

Showing first 80 references.