Generative Language Modeling for Automated Theorem Proving

Ilya Sutskever; Stanislas Polu

arxiv: 2009.03393 · v1 · pith:EHSUJMURnew · submitted 2020-09-07 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu , Ilya Sutskever This is my paper

Pith reviewed 2026-05-23 05:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML

keywords automated theorem provinglanguage modelingMetamathtransformersformal mathematicsproof generationgenerative models

0 comments

The pith

A transformer language model generated new short proofs accepted into the Metamath library.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores whether generative language models can address a key gap in automated theorem provers: the creation of original mathematical terms rather than just verification. The authors train a transformer on Metamath proofs to build GPT-f, an automated prover and proof assistant. GPT-f produced new short proofs that were accepted into the main Metamath library. This is presented as the first documented case of a deep-learning system contributing proofs adopted by a formal mathematics community. A sympathetic reader would care if this shows language models can move beyond search and verification into generating genuinely new formal content.

Core claim

GPT-f is a transformer-based language model trained to generate sequences in the Metamath formal language. When used as an automated prover, it found new short proofs for theorems that were subsequently accepted into the main Metamath library. The authors state this constitutes the first instance in which a deep-learning based system has contributed proofs adopted by a formal mathematics community.

What carries the argument

GPT-f, a transformer language model fine-tuned on Metamath proof sequences to generate complete proofs from theorem statements.

If this is right

Transformer models can generate original terms inside formal proof languages.
Proofs produced by such models can meet the syntactic and consistency standards required for library inclusion.
Language-model-based generation can be combined with existing Metamath proof search procedures.
Deep learning systems can contribute content directly usable by human-maintained formal libraries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training approach might transfer to other formal systems whose proof corpora are large enough to fine-tune on.
Future work could measure how much the model relies on memorization versus recombination of existing proof fragments.
If scaling holds, larger models could target harder open conjectures once sufficient formal data exists.

Load-bearing premise

That acceptance by Metamath library maintainers establishes both the correctness and the genuine novelty of the generated proofs.

What would settle it

An independent search showing that each accepted proof is equivalent to one already present in the library or derivable by existing non-language-model tactics without shortening or new structure.

read the original abstract

We explore the application of transformer-based language models to automated theorem proving. This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans -- the generation of original mathematical terms -- might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance. GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a language model generating Metamath proofs that library maintainers accepted, but supplies almost no internal metrics or comparisons to back the claim of meaningful novelty or shortness.

read the letter

The central takeaway is that GPT-f produced some proofs accepted into the main Metamath library, presented as the first deep-learning system to achieve community adoption of its output. That external acceptance is the concrete new element here; prior work had not reached that milestone according to the abstract. The motivation is also clear: using generative models to handle original term creation, which search-based provers struggle with. The approach of training on existing Metamath proofs and then sampling new steps is a direct attempt to address that gap, and the fact that any of the outputs survived human review is worth noting as a proof of concept. The paper does not appear to overclaim internal performance numbers in the abstract, which keeps the result grounded in an observable external event rather than self-reported accuracy. The soft spot is exactly the one flagged in the stress-test note. Acceptance by maintainers confirms syntactic validity and consistency with the library, but does not require or demonstrate that the proofs are shorter than existing ones or that the generated terms are original rather than recombinations already reachable by standard search. No counts of total candidates, rejection rates, length comparisons, or maintainer review criteria are supplied, so the significance rests on an interpretation of acceptance that the paper itself does not verify. This makes the result more of an existence proof than a quantified advance. The work is aimed at people already interested in machine learning for formal mathematics and would be useful in a reading group to discuss how to move from library acceptance to measurable improvement. It is coherent on its own terms and shows honest engagement with the formal-system setting, so it deserves a serious referee even if the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper explores the use of transformer-based generative language models for automated theorem proving in the Metamath formal system. It introduces GPT-f, an automated prover and proof assistant, and reports that the model generated new short proofs accepted into the main Metamath library—the first such contribution from a deep-learning system to a formal mathematics community.

Significance. If substantiated, the result would indicate that language models can generate original mathematical terms and proofs that integrate into established formal libraries, addressing a key limitation of traditional ATP systems. The work provides a concrete demonstration of applying generative modeling to formal mathematics, with potential implications for scaling proof search beyond exhaustive enumeration.

major comments (2)

[Abstract, §1] Abstract and §1: The central claim that GPT-f 'found new short proofs' accepted into the Metamath library is not supported by any explicit length comparisons to existing proofs for the same statements, enumeration of contributed proofs, or description of the maintainers' review criteria. Library acceptance confirms syntactic validity and consistency but does not establish novelty or shortness relative to prior entries.
[Abstract, evaluation sections] The abstract and evaluation sections report only the outcome of acceptance without quantitative baselines, error rates, number of candidate proofs generated/rejected, or ablation studies on the language model components. This makes it impossible to assess whether the result supports the broader claim that language models address the term-generation bottleneck in ATP.

minor comments (1)

[§2 or §3] Notation for the Metamath language and proof representation could be clarified with a brief example in §2 or §3 to aid readers unfamiliar with the formal system.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with clarifications and proposed revisions where the manuscript can be strengthened without misrepresenting our results.

read point-by-point responses

Referee: [Abstract, §1] Abstract and §1: The central claim that GPT-f 'found new short proofs' accepted into the Metamath library is not supported by any explicit length comparisons to existing proofs for the same statements, enumeration of contributed proofs, or description of the maintainers' review criteria. Library acceptance confirms syntactic validity and consistency but does not establish novelty or shortness relative to prior entries.

Authors: We agree that explicit comparisons would make the claim more robust. The proofs are new in the sense that they were absent from the library prior to submission and were reviewed and accepted as additions by the maintainers. Shortness is relative to the library's preference for concise proofs that pass verification. The manuscript does not include direct length comparisons or an enumeration. We will revise to add a table listing the contributed theorems, their proof lengths, and comparisons to any pre-existing proofs for the same statements (noting cases where none existed). We will also briefly describe the Metamath acceptance process, which requires syntactic validity, consistency, and favors brevity. revision: partial
Referee: [Abstract, evaluation sections] The abstract and evaluation sections report only the outcome of acceptance without quantitative baselines, error rates, number of candidate proofs generated/rejected, or ablation studies on the language model components. This makes it impossible to assess whether the result supports the broader claim that language models address the term-generation bottleneck in ATP.

Authors: The paper's primary evaluation is the external validation via library acceptance as evidence that the model addresses term generation. The evaluation sections already contain proof-search success rates and some model performance metrics. We acknowledge that additional context on candidate generation volume, rejection rates, and explicit ablations would help readers assess the bottleneck claim. We will incorporate available experimental statistics on generation volume and rejections into the revised evaluation section and ensure ablation results are clearly referenced. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central result relies on external library acceptance rather than internal self-reference.

full rationale

The paper reports an empirical application of a transformer model to Metamath proof generation. Its primary claim—that generated proofs were accepted into the main library—depends on external maintainer review for syntactic validity and consistency, which is independent of the model's training procedure or any fitted parameters. No equations, predictions, or uniqueness arguments reduce by construction to the inputs; the result is not a renaming of known patterns or a self-citation chain. The derivation chain from training data to generated proofs remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the external fact of library acceptance rather than on any modeled quantity.

pith-pipeline@v0.9.0 · 5626 in / 1023 out tokens · 52830 ms · 2026-05-23T05:14:07.468281+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community.
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We explore the application of transformer-based language models to automated theorem proving.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Advancing Mathematics Research with AI-Driven Formal Proof Search
cs.AI 2026-05 unverdicted novelty 7.0

LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.
What are the Right Symmetries for Formal Theorem Proving?
cs.LG 2026-05 unverdicted novelty 7.0

Introduces rewriting categories to formalize proof equivariance and success invariance, shows LLM provers violate both, and demonstrates test-time aggregation recovers invariance and boosts performance.
CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
cs.AI 2026-05 accept novelty 7.0

CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.
ABD: Default Exception Abduction in Finite First Order Worlds
cs.AI 2026-02 unverdicted novelty 7.0

ABD benchmark evaluates LLMs on producing parsimonious first-order exception formulas in three observation regimes using SMT verification, finding high validity but persistent parsimony and generalization gaps.
Massive Activations in Large Language Models
cs.CL 2024-02 unverdicted novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search
cs.LO 2026-05 unverdicted novelty 6.0

Lean Refactor uses retrieval from a curated multi-objective strategy database to guide frozen LLMs in refactoring Lean proofs, reporting over 70% token compression on benchmarks and improved version transfer.
OProver: A Unified Framework for Agentic Formal Theorem Proving
cs.CL 2026-05 unverdicted novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
cs.AI 2026-05 unverdicted novelty 6.0

Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
Measuring Representation Robustness in Large Language Models for Geometry
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
Neuro-Symbolic Proof Generation for Scaling Systems Software Verification
cs.AI 2026-03 conditional novelty 6.0

A neuro-symbolic system using LLM-guided best-first search and Isabelle tools proves up to 77.6% of theorems on the seL4 benchmark, outperforming prior LLM methods and Sledgehammer.
VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning
cs.CL 2026-01 unverdicted novelty 6.0

VERGE decomposes LLM outputs into atomic claims, autoformalizes them to first-order logic, verifies with SMT solvers and consensus, and refines via minimal correction subsets, yielding 18.7% average uplift on reasonin...
Aristotle: IMO-level Automated Theorem Proving
cs.AI 2025-10 unverdicted novelty 6.0

Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.
Llemma: An Open Language Model For Mathematics
cs.CL 2023-10 unverdicted novelty 6.0

Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
cs.CL 2023-09 conditional novelty 6.0

ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on...
Measuring Coding Challenge Competence With APPS
cs.SE 2021-05 unverdicted novelty 6.0

APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.
OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving
cs.LG 2026-04 unverdicted novelty 5.0

OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
eess.SP 2026-04 unverdicted novelty 4.0

Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
AI for Mathematics: Progress, Challenges, and Prospects
math.HO 2026-01 unverdicted novelty 4.0

AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
cs.AI 2026-05 unverdicted novelty 3.0

Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.
When control meets large language models: From words to dynamics
eess.SY 2026-02 unverdicted novelty 3.0

The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
cs.CL 2025-02 unverdicted novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 22 Pith papers · 19 internal anchors

[1]

Imagenet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012

work page 2012
[2]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016
[3]

Sequence to sequence learning with neural networks

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems , pages 3104–3112, 2014

work page 2014
[4]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Towards end-to-end speech recognition with recurrent neural networks

Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning , pages 1764–1772, 2014

work page 2014
[7]

Deep speech 2: End-to-end speech recognition in english and mandarin

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182, 2016

work page 2016
[8]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

work page 2014
[9]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019
[12]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, 2020

work page 2020
[13]

Mas- tering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016

work page 2016
[14]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[15]

Grandmaster level in starcraft ii using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019

work page 2019
[16]

Solving Rubik's Cube with a Robot Hand

Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019. 16

work page internal anchor Pith review Pith/arXiv arXiv 1910
[17]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , 17(1):1334–1373, 2016

work page 2016
[18]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

work page 2017
[19]

Improving language understanding by generative pre-training, 2018

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018

work page 2018
[20]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019

work page 2019
[21]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[22]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

History of interactive theorem proving

John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In Computational Logic, volume 9, pages 135–214, 2014

work page 2014
[25]

Learning to prove theorems by learning to generate theorems

Mingzhe Wang and Jia Deng. Learning to prove theorems by learning to generate theorems. arXiv preprint arXiv:2002.07019, 2020

work page arXiv 2002
[26]

Deepmath-deep sequence models for premise selection

Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Eén, François Chollet, and Josef Urban. Deepmath-deep sequence models for premise selection. In Advances in Neural Information Processing Systems, pages 2235–2243, 2016

work page 2016
[27]

Premise selection for theorem proving by deep graph embedding

Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. Premise selection for theorem proving by deep graph embedding. In Advances in Neural Information Processing Systems , pages 2786–2796, 2017

work page 2017
[28]

Learning a SAT Solver from Single-Bit Supervision

Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Deep Network Guided Proof Search

Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof search. arXiv preprint arXiv:1701.06972, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Holist: An environment for machine learning of higher order logic theorem proving

Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. Holist: An environment for machine learning of higher order logic theorem proving. In International Conference on Machine Learning, pages 454–463, 2019

work page 2019
[31]

Learning to reason in large theories without imitation

Kshitij Bansal, Sarah M Loos, Markus N Rabe, and Christian Szegedy. Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501, 2019

work page arXiv 1905
[32]

Mathematical reasoning via self-supervised skip-tree training

Markus N Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757, 2020

work page arXiv 2006
[33]

GamePad: A Learning Environment for Theorem Proving

Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. Gamepad: A learning environment for theorem proving. arXiv preprint arXiv:1806.00608, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Learning to Prove Theorems via Interacting with Proof Assistants

Kaiyu Yang and Jia Deng. Learning to prove theorems via interacting with proof assistants. arXiv preprint arXiv:1905.09381, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[35]

Holophrasm: a neural Automated Theorem Prover for higher-order logic

Daniel Whalen. Holophrasm: a neural automated theorem prover for higher-order logic. arXiv preprint arXiv:1608.02644, 2016. 17

work page internal anchor Pith review Pith/arXiv arXiv 2016
[36]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[37]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by ratio- nale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[39]

Deep learning for symbolic mathematics

Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv preprint arXiv:1912.01412, 2019

work page arXiv 1912
[40]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. arXiv preprint arXiv:1807.03819, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

Analysing Mathematical Reasoning Abilities of Neural Models

David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[42]

Megill and David A

Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Pure Mathe- matics, 2019. http://us.metamath.org/downloads/metamath.pdf

work page 2019
[43]

How Metamath Proofs Work , 2006

Norman Megill. How Metamath Proofs Work , 2006. http://us.metamath.org/mpeuni/mmset.html#proofs

work page 2006
[44]

de Bruijn factor

Freek Wiedijk. The "de Bruijn factor" , 2014. http://www.cs.ru.nl/ freek/factor/

work page 2014
[45]

Int: An inequality benchmark for evaluating generalization in theorem proving

Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Grosse. Int: An inequality benchmark for evaluating generalization in theorem proving. arXiv preprint arXiv:2007.02924, 2020

work page arXiv 2007
[46]

Thinking fast and slow with deep learning and tree search

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems , pages 5360–5370, 2017. A Key Results Table 13: Key results described in this paper (on the valid set) with a summary of the source of performance gains. Model Performance Gain Main ablation MetaGen...

work page 2017

[1] [1]

Imagenet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012

work page 2012

[2] [2]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016

[3] [3]

Sequence to sequence learning with neural networks

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems , pages 3104–3112, 2014

work page 2014

[4] [4]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[5] [5]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Towards end-to-end speech recognition with recurrent neural networks

Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning , pages 1764–1772, 2014

work page 2014

[7] [7]

Deep speech 2: End-to-end speech recognition in english and mandarin

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182, 2016

work page 2016

[8] [8]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

work page 2014

[9] [9]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019

[12] [12]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, 2020

work page 2020

[13] [13]

Mas- tering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016

work page 2016

[14] [14]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[15] [15]

Grandmaster level in starcraft ii using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019

work page 2019

[16] [16]

Solving Rubik's Cube with a Robot Hand

Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019. 16

work page internal anchor Pith review Pith/arXiv arXiv 1910

[17] [17]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , 17(1):1334–1373, 2016

work page 2016

[18] [18]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

work page 2017

[19] [19]

Improving language understanding by generative pre-training, 2018

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018

work page 2018

[20] [20]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019

work page 2019

[21] [21]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[22] [22]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

History of interactive theorem proving

John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In Computational Logic, volume 9, pages 135–214, 2014

work page 2014

[25] [25]

Learning to prove theorems by learning to generate theorems

Mingzhe Wang and Jia Deng. Learning to prove theorems by learning to generate theorems. arXiv preprint arXiv:2002.07019, 2020

work page arXiv 2002

[26] [26]

Deepmath-deep sequence models for premise selection

Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Eén, François Chollet, and Josef Urban. Deepmath-deep sequence models for premise selection. In Advances in Neural Information Processing Systems, pages 2235–2243, 2016

work page 2016

[27] [27]

Premise selection for theorem proving by deep graph embedding

Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. Premise selection for theorem proving by deep graph embedding. In Advances in Neural Information Processing Systems , pages 2786–2796, 2017

work page 2017

[28] [28]

Learning a SAT Solver from Single-Bit Supervision

Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Deep Network Guided Proof Search

Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof search. arXiv preprint arXiv:1701.06972, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Holist: An environment for machine learning of higher order logic theorem proving

Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. Holist: An environment for machine learning of higher order logic theorem proving. In International Conference on Machine Learning, pages 454–463, 2019

work page 2019

[31] [31]

Learning to reason in large theories without imitation

Kshitij Bansal, Sarah M Loos, Markus N Rabe, and Christian Szegedy. Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501, 2019

work page arXiv 1905

[32] [32]

Mathematical reasoning via self-supervised skip-tree training

Markus N Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757, 2020

work page arXiv 2006

[33] [33]

GamePad: A Learning Environment for Theorem Proving

Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. Gamepad: A learning environment for theorem proving. arXiv preprint arXiv:1806.00608, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [34]

Learning to Prove Theorems via Interacting with Proof Assistants

Kaiyu Yang and Jia Deng. Learning to prove theorems via interacting with proof assistants. arXiv preprint arXiv:1905.09381, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[35] [35]

Holophrasm: a neural Automated Theorem Prover for higher-order logic

Daniel Whalen. Holophrasm: a neural automated theorem prover for higher-order logic. arXiv preprint arXiv:1608.02644, 2016. 17

work page internal anchor Pith review Pith/arXiv arXiv 2016

[36] [36]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[37] [37]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by ratio- nale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[39] [39]

Deep learning for symbolic mathematics

Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv preprint arXiv:1912.01412, 2019

work page arXiv 1912

[40] [40]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. arXiv preprint arXiv:1807.03819, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[41] [41]

Analysing Mathematical Reasoning Abilities of Neural Models

David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[42] [42]

Megill and David A

Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Pure Mathe- matics, 2019. http://us.metamath.org/downloads/metamath.pdf

work page 2019

[43] [43]

How Metamath Proofs Work , 2006

Norman Megill. How Metamath Proofs Work , 2006. http://us.metamath.org/mpeuni/mmset.html#proofs

work page 2006

[44] [44]

de Bruijn factor

Freek Wiedijk. The "de Bruijn factor" , 2014. http://www.cs.ru.nl/ freek/factor/

work page 2014

[45] [45]

Int: An inequality benchmark for evaluating generalization in theorem proving

Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Grosse. Int: An inequality benchmark for evaluating generalization in theorem proving. arXiv preprint arXiv:2007.02924, 2020

work page arXiv 2007

[46] [46]

Thinking fast and slow with deep learning and tree search

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems , pages 5360–5370, 2017. A Key Results Table 13: Key results described in this paper (on the valid set) with a summary of the source of performance gains. Model Performance Gain Main ablation MetaGen...

work page 2017