pith. sign in

arxiv: 2009.03393 · v1 · pith:EHSUJMURnew · submitted 2020-09-07 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Generative Language Modeling for Automated Theorem Proving

Pith reviewed 2026-05-23 05:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML
keywords automated theorem provinglanguage modelingMetamathtransformersformal mathematicsproof generationgenerative models
0
0 comments X

The pith

A transformer language model generated new short proofs accepted into the Metamath library.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores whether generative language models can address a key gap in automated theorem provers: the creation of original mathematical terms rather than just verification. The authors train a transformer on Metamath proofs to build GPT-f, an automated prover and proof assistant. GPT-f produced new short proofs that were accepted into the main Metamath library. This is presented as the first documented case of a deep-learning system contributing proofs adopted by a formal mathematics community. A sympathetic reader would care if this shows language models can move beyond search and verification into generating genuinely new formal content.

Core claim

GPT-f is a transformer-based language model trained to generate sequences in the Metamath formal language. When used as an automated prover, it found new short proofs for theorems that were subsequently accepted into the main Metamath library. The authors state this constitutes the first instance in which a deep-learning based system has contributed proofs adopted by a formal mathematics community.

What carries the argument

GPT-f, a transformer language model fine-tuned on Metamath proof sequences to generate complete proofs from theorem statements.

If this is right

  • Transformer models can generate original terms inside formal proof languages.
  • Proofs produced by such models can meet the syntactic and consistency standards required for library inclusion.
  • Language-model-based generation can be combined with existing Metamath proof search procedures.
  • Deep learning systems can contribute content directly usable by human-maintained formal libraries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training approach might transfer to other formal systems whose proof corpora are large enough to fine-tune on.
  • Future work could measure how much the model relies on memorization versus recombination of existing proof fragments.
  • If scaling holds, larger models could target harder open conjectures once sufficient formal data exists.

Load-bearing premise

That acceptance by Metamath library maintainers establishes both the correctness and the genuine novelty of the generated proofs.

What would settle it

An independent search showing that each accepted proof is equivalent to one already present in the library or derivable by existing non-language-model tactics without shortening or new structure.

read the original abstract

We explore the application of transformer-based language models to automated theorem proving. This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans -- the generation of original mathematical terms -- might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance. GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper explores the use of transformer-based generative language models for automated theorem proving in the Metamath formal system. It introduces GPT-f, an automated prover and proof assistant, and reports that the model generated new short proofs accepted into the main Metamath library—the first such contribution from a deep-learning system to a formal mathematics community.

Significance. If substantiated, the result would indicate that language models can generate original mathematical terms and proofs that integrate into established formal libraries, addressing a key limitation of traditional ATP systems. The work provides a concrete demonstration of applying generative modeling to formal mathematics, with potential implications for scaling proof search beyond exhaustive enumeration.

major comments (2)
  1. [Abstract, §1] Abstract and §1: The central claim that GPT-f 'found new short proofs' accepted into the Metamath library is not supported by any explicit length comparisons to existing proofs for the same statements, enumeration of contributed proofs, or description of the maintainers' review criteria. Library acceptance confirms syntactic validity and consistency but does not establish novelty or shortness relative to prior entries.
  2. [Abstract, evaluation sections] The abstract and evaluation sections report only the outcome of acceptance without quantitative baselines, error rates, number of candidate proofs generated/rejected, or ablation studies on the language model components. This makes it impossible to assess whether the result supports the broader claim that language models address the term-generation bottleneck in ATP.
minor comments (1)
  1. [§2 or §3] Notation for the Metamath language and proof representation could be clarified with a brief example in §2 or §3 to aid readers unfamiliar with the formal system.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with clarifications and proposed revisions where the manuscript can be strengthened without misrepresenting our results.

read point-by-point responses
  1. Referee: [Abstract, §1] Abstract and §1: The central claim that GPT-f 'found new short proofs' accepted into the Metamath library is not supported by any explicit length comparisons to existing proofs for the same statements, enumeration of contributed proofs, or description of the maintainers' review criteria. Library acceptance confirms syntactic validity and consistency but does not establish novelty or shortness relative to prior entries.

    Authors: We agree that explicit comparisons would make the claim more robust. The proofs are new in the sense that they were absent from the library prior to submission and were reviewed and accepted as additions by the maintainers. Shortness is relative to the library's preference for concise proofs that pass verification. The manuscript does not include direct length comparisons or an enumeration. We will revise to add a table listing the contributed theorems, their proof lengths, and comparisons to any pre-existing proofs for the same statements (noting cases where none existed). We will also briefly describe the Metamath acceptance process, which requires syntactic validity, consistency, and favors brevity. revision: partial

  2. Referee: [Abstract, evaluation sections] The abstract and evaluation sections report only the outcome of acceptance without quantitative baselines, error rates, number of candidate proofs generated/rejected, or ablation studies on the language model components. This makes it impossible to assess whether the result supports the broader claim that language models address the term-generation bottleneck in ATP.

    Authors: The paper's primary evaluation is the external validation via library acceptance as evidence that the model addresses term generation. The evaluation sections already contain proof-search success rates and some model performance metrics. We acknowledge that additional context on candidate generation volume, rejection rates, and explicit ablations would help readers assess the bottleneck claim. We will incorporate available experimental statistics on generation volume and rejections into the revised evaluation section and ensure ablation results are clearly referenced. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central result relies on external library acceptance rather than internal self-reference.

full rationale

The paper reports an empirical application of a transformer model to Metamath proof generation. Its primary claim—that generated proofs were accepted into the main library—depends on external maintainer review for syntactic validity and consistency, which is independent of the model's training procedure or any fitted parameters. No equations, predictions, or uniqueness arguments reduce by construction to the inputs; the result is not a renaming of known patterns or a self-citation chain. The derivation chain from training data to generated proofs remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the external fact of library acceptance rather than on any modeled quantity.

pith-pipeline@v0.9.0 · 5626 in / 1023 out tokens · 52830 ms · 2026-05-23T05:14:07.468281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We explore the application of transformer-based language models to automated theorem proving.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Advancing Mathematics Research with AI-Driven Formal Proof Search

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.

  2. What are the Right Symmetries for Formal Theorem Proving?

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces rewriting categories to formalize proof equivariance and success invariance, shows LLM provers violate both, and demonstrates test-time aggregation recovers invariance and boosts performance.

  3. CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

    cs.AI 2026-05 accept novelty 7.0

    CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.

  4. ABD: Default Exception Abduction in Finite First Order Worlds

    cs.AI 2026-02 unverdicted novelty 7.0

    ABD benchmark evaluates LLMs on producing parsimonious first-order exception formulas in three observation regimes using SMT verification, finding high validity but persistent parsimony and generalization gaps.

  5. Massive Activations in Large Language Models

    cs.CL 2024-02 unverdicted novelty 7.0

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  6. Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

    cs.LO 2026-05 unverdicted novelty 6.0

    Lean Refactor uses retrieval from a curated multi-objective strategy database to guide frozen LLMs in refactoring Lean proofs, reporting over 70% token compression on benchmarks and improved version transfer.

  7. OProver: A Unified Framework for Agentic Formal Theorem Proving

    cs.CL 2026-05 unverdicted novelty 6.0

    OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...

  8. Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

    cs.AI 2026-05 unverdicted novelty 6.0

    Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.

  9. Measuring Representation Robustness in Large Language Models for Geometry

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...

  10. Neuro-Symbolic Proof Generation for Scaling Systems Software Verification

    cs.AI 2026-03 conditional novelty 6.0

    A neuro-symbolic system using LLM-guided best-first search and Isabelle tools proves up to 77.6% of theorems on the seL4 benchmark, outperforming prior LLM methods and Sledgehammer.

  11. VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

    cs.CL 2026-01 unverdicted novelty 6.0

    VERGE decomposes LLM outputs into atomic claims, autoformalizes them to first-order logic, verifies with SMT solvers and consensus, and refines via minimal correction subsets, yielding 18.7% average uplift on reasonin...

  12. Aristotle: IMO-level Automated Theorem Proving

    cs.AI 2025-10 unverdicted novelty 6.0

    Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.

  13. Llemma: An Open Language Model For Mathematics

    cs.CL 2023-10 unverdicted novelty 6.0

    Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

  14. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

    cs.CL 2023-09 conditional novelty 6.0

    ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on...

  15. Measuring Coding Challenge Competence With APPS

    cs.SE 2021-05 unverdicted novelty 6.0

    APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.

  16. OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving

    cs.LG 2026-04 unverdicted novelty 5.0

    OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.

  17. Rethinking Wireless Communications through Formal Mathematical AI Reasoning

    eess.SP 2026-04 unverdicted novelty 4.0

    Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.

  18. AI for Mathematics: Progress, Challenges, and Prospects

    math.HO 2026-01 unverdicted novelty 4.0

    AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.

  19. Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    cs.CV 2025-02 unverdicted novelty 4.0

    Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

  20. Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

    cs.AI 2026-05 unverdicted novelty 3.0

    Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

  21. When control meets large language models: From words to dynamics

    eess.SY 2026-02 unverdicted novelty 3.0

    The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.

  22. Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

    cs.CL 2025-02 unverdicted novelty 2.0

    Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 22 Pith papers · 19 internal anchors

  1. [1]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012

  2. [2]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

  3. [3]

    Sequence to sequence learning with neural networks

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems , pages 3104–3112, 2014

  4. [4]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014

  5. [5]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

  6. [6]

    Towards end-to-end speech recognition with recurrent neural networks

    Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning , pages 1764–1772, 2014

  7. [7]

    Deep speech 2: End-to-end speech recognition in english and mandarin

    Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182, 2016

  8. [8]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

  9. [9]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015

  10. [10]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017

  11. [11]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4401–4410, 2019

  12. [12]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, 2020

  13. [13]

    Mas- tering the game of go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016

  14. [14]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

  15. [15]

    Grandmaster level in starcraft ii using multi-agent reinforcement learning

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019

  16. [16]

    Solving Rubik's Cube with a Robot Hand

    Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019. 16

  17. [17]

    End-to-end training of deep visuomotor policies

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , 17(1):1334–1373, 2016

  18. [18]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

  19. [19]

    Improving language understanding by generative pre-training, 2018

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018

  20. [20]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019

  21. [21]

    Language Models are Few-Shot Learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

  22. [22]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  23. [23]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017

  24. [24]

    History of interactive theorem proving

    John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In Computational Logic, volume 9, pages 135–214, 2014

  25. [25]

    Learning to prove theorems by learning to generate theorems

    Mingzhe Wang and Jia Deng. Learning to prove theorems by learning to generate theorems. arXiv preprint arXiv:2002.07019, 2020

  26. [26]

    Deepmath-deep sequence models for premise selection

    Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Eén, François Chollet, and Josef Urban. Deepmath-deep sequence models for premise selection. In Advances in Neural Information Processing Systems, pages 2235–2243, 2016

  27. [27]

    Premise selection for theorem proving by deep graph embedding

    Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. Premise selection for theorem proving by deep graph embedding. In Advances in Neural Information Processing Systems , pages 2786–2796, 2017

  28. [28]

    Learning a SAT Solver from Single-Bit Supervision

    Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018

  29. [29]

    Deep Network Guided Proof Search

    Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof search. arXiv preprint arXiv:1701.06972, 2017

  30. [30]

    Holist: An environment for machine learning of higher order logic theorem proving

    Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. Holist: An environment for machine learning of higher order logic theorem proving. In International Conference on Machine Learning, pages 454–463, 2019

  31. [31]

    Learning to reason in large theories without imitation

    Kshitij Bansal, Sarah M Loos, Markus N Rabe, and Christian Szegedy. Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501, 2019

  32. [32]

    Mathematical reasoning via self-supervised skip-tree training

    Markus N Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757, 2020

  33. [33]

    GamePad: A Learning Environment for Theorem Proving

    Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. Gamepad: A learning environment for theorem proving. arXiv preprint arXiv:1806.00608, 2018

  34. [34]

    Learning to Prove Theorems via Interacting with Proof Assistants

    Kaiyu Yang and Jia Deng. Learning to prove theorems via interacting with proof assistants. arXiv preprint arXiv:1905.09381, 2019

  35. [35]

    Holophrasm: a neural Automated Theorem Prover for higher-order logic

    Daniel Whalen. Holophrasm: a neural automated theorem prover for higher-order logic. arXiv preprint arXiv:1608.02644, 2016. 17

  36. [36]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019

  37. [37]

    Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by ratio- nale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017

  38. [38]

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019

  39. [39]

    Deep learning for symbolic mathematics

    Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv preprint arXiv:1912.01412, 2019

  40. [40]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. arXiv preprint arXiv:1807.03819, 2018

  41. [41]

    Analysing Mathematical Reasoning Abilities of Neural Models

    David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019

  42. [42]

    Megill and David A

    Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Pure Mathe- matics, 2019. http://us.metamath.org/downloads/metamath.pdf

  43. [43]

    How Metamath Proofs Work , 2006

    Norman Megill. How Metamath Proofs Work , 2006. http://us.metamath.org/mpeuni/mmset.html#proofs

  44. [44]

    de Bruijn factor

    Freek Wiedijk. The "de Bruijn factor" , 2014. http://www.cs.ru.nl/ freek/factor/

  45. [45]

    Int: An inequality benchmark for evaluating generalization in theorem proving

    Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Grosse. Int: An inequality benchmark for evaluating generalization in theorem proving. arXiv preprint arXiv:2007.02924, 2020

  46. [46]

    Thinking fast and slow with deep learning and tree search

    Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems , pages 5360–5370, 2017. A Key Results Table 13: Key results described in this paper (on the valid set) with a summary of the source of performance gains. Model Performance Gain Main ablation MetaGen...