Generative Language Modeling for Automated Theorem Proving
Pith reviewed 2026-05-23 05:14 UTC · model grok-4.3
The pith
A transformer language model generated new short proofs accepted into the Metamath library.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPT-f is a transformer-based language model trained to generate sequences in the Metamath formal language. When used as an automated prover, it found new short proofs for theorems that were subsequently accepted into the main Metamath library. The authors state this constitutes the first instance in which a deep-learning based system has contributed proofs adopted by a formal mathematics community.
What carries the argument
GPT-f, a transformer language model fine-tuned on Metamath proof sequences to generate complete proofs from theorem statements.
If this is right
- Transformer models can generate original terms inside formal proof languages.
- Proofs produced by such models can meet the syntactic and consistency standards required for library inclusion.
- Language-model-based generation can be combined with existing Metamath proof search procedures.
- Deep learning systems can contribute content directly usable by human-maintained formal libraries.
Where Pith is reading between the lines
- The same training approach might transfer to other formal systems whose proof corpora are large enough to fine-tune on.
- Future work could measure how much the model relies on memorization versus recombination of existing proof fragments.
- If scaling holds, larger models could target harder open conjectures once sufficient formal data exists.
Load-bearing premise
That acceptance by Metamath library maintainers establishes both the correctness and the genuine novelty of the generated proofs.
What would settle it
An independent search showing that each accepted proof is equivalent to one already present in the library or derivable by existing non-language-model tactics without shortening or new structure.
read the original abstract
We explore the application of transformer-based language models to automated theorem proving. This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans -- the generation of original mathematical terms -- might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance. GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper explores the use of transformer-based generative language models for automated theorem proving in the Metamath formal system. It introduces GPT-f, an automated prover and proof assistant, and reports that the model generated new short proofs accepted into the main Metamath library—the first such contribution from a deep-learning system to a formal mathematics community.
Significance. If substantiated, the result would indicate that language models can generate original mathematical terms and proofs that integrate into established formal libraries, addressing a key limitation of traditional ATP systems. The work provides a concrete demonstration of applying generative modeling to formal mathematics, with potential implications for scaling proof search beyond exhaustive enumeration.
major comments (2)
- [Abstract, §1] Abstract and §1: The central claim that GPT-f 'found new short proofs' accepted into the Metamath library is not supported by any explicit length comparisons to existing proofs for the same statements, enumeration of contributed proofs, or description of the maintainers' review criteria. Library acceptance confirms syntactic validity and consistency but does not establish novelty or shortness relative to prior entries.
- [Abstract, evaluation sections] The abstract and evaluation sections report only the outcome of acceptance without quantitative baselines, error rates, number of candidate proofs generated/rejected, or ablation studies on the language model components. This makes it impossible to assess whether the result supports the broader claim that language models address the term-generation bottleneck in ATP.
minor comments (1)
- [§2 or §3] Notation for the Metamath language and proof representation could be clarified with a brief example in §2 or §3 to aid readers unfamiliar with the formal system.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below with clarifications and proposed revisions where the manuscript can be strengthened without misrepresenting our results.
read point-by-point responses
-
Referee: [Abstract, §1] Abstract and §1: The central claim that GPT-f 'found new short proofs' accepted into the Metamath library is not supported by any explicit length comparisons to existing proofs for the same statements, enumeration of contributed proofs, or description of the maintainers' review criteria. Library acceptance confirms syntactic validity and consistency but does not establish novelty or shortness relative to prior entries.
Authors: We agree that explicit comparisons would make the claim more robust. The proofs are new in the sense that they were absent from the library prior to submission and were reviewed and accepted as additions by the maintainers. Shortness is relative to the library's preference for concise proofs that pass verification. The manuscript does not include direct length comparisons or an enumeration. We will revise to add a table listing the contributed theorems, their proof lengths, and comparisons to any pre-existing proofs for the same statements (noting cases where none existed). We will also briefly describe the Metamath acceptance process, which requires syntactic validity, consistency, and favors brevity. revision: partial
-
Referee: [Abstract, evaluation sections] The abstract and evaluation sections report only the outcome of acceptance without quantitative baselines, error rates, number of candidate proofs generated/rejected, or ablation studies on the language model components. This makes it impossible to assess whether the result supports the broader claim that language models address the term-generation bottleneck in ATP.
Authors: The paper's primary evaluation is the external validation via library acceptance as evidence that the model addresses term generation. The evaluation sections already contain proof-search success rates and some model performance metrics. We acknowledge that additional context on candidate generation volume, rejection rates, and explicit ablations would help readers assess the bottleneck claim. We will incorporate available experimental statistics on generation volume and rejections into the revised evaluation section and ensure ablation results are clearly referenced. revision: yes
Circularity Check
No significant circularity; central result relies on external library acceptance rather than internal self-reference.
full rationale
The paper reports an empirical application of a transformer model to Metamath proof generation. Its primary claim—that generated proofs were accepted into the main library—depends on external maintainer review for syntactic validity and consistency, which is independent of the model's training procedure or any fitted parameters. No equations, predictions, or uniqueness arguments reduce by construction to the inputs; the result is not a renaming of known patterns or a self-citation chain. The derivation chain from training data to generated proofs remains self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community.
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We explore the application of transformer-based language models to automated theorem proving.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
Advancing Mathematics Research with AI-Driven Formal Proof Search
LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.
-
What are the Right Symmetries for Formal Theorem Proving?
Introduces rewriting categories to formalize proof equivariance and success invariance, shows LLM provers violate both, and demonstrates test-time aggregation recovers invariance and boosts performance.
-
CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.
-
ABD: Default Exception Abduction in Finite First Order Worlds
ABD benchmark evaluates LLMs on producing parsimonious first-order exception formulas in three observation regimes using SMT verification, finding high validity but persistent parsimony and generalization gaps.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search
Lean Refactor uses retrieval from a curated multi-objective strategy database to guide frozen LLMs in refactoring Lean proofs, reporting over 70% token compression on benchmarks and improved version transfer.
-
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
-
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
-
Measuring Representation Robustness in Large Language Models for Geometry
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
-
Neuro-Symbolic Proof Generation for Scaling Systems Software Verification
A neuro-symbolic system using LLM-guided best-first search and Isabelle tools proves up to 77.6% of theorems on the seL4 benchmark, outperforming prior LLM methods and Sledgehammer.
-
VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning
VERGE decomposes LLM outputs into atomic claims, autoformalizes them to first-order logic, verifies with SMT solvers and consensus, and refines via minimal correction subsets, yielding 18.7% average uplift on reasonin...
-
Aristotle: IMO-level Automated Theorem Proving
Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.
-
Llemma: An Open Language Model For Mathematics
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
-
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on...
-
Measuring Coding Challenge Competence With APPS
APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.
-
OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving
OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.
-
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
-
AI for Mathematics: Progress, Challenges, and Prospects
AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.
-
When control meets large language models: From words to dynamics
The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.
-
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.
Reference graph
Works this paper leans on
-
[1]
Imagenet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012
work page 2012
-
[2]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016
work page 2016
-
[3]
Sequence to sequence learning with neural networks
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems , pages 3104–3112, 2014
work page 2014
-
[4]
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[5]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Towards end-to-end speech recognition with recurrent neural networks
Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning , pages 1764–1772, 2014
work page 2014
-
[7]
Deep speech 2: End-to-end speech recognition in english and mandarin
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182, 2016
work page 2016
-
[8]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014
work page 2014
-
[9]
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4401–4410, 2019
work page 2019
-
[12]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, 2020
work page 2020
-
[13]
Mas- tering the game of go with deep neural networks and tree search
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016
work page 2016
-
[14]
Dota 2 with Large Scale Deep Reinforcement Learning
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[15]
Grandmaster level in starcraft ii using multi-agent reinforcement learning
Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019
work page 2019
-
[16]
Solving Rubik's Cube with a Robot Hand
Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019. 16
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[17]
End-to-end training of deep visuomotor policies
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , 17(1):1334–1373, 2016
work page 2016
-
[18]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017
work page 2017
-
[19]
Improving language understanding by generative pre-training, 2018
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018
work page 2018
-
[20]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019
work page 2019
-
[21]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[22]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
History of interactive theorem proving
John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In Computational Logic, volume 9, pages 135–214, 2014
work page 2014
-
[25]
Learning to prove theorems by learning to generate theorems
Mingzhe Wang and Jia Deng. Learning to prove theorems by learning to generate theorems. arXiv preprint arXiv:2002.07019, 2020
-
[26]
Deepmath-deep sequence models for premise selection
Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Eén, François Chollet, and Josef Urban. Deepmath-deep sequence models for premise selection. In Advances in Neural Information Processing Systems, pages 2235–2243, 2016
work page 2016
-
[27]
Premise selection for theorem proving by deep graph embedding
Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. Premise selection for theorem proving by deep graph embedding. In Advances in Neural Information Processing Systems , pages 2786–2796, 2017
work page 2017
-
[28]
Learning a SAT Solver from Single-Bit Supervision
Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Deep Network Guided Proof Search
Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof search. arXiv preprint arXiv:1701.06972, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Holist: An environment for machine learning of higher order logic theorem proving
Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. Holist: An environment for machine learning of higher order logic theorem proving. In International Conference on Machine Learning, pages 454–463, 2019
work page 2019
-
[31]
Learning to reason in large theories without imitation
Kshitij Bansal, Sarah M Loos, Markus N Rabe, and Christian Szegedy. Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501, 2019
-
[32]
Mathematical reasoning via self-supervised skip-tree training
Markus N Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757, 2020
-
[33]
GamePad: A Learning Environment for Theorem Proving
Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. Gamepad: A learning environment for theorem proving. arXiv preprint arXiv:1806.00608, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[34]
Learning to Prove Theorems via Interacting with Proof Assistants
Kaiyu Yang and Jia Deng. Learning to prove theorems via interacting with proof assistants. arXiv preprint arXiv:1905.09381, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[35]
Holophrasm: a neural Automated Theorem Prover for higher-order logic
Daniel Whalen. Holophrasm: a neural automated theorem prover for higher-order logic. arXiv preprint arXiv:1608.02644, 2016. 17
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[36]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[37]
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by ratio- nale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[39]
Deep learning for symbolic mathematics
Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv preprint arXiv:1912.01412, 2019
-
[40]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. arXiv preprint arXiv:1807.03819, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
Analysing Mathematical Reasoning Abilities of Neural Models
David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[42]
Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Pure Mathe- matics, 2019. http://us.metamath.org/downloads/metamath.pdf
work page 2019
-
[43]
How Metamath Proofs Work , 2006
Norman Megill. How Metamath Proofs Work , 2006. http://us.metamath.org/mpeuni/mmset.html#proofs
work page 2006
-
[44]
Freek Wiedijk. The "de Bruijn factor" , 2014. http://www.cs.ru.nl/ freek/factor/
work page 2014
-
[45]
Int: An inequality benchmark for evaluating generalization in theorem proving
Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Grosse. Int: An inequality benchmark for evaluating generalization in theorem proving. arXiv preprint arXiv:2007.02924, 2020
-
[46]
Thinking fast and slow with deep learning and tree search
Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems , pages 5360–5370, 2017. A Key Results Table 13: Key results described in this paper (on the valid set) with a summary of the source of performance gains. Model Performance Gain Main ablation MetaGen...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.