pith. sign in

hub Canonical reference

Generative Language Modeling for Automated Theorem Proving

Canonical reference. 80% of citing Pith papers cite this work as background.

28 Pith papers citing it
Background 80% of classified citations
abstract

We explore the application of transformer-based language models to automated theorem proving. This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans -- the generation of original mathematical terms -- might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance. GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community.

hub tools

citation-role summary

background 4 method 1

citation-polarity summary

representative citing papers

What are the Right Symmetries for Formal Theorem Proving?

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Introduces rewriting categories to formalize proof equivariance and success invariance, shows LLM provers violate both, and demonstrates test-time aggregation recovers invariance and boosts performance.

ABD: Default Exception Abduction in Finite First Order Worlds

cs.AI · 2026-02-21 · unverdicted · novelty 7.0

ABD benchmark evaluates LLMs on producing parsimonious first-order exception formulas in three observation regimes using SMT verification, finding high validity but persistent parsimony and generalization gaps.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

RMA: an Agentic System for Research-Level Mathematical Problems

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

Aristotle: IMO-level Automated Theorem Proving

cs.AI · 2025-10-01 · unverdicted · novelty 6.0

Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.

Llemma: An Open Language Model For Mathematics

cs.CL · 2023-10-16 · unverdicted · novelty 6.0

Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

cs.CL · 2023-09-29 · conditional · novelty 6.0

ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.

Solving math word problems with process- and outcome-based feedback

cs.LG · 2022-11-25 · unverdicted · novelty 6.0

On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.

Measuring Coding Challenge Competence With APPS

cs.SE · 2021-05-20 · unverdicted · novelty 6.0

APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.

citing papers explorer

Showing 28 of 28 citing papers.