MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
hub
Scaling long-running autonomous coding, January 2026.https://cursor.com/blog/scaling-agents
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 14roles
background 1polarities
background 1representative citing papers
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.
The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.
Meno and tactic ablation on Tao's Analysis I generate proof populations that embed on low one- or two-dimensional submanifolds far from human constructions in Goedel Prover space.
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
The topological dual of a dataset is introduced as a transformation that encodes logical structures into topological ones to expose invariants in neural latent spaces for AlphaGeometry-style reasoning.
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.
ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.
citing papers explorer
-
MathAtlas: A Benchmark for Autoformalization in the Wild
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
-
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
-
Automatic Textbook Formalization
Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.
-
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.
-
Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game
The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.
-
Ablation and the Meno: Tools for Empirical Metamathematics
Meno and tactic ablation on Tao's Analysis I generate proof populations that embed on low one- or two-dimensional submanifolds far from human constructions in Goedel Prover space.
-
Scaling Self-Play with Self-Guidance
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
-
The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data
The topological dual of a dataset is introduced as a transformation that encodes logical structures into topological ones to expose invariants in neural latent spaces for AlphaGeometry-style reasoning.
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
-
OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving
OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.
-
On Reasoning-Centric LLM-based Automated Theorem Proving
ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.