LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
Lemmabench: A live, research-level benchmark to evaluate LLM capabilities in mathematics.CoRR, abs/2602.24173
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Matlas introduces a semantic retrieval system over 8.07 million mathematical statements from papers and textbooks, using dependency graphs and topological unfolding for self-contained search via natural language queries.
citing papers explorer
-
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
-
Matlas: A Semantic Search Engine for Mathematics
Matlas introduces a semantic retrieval system over 8.07 million mathematical statements from papers and textbooks, using dependency graphs and topological unfolding for self-contained search via natural language queries.