LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
STAR-PólyaMath introduces a multi-agent framework with meta-strategic supervision and state-machine orchestration that reports state-of-the-art and perfect scores on eight top math competition benchmarks.
citing papers explorer
-
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
-
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
-
STAR-P\'olyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision
STAR-PólyaMath introduces a multi-agent framework with meta-strategic supervision and state-machine orchestration that reports state-of-the-art and perfect scores on eight top math competition benchmarks.