Harp: A challenging human-annotated math reasoning benchmark

Albert S Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, Aaditya K Singh · 2024 · arXiv 2412.08819

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

cs.AI · 2025-09-22 · unverdicted · novelty 7.0

EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

cs.AI · 2025-05-29 · unverdicted · novelty 7.0

MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

citing papers explorer

Showing 3 of 3 citing papers.

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics cs.AI · 2026-05-09 · unverdicted · none · ref 25
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving cs.AI · 2025-09-22 · unverdicted · none · ref 50
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
MathArena: Evaluating LLMs on Uncontaminated Math Competitions cs.AI · 2025-05-29 · unverdicted · none · ref 36
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

Harp: A challenging human-annotated math reasoning benchmark

fields

years

verdicts

representative citing papers

citing papers explorer