A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, William Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, et al · 2024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

cs.AI · 2025-09-22 · unverdicted · novelty 7.0

EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

cs.AI · 2025-05-29 · unverdicted · novelty 7.0

MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

citing papers explorer

Showing 2 of 2 citing papers.

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving cs.AI · 2025-09-22 · unverdicted · none · ref 51
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
MathArena: Evaluating LLMs on Uncontaminated Math Competitions cs.AI · 2025-05-29 · unverdicted · none · ref 37
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

fields

years

verdicts

representative citing papers

citing papers explorer