StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

Changyu Zhu; Fan Zhou; Jiayi Xiang; Run Yang; Shuguang Yu; Wenxin E; Yichen Zhang; Yuchen Lu; Ziwei Wang

read the original abstract

Despite rapid advances in large language models (LLMs), statistical reasoning remains underrepresented in existing LLM benchmarks, which often do not reflect the layered, proof-driven nature of real statistical practice. To address this gap, we introduce \textbf{StatEval}, the first large-scale benchmark for statistical reasoning across curricular and research-level settings. StatEval includes over 100,000 curated problems, with 20,000+ foundational questions spanning undergraduate and graduate curricula and 80,000+ research-level proof tasks extracted from leading statistical journals. To construct StatEval, we develop \textbf{TRACE} (Topology and Reasoning-Aware Context Extractor), a multi-agent pipeline with human-in-the-loop validation that converts unstructured academic texts into self-contained theorem-level reasoning tasks. We also propose an Adaptive Process-Based Scoring Pipeline for complex statistical proofs, enabling fine-grained evaluation beyond final-answer matching. Experiments show that while LLMs perform reasonably on foundational tasks, they struggle with rigorous research-level reasoning. Beyond evaluation, StatEval serves as a resource for improving reasoning, as retrieval-augmented generation and domain-specific alignment consistently enhance performance. Together, these results establish StatEval as both a benchmark and an infrastructure for advancing statistical reasoning in LLMs.

StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

discussion (0)