Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap

Saurabh Srivastava, Annarose M B, Anto P V , Shashank Menon, Ajay Sukumar, Adwaith Samod T, Alan Philipose, Stevin Prince, Sooraj Thomas · 2024 · arXiv 2402.19450

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

representative citing papers

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

Robust Reasoning Benchmark

cs.LG · 2026-03-26 · unverdicted · novelty 7.0 · 2 refs

The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

cs.AI · 2025-09-22 · unverdicted · novelty 7.0

EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

cs.LG · 2024-10-07 · accept · novelty 7.0

LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

cs.LG · 2026-05-14 · conditional · novelty 6.0

LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.

Riemann-Bench: A Benchmark for Moonshot Mathematics

cs.AI · 2026-04-08 · conditional · novelty 5.0

Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.

citing papers explorer

Showing 6 of 6 citing papers.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 34
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
Robust Reasoning Benchmark cs.LG · 2026-03-26 · unverdicted · none · ref 47 · 2 links
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving cs.AI · 2025-09-22 · unverdicted · none · ref 39
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models cs.LG · 2024-10-07 · accept · none · ref 94
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling cs.LG · 2026-05-14 · conditional · none · ref 19
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
Riemann-Bench: A Benchmark for Moonshot Mathematics cs.AI · 2026-04-08 · conditional · none · ref 15
Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap

fields

years

verdicts

representative citing papers

citing papers explorer