SuperGPQA is a new benchmark that tests LLMs on graduate questions from 285 disciplines after human-LLM filtering, with current best models scoring 61.82 percent.
U-math: A university-level benchmark for evaluating mathematical skills in llms
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
citing papers explorer
-
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
SuperGPQA is a new benchmark that tests LLMs on graduate questions from 285 disciplines after human-LLM filtering, with current best models scoring 61.82 percent.
-
Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.