U-math: A university-level benchmark for evaluating mathematical skills in llms

URL https://api · 2024 · arXiv 2412.03205

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

cs.CL · 2025-02-20 · conditional · novelty 6.0

SuperGPQA is a new benchmark that tests LLMs on graduate questions from 285 disciplines after human-LLM filtering, with current best models scoring 61.82 percent.

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

cs.AI · 2026-04-06 · unverdicted · novelty 5.0

A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.

citing papers explorer

Showing 2 of 2 citing papers.

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines cs.CL · 2025-02-20 · conditional · none · ref 1
SuperGPQA is a new benchmark that tests LLMs on graduate questions from 285 disciplines after human-LLM filtering, with current best models scoring 61.82 percent.
Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis cs.AI · 2026-04-06 · unverdicted · none · ref 2
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.

U-math: A university-level benchmark for evaluating mathematical skills in llms

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer