pith. sign in
Pith Number

pith:EWBEHR7L

pith:2026:EWBEHR7LLQH5LMGW7QVW2S3D46
not attested not anchored not stored refs pending

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Akari Asai, Akshelin R, Alexander B. Ivanov, Boboev Muhammadjon, Catherine Arnett, Chaeyoung Han, Christian Stump, Dmitrii Karp, Dohyun Kwon, DoYong Kwon, Duk-Soon Oh, Giovanni Resta, Graham Neubig, Greta Panova, Guijin Son, Hanearl Jung, Huiyun Noh, Hyein Lee, Hyeonah Kang, Hyungryul Baik, Hyungsun Bae, Hyunwoo Ko, Inomov Mashrafdzhon, Jeewon Kim, Jiang Longxi, Jiaqi Liu, Jieui Kang, Ji Eun Lee, Jimin Kim, Jin Yun, Jon-Lark Kim, JungYup Lee, Junseo Yoon, Junwoo Jo, Kibeom Kim, Kiwoon Kwon, Kyungmin Lee, Mario Kummer, Max Mercer, Minjun Kim, Nahyun Lee, Ng Ze-An, Rafa{\l} Marcin {\L}ochowski, Rapha\"el Lachi\`eze-Rey, Ruichen Zhang, Sam Yoosuk Kim, Sang Park, Sean Welleck, Sejin Park, Seonguk Seo, Seunghyeok Hong, Seungjae Lee, Seungone Kim, Seungyeop Yi, Shinae Shin, Shin Jaehoon, Sunatullo, SunHye Bok, Sunyoung Shin, Taewoong Eom, Yeachan Park, Yonghoon Ji, Yongseok Jang, Youchan Oh, Youngjae Yu, Youngtaek Kim, Zhaoyang Wang, Zolt\'an Kov\'acs

A benchmark of 439 original research-level math problems shows frontier LLMs solve at most 30 percent and recognize ill-posed questions less than half the time.

arxiv:2605.09063 v2 · 2026-05-09 · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{EWBEHR7LLQH5LMGW7QVW2S3D46}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while on the refusal subset no model exceeds 50%.

C2weakest assumption

That the 439 problems are genuinely original, uncontaminated by training data, and correctly classified as research-level by the 64 mathematicians who authored them.

C3one line summary

Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:03:16.130063Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

258243c7eb5c0fd5b0d6fc2b6d4b63e7bb88cfad578d639315e4abdd2a1e60f3

Aliases

arxiv: 2605.09063 · arxiv_version: 2605.09063v2 · doi: 10.48550/arxiv.2605.09063 · pith_short_12: EWBEHR7LLQH5 · pith_short_16: EWBEHR7LLQH5LMGW · pith_short_8: EWBEHR7L
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/EWBEHR7LLQH5LMGW7QVW2S3D46 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 258243c7eb5c0fd5b0d6fc2b6d4b63e7bb88cfad578d639315e4abdd2a1e60f3
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "63dd3bf500c9c94be883301624a100886a0cc429833dbc7ba241cf1220cb0756",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-sa/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-09T17:14:22Z",
    "title_canon_sha256": "e17853d658a72e039e57bb88f262c9f91032ab756dda7b0d58b0eed805030eb8"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.09063",
    "kind": "arxiv",
    "version": 2
  }
}