pith. sign in
Pith Number

pith:BOZBDNTQ

pith:2024:BOZBDNTQTSVGQ3OV3K34FZNYHQ
not attested not anchored not stored refs resolved

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Baobao Chang, Benyou Wang, Bofei Gao, Chenghao Ma, Daoguang Zan, Feifan Song, Ge Zhang, Lei Li, Lei Sha, Liang Chen, Qingxiu Dong, Runxin Xu, Shanghaoran Quan, Tianyu Liu, Xuancheng Ren, Yibo Miao, Yichang Zhang, Zefan Cai, Zhengyang Tang, Zhe Yang

A new benchmark of 4428 Olympiad math problems shows even top models like o1-preview reach only 52.55% accuracy.

arxiv:2410.07985 v3 · 2024-10-10 · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{BOZBDNTQTSVGQ3OV3K34FZNYHQ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy

C2weakest assumption

The 4428 problems constitute a fair, unbiased, and comprehensive sample of Olympiad-level mathematics, with human annotation free of selection bias or verification errors.

C3one line summary

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.

References

77 extracted · 77 resolved · 15 Pith anchors

[1] Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint= 2021
[2] Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint= 2021
[3] Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models , author=. 2023 , eprint= 2023
[4] MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics , author=. 2022 , eprint= 2022
[5] ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. 2023 , eprint= 2023

Cited by

29 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:52.937178Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

0bb211b6709caa686dd5dab7c2e5b83c3018ee224315ed0473d3973dd3e1623b

Aliases

arxiv: 2410.07985 · arxiv_version: 2410.07985v3 · doi: 10.48550/arxiv.2410.07985 · pith_short_12: BOZBDNTQTSVG · pith_short_16: BOZBDNTQTSVGQ3OV · pith_short_8: BOZBDNTQ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0bb211b6709caa686dd5dab7c2e5b83c3018ee224315ed0473d3973dd3e1623b
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "82870184c474010992b21114f4f6a26d9e67b812d579908e14bf4a1353f907c0",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/publicdomain/zero/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-10-10T14:39:33Z",
    "title_canon_sha256": "e103455cf7c83326169aaa95e18f53b32cf2ecf649d774e9bcbbffd9d9379194"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.07985",
    "kind": "arxiv",
    "version": 3
  }
}