{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:BOZBDNTQTSVGQ3OV3K34FZNYHQ","short_pith_number":"pith:BOZBDNTQ","schema_version":"1.0","canonical_sha256":"0bb211b6709caa686dd5dab7c2e5b83c3018ee224315ed0473d3973dd3e1623b","source":{"kind":"arxiv","id":"2410.07985","version":3},"attestation_state":"computed","paper":{"title":"Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models","license":"http://creativecommons.org/publicdomain/zero/1.0/","headline":"A new benchmark of 4428 Olympiad math problems shows even top models like o1-preview reach only 52.55% accuracy.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Baobao Chang, Benyou Wang, Bofei Gao, Chenghao Ma, Daoguang Zan, Feifan Song, Ge Zhang, Lei Li, Lei Sha, Liang Chen, Qingxiu Dong, Runxin Xu, Shanghaoran Quan, Tianyu Liu, Xuancheng Ren, Yibo Miao, Yichang Zhang, Zefan Cai, Zhengyang Tang, Zhe Yang","submitted_at":"2024-10-10T14:39:33Z","abstract_excerpt":"Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8\\% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a va"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2410.07985","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/publicdomain/zero/1.0/","primary_cat":"cs.CL","submitted_at":"2024-10-10T14:39:33Z","cross_cats_sorted":[],"title_canon_sha256":"e103455cf7c83326169aaa95e18f53b32cf2ecf649d774e9bcbbffd9d9379194","abstract_canon_sha256":"82870184c474010992b21114f4f6a26d9e67b812d579908e14bf4a1353f907c0"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.937760Z","signature_b64":"yAJ9gIHyNMj4pD4ELp4bktTHpVI8SFnLFTt9mTI007oU6uIU/kUwYitAi8y33nIAi0gT2i8KpxBAyGeEkh+qDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0bb211b6709caa686dd5dab7c2e5b83c3018ee224315ed0473d3973dd3e1623b","last_reissued_at":"2026-05-17T23:38:52.937178Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.937178Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models","license":"http://creativecommons.org/publicdomain/zero/1.0/","headline":"A new benchmark of 4428 Olympiad math problems shows even top models like o1-preview reach only 52.55% accuracy.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Baobao Chang, Benyou Wang, Bofei Gao, Chenghao Ma, Daoguang Zan, Feifan Song, Ge Zhang, Lei Li, Lei Sha, Liang Chen, Qingxiu Dong, Runxin Xu, Shanghaoran Quan, Tianyu Liu, Xuancheng Ren, Yibo Miao, Yichang Zhang, Zefan Cai, Zhengyang Tang, Zhe Yang","submitted_at":"2024-10-10T14:39:33Z","abstract_excerpt":"Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8\\% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a va"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The 4428 problems constitute a fair, unbiased, and comprehensive sample of Olympiad-level mathematics, with human annotation free of selection bias or verification errors.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A new benchmark of 4428 Olympiad math problems shows even top models like o1-preview reach only 52.55% accuracy.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"295e7fa6fb1bef9d8f966868a98ad7c8d9c16be9a13b32cbfc20acd9ef01dfe0"},"source":{"id":"2410.07985","kind":"arxiv","version":3},"verdict":{"id":"0e53d206-3ba1-4673-84e9-33b811ac3427","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T09:04:10.133306Z","strongest_claim":"even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy","one_line_summary":"Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The 4428 problems constitute a fair, unbiased, and comprehensive sample of Olympiad-level mathematics, with human annotation free of selection bias or verification errors.","pith_extraction_headline":"A new benchmark of 4428 Olympiad math problems shows even top models like o1-preview reach only 52.55% accuracy."},"references":{"count":77,"sample":[{"doi":"","year":2021,"title":"Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=","work_id":"0d46a05b-859d-423c-ae95-e7bb4f120561","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=","work_id":"96984740-1b0a-4e03-9a02-9a0f6b7c8314","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models , author=. 2023 , eprint=","work_id":"b4fa534b-c783-45a0-b643-85cee2c4d782","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics , author=. 2022 , eprint=","work_id":"70cbb291-0c52-4b71-9a74-14949ec2d28b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. 2023 , eprint=","work_id":"186db717-bd16-4a71-9357-e93fdf5c2941","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":77,"snapshot_sha256":"ea96e6fad5a9729ddf9a4593fb01df5b0cc3d64fd83aa8084bfb087f22ed3f72","internal_anchors":15},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2410.07985","created_at":"2026-05-17T23:38:52.937279+00:00"},{"alias_kind":"arxiv_version","alias_value":"2410.07985v3","created_at":"2026-05-17T23:38:52.937279+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2410.07985","created_at":"2026-05-17T23:38:52.937279+00:00"},{"alias_kind":"pith_short_12","alias_value":"BOZBDNTQTSVG","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"BOZBDNTQTSVGQ3OV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"BOZBDNTQ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":29,"internal_anchor_count":29,"sample":[{"citing_arxiv_id":"2605.22875","citing_title":"RMA: an Agentic System for Research-Level Mathematical Problems","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23904","citing_title":"SkillOpt: Executive Strategy for Self-Evolving Agent Skills","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2503.16549","citing_title":"MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2503.21380","citing_title":"Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2509.17677","citing_title":"EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2503.04697","citing_title":"L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07536","citing_title":"LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2601.18832","citing_title":"The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2504.11456","citing_title":"DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2504.20571","citing_title":"Reinforcement Learning for Reasoning in Large Language Models with One Training Example","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23281","citing_title":"MathArena: Evaluating LLMs on Uncontaminated Math Competitions","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12524","citing_title":"Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism","ref_index":103,"is_internal_anchor":true},{"citing_arxiv_id":"2512.15745","citing_title":"LLaDA2.0: Scaling Up Diffusion Language Models to 100B","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11625","citing_title":"Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08686","citing_title":"Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09292","citing_title":"Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09544","citing_title":"TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06116","citing_title":"Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22597","citing_title":"Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04992","citing_title":"You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00365","citing_title":"Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21510","citing_title":"OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving","ref_index":71,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20183","citing_title":"Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19087","citing_title":"OLLM: Options-based Large Language Models","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18584","citing_title":"MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval","ref_index":1,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ","json":"https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ.json","graph_json":"https://pith.science/api/pith-number/BOZBDNTQTSVGQ3OV3K34FZNYHQ/graph.json","events_json":"https://pith.science/api/pith-number/BOZBDNTQTSVGQ3OV3K34FZNYHQ/events.json","paper":"https://pith.science/paper/BOZBDNTQ"},"agent_actions":{"view_html":"https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ","download_json":"https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ.json","view_paper":"https://pith.science/paper/BOZBDNTQ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2410.07985&json=true","fetch_graph":"https://pith.science/api/pith-number/BOZBDNTQTSVGQ3OV3K34FZNYHQ/graph.json","fetch_events":"https://pith.science/api/pith-number/BOZBDNTQTSVGQ3OV3K34FZNYHQ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ/action/storage_attestation","attest_author":"https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ/action/author_attestation","sign_citation":"https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ/action/citation_signature","submit_replication":"https://pith.science/pith/BOZBDNTQTSVGQ3OV3K34FZNYHQ/action/replication_record"}},"created_at":"2026-05-17T23:38:52.937279+00:00","updated_at":"2026-05-17T23:38:52.937279+00:00"}