{"paper":{"title":"Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"The MATH-Vision dataset of 3,040 competition-sourced visual math problems reveals a large performance gap between current large multimodal models and human solvers.","cross_cats":["cs.AI","cs.CL","cs.LG","math.HO"],"primary_cat":"cs.CV","authors_text":"Hongsheng Li, Junting Pan, Ke Wang, Mingjie Zhan, Weikang Shi, Zimu Lu","submitted_at":"2024-02-22T18:56:38Z","abstract_excerpt":"Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The curation process from real competitions produces a representative and unbiased sample of visual mathematical reasoning challenges without introducing selection effects that favor certain problem types.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MATH-Vision is a new benchmark of 3,040 visual mathematical competition problems that reveals substantial gaps between large multimodal models and human performance in mathematical reasoning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The MATH-Vision dataset of 3,040 competition-sourced visual math problems reveals a large performance gap between current large multimodal models and human solvers.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a2b4abad794b3fe49eb2943d1a65570254906397f11cb93987467acf9d202cb9"},"source":{"id":"2402.14804","kind":"arxiv","version":1},"verdict":{"id":"11a322dc-5d41-4afc-9f9a-9b10b6216dc7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T20:40:56.021007Z","strongest_claim":"Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs.","one_line_summary":"MATH-Vision is a new benchmark of 3,040 visual mathematical competition problems that reveals substantial gaps between large multimodal models and human performance in mathematical reasoning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The curation process from real competitions produces a representative and unbiased sample of visual mathematical reasoning challenges without introducing selection effects that favor certain problem types.","pith_extraction_headline":"The MATH-Vision dataset of 3,040 competition-sourced visual math problems reveals a large performance gap between current large multimodal models and human solvers."},"references":{"count":27,"sample":[{"doi":"","year":2023,"title":"GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning","work_id":"dffe6af3-2c37-4256-b87a-6eab51b0f488","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos","work_id":"f6366d6b-34c7-4db1-8b33-2ceadd5f3d7c","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","ref_index":3,"cited_arxiv_id":"2310.02255","is_internal_anchor":true},{"doi":"","year":2022,"title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI","work_id":"da087b16-ea05-4064-980e-ce1d6e281d49","ref_index":4,"cited_arxiv_id":"2311.16502","is_internal_anchor":true},{"doi":"","year":2010,"title":"\". If it is a multiple choice question, only one letter is allowed in the","work_id":"543e34b9-911b-41f2-8648-f19f9827ed4b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":27,"snapshot_sha256":"3b8e6e724b0aba510a4a8f9d99696ee061a6e64361c6f73baf1a4f7072e76fe5","internal_anchors":2},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a77b415df275f68d79864ad0d3029cd3094d051a1b1f6b780fbd98381b1c3f62"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}