{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:JATB3AB7PTUEINKWCUPAMJTI3S","short_pith_number":"pith:JATB3AB7","schema_version":"1.0","canonical_sha256":"48261d803f7ce8443556151e062668dcb02f1c5399b48b445cea402c82a6e058","source":{"kind":"arxiv","id":"2402.14804","version":1},"attestation_state":"computed","paper":{"title":"Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"The MATH-Vision dataset of 3,040 competition-sourced visual math problems reveals a large performance gap between current large multimodal models and human solvers.","cross_cats":["cs.AI","cs.CL","cs.LG","math.HO"],"primary_cat":"cs.CV","authors_text":"Hongsheng Li, Junting Pan, Ke Wang, Mingjie Zhan, Weikang Shi, Zimu Lu","submitted_at":"2024-02-22T18:56:38Z","abstract_excerpt":"Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2402.14804","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.CV","submitted_at":"2024-02-22T18:56:38Z","cross_cats_sorted":["cs.AI","cs.CL","cs.LG","math.HO"],"title_canon_sha256":"cfba9a1c2871cc9808744fdd45c3ddacdb2bc0720bce42dbe33b3e93b236e091","abstract_canon_sha256":"ab8db3b1d9ffae8b1caabd3e948c239ccc005888bae1db02f29069b09bca720b"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.128095Z","signature_b64":"RBoVxXKoXBJ+XJt9IaEn3rQfBtg0FW8XTTkLgWyUM80TQjaw1e8NevbEJKJbnPFMglzyDKzSWMs2WhprN3bsAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"48261d803f7ce8443556151e062668dcb02f1c5399b48b445cea402c82a6e058","last_reissued_at":"2026-05-17T23:38:13.127495Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.127495Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"The MATH-Vision dataset of 3,040 competition-sourced visual math problems reveals a large performance gap between current large multimodal models and human solvers.","cross_cats":["cs.AI","cs.CL","cs.LG","math.HO"],"primary_cat":"cs.CV","authors_text":"Hongsheng Li, Junting Pan, Ke Wang, Mingjie Zhan, Weikang Shi, Zimu Lu","submitted_at":"2024-02-22T18:56:38Z","abstract_excerpt":"Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The curation process from real competitions produces a representative and unbiased sample of visual mathematical reasoning challenges without introducing selection effects that favor certain problem types.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MATH-Vision is a new benchmark of 3,040 visual mathematical competition problems that reveals substantial gaps between large multimodal models and human performance in mathematical reasoning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The MATH-Vision dataset of 3,040 competition-sourced visual math problems reveals a large performance gap between current large multimodal models and human solvers.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a2b4abad794b3fe49eb2943d1a65570254906397f11cb93987467acf9d202cb9"},"source":{"id":"2402.14804","kind":"arxiv","version":1},"verdict":{"id":"11a322dc-5d41-4afc-9f9a-9b10b6216dc7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T20:40:56.021007Z","strongest_claim":"Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs.","one_line_summary":"MATH-Vision is a new benchmark of 3,040 visual mathematical competition problems that reveals substantial gaps between large multimodal models and human performance in mathematical reasoning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The curation process from real competitions produces a representative and unbiased sample of visual mathematical reasoning challenges without introducing selection effects that favor certain problem types.","pith_extraction_headline":"The MATH-Vision dataset of 3,040 competition-sourced visual math problems reveals a large performance gap between current large multimodal models and human solvers."},"references":{"count":27,"sample":[{"doi":"","year":2023,"title":"GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning","work_id":"dffe6af3-2c37-4256-b87a-6eab51b0f488","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos","work_id":"f6366d6b-34c7-4db1-8b33-2ceadd5f3d7c","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","ref_index":3,"cited_arxiv_id":"2310.02255","is_internal_anchor":true},{"doi":"","year":2022,"title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI","work_id":"da087b16-ea05-4064-980e-ce1d6e281d49","ref_index":4,"cited_arxiv_id":"2311.16502","is_internal_anchor":true},{"doi":"","year":2010,"title":"\". If it is a multiple choice question, only one letter is allowed in the","work_id":"543e34b9-911b-41f2-8648-f19f9827ed4b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":27,"snapshot_sha256":"3b8e6e724b0aba510a4a8f9d99696ee061a6e64361c6f73baf1a4f7072e76fe5","internal_anchors":2},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a77b415df275f68d79864ad0d3029cd3094d051a1b1f6b780fbd98381b1c3f62"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2402.14804","created_at":"2026-05-17T23:38:13.127585+00:00"},{"alias_kind":"arxiv_version","alias_value":"2402.14804v1","created_at":"2026-05-17T23:38:13.127585+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2402.14804","created_at":"2026-05-17T23:38:13.127585+00:00"},{"alias_kind":"pith_short_12","alias_value":"JATB3AB7PTUE","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"JATB3AB7PTUEINKW","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"JATB3AB7","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":20,"internal_anchor_count":20,"sample":[{"citing_arxiv_id":"2509.22746","citing_title":"Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2509.23322","citing_title":"Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2501.00321","citing_title":"OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning","ref_index":108,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14164","citing_title":"MetaMorph: Multimodal Understanding and Generation via Instruction Tuning","ref_index":117,"is_internal_anchor":true},{"citing_arxiv_id":"2406.16860","citing_title":"Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs","ref_index":133,"is_internal_anchor":true},{"citing_arxiv_id":"2411.10442","citing_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","ref_index":98,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12034","citing_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03128","citing_title":"Self-Distilled RLVR","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12034","citing_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11301","citing_title":"LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2502.16982","citing_title":"Muon is Scalable for LLM Training","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12358","citing_title":"Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10228","citing_title":"SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10219","citing_title":"Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models","ref_index":97,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01006","citing_title":"GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2501.13106","citing_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","ref_index":117,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02276","citing_title":"Kimi K2.5: Visual Agentic Intelligence","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2504.10479","citing_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","ref_index":120,"is_internal_anchor":true},{"citing_arxiv_id":"2412.05271","citing_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","ref_index":246,"is_internal_anchor":true},{"citing_arxiv_id":"2508.18265","citing_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","ref_index":135,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/JATB3AB7PTUEINKWCUPAMJTI3S","json":"https://pith.science/pith/JATB3AB7PTUEINKWCUPAMJTI3S.json","graph_json":"https://pith.science/api/pith-number/JATB3AB7PTUEINKWCUPAMJTI3S/graph.json","events_json":"https://pith.science/api/pith-number/JATB3AB7PTUEINKWCUPAMJTI3S/events.json","paper":"https://pith.science/paper/JATB3AB7"},"agent_actions":{"view_html":"https://pith.science/pith/JATB3AB7PTUEINKWCUPAMJTI3S","download_json":"https://pith.science/pith/JATB3AB7PTUEINKWCUPAMJTI3S.json","view_paper":"https://pith.science/paper/JATB3AB7","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2402.14804&json=true","fetch_graph":"https://pith.science/api/pith-number/JATB3AB7PTUEINKWCUPAMJTI3S/graph.json","fetch_events":"https://pith.science/api/pith-number/JATB3AB7PTUEINKWCUPAMJTI3S/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/JATB3AB7PTUEINKWCUPAMJTI3S/action/timestamp_anchor","attest_storage":"https://pith.science/pith/JATB3AB7PTUEINKWCUPAMJTI3S/action/storage_attestation","attest_author":"https://pith.science/pith/JATB3AB7PTUEINKWCUPAMJTI3S/action/author_attestation","sign_citation":"https://pith.science/pith/JATB3AB7PTUEINKWCUPAMJTI3S/action/citation_signature","submit_replication":"https://pith.science/pith/JATB3AB7PTUEINKWCUPAMJTI3S/action/replication_record"}},"created_at":"2026-05-17T23:38:13.127585+00:00","updated_at":"2026-05-17T23:38:13.127585+00:00"}