{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:YPVVT7ZXFHYHHO7BQ3DGEIG62C","short_pith_number":"pith:YPVVT7ZX","schema_version":"1.0","canonical_sha256":"c3eb59ff3729f073bbe186c66220ded0ab065cbf552b72ccd37c1d4642381502","source":{"kind":"arxiv","id":"2407.01284","version":1},"attestation_state":"computed","paper":{"title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Most large multimodal models solve visual math by rote memorization rather than grasping underlying concepts.","cross_cats":["cs.CL","cs.CV","cs.LG","cs.SC"],"primary_cat":"cs.AI","authors_text":"Chen Li, Chong Sun, Guanting Dong, Honggang Zhang, Miaoxuan Zhang, Minhui Wu, Muxi Diao, Qiuna Tan, Runfeng Qiao, Runqi Qiao, Shanglin Lei, Xiaoshuai Song, Xiao Zong, Yida Xu, Yifan Zhang, Zhe Wei, Zhimin Bao, Zhuoma Gongque","submitted_at":"2024-07-01T13:39:08Z","abstract_excerpt":"Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2407.01284","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.AI","submitted_at":"2024-07-01T13:39:08Z","cross_cats_sorted":["cs.CL","cs.CV","cs.LG","cs.SC"],"title_canon_sha256":"2569764b94b36b7df5a6002cad881f0376ed566570187feb8b19985df880ebaf","abstract_canon_sha256":"7c039bb5895f2369789dd0dab3df8b0c5538a688f2f9fd9d3c53c8c91157a385"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.457079Z","signature_b64":"mu3MxBgR7DM9jApaDGfLIeXPUpLTIkRIliKs1it7l0QlNVQiF3Jb2m9A6rDlzrQo+Vs3nnQmElH9tp7HB+72Dg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c3eb59ff3729f073bbe186c66220ded0ab065cbf552b72ccd37c1d4642381502","last_reissued_at":"2026-05-17T23:38:46.456615Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.456615Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Most large multimodal models solve visual math by rote memorization rather than grasping underlying concepts.","cross_cats":["cs.CL","cs.CV","cs.LG","cs.SC"],"primary_cat":"cs.AI","authors_text":"Chen Li, Chong Sun, Guanting Dong, Honggang Zhang, Miaoxuan Zhang, Minhui Wu, Muxi Diao, Qiuna Tan, Runfeng Qiao, Runqi Qiao, Shanglin Lei, Xiaoshuai Song, Xiao Zong, Yida Xu, Yifan Zhang, Zhe Wei, Zhimin Bao, Zhuoma Gongque","submitted_at":"2024-07-01T13:39:08Z","abstract_excerpt":"Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That decomposing composite problems into sub-problems according to the required knowledge concepts accurately isolates inherent reasoning issues rather than introducing artifacts from visual parsing errors or ambiguous concept boundaries.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Most large multimodal models solve visual math by rote memorization rather than grasping underlying concepts.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5f4163e86b9b61e95a7e42966b13aa5f0d51e17f72c351f32204a0bc169a4260"},"source":{"id":"2407.01284","kind":"arxiv","version":1},"verdict":{"id":"722db9c0-a55c-4891-bcdc-09cd5961df1c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T21:52:03.607052Z","strongest_claim":"the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems.","one_line_summary":"WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That decomposing composite problems into sub-problems according to the required knowledge concepts accurately isolates inherent reasoning issues rather than introducing artifacts from visual parsing errors or ambiguous concept boundaries.","pith_extraction_headline":"Most large multimodal models solve visual math by rote memorization rather than grasping underlying concepts."},"references":{"count":166,"sample":[{"doi":"","year":2015,"title":"Deep learning.nature, 521(7553):436–444, 2015","work_id":"8c42ff53-c495-4b0d-8fa1-03b2d8f9af31","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1998,"title":"Gradient-based learning applied to document recognition","work_id":"2fdc4d60-bf35-48a0-bf09-40fe4cd1de32","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Attention is all you need","work_id":"ac9d72e5-eb60-417e-963d-25671207074e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, ","work_id":"9a5da3ad-8044-47e0-96db-ee6aae074eed","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":5,"cited_arxiv_id":"2303.08774","is_internal_anchor":true}],"resolved_work":166,"snapshot_sha256":"ce7947c3037ecd1fa7a4d4e6a1243b7db5aaeeee49d7a5446b85b1385bfdcee5","internal_anchors":24},"formal_canon":{"evidence_count":2,"snapshot_sha256":"e9d873d779809b744bd3f9aab1abbe428542722a3b5140916a11427db5834d46"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2407.01284","created_at":"2026-05-17T23:38:46.456694+00:00"},{"alias_kind":"arxiv_version","alias_value":"2407.01284v1","created_at":"2026-05-17T23:38:46.456694+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2407.01284","created_at":"2026-05-17T23:38:46.456694+00:00"},{"alias_kind":"pith_short_12","alias_value":"YPVVT7ZXFHYH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"YPVVT7ZXFHYHHO7B","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"YPVVT7ZX","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":32,"internal_anchor_count":32,"sample":[{"citing_arxiv_id":"2410.04509","citing_title":"ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2502.02871","citing_title":"Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning","ref_index":155,"is_internal_anchor":true},{"citing_arxiv_id":"2503.16549","citing_title":"MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2508.03556","citing_title":"VRPRM: Process Reward Modeling via Visual Reasoning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21924","citing_title":"Visual-Advantage On-Policy Distillation for Vision-Language Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16371","citing_title":"GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19852","citing_title":"Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19461","citing_title":"Beyond Mode Collapse: Distribution Matching for Diverse Reasoning","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2503.17352","citing_title":"OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2507.06448","citing_title":"Perception-Aware Policy Optimization for Multimodal Reasoning","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22746","citing_title":"Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2509.23322","citing_title":"Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2510.10606","citing_title":"ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2511.19972","citing_title":"Boosting Reasoning in Large Multimodal Models via Activation Replay","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2411.10442","citing_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2511.05271","citing_title":"DeepEyesV2: Toward Agentic Multimodal Model","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2503.10615","citing_title":"R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2603.01070","citing_title":"How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12163","citing_title":"Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03128","citing_title":"Self-Distilled RLVR","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2501.05366","citing_title":"Search-o1: Agentic Search-Enhanced Large Reasoning Models","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11856","citing_title":"UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12163","citing_title":"Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09262","citing_title":"Reinforcing Multimodal Reasoning Against Visual Degradation","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2505.14362","citing_title":"DeepEyes: Incentivizing \"Thinking with Images\" via Reinforcement Learning","ref_index":17,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/YPVVT7ZXFHYHHO7BQ3DGEIG62C","json":"https://pith.science/pith/YPVVT7ZXFHYHHO7BQ3DGEIG62C.json","graph_json":"https://pith.science/api/pith-number/YPVVT7ZXFHYHHO7BQ3DGEIG62C/graph.json","events_json":"https://pith.science/api/pith-number/YPVVT7ZXFHYHHO7BQ3DGEIG62C/events.json","paper":"https://pith.science/paper/YPVVT7ZX"},"agent_actions":{"view_html":"https://pith.science/pith/YPVVT7ZXFHYHHO7BQ3DGEIG62C","download_json":"https://pith.science/pith/YPVVT7ZXFHYHHO7BQ3DGEIG62C.json","view_paper":"https://pith.science/paper/YPVVT7ZX","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2407.01284&json=true","fetch_graph":"https://pith.science/api/pith-number/YPVVT7ZXFHYHHO7BQ3DGEIG62C/graph.json","fetch_events":"https://pith.science/api/pith-number/YPVVT7ZXFHYHHO7BQ3DGEIG62C/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/YPVVT7ZXFHYHHO7BQ3DGEIG62C/action/timestamp_anchor","attest_storage":"https://pith.science/pith/YPVVT7ZXFHYHHO7BQ3DGEIG62C/action/storage_attestation","attest_author":"https://pith.science/pith/YPVVT7ZXFHYHHO7BQ3DGEIG62C/action/author_attestation","sign_citation":"https://pith.science/pith/YPVVT7ZXFHYHHO7BQ3DGEIG62C/action/citation_signature","submit_replication":"https://pith.science/pith/YPVVT7ZXFHYHHO7BQ3DGEIG62C/action/replication_record"}},"created_at":"2026-05-17T23:38:46.456694+00:00","updated_at":"2026-05-17T23:38:46.456694+00:00"}