{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2021:Y2PRIO4E3LEENHWXTSUJMEA4GQ","short_pith_number":"pith:Y2PRIO4E","schema_version":"1.0","canonical_sha256":"c69f143b84dac8469ed79ca896101c343937602e2da37b54f59641f4f9c4056c","source":{"kind":"arxiv","id":"2103.07191","version":2},"attestation_state":"computed","paper":{"title":"Are NLP Models really able to Solve Simple Math Word Problems?","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"NLP solvers for simple math word problems achieve high benchmark scores by exploiting shallow patterns instead of actual reasoning.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Arkil Patel, Navin Goyal, Satwik Bhattamishra","submitted_at":"2021-03-12T10:23:47Z","abstract_excerpt":"The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered \"solved\" with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performanc"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2103.07191","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2021-03-12T10:23:47Z","cross_cats_sorted":[],"title_canon_sha256":"27961b34ee3f17ffb2c39dc34923abd3cbb8e795187eb39ceb96d48b3f33953d","abstract_canon_sha256":"620056386ef10602f08752d34cf4fdc9b37ad1f22161507fa11deb87ca0ad34a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.116211Z","signature_b64":"P9H3kPswR3+6ZHFos/bmehUKd/3z6BEoXc7BOFZ2fJj+LvjWyoYVQH3ZppqPvqrHuTp3qPQoMw9FBym+VPzsAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c69f143b84dac8469ed79ca896101c343937602e2da37b54f59641f4f9c4056c","last_reissued_at":"2026-05-17T23:38:47.115752Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.115752Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Are NLP Models really able to Solve Simple Math Word Problems?","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"NLP solvers for simple math word problems achieve high benchmark scores by exploiting shallow patterns instead of actual reasoning.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Arkil Patel, Navin Goyal, Satwik Bhattamishra","submitted_at":"2021-03-12T10:23:47Z","abstract_excerpt":"The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered \"solved\" with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performanc"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the carefully chosen variations used to create SVAMP are sufficient to block all shallow heuristics while still testing the intended arithmetic reasoning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"NLP models for elementary math word problems rely on shallow heuristics rather than genuine understanding, performing well without questions or as bag-of-words but dropping substantially on the new SVAMP variation dataset.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"NLP solvers for simple math word problems achieve high benchmark scores by exploiting shallow patterns instead of actual reasoning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"88643d083ba5b2aa8260df09b6d37f9a40b8e50c0e30100e3091326dd646f777"},"source":{"id":"2103.07191","kind":"arxiv","version":2},"verdict":{"id":"60bd8bfe-824d-4207-bfb1-52a7d9f03f46","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T17:26:35.756651Z","strongest_claim":"MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP.","one_line_summary":"NLP models for elementary math word problems rely on shallow heuristics rather than genuine understanding, performing well without questions or as bag-of-words but dropping substantially on the new SVAMP variation dataset.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the carefully chosen variations used to create SVAMP are sufficient to block all shallow heuristics while still testing the intended arithmetic reasoning.","pith_extraction_headline":"NLP solvers for simple math word problems achieve high benchmark scores by exploiting shallow patterns instead of actual reasoning."},"references":{"count":12,"sample":[{"doi":"","year":2018,"title":"Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A","work_id":"94e45806-43ab-42d2-a622-ba654423f9cb","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"In Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, On- line","work_id":"03bc4244-11f6-46ce-846f-67799b8784dc","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"In Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 3702–3710, Online","work_id":"c7adf57e-989f-47fa-bdd4-c2556f46654a","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"IEEE Transac- tions on Pattern Analysis and Machine Intelligence , 42(9):2287–2305","work_id":"a3266838-e9a2-466d-ba59-76b2d13439fc","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"B Implementation Details We use 8 NVIDIA Tesla P100 GPUs each with 16 GB memory to run our experiments","work_id":"6167010d-9c6b-43c9-8e30-20d05cb93cfd","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":12,"snapshot_sha256":"bac01043592fe10ecc8f06eb5051cfc959091af85eb122088c5d7771a87da948","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"26efa7f76fb43cb2be17d454d0d5e33c052606cf54467f76be5bbd2950da5f53"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2103.07191","created_at":"2026-05-17T23:38:47.115831+00:00"},{"alias_kind":"arxiv_version","alias_value":"2103.07191v2","created_at":"2026-05-17T23:38:47.115831+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2103.07191","created_at":"2026-05-17T23:38:47.115831+00:00"},{"alias_kind":"pith_short_12","alias_value":"Y2PRIO4E3LEE","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"Y2PRIO4E3LEENHWX","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"Y2PRIO4E","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":40,"internal_anchor_count":40,"sample":[{"citing_arxiv_id":"2410.13181","citing_title":"AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2411.01141","citing_title":"Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2510.07962","citing_title":"LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2305.02301","citing_title":"Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes","ref_index":92,"is_internal_anchor":true},{"citing_arxiv_id":"2601.21350","citing_title":"Factored Causal Representation Learning for Robust Reward Modeling in RLHF","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08324","citing_title":"Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15706","citing_title":"Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models","ref_index":103,"is_internal_anchor":true},{"citing_arxiv_id":"2309.17452","citing_title":"ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2502.10248","citing_title":"Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model","ref_index":260,"is_internal_anchor":true},{"citing_arxiv_id":"2509.18629","citing_title":"HyperAdapt: Simple High-Rank Adaptation","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2510.06965","citing_title":"EDUMATH: Generating Standards-aligned Educational Math Word Problems","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2511.00739","citing_title":"Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2309.05653","citing_title":"MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2502.21074","citing_title":"CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation","ref_index":118,"is_internal_anchor":true},{"citing_arxiv_id":"2511.14846","citing_title":"Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21285","citing_title":"PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2512.02764","citing_title":"PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2512.18857","citing_title":"CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2303.09014","citing_title":"ART: Automatic multi-step reasoning and tool-use for large language models","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2601.03559","citing_title":"DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2601.03682","citing_title":"From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2210.03493","citing_title":"Automatic Chain of Thought Prompting in Large Language Models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06192","citing_title":"The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2603.16105","citing_title":"Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2408.08435","citing_title":"Automated Design of Agentic Systems","ref_index":193,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/Y2PRIO4E3LEENHWXTSUJMEA4GQ","json":"https://pith.science/pith/Y2PRIO4E3LEENHWXTSUJMEA4GQ.json","graph_json":"https://pith.science/api/pith-number/Y2PRIO4E3LEENHWXTSUJMEA4GQ/graph.json","events_json":"https://pith.science/api/pith-number/Y2PRIO4E3LEENHWXTSUJMEA4GQ/events.json","paper":"https://pith.science/paper/Y2PRIO4E"},"agent_actions":{"view_html":"https://pith.science/pith/Y2PRIO4E3LEENHWXTSUJMEA4GQ","download_json":"https://pith.science/pith/Y2PRIO4E3LEENHWXTSUJMEA4GQ.json","view_paper":"https://pith.science/paper/Y2PRIO4E","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2103.07191&json=true","fetch_graph":"https://pith.science/api/pith-number/Y2PRIO4E3LEENHWXTSUJMEA4GQ/graph.json","fetch_events":"https://pith.science/api/pith-number/Y2PRIO4E3LEENHWXTSUJMEA4GQ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/Y2PRIO4E3LEENHWXTSUJMEA4GQ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/Y2PRIO4E3LEENHWXTSUJMEA4GQ/action/storage_attestation","attest_author":"https://pith.science/pith/Y2PRIO4E3LEENHWXTSUJMEA4GQ/action/author_attestation","sign_citation":"https://pith.science/pith/Y2PRIO4E3LEENHWXTSUJMEA4GQ/action/citation_signature","submit_replication":"https://pith.science/pith/Y2PRIO4E3LEENHWXTSUJMEA4GQ/action/replication_record"}},"created_at":"2026-05-17T23:38:47.115831+00:00","updated_at":"2026-05-17T23:38:47.115831+00:00"}