{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2022:3YDXEUQSWA63WKHNHAQJUHOC3G","short_pith_number":"pith:3YDXEUQS","schema_version":"1.0","canonical_sha256":"de07725212b03dbb28ed38209a1dc2d9ad3b9bc9050282169cf2c3b6cf22949e","source":{"kind":"arxiv","id":"2210.03057","version":1},"attestation_state":"computed","paper":{"title":"Language Models are Multilingual Chain-of-Thought Reasoners","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Large language models gain step-by-step reasoning ability across many languages as they scale up.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Denny Zhou, Dipanjan Das, Freda Shi, Hyung Won Chung, Jason Wei, Markus Freitag, Mirac Suzgun, Sebastian Ruder, Soroush Vosoughi, Suraj Srivats, Xuezhi Wang, Yi Tay","submitted_at":"2022-10-06T17:03:34Z","abstract_excerpt":"We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili. Finally, we show that the multilingual reasoning abil"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2210.03057","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2022-10-06T17:03:34Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"781e2d2fc0372114d9092ec058b5b0795ebb6bd530c29b7bd58c4064d348483d","abstract_canon_sha256":"992c71f9089ceabc44e1ab244d0fc7bd5f3d12c95c1151a0918fb0675d2fa896"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.174218Z","signature_b64":"l3PQ/aL81/CZ5IvFSLrpvIJ4RfCYxmQq2mFwmeH1AQdBnBzEYy5EvKzmtgcIXdW2ovMeTI7D4+n/aVx1ulRwDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"de07725212b03dbb28ed38209a1dc2d9ad3b9bc9050282169cf2c3b6cf22949e","last_reissued_at":"2026-05-17T23:38:50.173722Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.173722Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Language Models are Multilingual Chain-of-Thought Reasoners","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Large language models gain step-by-step reasoning ability across many languages as they scale up.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Denny Zhou, Dipanjan Das, Freda Shi, Hyung Won Chung, Jason Wei, Markus Freitag, Mirac Suzgun, Sebastian Ruder, Soroush Vosoughi, Suraj Srivats, Xuezhi Wang, Yi Tay","submitted_at":"2022-10-06T17:03:34Z","abstract_excerpt":"We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili. Finally, we show that the multilingual reasoning abil"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The manual translations preserve the original semantic meaning, logical structure, and difficulty level of the problems without introducing translation artifacts that would make the task easier or harder in non-English languages.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Large language models show strong chain-of-thought reasoning on math problems across ten languages, with abilities emerging at larger scales and extending to other reasoning tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Large language models gain step-by-step reasoning ability across many languages as they scale up.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4ea2e0772861ba44e2f429c2c5eb49075ee60d0ff4e8aba6886d4b8fb14e8d76"},"source":{"id":"2210.03057","kind":"arxiv","version":1},"verdict":{"id":"fe781533-2a31-46f0-947a-c93b09b9a8c9","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T21:02:25.175118Z","strongest_claim":"the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili","one_line_summary":"Large language models show strong chain-of-thought reasoning on math problems across ten languages, with abilities emerging at larger scales and extending to other reasoning tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The manual translations preserve the original semantic meaning, logical structure, and difficulty level of the problems without introducing translation artifacts that would make the task easier or harder in non-English languages.","pith_extraction_headline":"Large language models gain step-by-step reasoning ability across many languages as they scale up."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"715a8b5436ec6450059d21727722178f6161cc06ae5fe91f5467fb17fbeb70f3"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2210.03057","created_at":"2026-05-17T23:38:50.173801+00:00"},{"alias_kind":"arxiv_version","alias_value":"2210.03057v1","created_at":"2026-05-17T23:38:50.173801+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2210.03057","created_at":"2026-05-17T23:38:50.173801+00:00"},{"alias_kind":"pith_short_12","alias_value":"3YDXEUQSWA63","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"3YDXEUQSWA63WKHN","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"3YDXEUQS","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":30,"internal_anchor_count":30,"sample":[{"citing_arxiv_id":"2603.06610","citing_title":"CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23019","citing_title":"PACE: Two-Timescale Self-Evolution for Small Language Model Agents","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2309.10305","citing_title":"Baichuan 2: Open Large-scale Language Models","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07731","citing_title":"Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2603.15031","citing_title":"Attention Residuals","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2511.01831","citing_title":"Routing-Based Continual Learning for Multimodal Large Language Models","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2511.22972","citing_title":"Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2504.21318","citing_title":"Phi-4-reasoning Technical Report","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2601.06767","citing_title":"GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2601.13262","citing_title":"CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2601.21225","citing_title":"MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2504.19678","citing_title":"From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2507.21046","citing_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","ref_index":143,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16392","citing_title":"RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10774","citing_title":"Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads","ref_index":96,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12227","citing_title":"Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2305.10403","citing_title":"PaLM 2 Technical Report","ref_index":138,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20090","citing_title":"Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20720","citing_title":"COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13286","citing_title":"English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2206.07682","citing_title":"Emergent Abilities of Large Language Models","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2210.09261","citing_title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08881","citing_title":"Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07766","citing_title":"Sensitivity-Positional Co-Localization in GQA Transformers","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02971","citing_title":"Multilingual Safety Alignment via Self-Distillation","ref_index":48,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3YDXEUQSWA63WKHNHAQJUHOC3G","json":"https://pith.science/pith/3YDXEUQSWA63WKHNHAQJUHOC3G.json","graph_json":"https://pith.science/api/pith-number/3YDXEUQSWA63WKHNHAQJUHOC3G/graph.json","events_json":"https://pith.science/api/pith-number/3YDXEUQSWA63WKHNHAQJUHOC3G/events.json","paper":"https://pith.science/paper/3YDXEUQS"},"agent_actions":{"view_html":"https://pith.science/pith/3YDXEUQSWA63WKHNHAQJUHOC3G","download_json":"https://pith.science/pith/3YDXEUQSWA63WKHNHAQJUHOC3G.json","view_paper":"https://pith.science/paper/3YDXEUQS","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2210.03057&json=true","fetch_graph":"https://pith.science/api/pith-number/3YDXEUQSWA63WKHNHAQJUHOC3G/graph.json","fetch_events":"https://pith.science/api/pith-number/3YDXEUQSWA63WKHNHAQJUHOC3G/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3YDXEUQSWA63WKHNHAQJUHOC3G/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3YDXEUQSWA63WKHNHAQJUHOC3G/action/storage_attestation","attest_author":"https://pith.science/pith/3YDXEUQSWA63WKHNHAQJUHOC3G/action/author_attestation","sign_citation":"https://pith.science/pith/3YDXEUQSWA63WKHNHAQJUHOC3G/action/citation_signature","submit_replication":"https://pith.science/pith/3YDXEUQSWA63WKHNHAQJUHOC3G/action/replication_record"}},"created_at":"2026-05-17T23:38:50.173801+00:00","updated_at":"2026-05-17T23:38:50.173801+00:00"}