MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation
Pith reviewed 2026-05-16 10:18 UTC · model grok-4.3
The pith
Varying digits in the same math problem causes large performance drops for low-resource languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MGSM-Pro applies the GSM-Symbolic method to the MGSM dataset to produce five instantiations per question by varying names, digits, and irrelevant context. Across nine languages, many low-resource languages show large performance drops on digit-different versions compared with the original test set. Robustness observed in high-resource language settings does not necessarily translate to low-resource languages, with Gemini 3.0 Pro displaying greater stability to digit changes than other proprietary models and GPT-OSS 120B and DeepSeek v3 showing stronger robustness among open models.
What carries the argument
MGSM-Pro dataset of five GSM-Symbolic instantiations per original MGSM question, created by varying names, digits, and context to expose evaluation instability.
If this is right
- Low-resource language performance is more sensitive to numerical changes than high-resource language performance.
- Model robustness in high-resource languages does not reliably predict performance under digit variation in low-resource languages.
- Single-instance evaluations overestimate multilingual mathematical reasoning capabilities.
- Using at least five digit-varying instantiations per problem yields a more realistic assessment.
Where Pith is reading between the lines
- Models may be matching memorized number patterns rather than applying general reasoning, especially in lower-resource settings.
- Benchmark design for all languages should default to multiple numerical variations to avoid overestimating progress.
- Real-world multilingual math applications may require explicit testing against number format changes.
Load-bearing premise
The specific variations in names, digits, and irrelevant context are enough to capture the main sources of instability in multilingual math reasoning.
What would settle it
If models maintain consistent accuracy across five or more digit-different versions of the same problems in low-resource languages with no large drops, that would falsify the claim that single-version testing is insufficient.
Figures
read the original abstract
Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that models robustness in HRL setting do not necessarily translate to LRL. Moreover, proprietary models, such as Gemini 2.5 Flash and GPT-4.1 are less robust to digit, whereas Gemini 3.0 Pro is more robust. Among open models, GPT-OSS 120B and DeepSeek v3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MGSM-Pro, an extension of the MGSM dataset that applies the GSM-Symbolic method to generate five instantiations per original problem by varying names, digits, and irrelevant context. Evaluations across nine languages show large performance drops for low-resource languages on digit-different instantiations, that high-resource robustness does not reliably transfer to low-resource settings, and specific comparisons among proprietary and open models (e.g., Gemini 3.0 Pro more robust than Gemini 2.5 Flash). The authors recommend evaluating each problem with at least five digit-varying instantiations for more robust assessment.
Significance. If the empirical observations hold, the work usefully demonstrates that single-instantiation multilingual math benchmarks can overestimate model capabilities, especially in low-resource languages, and supplies concrete model robustness rankings that can inform evaluation protocols. The simple extension of an existing dataset makes the contribution accessible, though its scope is limited to GSM-Symbolic-style perturbations.
major comments (3)
- [§3] §3 (Dataset Construction): the exact procedure for generating and validating the five instantiations per MGSM problem in each of the nine languages is not described in sufficient detail, including how names are chosen for cultural/linguistic naturalness and how digits are varied in non-Latin scripts; this directly affects reproducibility and the ability to assess whether the reported drops are method-specific.
- [§4] §4 (Experiments and Results): no statistical tests, confidence intervals, or variance measures are reported for the performance drops between original and varied instantiations, so it is unclear whether the large drops in low-resource languages are statistically reliable or driven by a small number of problems.
- [§5] §5 (Discussion and Recommendation): the recommendation to use at least five digit-varying instantiations rests on the untested assumption that GSM-Symbolic variations capture the primary sources of multilingual instability; the paper provides no ablation against alternative variation methods (e.g., human rephrasing or tokenization-aware changes) that could produce different variance patterns.
minor comments (3)
- The full list of the nine languages and their high/low-resource classification should be stated explicitly in the introduction rather than only in later tables.
- [Tables 2-4] Table captions and axis labels in the result figures would benefit from clearer indication of whether scores are averaged across the five instantiations or shown per instantiation.
- A small number of model-name inconsistencies appear (e.g., GPT-4.1 vs. GPT-OSS 120B); standardize formatting throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have updated the manuscript to improve clarity, add statistical rigor, and better scope the recommendations.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction): the exact procedure for generating and validating the five instantiations per MGSM problem in each of the nine languages is not described in sufficient detail, including how names are chosen for cultural/linguistic naturalness and how digits are varied in non-Latin scripts; this directly affects reproducibility and the ability to assess whether the reported drops are method-specific.
Authors: We agree that additional detail is needed for reproducibility. In the revised manuscript, Section 3 now includes a step-by-step description of the template adaptation from GSM-Symbolic, the source lists used for culturally appropriate names per language (drawn from public corpora and verified by native speakers), and the digit-variation procedure that preserves numerical semantics across scripts (e.g., mapping Arabic-Indic digits while keeping equation structure intact). We also added a validation subsection describing the native-speaker review process for a random sample of 50 problems per language. revision: yes
-
Referee: [§4] §4 (Experiments and Results): no statistical tests, confidence intervals, or variance measures are reported for the performance drops between original and varied instantiations, so it is unclear whether the large drops in low-resource languages are statistically reliable or driven by a small number of problems.
Authors: We accept this criticism. The revised Section 4 now reports bootstrap confidence intervals (1,000 resamples) around all accuracy figures, per-language standard deviations across the five instantiations, and paired t-test p-values comparing original vs. varied sets. These additions confirm that the drops observed in low-resource languages remain statistically significant (p < 0.01) and are not driven by outliers. revision: yes
-
Referee: [§5] §5 (Discussion and Recommendation): the recommendation to use at least five digit-varying instantiations rests on the untested assumption that GSM-Symbolic variations capture the primary sources of multilingual instability; the paper provides no ablation against alternative variation methods (e.g., human rephrasing or tokenization-aware changes) that could produce different variance patterns.
Authors: We acknowledge the limitation. The revised discussion explicitly states that the five-instantiation recommendation is tied to the GSM-Symbolic perturbation style we adopted and does not claim it exhausts all sources of instability. We note that exploring human rephrasing or tokenization-specific ablations would require new data collection outside the current scope and is left for future work. The practical recommendation remains useful for the simple, reproducible extension presented. revision: partial
Circularity Check
No circularity: purely empirical benchmark construction and evaluation
full rationale
The paper introduces MGSM-Pro by applying the existing GSM-Symbolic variation method (names, digits, irrelevant context) to the MGSM dataset to produce five instantiations per question, then directly measures model accuracy across nine languages. No derivations, equations, fitted parameters, or predictions are present; all findings consist of observed performance drops on external model outputs. The recommendation for five instantiations follows from these measurements rather than reducing to any self-referential input by construction. This is standard empirical benchmark work with no load-bearing self-citation chains or self-definitional steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems. Computing Research Repository, arXiv:2110.14168. Introduces the GSM8K dataset. Gemma-Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, and 1 others. 2025. Gemma 3 tech- nical report.arXiv preprint arXiv:250...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen. 2025. MMATH: A multilingual benchmark for mathematical reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 11187–11202, Suzhou, China. Association for Computational Linguistics. Shen-Yun M...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
A diverse corpus for evaluating and developing english math word problem solvers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 8344–8355. Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. GSM-symbolic: Understanding the limitations of ma...
work page 2025
-
[4]
Language Models are Multilingual Chain-of-Thought Reasoners
Are nlp models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4074– 4085. Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle Bitterman, and Arianna Bisazza. 2025. When models reason i...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
arXiv preprint arXiv:2504.18428 , year=
Polymath: Evaluating mathematical rea- soning in multilingual contexts.arXiv preprint arXiv:2504.18428. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others
-
[6]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. A MGSM-Pro Dataset A.1 Language Details The resource levels and language families of the nine languages in MGSM-Pro are shown in Table 3. Each language has 225 question templates out of the 250 MGSM questions. Language Code Language Family Joshi Class English eng_Latn Indo-European Class 5 Chinese zh...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Tag all placeholders from English in the native sentence
-
[8]
James→Jacques) — match by position
Names may differ across languages (e.g. James→Jacques) — match by position
-
[9]
Always tag the first word if it’s a person name
-
[10]
Do not reword the native sentence in anyway, you should just be inserting the variable names and brackets Input: {english template} Native: {native question} Output: C Instructions for Annotators This section provides a brief introduction to the annotation guide for the MGSM-Pro dataset. We categorize the MGSM-Pro annotation process into two main tasks: 1...
-
[11]
,→The templates from your ,→annotation process will be ,→used in the future to ,→create math dataset
providing native names C.1 Template Correction Annotation The goal of our study is to ,→create variations of the ,→same problem by changing ,→names and numbers within ,→the math problem while ,→keeping the logic intact . ,→The templates from your ,→annotation process will be ,→used in the future to ,→create math dataset . For each problem template ,→corre...
-
[12]
English Template This is the gold template . You ,→should make sure the native ,→language template is as ,→similar to the english ,→template as possible
-
[13]
Original Native Question This is the original native ,→question in the dataset . ,→You should use this as a ,→reference alongside the ,→English template to judge ,→if the Native language ,→template is correct
-
[14]
It could ,→very likely contain errors ,→
Native Language Template This is a machine - created native ,→language template . It could ,→very likely contain errors ,→. This is the template that ,→you will judge if it is ,→correct or not . Below are the five critierias the ,→native language template ,→must achieve in order to be ,→considered as correct
-
[15]
Native Language Templates will ,→need to contain the ,→original question . I . E . the ,→wording of the native ,→template should not change ,→from the native question , ,→the template should only be ,→adding in the variable ,→names . If this is not the ,→case , you should ignore the ,→Native Language Template ,→and please provide the new ,→annotated templ...
-
[16]
No missing variable annotation ,→. I . E . all names or digits ,→tagged in English template ,→is tagged in the native ,→language template . You ,→should add the ,→corresponding { type , value } ,→annotation around the ,→target language word or ,→number
-
[17]
No extra annotation . I . E . ,→there is no extra variables ,→annotated in the ,→translation but was not in ,→the English template . You ,→should remove any { } ,→markers around words or ,→numbers that were not ,→annotated in English
-
[18]
No incorrect bracket {} span . ,→I . E . the annotated span is ,→not too long or too short . ,→You should adjust the ,→braces so they exactly ,→enclose the intended word ,→or number , matching the ,→English span . C.2 Native Name Annotation You will be given eight types of ,→name . You will need to provide 10 names ,→to each types that fit ,→into your nat...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.