pith. sign in

arxiv: 2601.21225 · v2 · submitted 2026-01-29 · 💻 cs.CL · cs.AI

MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

Pith reviewed 2026-05-16 10:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords MGSM-Promultilingual math reasoningGSM-Symboliclow-resource languagesdigit variationevaluation robustnessLLM benchmarksperformance stability
0
0 comments X

The pith

Varying digits in the same math problem causes large performance drops for low-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the MGSM benchmark into MGSM-Pro by generating five versions of each question through changes to names, digits, and irrelevant context. Evaluations across nine languages show that low-resource languages experience sharp drops when models encounter different digit versions than the original set, while high-resource robustness does not transfer. Proprietary models such as Gemini 2.5 Flash and GPT-4.1 prove less stable to digit shifts, whereas Gemini 3.0 Pro and certain open models like DeepSeek v3 hold up better. The authors conclude that single-version testing overstates actual reasoning ability and recommend at least five digit-varying instantiations per problem for reliable results.

Core claim

MGSM-Pro applies the GSM-Symbolic method to the MGSM dataset to produce five instantiations per question by varying names, digits, and irrelevant context. Across nine languages, many low-resource languages show large performance drops on digit-different versions compared with the original test set. Robustness observed in high-resource language settings does not necessarily translate to low-resource languages, with Gemini 3.0 Pro displaying greater stability to digit changes than other proprietary models and GPT-OSS 120B and DeepSeek v3 showing stronger robustness among open models.

What carries the argument

MGSM-Pro dataset of five GSM-Symbolic instantiations per original MGSM question, created by varying names, digits, and context to expose evaluation instability.

If this is right

  • Low-resource language performance is more sensitive to numerical changes than high-resource language performance.
  • Model robustness in high-resource languages does not reliably predict performance under digit variation in low-resource languages.
  • Single-instance evaluations overestimate multilingual mathematical reasoning capabilities.
  • Using at least five digit-varying instantiations per problem yields a more realistic assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models may be matching memorized number patterns rather than applying general reasoning, especially in lower-resource settings.
  • Benchmark design for all languages should default to multiple numerical variations to avoid overestimating progress.
  • Real-world multilingual math applications may require explicit testing against number format changes.

Load-bearing premise

The specific variations in names, digits, and irrelevant context are enough to capture the main sources of instability in multilingual math reasoning.

What would settle it

If models maintain consistent accuracy across five or more digit-different versions of the same problems in low-resource languages with no large drops, that would falsify the claim that single-version testing is insufficient.

Figures

Figures reproduced from arXiv: 2601.21225 by Alfred Malengo Kondoro, Ayodele Awokoya, Catherine Nana Nyaah Essuman, David Ifeoluwa Adelani, Ganiyat Afolabi, Ifeoma Okoh, Kosei Uemura, Tadesse Destaw Belay, Tianyi Xu.

Figure 1
Figure 1. Figure 1: Relative decrease in accuracy from the origi [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow diagram illustrating the template [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows the effect of scaling of model sizes and robustness to change in names and numbers (IC_N#). There is no clear pattern across different model architectures. For Gemma family of models, the drop in performance gets worse as the model parameters increases from 4B, 12B and 27B. How￾ever, for GPT-OSS, we have the opposite trend where bigger model size is more robust to the per￾formance drop (see Appendix … view at source ↗
Figure 4
Figure 4. Figure 4: Example of a MGSM-Pro question template alongside its IC sentence template Each question in the MGSM-Pro dataset is paired with a corresponding IC sentence template. The curation of IC sentence template follows the methodology of (Shi et al., 2023), where we ensure that the irrelevant sentences have: 1) some related connection with the problem and 2) uses names found in the question. An examplar is shown i… view at source ↗
Figure 5
Figure 5. Figure 5: shows the relationship between model size and robustness to name and number varia￾tions within the GPT-OSS family. Unlike the trends observed in the Gemma 3 family, larger GPT-OSS models show better robustness to per￾formance drops. The contradictory findings across different model families suggests that simply in￾creasing model scale does not automatically im￾prove robustness; instead, other factors like … view at source ↗
read the original abstract

Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that models robustness in HRL setting do not necessarily translate to LRL. Moreover, proprietary models, such as Gemini 2.5 Flash and GPT-4.1 are less robust to digit, whereas Gemini 3.0 Pro is more robust. Among open models, GPT-OSS 120B and DeepSeek v3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces MGSM-Pro, an extension of the MGSM dataset that applies the GSM-Symbolic method to generate five instantiations per original problem by varying names, digits, and irrelevant context. Evaluations across nine languages show large performance drops for low-resource languages on digit-different instantiations, that high-resource robustness does not reliably transfer to low-resource settings, and specific comparisons among proprietary and open models (e.g., Gemini 3.0 Pro more robust than Gemini 2.5 Flash). The authors recommend evaluating each problem with at least five digit-varying instantiations for more robust assessment.

Significance. If the empirical observations hold, the work usefully demonstrates that single-instantiation multilingual math benchmarks can overestimate model capabilities, especially in low-resource languages, and supplies concrete model robustness rankings that can inform evaluation protocols. The simple extension of an existing dataset makes the contribution accessible, though its scope is limited to GSM-Symbolic-style perturbations.

major comments (3)
  1. [§3] §3 (Dataset Construction): the exact procedure for generating and validating the five instantiations per MGSM problem in each of the nine languages is not described in sufficient detail, including how names are chosen for cultural/linguistic naturalness and how digits are varied in non-Latin scripts; this directly affects reproducibility and the ability to assess whether the reported drops are method-specific.
  2. [§4] §4 (Experiments and Results): no statistical tests, confidence intervals, or variance measures are reported for the performance drops between original and varied instantiations, so it is unclear whether the large drops in low-resource languages are statistically reliable or driven by a small number of problems.
  3. [§5] §5 (Discussion and Recommendation): the recommendation to use at least five digit-varying instantiations rests on the untested assumption that GSM-Symbolic variations capture the primary sources of multilingual instability; the paper provides no ablation against alternative variation methods (e.g., human rephrasing or tokenization-aware changes) that could produce different variance patterns.
minor comments (3)
  1. The full list of the nine languages and their high/low-resource classification should be stated explicitly in the introduction rather than only in later tables.
  2. [Tables 2-4] Table captions and axis labels in the result figures would benefit from clearer indication of whether scores are averaged across the five instantiations or shown per instantiation.
  3. A small number of model-name inconsistencies appear (e.g., GPT-4.1 vs. GPT-OSS 120B); standardize formatting throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have updated the manuscript to improve clarity, add statistical rigor, and better scope the recommendations.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): the exact procedure for generating and validating the five instantiations per MGSM problem in each of the nine languages is not described in sufficient detail, including how names are chosen for cultural/linguistic naturalness and how digits are varied in non-Latin scripts; this directly affects reproducibility and the ability to assess whether the reported drops are method-specific.

    Authors: We agree that additional detail is needed for reproducibility. In the revised manuscript, Section 3 now includes a step-by-step description of the template adaptation from GSM-Symbolic, the source lists used for culturally appropriate names per language (drawn from public corpora and verified by native speakers), and the digit-variation procedure that preserves numerical semantics across scripts (e.g., mapping Arabic-Indic digits while keeping equation structure intact). We also added a validation subsection describing the native-speaker review process for a random sample of 50 problems per language. revision: yes

  2. Referee: [§4] §4 (Experiments and Results): no statistical tests, confidence intervals, or variance measures are reported for the performance drops between original and varied instantiations, so it is unclear whether the large drops in low-resource languages are statistically reliable or driven by a small number of problems.

    Authors: We accept this criticism. The revised Section 4 now reports bootstrap confidence intervals (1,000 resamples) around all accuracy figures, per-language standard deviations across the five instantiations, and paired t-test p-values comparing original vs. varied sets. These additions confirm that the drops observed in low-resource languages remain statistically significant (p < 0.01) and are not driven by outliers. revision: yes

  3. Referee: [§5] §5 (Discussion and Recommendation): the recommendation to use at least five digit-varying instantiations rests on the untested assumption that GSM-Symbolic variations capture the primary sources of multilingual instability; the paper provides no ablation against alternative variation methods (e.g., human rephrasing or tokenization-aware changes) that could produce different variance patterns.

    Authors: We acknowledge the limitation. The revised discussion explicitly states that the five-instantiation recommendation is tied to the GSM-Symbolic perturbation style we adopted and does not claim it exhausts all sources of instability. We note that exploring human rephrasing or tokenization-specific ablations would require new data collection outside the current scope and is left for future work. The practical recommendation remains useful for the simple, reproducible extension presented. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and evaluation

full rationale

The paper introduces MGSM-Pro by applying the existing GSM-Symbolic variation method (names, digits, irrelevant context) to the MGSM dataset to produce five instantiations per question, then directly measures model accuracy across nine languages. No derivations, equations, fitted parameters, or predictions are present; all findings consist of observed performance drops on external model outputs. The recommendation for five instantiations follows from these measurements rather than reducing to any self-referential input by construction. This is standard empirical benchmark work with no load-bearing self-citation chains or self-definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No free parameters are fitted, no mathematical axioms are invoked, and no new entities are postulated. The work relies on standard NLP evaluation practices and the prior MGSM and GSM-Symbolic datasets.

pith-pipeline@v0.9.0 · 5560 in / 1085 out tokens · 30641 ms · 2026-05-16T10:18:21.658748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems. Computing Research Repository, arXiv:2110.14168. Introduces the GSM8K dataset. Gemma-Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, and 1 others. 2025. Gemma 3 tech- nical report.arXiv preprint arXiv:250...

  2. [2]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen. 2025. MMATH: A multilingual benchmark for mathematical reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 11187–11202, Suzhou, China. Association for Computational Linguistics. Shen-Yun M...

  3. [3]

    InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 8344–8355

    A diverse corpus for evaluating and developing english math word problem solvers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 8344–8355. Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. GSM-symbolic: Understanding the limitations of ma...

  4. [4]

    Language Models are Multilingual Chain-of-Thought Reasoners

    Are nlp models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4074– 4085. Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle Bitterman, and Arianna Bisazza. 2025. When models reason i...

  5. [5]

    arXiv preprint arXiv:2504.18428 , year=

    Polymath: Evaluating mathematical rea- soning in multilingual contexts.arXiv preprint arXiv:2504.18428. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

  6. [6]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. A MGSM-Pro Dataset A.1 Language Details The resource levels and language families of the nine languages in MGSM-Pro are shown in Table 3. Each language has 225 question templates out of the 250 MGSM questions. Language Code Language Family Joshi Class English eng_Latn Indo-European Class 5 Chinese zh...

  7. [7]

    Tag all placeholders from English in the native sentence

  8. [8]

    James→Jacques) — match by position

    Names may differ across languages (e.g. James→Jacques) — match by position

  9. [9]

    Always tag the first word if it’s a person name

  10. [10]

    We categorize the MGSM-Pro annotation process into two main tasks: 1) correcting native templates, and

    Do not reword the native sentence in anyway, you should just be inserting the variable names and brackets Input: {english template} Native: {native question} Output: C Instructions for Annotators This section provides a brief introduction to the annotation guide for the MGSM-Pro dataset. We categorize the MGSM-Pro annotation process into two main tasks: 1...

  11. [11]

    ,→The templates from your ,→annotation process will be ,→used in the future to ,→create math dataset

    providing native names C.1 Template Correction Annotation The goal of our study is to ,→create variations of the ,→same problem by changing ,→names and numbers within ,→the math problem while ,→keeping the logic intact . ,→The templates from your ,→annotation process will be ,→used in the future to ,→create math dataset . For each problem template ,→corre...

  12. [12]

    You ,→should make sure the native ,→language template is as ,→similar to the english ,→template as possible

    English Template This is the gold template . You ,→should make sure the native ,→language template is as ,→similar to the english ,→template as possible

  13. [13]

    ,→You should use this as a ,→reference alongside the ,→English template to judge ,→if the Native language ,→template is correct

    Original Native Question This is the original native ,→question in the dataset . ,→You should use this as a ,→reference alongside the ,→English template to judge ,→if the Native language ,→template is correct

  14. [14]

    It could ,→very likely contain errors ,→

    Native Language Template This is a machine - created native ,→language template . It could ,→very likely contain errors ,→. This is the template that ,→you will judge if it is ,→correct or not . Below are the five critierias the ,→native language template ,→must achieve in order to be ,→considered as correct

  15. [15]

    Native Language Templates will ,→need to contain the ,→original question . I . E . the ,→wording of the native ,→template should not change ,→from the native question , ,→the template should only be ,→adding in the variable ,→names . If this is not the ,→case , you should ignore the ,→Native Language Template ,→and please provide the new ,→annotated templ...

  16. [16]

    No missing variable annotation ,→. I . E . all names or digits ,→tagged in English template ,→is tagged in the native ,→language template . You ,→should add the ,→corresponding { type , value } ,→annotation around the ,→target language word or ,→number

  17. [17]

    No extra annotation . I . E . ,→there is no extra variables ,→annotated in the ,→translation but was not in ,→the English template . You ,→should remove any { } ,→markers around words or ,→numbers that were not ,→annotated in English

  18. [18]

    No incorrect bracket {} span . ,→I . E . the annotated span is ,→not too long or too short . ,→You should adjust the ,→braces so they exactly ,→enclose the intended word ,→or number , matching the ,→English span . C.2 Native Name Annotation You will be given eight types of ,→name . You will need to provide 10 names ,→to each types that fit ,→into your nat...