MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

· 2026 · cs.CL · arXiv 2601.21225

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that models robustness in HRL setting do not necessarily translate to LRL. Moreover, proprietary models, such as Gemini 2.5 Flash and GPT-4.1 are less robust to digit, whereas Gemini 3.0 Pro is more robust. Among open models, GPT-OSS 120B and DeepSeek v3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.

representative citing papers

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

ReverseMath uses answer inversion to generate paired original and reversed math problems with known answers for detecting memorization and improving LLM reasoning via data augmentation.

citing papers explorer

Showing 1 of 1 citing paper.

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation cs.CL · 2026-05-26 · unverdicted · none · ref 28 · internal anchor
ReverseMath uses answer inversion to generate paired original and reversed math problems with known answers for detecting memorization and improving LLM reasoning via data augmentation.

MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

fields

years

verdicts

representative citing papers

citing papers explorer