Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil
Pith reviewed 2026-05-15 22:01 UTC · model grok-4.3
The pith
Large language models perform basic arithmetic well across languages but show marked drops in complex reasoning for Sinhala and Tamil.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
While basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that strong performance in English does not guarantee reliable performance across languages.
What carries the argument
A parallel dataset of math problems independently authored in Sinhala, Tamil, and English by native speakers with mathematical expertise, paired with a taxonomy of six problem types ranging from basic arithmetic to optimization.
If this is right
- Basic arithmetic problems can be reliably solved by LLMs in Sinhala and Tamil.
- Complex math problems require language-specific evaluation before LLM use in education.
- Performance gaps differ by model, so selection matters for multilingual settings.
- AI tutoring tools need language-specific benchmarks to ensure equity in low-resource contexts.
Where Pith is reading between the lines
- Similar patterns may hold for other low-resource languages with limited training data.
- Targeted fine-tuning on native-language math problems could mitigate the degradation.
- This underscores the importance of diverse data sources beyond English-centric corpora for reasoning capabilities.
Load-bearing premise
The problems independently authored in each language have equivalent difficulty and mathematical complexity.
What would settle it
If a new set of problems rated for equivalent difficulty by experts shows no significant performance difference across languages, this would falsify the degradation finding.
Figures
read the original abstract
Large language models (LLMs) have achieved strong results in mathematical reasoning, and are increasingly deployed as tutoring and learning support tools in educational settings. However, their reliability for students working in non-English languages, especially low-resource languages, remains poorly understood. We examine this gap by evaluating mathematical reasoning in Sinhala and Tamil -- two languages widely used in South Asian schools but underrepresented in artificial intelligence (AI) research. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset in which each problem is independently authored in Sinhala and Tamil by native speakers, and in English by fluent speakers, all with strong mathematical backgrounds. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that strong performance in English does not guarantee reliable performance across languages. These findings have direct implications for the deployment of AI tools in multilingual classrooms, and highlight the need for language-specific evaluation before adopting large language models as math tutoring aids in non-English educational contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates four prominent LLMs on mathematical reasoning across English, Sinhala, and Tamil using a taxonomy of six problem types (basic arithmetic through complex unit-conflict and optimization tasks). It constructs a parallel dataset in which each problem is independently authored by native-speaker experts with strong mathematical backgrounds, avoiding translation artifacts. The central claim is that basic arithmetic reasoning transfers robustly across languages while complex reasoning exhibits significant degradation in Tamil and Sinhala, with failure patterns varying by model and problem type.
Significance. If the empirical patterns hold after validation of the dataset, the work would be significant for AI-assisted education: it supplies direct evidence that English-centric LLM performance does not guarantee reliability in low-resource languages, with concrete implications for deploying tutoring tools in South Asian classrooms. The independently authored parallel dataset is a methodological strength that sidesteps translation-quality confounds.
major comments (1)
- [§3 (Dataset Construction)] §3 (Dataset Construction): The central claim that performance gaps reflect language-specific limitations rather than problem design rests on the assumption that independently authored problems maintain equivalent mathematical difficulty and complexity across Sinhala, Tamil, and English. The manuscript provides no quantitative validation (e.g., average statement length, concept rarity counts, pilot accuracy rates, or inter-author difficulty ratings) to support this equivalence; without such checks, degradation on complex tasks could arise from systematic differences in problem formulation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the value of our independently authored parallel dataset. We address the concern regarding validation of problem equivalence below.
read point-by-point responses
-
Referee: The central claim that performance gaps reflect language-specific limitations rather than problem design rests on the assumption that independently authored problems maintain equivalent mathematical difficulty and complexity across Sinhala, Tamil, and English. The manuscript provides no quantitative validation (e.g., average statement length, concept rarity counts, pilot accuracy rates, or inter-author difficulty ratings) to support this equivalence; without such checks, degradation on complex tasks could arise from systematic differences in problem formulation.
Authors: We agree that the original manuscript lacks explicit quantitative checks for cross-language equivalence of mathematical difficulty. Problems were independently authored by native-speaker experts with strong mathematical backgrounds to match the same concepts and intended difficulty levels, but we did not report supporting metrics. In the revised version we will add: (1) average statement lengths (in words and tokens) for each language version of every problem type; (2) counts of key mathematical concepts per problem; and (3) inter-author difficulty ratings collected during dataset creation. Pilot accuracy rates were not performed prior to the main study, so we cannot retroactively supply them; however, the added metrics will allow readers to assess whether systematic formulation differences could explain the observed gaps. revision: yes
Circularity Check
No circularity: purely empirical evaluation study
full rationale
This paper is a standard empirical comparison: it constructs a parallel dataset of independently authored math problems in English, Sinhala, and Tamil, then measures LLM accuracy across six problem types. No equations, fitted parameters, or predictions appear anywhere in the provided text. No self-citations are invoked to justify uniqueness theorems or ansatzes. The central claim (basic arithmetic transfers while complex reasoning degrades) is supported solely by direct performance deltas on the new dataset; it does not reduce to any prior quantity by construction. Per the hard rules, absence of any load-bearing derivation or self-referential reduction yields score 0.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a taxonomy of six math word problem types... construct a parallel dataset... test four leading LLMs using zero-shot prompting
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Unit conflict resolution for automatic math word problem solving,
N. Dewappriya, G. U. Kankanamge, D. Wellappili, A. Hevapathige, and S. Ranathunga, “Unit conflict resolution for automatic math word problem solving,” in2018 Moratuwa Engineering Research Conference (MERCon), pp. 191–196, IEEE, 2018
work page 2018
-
[2]
A two-phase classifier for automatic answer generation for math word problems,
A. Hevapathige, D. Wellappili, G. U. Kankanamge, N. Dewappriya, and S. Ranathunga, “A two-phase classifier for automatic answer generation for math word problems,” in2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 1–6, IEEE, 2018
work page 2018
-
[3]
Towards robust automated math problem solving: a survey of statistical and deep learning approaches,
A. Saraf, P. Kamat, S. Gite, S. Kumar, and K. Kotecha, “Towards robust automated math problem solving: a survey of statistical and deep learning approaches,”Evolutionary Intelligence, vol. 17, no. 5, pp. 3113– 3150, 2024
work page 2024
-
[4]
Y . Yan, J. Su, J. He, F. Fu, X. Zheng, Y . Lyu, K. Wang, S. Wang, Q. Wen, and X. Hu, “A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 11798–11827, 2025
work page 2025
-
[5]
Mathematical language models: A survey,
W. Liu, H. Hu, J. Zhou, Y . Ding, J. Li, J. Zeng, M. He, Q. Chen, B. Jiang, A. Zhou,et al., “Mathematical language models: A survey,” ACM Computing Surveys, vol. 58, no. 6, pp. 1–37, 2025
work page 2025
-
[6]
A survey on large language models for mathematical reasoning,
P.-Y . Wang, T.-S. Liu, C. Wang, Z. Li, Y . Wang, S. Yan, C. Jia, X.-H. Liu, X. Chen, J. Xu,et al., “A survey on large language models for mathematical reasoning,”ACM Computing Surveys, 2025
work page 2025
-
[7]
Large lan- guage models for mathematical reasoning: Progresses and challenges,
J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large lan- guage models for mathematical reasoning: Progresses and challenges,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 225–237, 2024
work page 2024
-
[8]
Large language models: a tool for solving mathematical problems in high school,
R. Stamenkova, “Large language models: a tool for solving mathematical problems in high school,”Annual of Sofia University St. Kliment Ohridski. Faculty of Mathematics and Informatics, vol. 112, pp. 165– 183, 2025
work page 2025
-
[9]
U. Lee, Y . Kim, S. Lee, J. Park, J. Mun, E. Lee, H. Kim, C. Lim, and Y . J. Yoo, “Can we use gpt-4 as a mathematics evaluator in education?: Exploring the efficacy and limitation of llm-based automatic assessment system for open-ended mathematics question,”International Journal of Artificial Intelligence in Education, vol. 35, no. 3, pp. 1560–1596, 2025
work page 2025
-
[10]
K. Tan, J. Yao, T. Pang, C. Fan, and Y . Song, “Elf: Educational llm framework of improving and evaluating ai-generated content for classroom teaching,”ACM Journal of Data and Information Quality, vol. 17, no. 3, pp. 1–23, 2025
work page 2025
-
[11]
H. H. Hock and E. Bashir,The languages and linguistics of South Asia: A comprehensive guide, vol. 7. Walter de Gruyter GmbH & Co KG, 2016
work page 2016
-
[12]
Computational historical linguistics and language diversity in south asia,
A. Arora, A. Farris, S. Basu, and S. Kolichala, “Computational historical linguistics and language diversity in south asia,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1396–1409, 2022
work page 2022
-
[13]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano,et al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Language models are multilingual chain-of-thought reasoners,
F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. V osoughi, H. W. Chung, Y . Tay, S. Ruder, D. Zhou,et al., “Language models are multilingual chain-of-thought reasoners,” inThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[15]
Solving math word problem with prob- lem type classification,
J. Yao, Z. Zhou, and Q. Wang, “Solving math word problem with prob- lem type classification,” inCCF International Conference on Natural Language Processing and Chinese Computing, pp. 123–134, 2023
work page 2023
-
[16]
Cutting through the noise: Boosting llm performance on math word problems,
U. Anantheswaran, H. Gupta, K. Scaria, S. Verma, C. Baral, and S. Mishra, “Cutting through the noise: Boosting llm performance on math word problems,” inWorkshop on Reasoning and Planning for Large Language Models, 2025
work page 2025
-
[17]
Q. Zhong, K. Wang, Z. Xu, L. Ding, J. Liu, and B. Du, “Achieving¿ 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems,”Frontiers of Computer Science, vol. 20, no. 1, pp. 1–3, 2026
work page 2026
-
[18]
Can llms solve longer math word problems better?,
X. Xu, T. Xiao, Z. Chao, Z. Huang, C. Yang, and Y . Wang, “Can llms solve longer math word problems better?,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[19]
What makes math word problems challenging for llms?,
K. A. Srivatsa and E. Kochmar, “What makes math word problems challenging for llms?,” inFindings of the Association for Computational Linguistics: NAACL 2024, pp. 1138–1148, 2024
work page 2024
-
[20]
Llama beyond english: An empirical study on language capability transfer,
J. Zhao, Z. Zhang, L. Gao, Q. Zhang, T. Gui, and X. Huang, “Llama beyond english: An empirical study on language capability transfer,” arXiv preprint arXiv:2401.01055, 2024
-
[21]
Bertaqa: How much do language models know about local culture?,
J. Etxaniz, G. Azkune, A. Soroa, O. Lacalle, and M. Artetxe, “Bertaqa: How much do language models know about local culture?,”Advances in Neural Information Processing Systems, vol. 37, pp. 34077–34097, 2024
work page 2024
-
[22]
Performance of recent large language models for a low-resourced language,
R. Jayakody and G. Dias, “Performance of recent large language models for a low-resourced language,” in2024 International Conference on Asian Language Processing (IALP), pp. 162–167, IEEE, 2024
work page 2024
-
[23]
Sinhala- mmlu: A comprehensive benchmark for evaluating multitask language understanding in sinhala,
A. Pramodya, N. Nelki, H. Shalinda, C. Liyanage, Y . Sakai, R. Push- pananda, R. Weerasinghe, H. Kamigaito, and T. Watanabe, “Sinhala- mmlu: A comprehensive benchmark for evaluating multitask language understanding in sinhala,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 32931–32949, 2025
work page 2025
-
[24]
Tamil-llama: A new tamil language model based on llama 2,
A. Balachandran, “Tamil-llama: A new tamil language model based on llama 2,”arXiv preprint arXiv:2311.05845, 2023
-
[25]
Tamil text generation using chatgpt-3 models,
R. Ponnusamy, “Tamil text generation using chatgpt-3 models,”Serial Number Speaker/Title Page Number, p. 30, 2023
work page 2023
-
[26]
C. Fuligni, D. Dominguez Figaredo, and J. Stoyanovich, “” would you want an ai tutor?” understanding stakeholder perceptions of llm-based chatbots in the classroom,”arXiv e-prints, pp. arXiv–2503, 2025
work page 2025
-
[27]
Chatgpt for good? on opportunities and challenges of large language models for education,
E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier,et al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and individual differences, vol. 103, p. 102274, 2023
work page 2023
-
[28]
Chatgpt: A revolutionary tool for teaching and learning mathematics,
Y . Wardat, M. A. Tashtoush, R. AlAli, and A. M. Jarrah, “Chatgpt: A revolutionary tool for teaching and learning mathematics,”EURASIA Journal of Mathematics, Science and Technology Education, vol. 19, p. 7, 2023
work page 2023
-
[29]
B. S. K. AlHatmi, “Exploring generative ai’s role as a learning sup- plement tool for higher education students in mathematics: A focus on solving exams and assignments,” Master’s thesis, Sultan Qaboos University (Oman), 2024
work page 2024
-
[30]
Gpt-4o: The cutting-edge advancement in multimodal llm,
R. Islam and O. M. Moushi, “Gpt-4o: The cutting-edge advancement in multimodal llm,” inIntelligent Computing-Proceedings of the Comput- ing Conference, pp. 47–60, Springer, 2025
work page 2025
-
[31]
Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,
C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma,et al., “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, pp. 1731– 1745, 2025
work page 2025
-
[32]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [33]
- [34]
-
[35]
A practical survey on zero-shot prompt design for in-context learning,
Y . Li, “A practical survey on zero-shot prompt design for in-context learning,” inProceedings of the 14th international conference on recent advances in natural language processing, pp. 641–647, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.