pith. sign in

arxiv: 2602.14517 · v3 · submitted 2026-02-16 · 💻 cs.CL · cs.LG

Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil

Pith reviewed 2026-05-15 22:01 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords large language modelsmathematical reasoninglow-resource languagesSinhalaTamilmultilingual AImath educationmodel evaluation
0
0 comments X

The pith

Large language models perform basic arithmetic well across languages but show marked drops in complex reasoning for Sinhala and Tamil.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests four large language models on mathematical problems in English, Sinhala, and Tamil using a specially built parallel dataset. Each problem was written independently by native speakers to ensure the comparison reflects actual language capability rather than translation quality. The results indicate that simple calculations transfer reliably, yet tasks involving unit conflicts, optimization, or multi-step logic degrade substantially in the two South Asian languages. Because these models are increasingly proposed as classroom aids, the uneven performance raises questions about their readiness for non-English educational use. The variation across models and problem types points to the need for targeted testing in each language before widespread adoption.

Core claim

While basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that strong performance in English does not guarantee reliable performance across languages.

What carries the argument

A parallel dataset of math problems independently authored in Sinhala, Tamil, and English by native speakers with mathematical expertise, paired with a taxonomy of six problem types ranging from basic arithmetic to optimization.

If this is right

  • Basic arithmetic problems can be reliably solved by LLMs in Sinhala and Tamil.
  • Complex math problems require language-specific evaluation before LLM use in education.
  • Performance gaps differ by model, so selection matters for multilingual settings.
  • AI tutoring tools need language-specific benchmarks to ensure equity in low-resource contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar patterns may hold for other low-resource languages with limited training data.
  • Targeted fine-tuning on native-language math problems could mitigate the degradation.
  • This underscores the importance of diverse data sources beyond English-centric corpora for reasoning capabilities.

Load-bearing premise

The problems independently authored in each language have equivalent difficulty and mathematical complexity.

What would settle it

If a new set of problems rated for equivalent difficulty by experts shows no significant performance difference across languages, this would falsify the degradation finding.

Figures

Figures reproduced from arXiv: 2602.14517 by Asela Hevapathige, Buddhi Jayasekara, Kumar Thushalika, Sukumar Kishanthan.

Figure 1
Figure 1. Figure 1: Sample problems from each of the six problem types in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy (%) across four LLMs, six problem types, and three languages. Darker green indicates higher accuracy; red [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Radar plots comparing model accuracy (%) across six problem types for each language. Polygon shrinkage from English [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative example of a unit conflict problem [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Large language models (LLMs) have achieved strong results in mathematical reasoning, and are increasingly deployed as tutoring and learning support tools in educational settings. However, their reliability for students working in non-English languages, especially low-resource languages, remains poorly understood. We examine this gap by evaluating mathematical reasoning in Sinhala and Tamil -- two languages widely used in South Asian schools but underrepresented in artificial intelligence (AI) research. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset in which each problem is independently authored in Sinhala and Tamil by native speakers, and in English by fluent speakers, all with strong mathematical backgrounds. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that strong performance in English does not guarantee reliable performance across languages. These findings have direct implications for the deployment of AI tools in multilingual classrooms, and highlight the need for language-specific evaluation before adopting large language models as math tutoring aids in non-English educational contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript evaluates four prominent LLMs on mathematical reasoning across English, Sinhala, and Tamil using a taxonomy of six problem types (basic arithmetic through complex unit-conflict and optimization tasks). It constructs a parallel dataset in which each problem is independently authored by native-speaker experts with strong mathematical backgrounds, avoiding translation artifacts. The central claim is that basic arithmetic reasoning transfers robustly across languages while complex reasoning exhibits significant degradation in Tamil and Sinhala, with failure patterns varying by model and problem type.

Significance. If the empirical patterns hold after validation of the dataset, the work would be significant for AI-assisted education: it supplies direct evidence that English-centric LLM performance does not guarantee reliability in low-resource languages, with concrete implications for deploying tutoring tools in South Asian classrooms. The independently authored parallel dataset is a methodological strength that sidesteps translation-quality confounds.

major comments (1)
  1. [§3 (Dataset Construction)] §3 (Dataset Construction): The central claim that performance gaps reflect language-specific limitations rather than problem design rests on the assumption that independently authored problems maintain equivalent mathematical difficulty and complexity across Sinhala, Tamil, and English. The manuscript provides no quantitative validation (e.g., average statement length, concept rarity counts, pilot accuracy rates, or inter-author difficulty ratings) to support this equivalence; without such checks, degradation on complex tasks could arise from systematic differences in problem formulation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of our independently authored parallel dataset. We address the concern regarding validation of problem equivalence below.

read point-by-point responses
  1. Referee: The central claim that performance gaps reflect language-specific limitations rather than problem design rests on the assumption that independently authored problems maintain equivalent mathematical difficulty and complexity across Sinhala, Tamil, and English. The manuscript provides no quantitative validation (e.g., average statement length, concept rarity counts, pilot accuracy rates, or inter-author difficulty ratings) to support this equivalence; without such checks, degradation on complex tasks could arise from systematic differences in problem formulation.

    Authors: We agree that the original manuscript lacks explicit quantitative checks for cross-language equivalence of mathematical difficulty. Problems were independently authored by native-speaker experts with strong mathematical backgrounds to match the same concepts and intended difficulty levels, but we did not report supporting metrics. In the revised version we will add: (1) average statement lengths (in words and tokens) for each language version of every problem type; (2) counts of key mathematical concepts per problem; and (3) inter-author difficulty ratings collected during dataset creation. Pilot accuracy rates were not performed prior to the main study, so we cannot retroactively supply them; however, the added metrics will allow readers to assess whether systematic formulation differences could explain the observed gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation study

full rationale

This paper is a standard empirical comparison: it constructs a parallel dataset of independently authored math problems in English, Sinhala, and Tamil, then measures LLM accuracy across six problem types. No equations, fitted parameters, or predictions appear anywhere in the provided text. No self-citations are invoked to justify uniqueness theorems or ansatzes. The central claim (basic arithmetic transfers while complex reasoning degrades) is supported solely by direct performance deltas on the new dataset; it does not reduce to any prior quantity by construction. Per the hard rules, absence of any load-bearing derivation or self-referential reduction yields score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5536 in / 1038 out tokens · 19939 ms · 2026-05-15T22:01:56.218062+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

  1. [1]

    Unit conflict resolution for automatic math word problem solving,

    N. Dewappriya, G. U. Kankanamge, D. Wellappili, A. Hevapathige, and S. Ranathunga, “Unit conflict resolution for automatic math word problem solving,” in2018 Moratuwa Engineering Research Conference (MERCon), pp. 191–196, IEEE, 2018

  2. [2]

    A two-phase classifier for automatic answer generation for math word problems,

    A. Hevapathige, D. Wellappili, G. U. Kankanamge, N. Dewappriya, and S. Ranathunga, “A two-phase classifier for automatic answer generation for math word problems,” in2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 1–6, IEEE, 2018

  3. [3]

    Towards robust automated math problem solving: a survey of statistical and deep learning approaches,

    A. Saraf, P. Kamat, S. Gite, S. Kumar, and K. Kotecha, “Towards robust automated math problem solving: a survey of statistical and deep learning approaches,”Evolutionary Intelligence, vol. 17, no. 5, pp. 3113– 3150, 2024

  4. [4]

    A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,

    Y . Yan, J. Su, J. He, F. Fu, X. Zheng, Y . Lyu, K. Wang, S. Wang, Q. Wen, and X. Hu, “A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 11798–11827, 2025

  5. [5]

    Mathematical language models: A survey,

    W. Liu, H. Hu, J. Zhou, Y . Ding, J. Li, J. Zeng, M. He, Q. Chen, B. Jiang, A. Zhou,et al., “Mathematical language models: A survey,” ACM Computing Surveys, vol. 58, no. 6, pp. 1–37, 2025

  6. [6]

    A survey on large language models for mathematical reasoning,

    P.-Y . Wang, T.-S. Liu, C. Wang, Z. Li, Y . Wang, S. Yan, C. Jia, X.-H. Liu, X. Chen, J. Xu,et al., “A survey on large language models for mathematical reasoning,”ACM Computing Surveys, 2025

  7. [7]

    Large lan- guage models for mathematical reasoning: Progresses and challenges,

    J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large lan- guage models for mathematical reasoning: Progresses and challenges,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 225–237, 2024

  8. [8]

    Large language models: a tool for solving mathematical problems in high school,

    R. Stamenkova, “Large language models: a tool for solving mathematical problems in high school,”Annual of Sofia University St. Kliment Ohridski. Faculty of Mathematics and Informatics, vol. 112, pp. 165– 183, 2025

  9. [9]

    Can we use gpt-4 as a mathematics evaluator in education?: Exploring the efficacy and limitation of llm-based automatic assessment system for open-ended mathematics question,

    U. Lee, Y . Kim, S. Lee, J. Park, J. Mun, E. Lee, H. Kim, C. Lim, and Y . J. Yoo, “Can we use gpt-4 as a mathematics evaluator in education?: Exploring the efficacy and limitation of llm-based automatic assessment system for open-ended mathematics question,”International Journal of Artificial Intelligence in Education, vol. 35, no. 3, pp. 1560–1596, 2025

  10. [10]

    Elf: Educational llm framework of improving and evaluating ai-generated content for classroom teaching,

    K. Tan, J. Yao, T. Pang, C. Fan, and Y . Song, “Elf: Educational llm framework of improving and evaluating ai-generated content for classroom teaching,”ACM Journal of Data and Information Quality, vol. 17, no. 3, pp. 1–23, 2025

  11. [11]

    H. H. Hock and E. Bashir,The languages and linguistics of South Asia: A comprehensive guide, vol. 7. Walter de Gruyter GmbH & Co KG, 2016

  12. [12]

    Computational historical linguistics and language diversity in south asia,

    A. Arora, A. Farris, S. Basu, and S. Kolichala, “Computational historical linguistics and language diversity in south asia,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1396–1409, 2022

  13. [13]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano,et al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

  14. [14]

    Language models are multilingual chain-of-thought reasoners,

    F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. V osoughi, H. W. Chung, Y . Tay, S. Ruder, D. Zhou,et al., “Language models are multilingual chain-of-thought reasoners,” inThe Eleventh International Conference on Learning Representations, 2023

  15. [15]

    Solving math word problem with prob- lem type classification,

    J. Yao, Z. Zhou, and Q. Wang, “Solving math word problem with prob- lem type classification,” inCCF International Conference on Natural Language Processing and Chinese Computing, pp. 123–134, 2023

  16. [16]

    Cutting through the noise: Boosting llm performance on math word problems,

    U. Anantheswaran, H. Gupta, K. Scaria, S. Verma, C. Baral, and S. Mishra, “Cutting through the noise: Boosting llm performance on math word problems,” inWorkshop on Reasoning and Planning for Large Language Models, 2025

  17. [17]

    Achieving¿ 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems,

    Q. Zhong, K. Wang, Z. Xu, L. Ding, J. Liu, and B. Du, “Achieving¿ 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems,”Frontiers of Computer Science, vol. 20, no. 1, pp. 1–3, 2026

  18. [18]

    Can llms solve longer math word problems better?,

    X. Xu, T. Xiao, Z. Chao, Z. Huang, C. Yang, and Y . Wang, “Can llms solve longer math word problems better?,” inThe Thirteenth International Conference on Learning Representations, 2025

  19. [19]

    What makes math word problems challenging for llms?,

    K. A. Srivatsa and E. Kochmar, “What makes math word problems challenging for llms?,” inFindings of the Association for Computational Linguistics: NAACL 2024, pp. 1138–1148, 2024

  20. [20]

    Llama beyond english: An empirical study on language capability transfer,

    J. Zhao, Z. Zhang, L. Gao, Q. Zhang, T. Gui, and X. Huang, “Llama beyond english: An empirical study on language capability transfer,” arXiv preprint arXiv:2401.01055, 2024

  21. [21]

    Bertaqa: How much do language models know about local culture?,

    J. Etxaniz, G. Azkune, A. Soroa, O. Lacalle, and M. Artetxe, “Bertaqa: How much do language models know about local culture?,”Advances in Neural Information Processing Systems, vol. 37, pp. 34077–34097, 2024

  22. [22]

    Performance of recent large language models for a low-resourced language,

    R. Jayakody and G. Dias, “Performance of recent large language models for a low-resourced language,” in2024 International Conference on Asian Language Processing (IALP), pp. 162–167, IEEE, 2024

  23. [23]

    Sinhala- mmlu: A comprehensive benchmark for evaluating multitask language understanding in sinhala,

    A. Pramodya, N. Nelki, H. Shalinda, C. Liyanage, Y . Sakai, R. Push- pananda, R. Weerasinghe, H. Kamigaito, and T. Watanabe, “Sinhala- mmlu: A comprehensive benchmark for evaluating multitask language understanding in sinhala,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 32931–32949, 2025

  24. [24]

    Tamil-llama: A new tamil language model based on llama 2,

    A. Balachandran, “Tamil-llama: A new tamil language model based on llama 2,”arXiv preprint arXiv:2311.05845, 2023

  25. [25]

    Tamil text generation using chatgpt-3 models,

    R. Ponnusamy, “Tamil text generation using chatgpt-3 models,”Serial Number Speaker/Title Page Number, p. 30, 2023

  26. [26]

    ” would you want an ai tutor?

    C. Fuligni, D. Dominguez Figaredo, and J. Stoyanovich, “” would you want an ai tutor?” understanding stakeholder perceptions of llm-based chatbots in the classroom,”arXiv e-prints, pp. arXiv–2503, 2025

  27. [27]

    Chatgpt for good? on opportunities and challenges of large language models for education,

    E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier,et al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and individual differences, vol. 103, p. 102274, 2023

  28. [28]

    Chatgpt: A revolutionary tool for teaching and learning mathematics,

    Y . Wardat, M. A. Tashtoush, R. AlAli, and A. M. Jarrah, “Chatgpt: A revolutionary tool for teaching and learning mathematics,”EURASIA Journal of Mathematics, Science and Technology Education, vol. 19, p. 7, 2023

  29. [29]

    Exploring generative ai’s role as a learning sup- plement tool for higher education students in mathematics: A focus on solving exams and assignments,

    B. S. K. AlHatmi, “Exploring generative ai’s role as a learning sup- plement tool for higher education students in mathematics: A focus on solving exams and assignments,” Master’s thesis, Sultan Qaboos University (Oman), 2024

  30. [30]

    Gpt-4o: The cutting-edge advancement in multimodal llm,

    R. Islam and O. M. Moushi, “Gpt-4o: The cutting-edge advancement in multimodal llm,” inIntelligent Computing-Proceedings of the Comput- ing Conference, pp. 47–60, Springer, 2025

  31. [31]

    Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

    C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma,et al., “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, pp. 1731– 1745, 2025

  32. [32]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  33. [33]

    Introducing claude,

    P. Anthropic, “Introducing claude,”March, vol. 14, p. 2023, 2023

  34. [34]

    Claude sonnet 4

    Anthropic, “Claude sonnet 4.” https://www.anthropic.com/claude/sonnet,

  35. [35]

    A practical survey on zero-shot prompt design for in-context learning,

    Y . Li, “A practical survey on zero-shot prompt design for in-context learning,” inProceedings of the 14th international conference on recent advances in natural language processing, pp. 641–647, 2023