Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil

Asela Hevapathige; Buddhi Jayasekara; Kumar Thushalika; Sukumar Kishanthan

arxiv: 2602.14517 · v3 · submitted 2026-02-16 · 💻 cs.CL · cs.LG

Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil

Sukumar Kishanthan , Kumar Thushalika , Buddhi Jayasekara , Asela Hevapathige This is my paper

Pith reviewed 2026-05-15 22:01 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords large language modelsmathematical reasoninglow-resource languagesSinhalaTamilmultilingual AImath educationmodel evaluation

0 comments

The pith

Large language models perform basic arithmetic well across languages but show marked drops in complex reasoning for Sinhala and Tamil.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests four large language models on mathematical problems in English, Sinhala, and Tamil using a specially built parallel dataset. Each problem was written independently by native speakers to ensure the comparison reflects actual language capability rather than translation quality. The results indicate that simple calculations transfer reliably, yet tasks involving unit conflicts, optimization, or multi-step logic degrade substantially in the two South Asian languages. Because these models are increasingly proposed as classroom aids, the uneven performance raises questions about their readiness for non-English educational use. The variation across models and problem types points to the need for targeted testing in each language before widespread adoption.

Core claim

While basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that strong performance in English does not guarantee reliable performance across languages.

What carries the argument

A parallel dataset of math problems independently authored in Sinhala, Tamil, and English by native speakers with mathematical expertise, paired with a taxonomy of six problem types ranging from basic arithmetic to optimization.

If this is right

Basic arithmetic problems can be reliably solved by LLMs in Sinhala and Tamil.
Complex math problems require language-specific evaluation before LLM use in education.
Performance gaps differ by model, so selection matters for multilingual settings.
AI tutoring tools need language-specific benchmarks to ensure equity in low-resource contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar patterns may hold for other low-resource languages with limited training data.
Targeted fine-tuning on native-language math problems could mitigate the degradation.
This underscores the importance of diverse data sources beyond English-centric corpora for reasoning capabilities.

Load-bearing premise

The problems independently authored in each language have equivalent difficulty and mathematical complexity.

What would settle it

If a new set of problems rated for equivalent difficulty by experts shows no significant performance difference across languages, this would falsify the degradation finding.

Figures

Figures reproduced from arXiv: 2602.14517 by Asela Hevapathige, Buddhi Jayasekara, Kumar Thushalika, Sukumar Kishanthan.

**Figure 2.** Figure 2: Accuracy (%) across four LLMs, six problem types, and three languages. Darker green indicates higher accuracy; red [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Radar plots comparing model accuracy (%) across six problem types for each language. Polygon shrinkage from English [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Representative example of a unit conflict problem [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Large language models (LLMs) have achieved strong results in mathematical reasoning, and are increasingly deployed as tutoring and learning support tools in educational settings. However, their reliability for students working in non-English languages, especially low-resource languages, remains poorly understood. We examine this gap by evaluating mathematical reasoning in Sinhala and Tamil -- two languages widely used in South Asian schools but underrepresented in artificial intelligence (AI) research. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset in which each problem is independently authored in Sinhala and Tamil by native speakers, and in English by fluent speakers, all with strong mathematical backgrounds. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that strong performance in English does not guarantee reliable performance across languages. These findings have direct implications for the deployment of AI tools in multilingual classrooms, and highlight the need for language-specific evaluation before adopting large language models as math tutoring aids in non-English educational contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs keep basic arithmetic working across Sinhala and Tamil but lose ground on complex math tasks, using a fresh parallel dataset written independently in each language.

read the letter

The main point is straightforward: basic arithmetic transfers reasonably well to Sinhala and Tamil, but harder problems like unit conflicts or optimization show noticeable drops compared with the English versions. The authors built a parallel set of problems written from scratch by native speakers with math backgrounds, which sidesteps the usual translation noise that muddies these comparisons. That construction is the clearest new piece here, and it targets a practical gap for school-level math support in two languages that get little attention in LLM work. The six-type taxonomy is simple and covers the right range from easy to tricky school problems, so the setup feels grounded in real classroom needs rather than abstract benchmarks. The pattern they report aligns with what many people already suspect about English-heavy models, but having data on these specific languages makes it more useful for anyone thinking about deployment in South Asia. The soft spot is the missing check on whether the independently written problems really match in difficulty. Without ratings, pilot runs, or any complexity metrics, some of the performance gap could trace back to how the Sinhala or Tamil versions were phrased rather than to language effects alone. The abstract also gives no numbers, sample sizes, or error bars, which leaves the size and reliability of the effect hard to judge until the full results are in. This is the kind of paper that matters for people building or evaluating multilingual tools for education. It does not invent new methods, but the dataset and the basic cross-language comparison are worth referee time to verify the details and tighten the controls. I would send it to review.

Referee Report

1 major / 0 minor

Summary. The manuscript evaluates four prominent LLMs on mathematical reasoning across English, Sinhala, and Tamil using a taxonomy of six problem types (basic arithmetic through complex unit-conflict and optimization tasks). It constructs a parallel dataset in which each problem is independently authored by native-speaker experts with strong mathematical backgrounds, avoiding translation artifacts. The central claim is that basic arithmetic reasoning transfers robustly across languages while complex reasoning exhibits significant degradation in Tamil and Sinhala, with failure patterns varying by model and problem type.

Significance. If the empirical patterns hold after validation of the dataset, the work would be significant for AI-assisted education: it supplies direct evidence that English-centric LLM performance does not guarantee reliability in low-resource languages, with concrete implications for deploying tutoring tools in South Asian classrooms. The independently authored parallel dataset is a methodological strength that sidesteps translation-quality confounds.

major comments (1)

[§3 (Dataset Construction)] §3 (Dataset Construction): The central claim that performance gaps reflect language-specific limitations rather than problem design rests on the assumption that independently authored problems maintain equivalent mathematical difficulty and complexity across Sinhala, Tamil, and English. The manuscript provides no quantitative validation (e.g., average statement length, concept rarity counts, pilot accuracy rates, or inter-author difficulty ratings) to support this equivalence; without such checks, degradation on complex tasks could arise from systematic differences in problem formulation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of our independently authored parallel dataset. We address the concern regarding validation of problem equivalence below.

read point-by-point responses

Referee: The central claim that performance gaps reflect language-specific limitations rather than problem design rests on the assumption that independently authored problems maintain equivalent mathematical difficulty and complexity across Sinhala, Tamil, and English. The manuscript provides no quantitative validation (e.g., average statement length, concept rarity counts, pilot accuracy rates, or inter-author difficulty ratings) to support this equivalence; without such checks, degradation on complex tasks could arise from systematic differences in problem formulation.

Authors: We agree that the original manuscript lacks explicit quantitative checks for cross-language equivalence of mathematical difficulty. Problems were independently authored by native-speaker experts with strong mathematical backgrounds to match the same concepts and intended difficulty levels, but we did not report supporting metrics. In the revised version we will add: (1) average statement lengths (in words and tokens) for each language version of every problem type; (2) counts of key mathematical concepts per problem; and (3) inter-author difficulty ratings collected during dataset creation. Pilot accuracy rates were not performed prior to the main study, so we cannot retroactively supply them; however, the added metrics will allow readers to assess whether systematic formulation differences could explain the observed gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation study

full rationale

This paper is a standard empirical comparison: it constructs a parallel dataset of independently authored math problems in English, Sinhala, and Tamil, then measures LLM accuracy across six problem types. No equations, fitted parameters, or predictions appear anywhere in the provided text. No self-citations are invoked to justify uniqueness theorems or ansatzes. The central claim (basic arithmetic transfers while complex reasoning degrades) is supported solely by direct performance deltas on the new dataset; it does not reduce to any prior quantity by construction. Per the hard rules, absence of any load-bearing derivation or self-referential reduction yields score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5536 in / 1038 out tokens · 19939 ms · 2026-05-15T22:01:56.218062+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a taxonomy of six math word problem types... construct a parallel dataset... test four leading LLMs using zero-shot prompting

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

Unit conflict resolution for automatic math word problem solving,

N. Dewappriya, G. U. Kankanamge, D. Wellappili, A. Hevapathige, and S. Ranathunga, “Unit conflict resolution for automatic math word problem solving,” in2018 Moratuwa Engineering Research Conference (MERCon), pp. 191–196, IEEE, 2018

work page 2018
[2]

A two-phase classifier for automatic answer generation for math word problems,

A. Hevapathige, D. Wellappili, G. U. Kankanamge, N. Dewappriya, and S. Ranathunga, “A two-phase classifier for automatic answer generation for math word problems,” in2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 1–6, IEEE, 2018

work page 2018
[3]

Towards robust automated math problem solving: a survey of statistical and deep learning approaches,

A. Saraf, P. Kamat, S. Gite, S. Kumar, and K. Kotecha, “Towards robust automated math problem solving: a survey of statistical and deep learning approaches,”Evolutionary Intelligence, vol. 17, no. 5, pp. 3113– 3150, 2024

work page 2024
[4]

A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,

Y . Yan, J. Su, J. He, F. Fu, X. Zheng, Y . Lyu, K. Wang, S. Wang, Q. Wen, and X. Hu, “A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 11798–11827, 2025

work page 2025
[5]

Mathematical language models: A survey,

W. Liu, H. Hu, J. Zhou, Y . Ding, J. Li, J. Zeng, M. He, Q. Chen, B. Jiang, A. Zhou,et al., “Mathematical language models: A survey,” ACM Computing Surveys, vol. 58, no. 6, pp. 1–37, 2025

work page 2025
[6]

A survey on large language models for mathematical reasoning,

P.-Y . Wang, T.-S. Liu, C. Wang, Z. Li, Y . Wang, S. Yan, C. Jia, X.-H. Liu, X. Chen, J. Xu,et al., “A survey on large language models for mathematical reasoning,”ACM Computing Surveys, 2025

work page 2025
[7]

Large lan- guage models for mathematical reasoning: Progresses and challenges,

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large lan- guage models for mathematical reasoning: Progresses and challenges,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 225–237, 2024

work page 2024
[8]

Large language models: a tool for solving mathematical problems in high school,

R. Stamenkova, “Large language models: a tool for solving mathematical problems in high school,”Annual of Sofia University St. Kliment Ohridski. Faculty of Mathematics and Informatics, vol. 112, pp. 165– 183, 2025

work page 2025
[9]

Can we use gpt-4 as a mathematics evaluator in education?: Exploring the efficacy and limitation of llm-based automatic assessment system for open-ended mathematics question,

U. Lee, Y . Kim, S. Lee, J. Park, J. Mun, E. Lee, H. Kim, C. Lim, and Y . J. Yoo, “Can we use gpt-4 as a mathematics evaluator in education?: Exploring the efficacy and limitation of llm-based automatic assessment system for open-ended mathematics question,”International Journal of Artificial Intelligence in Education, vol. 35, no. 3, pp. 1560–1596, 2025

work page 2025
[10]

Elf: Educational llm framework of improving and evaluating ai-generated content for classroom teaching,

K. Tan, J. Yao, T. Pang, C. Fan, and Y . Song, “Elf: Educational llm framework of improving and evaluating ai-generated content for classroom teaching,”ACM Journal of Data and Information Quality, vol. 17, no. 3, pp. 1–23, 2025

work page 2025
[11]

H. H. Hock and E. Bashir,The languages and linguistics of South Asia: A comprehensive guide, vol. 7. Walter de Gruyter GmbH & Co KG, 2016

work page 2016
[12]

Computational historical linguistics and language diversity in south asia,

A. Arora, A. Farris, S. Basu, and S. Kolichala, “Computational historical linguistics and language diversity in south asia,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1396–1409, 2022

work page 2022
[13]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano,et al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Language models are multilingual chain-of-thought reasoners,

F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. V osoughi, H. W. Chung, Y . Tay, S. Ruder, D. Zhou,et al., “Language models are multilingual chain-of-thought reasoners,” inThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[15]

Solving math word problem with prob- lem type classification,

J. Yao, Z. Zhou, and Q. Wang, “Solving math word problem with prob- lem type classification,” inCCF International Conference on Natural Language Processing and Chinese Computing, pp. 123–134, 2023

work page 2023
[16]

Cutting through the noise: Boosting llm performance on math word problems,

U. Anantheswaran, H. Gupta, K. Scaria, S. Verma, C. Baral, and S. Mishra, “Cutting through the noise: Boosting llm performance on math word problems,” inWorkshop on Reasoning and Planning for Large Language Models, 2025

work page 2025
[17]

Achieving¿ 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems,

Q. Zhong, K. Wang, Z. Xu, L. Ding, J. Liu, and B. Du, “Achieving¿ 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems,”Frontiers of Computer Science, vol. 20, no. 1, pp. 1–3, 2026

work page 2026
[18]

Can llms solve longer math word problems better?,

X. Xu, T. Xiao, Z. Chao, Z. Huang, C. Yang, and Y . Wang, “Can llms solve longer math word problems better?,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

What makes math word problems challenging for llms?,

K. A. Srivatsa and E. Kochmar, “What makes math word problems challenging for llms?,” inFindings of the Association for Computational Linguistics: NAACL 2024, pp. 1138–1148, 2024

work page 2024
[20]

Llama beyond english: An empirical study on language capability transfer,

J. Zhao, Z. Zhang, L. Gao, Q. Zhang, T. Gui, and X. Huang, “Llama beyond english: An empirical study on language capability transfer,” arXiv preprint arXiv:2401.01055, 2024

work page arXiv 2024
[21]

Bertaqa: How much do language models know about local culture?,

J. Etxaniz, G. Azkune, A. Soroa, O. Lacalle, and M. Artetxe, “Bertaqa: How much do language models know about local culture?,”Advances in Neural Information Processing Systems, vol. 37, pp. 34077–34097, 2024

work page 2024
[22]

Performance of recent large language models for a low-resourced language,

R. Jayakody and G. Dias, “Performance of recent large language models for a low-resourced language,” in2024 International Conference on Asian Language Processing (IALP), pp. 162–167, IEEE, 2024

work page 2024
[23]

Sinhala- mmlu: A comprehensive benchmark for evaluating multitask language understanding in sinhala,

A. Pramodya, N. Nelki, H. Shalinda, C. Liyanage, Y . Sakai, R. Push- pananda, R. Weerasinghe, H. Kamigaito, and T. Watanabe, “Sinhala- mmlu: A comprehensive benchmark for evaluating multitask language understanding in sinhala,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 32931–32949, 2025

work page 2025
[24]

Tamil-llama: A new tamil language model based on llama 2,

A. Balachandran, “Tamil-llama: A new tamil language model based on llama 2,”arXiv preprint arXiv:2311.05845, 2023

work page arXiv 2023
[25]

Tamil text generation using chatgpt-3 models,

R. Ponnusamy, “Tamil text generation using chatgpt-3 models,”Serial Number Speaker/Title Page Number, p. 30, 2023

work page 2023
[26]

” would you want an ai tutor?

C. Fuligni, D. Dominguez Figaredo, and J. Stoyanovich, “” would you want an ai tutor?” understanding stakeholder perceptions of llm-based chatbots in the classroom,”arXiv e-prints, pp. arXiv–2503, 2025

work page 2025
[27]

Chatgpt for good? on opportunities and challenges of large language models for education,

E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier,et al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and individual differences, vol. 103, p. 102274, 2023

work page 2023
[28]

Chatgpt: A revolutionary tool for teaching and learning mathematics,

Y . Wardat, M. A. Tashtoush, R. AlAli, and A. M. Jarrah, “Chatgpt: A revolutionary tool for teaching and learning mathematics,”EURASIA Journal of Mathematics, Science and Technology Education, vol. 19, p. 7, 2023

work page 2023
[29]

Exploring generative ai’s role as a learning sup- plement tool for higher education students in mathematics: A focus on solving exams and assignments,

B. S. K. AlHatmi, “Exploring generative ai’s role as a learning sup- plement tool for higher education students in mathematics: A focus on solving exams and assignments,” Master’s thesis, Sultan Qaboos University (Oman), 2024

work page 2024
[30]

Gpt-4o: The cutting-edge advancement in multimodal llm,

R. Islam and O. M. Moushi, “Gpt-4o: The cutting-edge advancement in multimodal llm,” inIntelligent Computing-Proceedings of the Comput- ing Conference, pp. 47–60, Springer, 2025

work page 2025
[31]

Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma,et al., “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, pp. 1731– 1745, 2025

work page 2025
[32]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Introducing claude,

P. Anthropic, “Introducing claude,”March, vol. 14, p. 2023, 2023

work page 2023
[34]

Claude sonnet 4

Anthropic, “Claude sonnet 4.” https://www.anthropic.com/claude/sonnet,

work page
[35]

A practical survey on zero-shot prompt design for in-context learning,

Y . Li, “A practical survey on zero-shot prompt design for in-context learning,” inProceedings of the 14th international conference on recent advances in natural language processing, pp. 641–647, 2023

work page 2023

[1] [1]

Unit conflict resolution for automatic math word problem solving,

N. Dewappriya, G. U. Kankanamge, D. Wellappili, A. Hevapathige, and S. Ranathunga, “Unit conflict resolution for automatic math word problem solving,” in2018 Moratuwa Engineering Research Conference (MERCon), pp. 191–196, IEEE, 2018

work page 2018

[2] [2]

A two-phase classifier for automatic answer generation for math word problems,

A. Hevapathige, D. Wellappili, G. U. Kankanamge, N. Dewappriya, and S. Ranathunga, “A two-phase classifier for automatic answer generation for math word problems,” in2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 1–6, IEEE, 2018

work page 2018

[3] [3]

Towards robust automated math problem solving: a survey of statistical and deep learning approaches,

A. Saraf, P. Kamat, S. Gite, S. Kumar, and K. Kotecha, “Towards robust automated math problem solving: a survey of statistical and deep learning approaches,”Evolutionary Intelligence, vol. 17, no. 5, pp. 3113– 3150, 2024

work page 2024

[4] [4]

A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,

Y . Yan, J. Su, J. He, F. Fu, X. Zheng, Y . Lyu, K. Wang, S. Wang, Q. Wen, and X. Hu, “A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 11798–11827, 2025

work page 2025

[5] [5]

Mathematical language models: A survey,

W. Liu, H. Hu, J. Zhou, Y . Ding, J. Li, J. Zeng, M. He, Q. Chen, B. Jiang, A. Zhou,et al., “Mathematical language models: A survey,” ACM Computing Surveys, vol. 58, no. 6, pp. 1–37, 2025

work page 2025

[6] [6]

A survey on large language models for mathematical reasoning,

P.-Y . Wang, T.-S. Liu, C. Wang, Z. Li, Y . Wang, S. Yan, C. Jia, X.-H. Liu, X. Chen, J. Xu,et al., “A survey on large language models for mathematical reasoning,”ACM Computing Surveys, 2025

work page 2025

[7] [7]

Large lan- guage models for mathematical reasoning: Progresses and challenges,

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large lan- guage models for mathematical reasoning: Progresses and challenges,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 225–237, 2024

work page 2024

[8] [8]

Large language models: a tool for solving mathematical problems in high school,

R. Stamenkova, “Large language models: a tool for solving mathematical problems in high school,”Annual of Sofia University St. Kliment Ohridski. Faculty of Mathematics and Informatics, vol. 112, pp. 165– 183, 2025

work page 2025

[9] [9]

Can we use gpt-4 as a mathematics evaluator in education?: Exploring the efficacy and limitation of llm-based automatic assessment system for open-ended mathematics question,

U. Lee, Y . Kim, S. Lee, J. Park, J. Mun, E. Lee, H. Kim, C. Lim, and Y . J. Yoo, “Can we use gpt-4 as a mathematics evaluator in education?: Exploring the efficacy and limitation of llm-based automatic assessment system for open-ended mathematics question,”International Journal of Artificial Intelligence in Education, vol. 35, no. 3, pp. 1560–1596, 2025

work page 2025

[10] [10]

Elf: Educational llm framework of improving and evaluating ai-generated content for classroom teaching,

K. Tan, J. Yao, T. Pang, C. Fan, and Y . Song, “Elf: Educational llm framework of improving and evaluating ai-generated content for classroom teaching,”ACM Journal of Data and Information Quality, vol. 17, no. 3, pp. 1–23, 2025

work page 2025

[11] [11]

H. H. Hock and E. Bashir,The languages and linguistics of South Asia: A comprehensive guide, vol. 7. Walter de Gruyter GmbH & Co KG, 2016

work page 2016

[12] [12]

Computational historical linguistics and language diversity in south asia,

A. Arora, A. Farris, S. Basu, and S. Kolichala, “Computational historical linguistics and language diversity in south asia,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1396–1409, 2022

work page 2022

[13] [13]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano,et al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Language models are multilingual chain-of-thought reasoners,

F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. V osoughi, H. W. Chung, Y . Tay, S. Ruder, D. Zhou,et al., “Language models are multilingual chain-of-thought reasoners,” inThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[15] [15]

Solving math word problem with prob- lem type classification,

J. Yao, Z. Zhou, and Q. Wang, “Solving math word problem with prob- lem type classification,” inCCF International Conference on Natural Language Processing and Chinese Computing, pp. 123–134, 2023

work page 2023

[16] [16]

Cutting through the noise: Boosting llm performance on math word problems,

U. Anantheswaran, H. Gupta, K. Scaria, S. Verma, C. Baral, and S. Mishra, “Cutting through the noise: Boosting llm performance on math word problems,” inWorkshop on Reasoning and Planning for Large Language Models, 2025

work page 2025

[17] [17]

Achieving¿ 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems,

Q. Zhong, K. Wang, Z. Xu, L. Ding, J. Liu, and B. Du, “Achieving¿ 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems,”Frontiers of Computer Science, vol. 20, no. 1, pp. 1–3, 2026

work page 2026

[18] [18]

Can llms solve longer math word problems better?,

X. Xu, T. Xiao, Z. Chao, Z. Huang, C. Yang, and Y . Wang, “Can llms solve longer math word problems better?,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[19] [19]

What makes math word problems challenging for llms?,

K. A. Srivatsa and E. Kochmar, “What makes math word problems challenging for llms?,” inFindings of the Association for Computational Linguistics: NAACL 2024, pp. 1138–1148, 2024

work page 2024

[20] [20]

Llama beyond english: An empirical study on language capability transfer,

J. Zhao, Z. Zhang, L. Gao, Q. Zhang, T. Gui, and X. Huang, “Llama beyond english: An empirical study on language capability transfer,” arXiv preprint arXiv:2401.01055, 2024

work page arXiv 2024

[21] [21]

Bertaqa: How much do language models know about local culture?,

J. Etxaniz, G. Azkune, A. Soroa, O. Lacalle, and M. Artetxe, “Bertaqa: How much do language models know about local culture?,”Advances in Neural Information Processing Systems, vol. 37, pp. 34077–34097, 2024

work page 2024

[22] [22]

Performance of recent large language models for a low-resourced language,

R. Jayakody and G. Dias, “Performance of recent large language models for a low-resourced language,” in2024 International Conference on Asian Language Processing (IALP), pp. 162–167, IEEE, 2024

work page 2024

[23] [23]

Sinhala- mmlu: A comprehensive benchmark for evaluating multitask language understanding in sinhala,

A. Pramodya, N. Nelki, H. Shalinda, C. Liyanage, Y . Sakai, R. Push- pananda, R. Weerasinghe, H. Kamigaito, and T. Watanabe, “Sinhala- mmlu: A comprehensive benchmark for evaluating multitask language understanding in sinhala,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 32931–32949, 2025

work page 2025

[24] [24]

Tamil-llama: A new tamil language model based on llama 2,

A. Balachandran, “Tamil-llama: A new tamil language model based on llama 2,”arXiv preprint arXiv:2311.05845, 2023

work page arXiv 2023

[25] [25]

Tamil text generation using chatgpt-3 models,

R. Ponnusamy, “Tamil text generation using chatgpt-3 models,”Serial Number Speaker/Title Page Number, p. 30, 2023

work page 2023

[26] [26]

” would you want an ai tutor?

C. Fuligni, D. Dominguez Figaredo, and J. Stoyanovich, “” would you want an ai tutor?” understanding stakeholder perceptions of llm-based chatbots in the classroom,”arXiv e-prints, pp. arXiv–2503, 2025

work page 2025

[27] [27]

Chatgpt for good? on opportunities and challenges of large language models for education,

E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier,et al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and individual differences, vol. 103, p. 102274, 2023

work page 2023

[28] [28]

Chatgpt: A revolutionary tool for teaching and learning mathematics,

Y . Wardat, M. A. Tashtoush, R. AlAli, and A. M. Jarrah, “Chatgpt: A revolutionary tool for teaching and learning mathematics,”EURASIA Journal of Mathematics, Science and Technology Education, vol. 19, p. 7, 2023

work page 2023

[29] [29]

Exploring generative ai’s role as a learning sup- plement tool for higher education students in mathematics: A focus on solving exams and assignments,

B. S. K. AlHatmi, “Exploring generative ai’s role as a learning sup- plement tool for higher education students in mathematics: A focus on solving exams and assignments,” Master’s thesis, Sultan Qaboos University (Oman), 2024

work page 2024

[30] [30]

Gpt-4o: The cutting-edge advancement in multimodal llm,

R. Islam and O. M. Moushi, “Gpt-4o: The cutting-edge advancement in multimodal llm,” inIntelligent Computing-Proceedings of the Comput- ing Conference, pp. 47–60, Springer, 2025

work page 2025

[31] [31]

Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma,et al., “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, pp. 1731– 1745, 2025

work page 2025

[32] [32]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Introducing claude,

P. Anthropic, “Introducing claude,”March, vol. 14, p. 2023, 2023

work page 2023

[34] [34]

Claude sonnet 4

Anthropic, “Claude sonnet 4.” https://www.anthropic.com/claude/sonnet,

work page

[35] [35]

A practical survey on zero-shot prompt design for in-context learning,

Y . Li, “A practical survey on zero-shot prompt design for in-context learning,” inProceedings of the 14th international conference on recent advances in natural language processing, pp. 641–647, 2023

work page 2023