BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

Jann Railey Montalan; Jian Gang Ngui; Peerat Limkonchotiwat; Thura Aung

arxiv: 2602.18788 · v3 · pith:UWPTMCCMnew · submitted 2026-02-21 · 💻 cs.CL

BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

Thura Aung , Jann Railey Montalan , Jian Gang Ngui , Peerat Limkonchotiwat This is my paper

Pith reviewed 2026-05-25 07:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords Burmese NLPLLM evaluationlow-resource languagesbenchmarknatural language understandingmachine translationsentiment analysis

0 comments

The pith

Burmese LLM performance depends more on architecture, language representation, and instruction tuning than on model scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BURMESE-SAN as the first benchmark to test large language models on Burmese across understanding, reasoning, and generation. It assembles seven subtasks, several new to Burmese, using native-speaker review to keep the material natural and free of translation distortions. Large-scale tests on open and closed models show that regional fine-tuning and newer designs produce clearer gains than simply making models bigger. This matters for low-resource languages where pretraining data is sparse and morphology is complex. The benchmark is released as a public leaderboard to track ongoing work.

Core claim

BURMESE-SAN is the first holistic benchmark for Burmese NLP that evaluates LLMs on seven subtasks spanning natural language understanding, reasoning, and generation. Constructed via a native-speaker-driven process to ensure naturalness and cultural authenticity, the benchmark reveals that Burmese performance hinges more on architectural design, language representation, and instruction tuning than on model scale alone, with Southeast Asia regional fine-tuning and newer model generations producing substantial gains.

What carries the argument

BURMESE-SAN benchmark of seven subtasks (Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, Machine Translation) built through native-speaker review.

If this is right

Southeast Asia regional fine-tuning yields substantial gains on Burmese tasks.
Newer model generations outperform older ones even at comparable sizes.
Architectural design and instruction tuning outweigh raw parameter count for Burmese.
The public leaderboard enables tracking of progress on Burmese and other low-resource languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern may appear when similar benchmarks are built for other Southeast Asian languages with limited data.
Model developers could shift priority toward language-specific tuning rather than uniform scaling for underrepresented languages.
Adding targeted Burmese data in pretraining might alter how strongly scale predicts performance.

Load-bearing premise

The seven subtasks and native-speaker construction accurately represent core Burmese NLP abilities without biases from task selection or translation effects.

What would settle it

Finding a much larger model that outperforms smaller regionally-tuned models across most of the seven subtasks would show scale matters more than the claimed factors.

Figures

Figures reproduced from arXiv: 2602.18788 by Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat, Thura Aung.

**Figure 1.** Figure 1: BURMESE-SAN Benchmark (Left) and Dataset Curation Process for the benchmark (Right). BURMESE-SAN is a benchmark that holistically evaluates LLM performance across a wide range of Burmese language tasks. The evaluation is based on native Burmese text, with prompts written in formal Burmese to ensure clarity and grammatical correctness. BURMESE-SAN1 , the first holistic Burmese Benchmark for evaluating LLMs… view at source ↗

**Figure 2.** Figure 2: Left: Comparison of original models against SEA-fine-tuned variants, and Right: SEA-LION [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Acceptable and Not Acceptable Grammar Errors in the Dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Different Types of Spelling Errors. Each task may include different types of linguistic issues. QC members need to fix Grammar and Spelling errors, and the fixed datasets are later used for evaluating LLMs [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of variation in Burmese translations by native speakers. (a) Differences in technical [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Example Data Samples for each task in BURMESE-SAN [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt templates used for BURMESE-SAN. English prompt versions are also provided [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BURMESE-SAN gives the field a usable first benchmark for Burmese but its headline claim on scale versus tuning rests on uncontrolled comparisons.

read the letter

The paper's main contribution is BURMESE-SAN, the first benchmark that pulls together seven Burmese tasks across understanding, reasoning, and generation, with several tasks newly created. The native-speaker construction process is a clear strength because it targets naturalness and reduces translation artifacts that often plague low-resource work. Releasing the leaderboard publicly also gives the community a concrete place to track progress on a language with thin pretraining coverage and complex morphology.

Referee Report

2 major / 2 minor

Summary. The paper introduces BURMESE-SAN, the first holistic benchmark for evaluating LLMs on Burmese across NLU, NLR, and NLG competencies via seven subtasks (Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, Machine Translation). The benchmark is built through a native-speaker-driven process to ensure linguistic naturalness and minimize translation artifacts. Large-scale evaluations of open-weight and commercial LLMs are reported, leading to the claim that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone, with substantial gains from Southeast Asia regional fine-tuning and newer model generations. The benchmark is released as a public leaderboard.

Significance. If the benchmark construction and evaluations hold, BURMESE-SAN addresses a clear gap for a low-resource language with rich morphology and limited pretraining coverage, providing a reproducible public resource that can support systematic progress. The emphasis on native-speaker involvement and the public leaderboard are concrete strengths that enhance utility for the community.

major comments (2)

[Results / model comparison tables] Results section (comparative model evaluations): The central claim that performance 'depends more on architectural design, language representation, and instruction tuning than on model scale alone' is load-bearing but not supported by the evidence. Newer model generations co-vary with increases in scale, data, and tuning; without matched pairs (identical base architecture and instruction-tuning regime, differing only in parameter count) or a regression isolating scale while controlling for generation and SEA fine-tuning, the attribution cannot be established. This identification problem directly undermines the 'more than scale alone' conclusion.
[Benchmark construction] Benchmark construction (§3 or equivalent): The claim that the seven subtasks and native-speaker process 'accurately capture core Burmese NLP competencies without introducing selection biases or translation artifacts' lacks quantitative validation (e.g., inter-annotator agreement metrics, artifact detection rates, or comparison to machine-translated baselines). This is load-bearing for the benchmark's validity as a reliable evaluation suite.

minor comments (2)

[Abstract / Introduction] Abstract and introduction: The description of the seven subtasks would benefit from a brief table or enumerated list with example sizes or sources to improve readability.
[Evaluation methodology] Evaluation setup: Error bars, statistical significance tests, or variance across runs are not mentioned for the reported scores; adding these would strengthen the comparative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with clarifications and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Results / model comparison tables] The central claim that performance 'depends more on architectural design, language representation, and instruction tuning than on model scale alone' is load-bearing but not supported by the evidence. Newer model generations co-vary with increases in scale, data, and tuning; without matched pairs or a regression isolating scale while controlling for generation and SEA fine-tuning, the attribution cannot be established.

Authors: We agree that the available models do not permit perfectly matched pairs isolating scale from generation and tuning, and that our observational comparisons cannot establish strict causal attribution. The manuscript's claim is grounded in patterns across the evaluated models, including cases where smaller or similarly sized newer models with improved representation and tuning outperform larger older ones, as well as gains from SEA fine-tuning. To address the identification concern, we will revise the relevant sections to qualify the language (e.g., 'our results indicate' rather than 'show that performance depends more') and add an explicit limitations paragraph discussing co-varying factors. This is a partial revision. revision: partial
Referee: [Benchmark construction] The claim that the seven subtasks and native-speaker process 'accurately capture core Burmese NLP competencies without introducing selection biases or translation artifacts' lacks quantitative validation (e.g., inter-annotator agreement metrics, artifact detection rates, or comparison to machine-translated baselines).

Authors: The construction process emphasized native-speaker review to promote naturalness and reduce artifacts, as detailed in §3. While the initial submission focused on the qualitative process, we collected inter-annotator agreement statistics for several subtasks and performed preliminary native-vs-translated comparisons. We will add these quantitative metrics and an explicit artifact analysis section in the revision to strengthen the validity claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark and evaluation

full rationale

The paper introduces BURMESE-SAN as a new benchmark with seven subtasks, describes native-speaker construction, and reports empirical LLM evaluations. No equations, parameter fitting, derivations, or predictions appear. Claims about architecture/tuning vs. scale rest on comparative results across models rather than reducing to inputs by construction. No self-citation chains or ansatzes are load-bearing. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or rename known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper with no mathematical free parameters, axioms, or invented entities; all content is descriptive of tasks and evaluations.

pith-pipeline@v0.9.0 · 5759 in / 1061 out tokens · 65019 ms · 2026-05-25T07:18:40.947425+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Prompt Templates ForBURMESE-SAN, we designed task-specific prompt templates entirely in Burmese

Experimental Setup 4.1. Prompt Templates ForBURMESE-SAN, we designed task-specific prompt templates entirely in Burmese. These tem- plates were carefully aligned with the principles of prompt design established in SEA-HELM (Su- santo et al., 2025) to ensure consistency between tasks. In particular, we translate the task prompts fromSEA-HELMintothenativeBu...

work page 2025
[2]

Tables3, 4, and5reportperformance across all NLP tasks, where theMYcolumn de- notes overall performance

Evaluation Results We present our findings organized around five research questions that examine key aspects of Burmeselanguagemodelcapabilitiesasdescribed inSection1. Tables3, 4, and5reportperformance across all NLP tasks, where theMYcolumn de- notes overall performance. Figure 2 compares original models with their SEA-fine-tuned variants (left) and SEA-...

work page arXiv
[3]

Commercial models continue to achieve the strongest performance, but the gap with open- weight models is steadily narrowing as architec- tures and training strategies improve

4B achieves 26.24%, substantially exceeding SEA-LION v3 (Gemma 2) 9B at 15.40%. Commercial models continue to achieve the strongest performance, but the gap with open- weight models is steadily narrowing as architec- tures and training strategies improve. While model scale remains relevant, performance depends on architectural design, data quality, and in...

work page
[4]

Conclusion We introduceBURMESE-SAN, the first compre- hensive benchmark for evaluating large language models on Burmese across NLU, NLR, and NLG tasks, constructed with high-quality, linguistically natural data spanning diverse domains. Our evaluation reveals clear performance gaps between model families and generations, demon- strating that Burmese capab...

work page
[5]

Bibliographical References Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1):37–46. Sara Court and Micha Elsner. 2024. Shortcomings of LLMs for low-resource translation: Retrieval and understanding are both the problem. InPro- ceedings of the Ninth Conference on Machine Translation, pages 133...

work page arXiv 1960
[6]

Is small language model the silver bullet to low-resource languages machine translation? Open AI Team. 2025. Openai gpt-5 system card. Language Resource References Anthropic. 2025. System card: Claude Opus 4 & Claude Sonnet 4. Accessed: 2026-02-18. ThuraAung,YeKyawThu,andMyatNoeOo.2024. myocr: Optical character recognition for myan- mar language with post...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

InFind- ings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021, pages 4693–4703, Online

XL-Sum: Large-scaleMultilingualAbstrac- tive Summarization for 44 Languages. InFind- ings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguis- tics. Zar Zar Hlaing, Ye Kyaw Thu, Thepchai Supnithi, andPonrudeeNetisopakul.2022. Improvingneu- ral machine translation with pos-tag ...

work page 2021
[8]

Retrieved 2024-12-06

De Gruyter Mouton. Retrieved 2024-12-06. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William E...

work page 2024
[9]

InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6119–6136, Albuquerque, New Mexico

SeaExam and SeaBench: Benchmarking LLMs with local multilingual questions in South- east Asia. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6119–6136, Albuquerque, New Mexico. Associ- ation for Computational Linguistics. Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Mi- randa, Jennifer Santoso, El...

work page 2025
[10]

MULTICSD Project Team

Batayan: A filipino nlp benchmark for eval- uating large language models. MULTICSD Project Team. 2025. Burmese (myanmar). https://sites.google.com/ view/multicsd/global-languages/ burmese-myanmar. Accessed: 2025-07-12. Raymond Ng, Thanh Ngan Nguyen, Huang Yuli, Tai Ngee Chia, Leong Wai Yi, Wei Qi Leong, Xi- anbinYong,JianGangNgui,YosephineSusanto, Nichola...

work page 2025
[11]

SeaEval for multilingual foundation mod- els: From cross-lingual alignment to cultural rea- soning. InProceedings of the 2024 Conference oftheNorthAmericanChapteroftheAssociation forComputationalLinguistics: HumanLanguage Technologies (Volume 1: Long Papers), pages 370–390, Mexico City, Mexico. Association for Computational Linguistics. BryanWilie,Karissa...

work page 2024
[12]

Theoretical

IndoNLU: Benchmark and resources for evaluating Indonesian natural language under- standing. InProceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th Inter- national Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. As- sociation for Computational Linguistics...

work page arXiv 2025

[1] [1]

Prompt Templates ForBURMESE-SAN, we designed task-specific prompt templates entirely in Burmese

Experimental Setup 4.1. Prompt Templates ForBURMESE-SAN, we designed task-specific prompt templates entirely in Burmese. These tem- plates were carefully aligned with the principles of prompt design established in SEA-HELM (Su- santo et al., 2025) to ensure consistency between tasks. In particular, we translate the task prompts fromSEA-HELMintothenativeBu...

work page 2025

[2] [2]

Tables3, 4, and5reportperformance across all NLP tasks, where theMYcolumn de- notes overall performance

Evaluation Results We present our findings organized around five research questions that examine key aspects of Burmeselanguagemodelcapabilitiesasdescribed inSection1. Tables3, 4, and5reportperformance across all NLP tasks, where theMYcolumn de- notes overall performance. Figure 2 compares original models with their SEA-fine-tuned variants (left) and SEA-...

work page arXiv

[3] [3]

Commercial models continue to achieve the strongest performance, but the gap with open- weight models is steadily narrowing as architec- tures and training strategies improve

4B achieves 26.24%, substantially exceeding SEA-LION v3 (Gemma 2) 9B at 15.40%. Commercial models continue to achieve the strongest performance, but the gap with open- weight models is steadily narrowing as architec- tures and training strategies improve. While model scale remains relevant, performance depends on architectural design, data quality, and in...

work page

[4] [4]

Conclusion We introduceBURMESE-SAN, the first compre- hensive benchmark for evaluating large language models on Burmese across NLU, NLR, and NLG tasks, constructed with high-quality, linguistically natural data spanning diverse domains. Our evaluation reveals clear performance gaps between model families and generations, demon- strating that Burmese capab...

work page

[5] [5]

Bibliographical References Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1):37–46. Sara Court and Micha Elsner. 2024. Shortcomings of LLMs for low-resource translation: Retrieval and understanding are both the problem. InPro- ceedings of the Ninth Conference on Machine Translation, pages 133...

work page arXiv 1960

[6] [6]

Is small language model the silver bullet to low-resource languages machine translation? Open AI Team. 2025. Openai gpt-5 system card. Language Resource References Anthropic. 2025. System card: Claude Opus 4 & Claude Sonnet 4. Accessed: 2026-02-18. ThuraAung,YeKyawThu,andMyatNoeOo.2024. myocr: Optical character recognition for myan- mar language with post...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

InFind- ings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021, pages 4693–4703, Online

XL-Sum: Large-scaleMultilingualAbstrac- tive Summarization for 44 Languages. InFind- ings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguis- tics. Zar Zar Hlaing, Ye Kyaw Thu, Thepchai Supnithi, andPonrudeeNetisopakul.2022. Improvingneu- ral machine translation with pos-tag ...

work page 2021

[8] [8]

Retrieved 2024-12-06

De Gruyter Mouton. Retrieved 2024-12-06. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William E...

work page 2024

[9] [9]

InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6119–6136, Albuquerque, New Mexico

SeaExam and SeaBench: Benchmarking LLMs with local multilingual questions in South- east Asia. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6119–6136, Albuquerque, New Mexico. Associ- ation for Computational Linguistics. Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Mi- randa, Jennifer Santoso, El...

work page 2025

[10] [10]

MULTICSD Project Team

Batayan: A filipino nlp benchmark for eval- uating large language models. MULTICSD Project Team. 2025. Burmese (myanmar). https://sites.google.com/ view/multicsd/global-languages/ burmese-myanmar. Accessed: 2025-07-12. Raymond Ng, Thanh Ngan Nguyen, Huang Yuli, Tai Ngee Chia, Leong Wai Yi, Wei Qi Leong, Xi- anbinYong,JianGangNgui,YosephineSusanto, Nichola...

work page 2025

[11] [11]

SeaEval for multilingual foundation mod- els: From cross-lingual alignment to cultural rea- soning. InProceedings of the 2024 Conference oftheNorthAmericanChapteroftheAssociation forComputationalLinguistics: HumanLanguage Technologies (Volume 1: Long Papers), pages 370–390, Mexico City, Mexico. Association for Computational Linguistics. BryanWilie,Karissa...

work page 2024

[12] [12]

Theoretical

IndoNLU: Benchmark and resources for evaluating Indonesian natural language under- standing. InProceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th Inter- national Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. As- sociation for Computational Linguistics...

work page arXiv 2025