pith. sign in

arxiv: 2601.14958 · v3 · submitted 2026-01-21 · 💻 cs.CL · cs.AI

Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala

Pith reviewed 2026-05-16 12:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Sinhalascript sensitivitylanguage modelsRomanized textUnicodeperplexitylow-resource languages
0
0 comments X

The pith

Language models show over 300x performance drop on Romanized Sinhala versus Unicode, unrelated to model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks 24 open-source language models on Sinhala using perplexity across Unicode, Romanized, and mixed scripts from diverse sources. It finds median performance degrades by more than 300 times on Romanized text compared to Unicode. Model size shows no correlation with handling these variations, as smaller models often outperform much larger ones. Unicode results predict mixed-script behavior but not Romanized capability. The work supplies baselines for Sinhala and shows why single-script tests underestimate practical challenges.

Core claim

Open-source language models display strong script sensitivity on Sinhala. Perplexity rises by a median factor exceeding 300 when text shifts from Unicode to Romanized form. Model scale does not predict script competence, with smaller models frequently surpassing architectures up to 28 times larger. Unicode performance forecasts mixed-script robustness but fails to forecast Romanized results, demonstrating that single-script evaluations underestimate real-world deployment issues for this low-resource language.

What carries the argument

Perplexity evaluation of 24 language models on Unicode, Romanized, and mixed-script Sinhala text drawn from multiple sources.

If this is right

  • Model selection for Sinhala applications must test across scripts rather than rely on size or Unicode scores alone.
  • Single-script benchmarks substantially underestimate challenges in mixed-script social media settings.
  • Smaller models can be viable choices for Romanized Sinhala tasks despite lower parameter counts.
  • Unicode results alone cannot serve as a reliable proxy for full multi-script deployment readiness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training corpora for Sinhala models would benefit from greater inclusion of Romanized examples to lower sensitivity.
  • Languages with similar script duality may require parallel multi-script testing to avoid underestimating practical performance.
  • Standard benchmarks for low-resource models could add mixed-script and Romanized tracks as routine checks.

Load-bearing premise

The selected text sources and perplexity metric give an unbiased view of real-world Sinhala usage without domain or selection biases.

What would settle it

A re-evaluation using different text sources or a downstream metric such as translation accuracy that finds no large degradation between scripts or no size independence would falsify the central claims.

Figures

Figures reproduced from arXiv: 2601.14958 by Minuri Rajapakse, Ruvan Weerasinghe.

Figure 1
Figure 1. Figure 1: Benchmark methodology workflow from data collection through [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Llama family scaling across script conditions (log scale). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correlation between Unicode and other script conditions. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

The performance of Language Models (LMs) on low-resource, morphologically rich languages like Sinhala remains largely unexplored, particularly regarding script variation in digital communication. Sinhala exhibits script duality, with Unicode used in formal contexts and Romanized text dominating social media, while mixed-script usage is common in practice. This paper benchmarks 24 open-source LMs on Unicode, Romanized and mixed-script Sinhala using perplexity evaluation across diverse text sources. Results reveal substantial script sensitivity, with median performance degradation exceeding 300 times from Unicode to Romanized text. Critically, model size shows no correlation with script-handling competence, as smaller models often outperform architectures 28 times larger. Unicode performance strongly predicts mixed-script robustness but not Romanized capability, demonstrating that single-script evaluation substantially underestimates real-world deployment challenges. These findings establish baseline LM capabilities for Sinhala and provide practical guidance for model selection in multi-script low-resource environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript benchmarks 24 open-source language models on Sinhala text in Unicode, Romanized, and mixed scripts using perplexity evaluation across diverse text sources. It reports a median performance degradation exceeding 300 times from Unicode to Romanized text, finds no correlation between model size and script-handling competence (with smaller models sometimes outperforming much larger ones), and shows that Unicode performance strongly predicts mixed-script robustness but not Romanized capability.

Significance. If the results hold after addressing domain controls, the work supplies useful empirical baselines for low-resource morphologically rich languages under real-world script variation, with direct implications for model selection and evaluation practices in multi-script deployment scenarios.

major comments (3)
  1. [Abstract] Abstract and Evaluation section: the 300x median degradation claim is presented without details on exact data sources, sample sizes, statistical tests, or error bars, making the quantitative result unverifiable from the provided information.
  2. [Evaluation Setup] Evaluation Setup: the attribution of the perplexity gap solely to script sensitivity is undermined by the lack of explicit domain or style matching between Unicode sources (typically formal/news) and Romanized sources (typically informal social media), as the abstract refers only to 'diverse text sources' without parallel content or controls; this is load-bearing for the central 'script sensitivity' claim.
  3. [Results] Results section: the claim of no correlation between model size and script-handling competence (including smaller models outperforming 28x larger architectures) requires specification of the exact models, architectures, and any controls for factors other than parameter count to support the conclusion.
minor comments (1)
  1. [Abstract] Abstract: consider specifying the perplexity metric (e.g., token-level or character-level) and the exact number of models evaluated for immediate clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We have carefully considered each comment and provide point-by-point responses below. We will make revisions to the manuscript to address the concerns raised, particularly by adding more details and clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Evaluation section: the 300x median degradation claim is presented without details on exact data sources, sample sizes, statistical tests, or error bars, making the quantitative result unverifiable from the provided information.

    Authors: Thank you for highlighting this. In the full manuscript, the Evaluation section describes the data sources: Unicode text from formal news corpora and Romanized from informal social media, with sample sizes of 5,000-10,000 tokens per source type. The median is computed across all 24 models. We will revise the abstract and add a table with exact sample sizes, sources, and interquartile ranges as error bars. No parametric statistical tests were used; we will note this explicitly. revision: yes

  2. Referee: [Evaluation Setup] Evaluation Setup: the attribution of the perplexity gap solely to script sensitivity is undermined by the lack of explicit domain or style matching between Unicode sources (typically formal/news) and Romanized sources (typically informal social media), as the abstract refers only to 'diverse text sources' without parallel content or controls; this is load-bearing for the central 'script sensitivity' claim.

    Authors: We acknowledge that domain and style differences exist between the sources and could contribute to the observed gap. However, the paper emphasizes real-world script variation, where such domain associations are inherent. To strengthen the claim, we will add an analysis using any available parallel sentences in both scripts and include a discussion of this potential confound in the revised Evaluation Setup section. We believe the consistency of the degradation across multiple diverse sources supports the script sensitivity interpretation, but we will qualify our conclusions. revision: partial

  3. Referee: [Results] Results section: the claim of no correlation between model size and script-handling competence (including smaller models outperforming 28x larger architectures) requires specification of the exact models, architectures, and any controls for factors other than parameter count to support the conclusion.

    Authors: The full paper provides a table (Table 1) listing all 24 models with their exact parameter counts, architectures (primarily decoder-only transformers from families like LLaMA, Gemma, Mistral, etc.), and other relevant details. We controlled for evaluation by using the same perplexity computation and tokenizers where possible. We will add a scatter plot in the Results section showing model size vs. degradation factor, with the correlation coefficient, and explicitly name the models where smaller ones outperform larger ones (e.g., a 3B model vs. an 84B model). revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking with direct measurements

full rationale

The paper conducts a standard benchmarking study by computing perplexity on held-out text sources for Unicode, Romanized, and mixed-script Sinhala across 24 open-source LMs. No derivations, equations, fitted parameters, or self-referential predictions appear in the abstract or described methodology. Results are direct empirical measurements rather than reductions of outputs to inputs by construction. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The evaluation is self-contained against external model inferences and data sources, qualifying for the default non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmarking study and introduces no free parameters, new axioms beyond standard LM evaluation practices, or invented entities.

axioms (1)
  • domain assumption Perplexity is a valid proxy for language model quality across different scripts.
    Used as the sole evaluation metric for all script variants.

pith-pipeline@v0.9.0 · 5460 in / 1178 out tokens · 68752 ms · 2026-05-16T12:33:12.404526+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    SiTSE: Sinhala Text Simplification Dataset and Evaluation,

    S. Ranathunga et al., “SiTSE: Sinhala Text Simplification Dataset and Evaluation,”ACM Trans. Asian Low-Resour . Lang. Inf. Process., vol. 24, no. 5, pp. 1–19, May 2025, doi: 10.1145/3723160

  2. [2]

    SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala,

    A. Pramodya et al., “SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), pp. 32931–32949, Jan. 2025, doi: 10.18653/v1/2025.emnlp-main.1673

  3. [3]

    Handling Rare Word Problem Using Synthetic Training Data for Sinhala and Tamil Neural Machine Translation,

    P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, and S. Ranathunga, “Handling Rare Word Problem Using Synthetic Training Data for Sinhala and Tamil Neural Machine Translation,”Lang. Resour . Eval., May 2018

  4. [4]

    Fine-Tuning Self-Supervised Multilingual Sequence-to-Sequence Models for Ex- tremely Low-Resource NMT,

    S. Thillainathan, S. Ranathunga, and S. Jayasena, “Fine-Tuning Self-Supervised Multilingual Sequence-to-Sequence Models for Ex- tremely Low-Resource NMT,” inProc. Moratuwa Eng. Res. Conf. (MERCon), vol. 2, pp. 432–437, Jul. 2021, doi: 10.1109/mer- con52712.2021.9525720

  5. [5]

    Sinhala Transliteration: A Comparative Analysis Between Rule-Based and Seq2Seq Approaches,

    D. Mel, K. Wickramasinghe, N. de Silva, and S. Ranathunga, “Sinhala Transliteration: A Comparative Analysis Between Rule-Based and Seq2Seq Approaches,” arXiv:2501.00529, Dec. 2024

  6. [6]

    Social, Economic, and Demographic Factors Drive the Emergence of Hinglish Code-Mixing on Social Media,

    A. Sengupta, S. Das, Md. S. Akhtar, and T. Chakraborty, “Social, Economic, and Demographic Factors Drive the Emergence of Hinglish Code-Mixing on Social Media,”Humanit. Soc. Sci. Commun., vol. 11, no. 1, May 2024, doi: 10.1057/s41599-024-03058-6

  7. [7]

    Auto- matic Transliteration of Romanized Dialectal Arabic,

    M. Al-Badrashiny, R. Eskander, N. Habash, and O. Rambow, “Auto- matic Transliteration of Romanized Dialectal Arabic,” inProc. 18th Conf. Comput. Natural Lang. Learn., pp. 30–38, Jan. 2014, doi: 10.3115/v1/w14-1604

  8. [8]

    Improving Infor- mally Romanized Language Identification,

    A. Benton, A. Gutkin, C. Kirov, and B. Roark, “Improving Infor- mally Romanized Language Identification,” inProc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), pp. 2318–2336, 2025, doi: 10.18653/v1/2025.emnlp-main.117

  9. [9]

    The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Bench- marks,

    N. Selvam, S. Dev, D. Khashabi, T. Khot, and K.-W. Chang, “The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Bench- marks,” inProc. 61st Annu. Meeting Assoc. Comput. Linguist. (ACL), pp. 1373–1389, Jan. 2023, doi: 10.18653/v1/2023.acl-short.118

  10. [10]

    Language Model Tokenizers Introduce Unfairness Between Languages

    A. Petrov, E. La Malfa, P. H. S. Torr, and A. Bibi, “Lan- guage Model Tokenizers Introduce Unfairness Between Languages,” arXiv:2305.15425, Oct. 2023

  11. [11]

    Effect of Unknown and Fragmented Tokens on the Performance of Multilingual Language Models at Low-Resource Tasks,

    A. Nag, B. Samanta, A. Mukherjee, N. Ganguly, and S. Chakrabarty, “Effect of Unknown and Fragmented Tokens on the Performance of Multilingual Language Models at Low-Resource Tasks,” inProc. Int. Conf. Comput. Linguist. Intell. Text Process., pp. 95–107, Jun. 2024, doi: 10.1007/978-3-031-64451-1 5

  12. [12]

    Processing South Asian Languages Written in the Latin Script: The Dakshina Dataset,

    B. Roark et al., “Processing South Asian Languages Written in the Latin Script: The Dakshina Dataset,” inProc. 12th Lang. Resour . Eval. Conf. (LREC), pp. 2413–2423, Jul. 2020

  13. [13]

    Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting,

    M. Khullar, U. Desai, P. Malviya, A. Dalmia, and Z. R. Shi, “Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting,” arXiv:2512.10780, 2025

  14. [14]

    RomanSetu: Efficiently Unlocking Multilingual Capabilities of Large Language Models via Romanization,

    J. Jaavid et al., “RomanSetu: Efficiently Unlocking Multilingual Capabilities of Large Language Models via Romanization,” inProc. 62nd Annu. Meeting Assoc. Comput. Linguist. (ACL), pp. 15593– 15615, Jan. 2024, doi: 10.18653/v1/2024.acl-long.833

  15. [15]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,” inProc. EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Netw. NLP, pp. 353–355, Nov. 2018

  16. [16]

    XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization,

    J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson, “XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization,” inProc. 37th Int. Conf. Mach. Learn. (ICML), pp. 4411–4421, 2020

  17. [17]

    Measuring Massive Multitask Language Under- standing,

    D. Hendrycks et al., “Measuring Massive Multitask Language Under- standing,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021

  18. [18]

    Ranathunga and N

    S. Ranathunga and N. de Silva, ”Some Languages are More Equal than Others,” in Proc. AACL-IJCNLP, pp. 823–848, 2022, doi: 10.18653/v1/2022.aacl-main.62

  19. [19]

    Sea-helm: Southeast asian holistic evaluation of language models,

    Y . Susanto et al., “SEA-HELM: Southeast Asian Holistic Evaluation of Language Models,” arXiv:2502.14301, Feb. 2025

  20. [20]

    Rajapakse, ”Sinhala perplexity test dataset,” Hugging Face, 2026

    M. Rajapakse, ”Sinhala perplexity test dataset,” Hugging Face, 2026. [Online]. Available: https://hf.co/datasets/Minuri/sinhala-perplexity-test-dataset

  21. [21]

    Rajapakse, ”Sinhala mixedscript,” Hugging Face, 2026

    M. Rajapakse, ”Sinhala mixedscript,” Hugging Face, 2026. [Online]. Available: https://hf.co/datasets/Minuri/sinhala-mixedscript

  22. [22]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    N. Reimers and I. Gurevych, ”Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” in Proc. EMNLP-IJCNLP, 2019, doi: 10.18653/v1/d19-1410