The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?
Pith reviewed 2026-05-16 14:56 UTC · model grok-4.3
The pith
Multilingual model gaps often shrink when tokenization, encoding, and data exposure are equalized across languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Performance disparities across languages in multilingual LMs arise primarily from representation and allocation choices such as tokenization, encoding, data exposure, and parameter sharing rather than from inherent linguistic complexity, because gaps shrink when these factors are normalized in the studies examined.
What carries the argument
Normalization of segmentation, encoding, and data exposure, which serves as the test that isolates modeling artifacts from linguistic features.
Load-bearing premise
The reviewed studies cover a representative range of typologically diverse languages and the observed gap reductions generalize beyond the specific normalization experiments cited.
What would settle it
A controlled experiment that applies uniform tokenization, encoding, and data sampling to a fresh typologically diverse language set and still finds large persistent gaps would falsify the central claim.
read the original abstract
Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a literature survey examining performance disparities in multilingual language models. It investigates whether gaps arise from intrinsic linguistic difficulty or from design choices in representation and allocation (e.g., tokenization, encoding, data exposure, parameter sharing). The survey reviews linguistic features such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. It synthesizes evidence that gaps often shrink when segmentation, encoding, and data exposure are normalized, and concludes with design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
Significance. If the synthesis holds, the work is significant for multilingual NLP by consolidating evidence that many apparent disparities are addressable through modeling choices rather than inherent complexity. It aggregates patterns from existing studies showing gap reductions under normalized conditions and translates these into practical recommendations, providing a useful reference for researchers aiming to improve equity in multilingual systems. The survey's focus on linking linguistic features directly to mechanisms adds clarity to an active area of research.
major comments (1)
- [Literature synthesis on typological distance] The central claim that gap reductions generalize beyond the cited experiments depends on the representativeness of the reviewed studies for typologically diverse languages. The manuscript should include an explicit discussion or summary (e.g., in the section reviewing typological distance) of the language families, scripts, and resource levels covered in the aggregated experiments to support the generalization that 'much apparent difficulty stems from current modeling choices.'
minor comments (1)
- [Abstract] The abstract would benefit from a brief statement on the scope of the review (e.g., approximate number of studies or languages covered) to help readers assess the breadth of the synthesis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the survey. We agree that explicitly summarizing the language coverage will strengthen the generalization regarding typological distance and will incorporate this in the revision.
read point-by-point responses
-
Referee: [Literature synthesis on typological distance] The central claim that gap reductions generalize beyond the cited experiments depends on the representativeness of the reviewed studies for typologically diverse languages. The manuscript should include an explicit discussion or summary (e.g., in the section reviewing typological distance) of the language families, scripts, and resource levels covered in the aggregated experiments to support the generalization that 'much apparent difficulty stems from current modeling choices.'
Authors: We agree that an explicit summary of the language families, scripts, and resource levels in the reviewed studies would better support the generalization. In the revised manuscript, we will add a dedicated paragraph and accompanying table in the typological distance section. The table will catalog the primary studies cited, listing covered language families (e.g., Indo-European, Sino-Tibetan, Niger-Congo, Austronesian, Afro-Asiatic), scripts (Latin, Cyrillic, Arabic, Devanagari, Hanzi, etc.), and resource tiers (high-, medium-, and low-resource). This addition will demonstrate the breadth of the evidence base and clarify that the observed gap reductions hold across diverse typological settings. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is a literature survey synthesizing existing studies on multilingual LM performance gaps. Its central claim—that gaps often shrink under normalized segmentation, encoding, and data exposure—is presented as an observed pattern across reviewed external work rather than a new derivation, fitted parameter, or self-referential equation. No load-bearing steps reduce by construction to the paper's own inputs, self-citations, or ansatzes; the argument relies on the fidelity of cited experiments, which is standard for surveys and does not constitute circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models
Maistros 8B is a new state-of-the-art open-weights Greek LLM built via knowledge distillation from large reasoning models on the CulturaQA dataset.
Reference graph
Works this paper leans on
-
[1]
Getting the most out of your tokenizer for pre-training and domain adaptation. InForty-first International Conference on Machine Learning. John Dang, Shivalika Singh, Daniel D’souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campo...
-
[2]
Position information in transformers: An overview.Computational Linguistics, 48(3):733– 763. Jonathan Dunn. 2020. Mapping languages: the corpus of global language use.Language Resources and Evaluation, 54(4):999–1018. Jonathan Dunn and Benjamin Adams. 2020. Mapping languages and demographics with georeferenced cor- pora.arXiv preprint arXiv:2004.00809. Li...
-
[3]
Large-scale evidence of dependency length minimization in 37 languages. InProceedings of the National Academy of Sciences, volume 112, pages 10336–10341. Matthias Gallé. 2019. Investigating the effectiveness of BPE: The power of shorter sequences. InProceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter-...
-
[4]
Dagmar Gurgurov, Ivan Vykopal, Josef van Genabith, and 1 others
Multilingual large language models and the curse of multilinguality.arXiv preprint arXiv:2406.10602. Dagmar Gurgurov, Ivan Vykopal, Josef van Genabith, and 1 others. 2025. Small models, big impact: Ef- ficient corpus and graph-based adaptation of small multilingual language models for low-resource lan- guages.arXiv preprint arXiv:2501.00000. R. A. Gutherz...
-
[5]
Tokenization and the noiseless channel.arXiv preprint arXiv:2306.16842. John Hale. 2001. A probabilistic Earley parser as a psy- cholinguistic model. InSecond Meeting of the North American Chapter of the Association for Computa- tional Linguistics. Martin Haspelmath and Andrea Sims. 2013.Under- standing morphology. Routledge. Yifei He, Alon Benhaim, Barun...
-
[6]
Tatsuki Kuribayashi, Ryo Ueda, Ryo Yoshida, Yohei Oseki, Ted Briscoe, and Timothy Baldwin
New data on text reading in english as a second language.Studies in Second Language Acquisition, 47:677 – 695. Tatsuki Kuribayashi, Ryo Ueda, Ryo Yoshida, Yohei Oseki, Ted Briscoe, and Timothy Baldwin. 2024. Emergent word order universals from cognitively- motivated language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- ta...
-
[7]
From zero to hero: On the limitations of zero- shot language transfer with multilingual transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483–4499. Daniel Lemire and Wojciech Muła. 2022. Transcoding billions of unicode characters per second with simd instructions.Software: Practice and E...
work page 2020
-
[8]
Tokenization impacts multilingual language modeling: Assessing vocabulary allocation and over- lap across languages. InFindings of ACL 2023. Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. 2024. Myte: Morphology-driven byte encoding for better and fairer multilingual language modeling.arXiv preprint arXiv:2403.10691....
-
[9]
On the importance of word order information in cross-lingual sequence labeling. InThirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, pages 13461–13469. AAAI Press. Georgia R. Loukatou, Sabine Stoll, Damián E. Blasi, and Alejandrina Cristià. 2021. Does morphological complexity affect word segmentation? ev...
work page 2021
-
[10]
Scaling laws are unreliable for downstream tasks: A reality check. ArXiv. Nishan Luitel, Nishant Bekoju, Anil Kumar Sah, and 1 others. 2025. Can perplexity predict finetuning performance? an investigation of tokenization ef- fects on sequential language models for nepali. In Proceedings of the Fourth Workshop on Multilingual Representation Learning. Jessi...
-
[11]
Bit-level bpe: Below the byte boundary.arXiv preprint arXiv:2506.07541. Niva Mor. 2025. It’s a global village (if you speak the right language): On language models, digital sidelin- ing, and participation.Wisconsin International Law Journal. Aaron Mueller, Garrett Nicolai, Panayiota Petrou- Zeniou, Natalia Talmina, and Tal Linzen. 2020. Cross-linguistic s...
-
[12]
Morphology matters: A multilingual language modeling analysis.Transactions of the Association for Computational Linguistics, 9:261–276. Isidro Parra. 2024. Morphological typology in bpe sub- word productivity and language modeling.arXiv preprint arXiv:2410.23656. Olga Pelloni, Anastassia Shaitarova, and Tanja Samardzic. 2022. Subword evenness (sue) as a p...
-
[13]
A surprisal–duration trade-off across and within the world’s languages. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 949–962. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual bert? InProceed- ings of the 57th Annual Meeting of the Association for Computational Linguisti...
-
[14]
Xenia Schmalz, Eva Marinus, Max Coltheart, and Anne Castles
Big data suggest strong constraints of linguis- tic similarity on adult language learning.Cognition, 194:104056. Xenia Schmalz, Eva Marinus, Max Coltheart, and Anne Castles. 2015. Getting to the bottom of orthographic depth.Psychonomic Bulletin & Review, 22:1614– 1629. Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter...
-
[15]
Neural machine translation of rare words with subword units. InACL 2016, pages 1715–1725. Philip H. K. Seymour, Mikko Aro, and Jane Erskine
work page 2016
-
[16]
Foundation literacy acquisition in european orthographies.British journal of psychology, 94 Pt 2:143–74. Claude E. Shannon. 1948.A Mathematical Theory of Communication. Bell System Technical Journal. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. InProceedings of the 2018 Conference of the Nor...
-
[17]
Junqiu Wei, Qun Liu, Yinpeng Guo, and Xin Jiang
On negative interference in multilingual mod- els: Findings and a meta-learning treatment.arXiv preprint arXiv:2010.03017. Junqiu Wei, Qun Liu, Yinpeng Guo, and Xin Jiang
-
[18]
Training multilingual pre-trained language model with byte-level subwords.arXiv preprint arXiv:2101.09469. Ethan G. Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P. Levy. 2023. Testing the pre- dictions of surprisal theory in 11 languages.Transac- tions of the Association for Computational Linguis- tics, 11:1451–1470. Shijie Wu and Mark...
-
[19]
A Formal Perspective on Byte-pair Encoding. InFindings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 598–614. Association for Computational Linguistics
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.