The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

Chen Shani; Dan Jurafsky; Ekaterina Shutova; Nathan Roll; Yuval Reif

arxiv: 2601.07220 · v3 · submitted 2026-01-12 · 💻 cs.CL

The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

Chen Shani , Yuval Reif , Nathan Roll , Dan Jurafsky , Ekaterina Shutova This is my paper

Pith reviewed 2026-05-16 14:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual language modelsperformance gapstokenizationdata exposuretypological diversitymodeling choicessegmentation

0 comments

The pith

Multilingual model gaps often shrink when tokenization, encoding, and data exposure are equalized across languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews the sources of uneven performance in multilingual language models. It separates intrinsic linguistic traits such as morphology and syntax from modeling decisions including tokenization, encoding, data sampling, and parameter sharing. The reviewed evidence shows that performance differences frequently narrow once segmentation and data exposure are normalized. The analysis concludes that many observed disparities reflect current design choices rather than fixed linguistic difficulty. From these patterns the paper draws concrete recommendations for tokenizers, sampling strategies, architectures, and evaluation protocols aimed at more balanced coverage.

Core claim

Performance disparities across languages in multilingual LMs arise primarily from representation and allocation choices such as tokenization, encoding, data exposure, and parameter sharing rather than from inherent linguistic complexity, because gaps shrink when these factors are normalized in the studies examined.

What carries the argument

Normalization of segmentation, encoding, and data exposure, which serves as the test that isolates modeling artifacts from linguistic features.

Load-bearing premise

The reviewed studies cover a representative range of typologically diverse languages and the observed gap reductions generalize beyond the specific normalization experiments cited.

What would settle it

A controlled experiment that applies uniform tokenization, encoding, and data sampling to a fresh typologically diverse language set and still finds large persistent gaps would falsify the central claim.

read the original abstract

Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey shows many multilingual LM gaps shrink under normalized tokenization and data exposure, pointing more to design choices than intrinsic language difficulty.

read the letter

The main takeaway is that performance gaps across languages often narrow when segmentation, encoding, and data exposure are balanced, which suggests current modeling decisions drive more of the disparity than any fixed linguistic complexity. The paper organizes existing studies around linguistic features like morphology, orthography, and syntax, then links each one to specific mechanisms such as tokenization and parameter sharing. It collects cases where gaps close under more even conditions and turns those patterns into practical recommendations for tokenizers, sampling strategies, and evaluation.

Referee Report

1 major / 1 minor

Summary. The paper is a literature survey examining performance disparities in multilingual language models. It investigates whether gaps arise from intrinsic linguistic difficulty or from design choices in representation and allocation (e.g., tokenization, encoding, data exposure, parameter sharing). The survey reviews linguistic features such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. It synthesizes evidence that gaps often shrink when segmentation, encoding, and data exposure are normalized, and concludes with design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.

Significance. If the synthesis holds, the work is significant for multilingual NLP by consolidating evidence that many apparent disparities are addressable through modeling choices rather than inherent complexity. It aggregates patterns from existing studies showing gap reductions under normalized conditions and translates these into practical recommendations, providing a useful reference for researchers aiming to improve equity in multilingual systems. The survey's focus on linking linguistic features directly to mechanisms adds clarity to an active area of research.

major comments (1)

[Literature synthesis on typological distance] The central claim that gap reductions generalize beyond the cited experiments depends on the representativeness of the reviewed studies for typologically diverse languages. The manuscript should include an explicit discussion or summary (e.g., in the section reviewing typological distance) of the language families, scripts, and resource levels covered in the aggregated experiments to support the generalization that 'much apparent difficulty stems from current modeling choices.'

minor comments (1)

[Abstract] The abstract would benefit from a brief statement on the scope of the review (e.g., approximate number of studies or languages covered) to help readers assess the breadth of the synthesis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the survey. We agree that explicitly summarizing the language coverage will strengthen the generalization regarding typological distance and will incorporate this in the revision.

read point-by-point responses

Referee: [Literature synthesis on typological distance] The central claim that gap reductions generalize beyond the cited experiments depends on the representativeness of the reviewed studies for typologically diverse languages. The manuscript should include an explicit discussion or summary (e.g., in the section reviewing typological distance) of the language families, scripts, and resource levels covered in the aggregated experiments to support the generalization that 'much apparent difficulty stems from current modeling choices.'

Authors: We agree that an explicit summary of the language families, scripts, and resource levels in the reviewed studies would better support the generalization. In the revised manuscript, we will add a dedicated paragraph and accompanying table in the typological distance section. The table will catalog the primary studies cited, listing covered language families (e.g., Indo-European, Sino-Tibetan, Niger-Congo, Austronesian, Afro-Asiatic), scripts (Latin, Cyrillic, Arabic, Devanagari, Hanzi, etc.), and resource tiers (high-, medium-, and low-resource). This addition will demonstrate the breadth of the evidence base and clarify that the observed gap reductions hold across diverse typological settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a literature survey synthesizing existing studies on multilingual LM performance gaps. Its central claim—that gaps often shrink under normalized segmentation, encoding, and data exposure—is presented as an observed pattern across reviewed external work rather than a new derivation, fitted parameter, or self-referential equation. No load-bearing steps reduce by construction to the paper's own inputs, self-citations, or ansatzes; the argument relies on the fidelity of cited experiments, which is standard for surveys and does not constitute circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a literature survey and introduces no new free parameters, axioms, or invented entities; all claims rest on cited external studies.

pith-pipeline@v0.9.0 · 5459 in / 926 out tokens · 33658 ms · 2026-05-16T14:56:39.098027+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models
cs.CL 2026-05 unverdicted novelty 6.0

Maistros 8B is a new state-of-the-art open-weights Greek LLM built via knowledge distillation from large reasoning models on the CulturaQA dataset.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper

[1]

DeepSeek-AI, D

Getting the most out of your tokenizer for pre-training and domain adaptation. InForty-first International Conference on Machine Learning. John Dang, Shivalika Singh, Daniel D’souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campo...

work page arXiv 2024
[2]

Jonathan Dunn

Position information in transformers: An overview.Computational Linguistics, 48(3):733– 763. Jonathan Dunn. 2020. Mapping languages: the corpus of global language use.Language Resources and Evaluation, 54(4):999–1018. Jonathan Dunn and Benjamin Adams. 2020. Mapping languages and demographics with georeferenced cor- pora.arXiv preprint arXiv:2004.00809. Li...

work page arXiv 2020
[3]

Version 1

Large-scale evidence of dependency length minimization in 37 languages. InProceedings of the National Academy of Sciences, volume 112, pages 10336–10341. Matthias Gallé. 2019. Investigating the effectiveness of BPE: The power of shorter sequences. InProceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter-...

work page arXiv 2019
[4]

Dagmar Gurgurov, Ivan Vykopal, Josef van Genabith, and 1 others

Multilingual large language models and the curse of multilinguality.arXiv preprint arXiv:2406.10602. Dagmar Gurgurov, Ivan Vykopal, Josef van Genabith, and 1 others. 2025. Small models, big impact: Ef- ficient corpus and graph-based adaptation of small multilingual language models for low-resource lan- guages.arXiv preprint arXiv:2501.00000. R. A. Gutherz...

work page arXiv 2025
[5]

John Hale

Tokenization and the noiseless channel.arXiv preprint arXiv:2306.16842. John Hale. 2001. A probabilistic Earley parser as a psy- cholinguistic model. InSecond Meeting of the North American Chapter of the Association for Computa- tional Linguistics. Martin Haspelmath and Andrea Sims. 2013.Under- standing morphology. Routledge. Yifei He, Alon Benhaim, Barun...

work page arXiv 2001
[6]

Tatsuki Kuribayashi, Ryo Ueda, Ryo Yoshida, Yohei Oseki, Ted Briscoe, and Timothy Baldwin

New data on text reading in english as a second language.Studies in Second Language Acquisition, 47:677 – 695. Tatsuki Kuribayashi, Ryo Ueda, Ryo Yoshida, Yohei Oseki, Ted Briscoe, and Timothy Baldwin. 2024. Emergent word order universals from cognitively- motivated language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- ta...

work page arXiv 2024
[7]

likely”,“unlike

From zero to hero: On the limitations of zero- shot language transfer with multilingual transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483–4499. Daniel Lemire and Wojciech Muła. 2022. Transcoding billions of unicode characters per second with simd instructions.Software: Practice and E...

work page 2020
[8]

InFindings of ACL 2023

Tokenization impacts multilingual language modeling: Assessing vocabulary allocation and over- lap across languages. InFindings of ACL 2023. Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. 2024. Myte: Morphology-driven byte encoding for better and fairer multilingual language modeling.arXiv preprint arXiv:2403.10691....

work page arXiv 2023
[9]

InThirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, pages 13461–13469

On the importance of word order information in cross-lingual sequence labeling. InThirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, pages 13461–13469. AAAI Press. Georgia R. Loukatou, Sabine Stoll, Damián E. Blasi, and Alejandrina Cristià. 2021. Does morphological complexity affect word segmentation? ev...

work page 2021
[10]

Scaling laws are unreliable for downstream tasks: A reality check. ArXiv. Nishan Luitel, Nishant Bekoju, Anil Kumar Sah, and 1 others. 2025. Can perplexity predict finetuning performance? an investigation of tokenization ef- fects on sequential language models for nepali. In Proceedings of the Fourth Workshop on Multilingual Representation Learning. Jessi...

work page arXiv 2025
[11]

Niva Mor

Bit-level bpe: Below the byte boundary.arXiv preprint arXiv:2506.07541. Niva Mor. 2025. It’s a global village (if you speak the right language): On language models, digital sidelin- ing, and participation.Wisconsin International Law Journal. Aaron Mueller, Garrett Nicolai, Panayiota Petrou- Zeniou, Natalia Talmina, and Tal Linzen. 2020. Cross-linguistic s...

work page arXiv 2025
[12]

Isidro Parra

Morphology matters: A multilingual language modeling analysis.Transactions of the Association for Computational Linguistics, 9:261–276. Isidro Parra. 2024. Morphological typology in bpe sub- word productivity and language modeling.arXiv preprint arXiv:2410.23656. Olga Pelloni, Anastassia Shaitarova, and Tanja Samardzic. 2022. Subword evenness (sue) as a p...

work page arXiv 2024
[13]

AgentNet: A scalable framework for multi-step agent trajectory generation.arXiv preprint arXiv:2501.00000,

A surprisal–duration trade-off across and within the world’s languages. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 949–962. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual bert? InProceed- ings of the 57th Annual Meeting of the Association for Computational Linguisti...

work page arXiv 2021
[14]

Xenia Schmalz, Eva Marinus, Max Coltheart, and Anne Castles

Big data suggest strong constraints of linguis- tic similarity on adult language learning.Cognition, 194:104056. Xenia Schmalz, Eva Marinus, Max Coltheart, and Anne Castles. 2015. Getting to the bottom of orthographic depth.Psychonomic Bulletin & Review, 22:1614– 1629. Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter...

work page arXiv 2015
[15]

InACL 2016, pages 1715–1725

Neural machine translation of rare words with subword units. InACL 2016, pages 1715–1725. Philip H. K. Seymour, Mikko Aro, and Jane Erskine

work page 2016
[16]

Claude E

Foundation literacy acquisition in european orthographies.British journal of psychology, 94 Pt 2:143–74. Claude E. Shannon. 1948.A Mathematical Theory of Communication. Bell System Technical Journal. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. InProceedings of the 2018 Conference of the Nor...

work page arXiv 1948
[17]

Junqiu Wei, Qun Liu, Yinpeng Guo, and Xin Jiang

On negative interference in multilingual mod- els: Findings and a meta-learning treatment.arXiv preprint arXiv:2010.03017. Junqiu Wei, Qun Liu, Yinpeng Guo, and Xin Jiang

work page arXiv 2010
[18]

Training multilingual pre-trained language model with byte-level subwords.arXiv preprint arXiv:2101.09469. Ethan G. Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P. Levy. 2023. Testing the pre- dictions of surprisal theory in 11 languages.Transac- tions of the Association for Computational Linguis- tics, 11:1451–1470. Shijie Wu and Mark...

work page arXiv 2023
[19]

InFindings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 598–614

A Formal Perspective on Byte-pair Encoding. InFindings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 598–614. Association for Computational Linguistics

work page 2023

[1] [1]

DeepSeek-AI, D

Getting the most out of your tokenizer for pre-training and domain adaptation. InForty-first International Conference on Machine Learning. John Dang, Shivalika Singh, Daniel D’souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campo...

work page arXiv 2024

[2] [2]

Jonathan Dunn

Position information in transformers: An overview.Computational Linguistics, 48(3):733– 763. Jonathan Dunn. 2020. Mapping languages: the corpus of global language use.Language Resources and Evaluation, 54(4):999–1018. Jonathan Dunn and Benjamin Adams. 2020. Mapping languages and demographics with georeferenced cor- pora.arXiv preprint arXiv:2004.00809. Li...

work page arXiv 2020

[3] [3]

Version 1

Large-scale evidence of dependency length minimization in 37 languages. InProceedings of the National Academy of Sciences, volume 112, pages 10336–10341. Matthias Gallé. 2019. Investigating the effectiveness of BPE: The power of shorter sequences. InProceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter-...

work page arXiv 2019

[4] [4]

Dagmar Gurgurov, Ivan Vykopal, Josef van Genabith, and 1 others

Multilingual large language models and the curse of multilinguality.arXiv preprint arXiv:2406.10602. Dagmar Gurgurov, Ivan Vykopal, Josef van Genabith, and 1 others. 2025. Small models, big impact: Ef- ficient corpus and graph-based adaptation of small multilingual language models for low-resource lan- guages.arXiv preprint arXiv:2501.00000. R. A. Gutherz...

work page arXiv 2025

[5] [5]

John Hale

Tokenization and the noiseless channel.arXiv preprint arXiv:2306.16842. John Hale. 2001. A probabilistic Earley parser as a psy- cholinguistic model. InSecond Meeting of the North American Chapter of the Association for Computa- tional Linguistics. Martin Haspelmath and Andrea Sims. 2013.Under- standing morphology. Routledge. Yifei He, Alon Benhaim, Barun...

work page arXiv 2001

[6] [6]

Tatsuki Kuribayashi, Ryo Ueda, Ryo Yoshida, Yohei Oseki, Ted Briscoe, and Timothy Baldwin

New data on text reading in english as a second language.Studies in Second Language Acquisition, 47:677 – 695. Tatsuki Kuribayashi, Ryo Ueda, Ryo Yoshida, Yohei Oseki, Ted Briscoe, and Timothy Baldwin. 2024. Emergent word order universals from cognitively- motivated language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- ta...

work page arXiv 2024

[7] [7]

likely”,“unlike

From zero to hero: On the limitations of zero- shot language transfer with multilingual transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483–4499. Daniel Lemire and Wojciech Muła. 2022. Transcoding billions of unicode characters per second with simd instructions.Software: Practice and E...

work page 2020

[8] [8]

InFindings of ACL 2023

Tokenization impacts multilingual language modeling: Assessing vocabulary allocation and over- lap across languages. InFindings of ACL 2023. Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. 2024. Myte: Morphology-driven byte encoding for better and fairer multilingual language modeling.arXiv preprint arXiv:2403.10691....

work page arXiv 2023

[9] [9]

InThirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, pages 13461–13469

On the importance of word order information in cross-lingual sequence labeling. InThirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, pages 13461–13469. AAAI Press. Georgia R. Loukatou, Sabine Stoll, Damián E. Blasi, and Alejandrina Cristià. 2021. Does morphological complexity affect word segmentation? ev...

work page 2021

[10] [10]

Scaling laws are unreliable for downstream tasks: A reality check. ArXiv. Nishan Luitel, Nishant Bekoju, Anil Kumar Sah, and 1 others. 2025. Can perplexity predict finetuning performance? an investigation of tokenization ef- fects on sequential language models for nepali. In Proceedings of the Fourth Workshop on Multilingual Representation Learning. Jessi...

work page arXiv 2025

[11] [11]

Niva Mor

Bit-level bpe: Below the byte boundary.arXiv preprint arXiv:2506.07541. Niva Mor. 2025. It’s a global village (if you speak the right language): On language models, digital sidelin- ing, and participation.Wisconsin International Law Journal. Aaron Mueller, Garrett Nicolai, Panayiota Petrou- Zeniou, Natalia Talmina, and Tal Linzen. 2020. Cross-linguistic s...

work page arXiv 2025

[12] [12]

Isidro Parra

Morphology matters: A multilingual language modeling analysis.Transactions of the Association for Computational Linguistics, 9:261–276. Isidro Parra. 2024. Morphological typology in bpe sub- word productivity and language modeling.arXiv preprint arXiv:2410.23656. Olga Pelloni, Anastassia Shaitarova, and Tanja Samardzic. 2022. Subword evenness (sue) as a p...

work page arXiv 2024

[13] [13]

AgentNet: A scalable framework for multi-step agent trajectory generation.arXiv preprint arXiv:2501.00000,

A surprisal–duration trade-off across and within the world’s languages. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 949–962. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual bert? InProceed- ings of the 57th Annual Meeting of the Association for Computational Linguisti...

work page arXiv 2021

[14] [14]

Xenia Schmalz, Eva Marinus, Max Coltheart, and Anne Castles

Big data suggest strong constraints of linguis- tic similarity on adult language learning.Cognition, 194:104056. Xenia Schmalz, Eva Marinus, Max Coltheart, and Anne Castles. 2015. Getting to the bottom of orthographic depth.Psychonomic Bulletin & Review, 22:1614– 1629. Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter...

work page arXiv 2015

[15] [15]

InACL 2016, pages 1715–1725

Neural machine translation of rare words with subword units. InACL 2016, pages 1715–1725. Philip H. K. Seymour, Mikko Aro, and Jane Erskine

work page 2016

[16] [16]

Claude E

Foundation literacy acquisition in european orthographies.British journal of psychology, 94 Pt 2:143–74. Claude E. Shannon. 1948.A Mathematical Theory of Communication. Bell System Technical Journal. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. InProceedings of the 2018 Conference of the Nor...

work page arXiv 1948

[17] [17]

Junqiu Wei, Qun Liu, Yinpeng Guo, and Xin Jiang

On negative interference in multilingual mod- els: Findings and a meta-learning treatment.arXiv preprint arXiv:2010.03017. Junqiu Wei, Qun Liu, Yinpeng Guo, and Xin Jiang

work page arXiv 2010

[18] [18]

Training multilingual pre-trained language model with byte-level subwords.arXiv preprint arXiv:2101.09469. Ethan G. Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P. Levy. 2023. Testing the pre- dictions of surprisal theory in 11 languages.Transac- tions of the Association for Computational Linguis- tics, 11:1451–1470. Shijie Wu and Mark...

work page arXiv 2023

[19] [19]

InFindings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 598–614

A Formal Perspective on Byte-pair Encoding. InFindings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 598–614. Association for Computational Linguistics

work page 2023