Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

Anastasiia Sedova; Maartje ter Hoeve; Natalie Schluter; Skyler Seto

arxiv: 2605.23885 · v1 · pith:CTN4BLK3new · submitted 2026-05-22 · 💻 cs.CL

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

Anastasiia Sedova , Natalie Schluter , Skyler Seto , Maartje ter Hoeve This is my paper

Pith reviewed 2026-05-25 04:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords cross-lingual knowledge transferlexical interventionmultilingual pretraininglow-resource languagesbilingual vocabularydata-level interventionknowledge transfer

0 comments

The pith

Random lexical swaps in English pretraining data improve knowledge transfer to eight low-resource languages without extra training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LINK, a method that randomly replaces some English words with their translations from a bilingual vocabulary during the pretraining phase on high-resource data. This intervention is intended to help models acquire target-language versions of knowledge that would otherwise come only from scarce target-language text. The approach needs nothing beyond an existing bilingual word list, which can be built at almost no cost. Experiments across eight languages and five model sizes report gains on downstream tasks plus up to a 2x reduction in the training steps needed to reach a given performance level.

Core claim

LINK improves cross-lingual knowledge transfer by performing random word-level lexical substitutions on a portion of the English pretraining corpus at a chosen replacement ratio. Selected English words are swapped with their translations drawn from a bilingual vocabulary, after which the mixed corpus is used for standard pretraining. The method requires no parallel sentences, no translation models, and no additional training stages.

What carries the argument

LINK, the data-level intervention that applies random lexical substitutions from bilingual vocabularies to the high-resource portion of pretraining data.

If this is right

Downstream task performance improves in the target language for eight languages and five model sizes.
Training reaches equivalent performance levels up to twice as fast.
Only a bilingual vocabulary is needed; no parallel data or extra model stages are required.
The intervention works when target-language data is scarce.
The method can be applied during ordinary pretraining at negligible added cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same substitution trick could be tested on pairs of languages that both have moderate data, not just English to low-resource.
If the effect holds, multilingual pretraining pipelines could add target-language coverage by extending existing bilingual dictionaries rather than collecting new corpora.
The result suggests that surface lexical overlap may help models align deeper reasoning structures across languages.

Load-bearing premise

Random word-level swaps using a bilingual vocabulary transfer complex knowledge such as scientific reasoning without adding harmful noise.

What would settle it

A controlled run in which the same pretraining data and schedule, but with substitutions disabled or replaced by random non-translation words, yields identical or better downstream results in the target languages.

read the original abstract

Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involving scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK - a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vocabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabulary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LINK is a low-cost lexical substitution trick during pretraining that claims to speed up knowledge transfer to low-resource languages, but the abstract supplies no experimental details or controls to show the gains are real rather than noise.

read the letter

The paper's main contribution is LINK, a data intervention that randomly replaces some English words in the pretraining corpus with their target-language translations drawn from a bilingual vocabulary. This requires nothing beyond the vocab itself and no extra training stages or parallel data. That setup is genuinely new in this specific use for cross-lingual knowledge transfer, and it is the kind of minimal intervention that could matter for languages where even small dictionaries are the only resource available. The reported results across eight languages and five model sizes, with gains on downstream tasks and up to 2x faster convergence, would be practically useful if they hold.

Referee Report

2 major / 0 minor

Summary. The paper proposes LINK, a data-level intervention for cross-lingual knowledge transfer that randomly substitutes words in portions of the English pretraining corpus with translations drawn from a bilingual vocabulary at a chosen replacement ratio. The method requires no additional training stages, parallel data, or auxiliary models. Evaluation across eight languages and five model sizes is reported to yield notable gains on downstream tasks in the target language together with up to 2x speedup to reach equivalent performance.

Significance. If the empirical results prove robust, the approach would supply a near-zero-cost, training-free mechanism for leveraging high-resource data to improve low-resource language performance on knowledge-intensive tasks, broadening access to effective multilingual models without reliance on scarce parallel resources.

major comments (2)

Abstract and method description: the central claim that random word-level lexical substitutions transfer scientific reasoning and commonsense knowledge rests on the unexamined assumption that mixed-language sequences preserve the gradient signals needed for higher-order reasoning; no analysis, ablation on replacement ratio, or coherence metric is supplied to address the risk that single-word swaps disrupt multi-word terms and local syntax.
Evaluation section: the reported performance gains and 2x speedup are stated without baselines, number of runs, statistical tests, controls for data volume or replacement ratio, or discussion of potential confounds, rendering it impossible to determine whether the results support the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and indicate where revisions will be made.

read point-by-point responses

Referee: Abstract and method description: the central claim that random word-level lexical substitutions transfer scientific reasoning and commonsense knowledge rests on the unexamined assumption that mixed-language sequences preserve the gradient signals needed for higher-order reasoning; no analysis, ablation on replacement ratio, or coherence metric is supplied to address the risk that single-word swaps disrupt multi-word terms and local syntax.

Authors: We agree that the manuscript does not provide explicit analysis of gradient preservation or coherence after substitutions. The empirical results across eight languages and five model sizes show consistent downstream gains, which indirectly supports that learning signals remain effective, but this is insufficient. We will add an ablation varying the replacement ratio (e.g., 0%, 5%, 10%, 20%) with corresponding downstream performance curves, and include a simple coherence metric such as the fraction of substituted sentences whose local syntax remains intact (measured via dependency parsing) or a comparison of perplexity on held-out monolingual data. These additions will appear in a new subsection of the method and experiments. revision: yes
Referee: Evaluation section: the reported performance gains and 2x speedup are stated without baselines, number of runs, statistical tests, controls for data volume or replacement ratio, or discussion of potential confounds, rendering it impossible to determine whether the results support the claimed improvements.

Authors: We acknowledge that the current evaluation lacks several standard controls. The manuscript already evaluates across eight languages and five model sizes with a fixed total token budget, but does not report baselines, run counts, or statistical tests. In revision we will (1) add a no-intervention baseline with identical data volume, (2) report means and standard deviations over three random seeds, (3) include paired t-tests or Wilcoxon tests for significance, (4) explicitly state that replacement ratio is varied while holding total tokens constant, and (5) add a short discussion of potential confounds including vocabulary overlap and the effect of code-switching on attention patterns. These changes will be incorporated into Section 4 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical intervention evaluated on external downstream tasks

full rationale

The paper presents LINK as a data-level intervention that performs random word-level lexical substitutions from bilingual vocabularies during pretraining of high-resource data. Claims of improved knowledge transfer and training speedup are assessed solely via empirical evaluation on downstream tasks across eight languages and five model sizes. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the method's value is not established by construction or internal redefinition but by external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the stated assumption that bilingual vocabularies are cheaply available; the replacement ratio is mentioned but not quantified or fitted.

axioms (1)

domain assumption Bilingual vocabularies can be obtained at near-zero cost for virtually any language
Invoked in the abstract to justify the method's practicality without additional resources.

pith-pipeline@v0.9.0 · 5727 in / 1302 out tokens · 42937 ms · 2026-05-25T04:08:56.100299+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 9 internal anchors

[1]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BigScience Workshop , :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. URL https://arxiv.org/abs/2211.05100

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. URL https://arxiv.org/abs/2303.12712

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

When is multilinguality a curse? language modeling for 250 high-and low-resource languages

Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 4074--4096, 2024

work page 2024
[6]

Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al

Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al. Global piqa: Evaluating physical commonsense reasoning across 100+ languages and cultures. ArXiv, abs/2510.24081, 2025. URL https://api.semanticscholar.org/CorpusID:282401377

work page arXiv 2025
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Cross-lingual language model pretraining

Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA, 2019

work page 2019
[9]

Unsupervised Cross-lingual Representation Learning at Scale , booktitle =

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the As...

work page doi:10.18653/v1/2020.acl-main.747 2020
[10]

Emerging Cross-lingual Structure in Pretrained Language Models , booktitle =

Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging Cross -lingual Structure in Pretrained Language Models . In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pp.\ 6022--6034, Online, July 2020 b . Associatio...

work page doi:10.18653/v1/2020.acl-main.536 2020
[11]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, et al. Deepseek-v3 technical report, 2025. URL https://arx...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol...

work page doi:10.18653/v1/n19-1423 2019
[13]

Data augmentation for low-resource neural machine translation

Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Data augmentation for low-resource neural machine translation. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 567--573, Vancouver, Canada, July 2017. Association for Computational Linguistic...

work page doi:10.18653/v1/p17-2090 2017
[14]

Gemini Team , Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, et al. Gemini: A family of highly capable multimoda...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Task-adaptive pretrained language models via clustered-importance sampling

David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=p6ncr0eTKE

work page 2025
[16]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

work page 2022
[17]

Explicit alignment objectives for multilingual bidirectional encoders

Junjie Hu, Melvin Johnson, Orhan Firat, Aditya Siddhant, and Graham Neubig. Explicit alignment objectives for multilingual bidirectional encoders. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the Nort...

work page doi:10.18653/v1/2021.naacl-main.284 2021
[18]

Contextual augmentation: Data augmentation by words with paradigmatic relations

Sosuke Kobayashi. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 452--457, New Orleans, Louisi...

work page doi:10.18653/v1/n18-2072 2018
[19]

Dimakis, Yair Carmon, Achal Dav, Ludwig Schmidt, and Vaishaal Shankar

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

work page 2024
[20]

P re A lign: Boosting cross-lingual transfer by early establishment of multilingual alignment

Jiahuan Li, Shujian Huang, Aarron Ching, Xinyu Dai, and Jiajun Chen. P re A lign: Boosting cross-lingual transfer by early establishment of multilingual alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10246--10257, Miami, Florida, USA, Nove...

work page doi:10.18653/v1/2024.emnlp-main.572 2024
[21]

Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLM s

Danni Liu and Jan Niehues. Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLM s. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 15979--15996, Vienna, Austria, July 202...

work page doi:10.18653/v1/2025.acl-long.778 2025
[22]

ATLAS : Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality

Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, and Sayna Ebrahimi. ATLAS : Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. In The Fourteenth International Conference on Learning Representations, 2026. URL https://op...

work page 2026
[23]

Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingual quality in mind. In Phil Blunsom, Shay Cohen, Paramveer Dhillon, and Percy Liang (eds.), Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp.\ 151--159, Denver, Colorado, June 2015. Association for Computational Ling...

work page doi:10.3115/v1/w15-1521 2015
[24]

Natural language processing applications for low-resource languages

Partha Pakray, Alexander Gelbukh, and Sivaji Bandyopadhyay. Natural language processing applications for low-resource languages. Natural Language Processing, 31 0 (2): 0 183–197, 2025. doi:10.1017/nlp.2024.33

work page doi:10.1017/nlp.2024.33 2025
[25]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Li...

work page doi:10.18653/v1/p16-1144 2016
[26]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?...

work page 2024
[27]

Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025

Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025. URL https://arxiv.org/abs/2506.20920

work page arXiv 2025
[28]

Winogrande: an adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64 0 (9): 0 99–106, August 2021. ISSN 0001-0782. doi:10.1145/3474381. URL https://doi.org/10.1145/3474381

work page doi:10.1145/3474381 2021
[29]

Training bilingual LM s with data constraints in the targeted language

Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual LM s with data constraints in the targeted language. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 19096--19122, Vienna, Austria, July 20...

work page doi:10.18653/v1/2025.findings-acl.977 2025
[30]

A benchmark for learning to translate a new language from one grammar book

Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book. In Arxiv, 2023

work page 2023
[31]

Qwen2.5 Technical Report

Team Qwen , :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianh...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Aya model: An instruction finetuned open-access multilingual language model

Ahmet \"U st \"u n, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D ' souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model. In Lun-Wei ...

work page doi:10.18653/v1/2024.acl-long.845 2024
[34]

Recent advancements and challenges of T urkic C entral A sian language processing

Yana Veitsman and Mareike Hartmann. Recent advancements and challenges of T urkic C entral A sian language processing. In Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, and Lasitha Uyangodage (eds.), Proceedings of the First Workshop on Language Models for Low-Resource Languages, pp...

work page 2025
[35]

S witch O ut: an efficient data augmentation algorithm for neural machine translation

Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. S witch O ut: an efficient data augmentation algorithm for neural machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 856--861, Brussels, Belgium, October-Novemb...

work page doi:10.18653/v1/d18-1100 2018
[36]

Investigating and scaling up code-switching for multilingual language model pre-training

Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, and Shujian Huang. Investigating and scaling up code-switching for multilingual language model pre-training. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Lingui...

work page doi:10.18653/v1/2025.findings-acl.575 2025
[37]

EDA : Easy data augmentation techniques for boosting performance on text classification tasks

Jason Wei and Kai Zou. EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP...

work page doi:10.18653/v1/d19-1670 2019
[38]

Liu, and Matt Gardner

Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin (eds.), Proceedings of the 3rd Workshop on Noisy User-generated Text, pp.\ 94--106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-4413. URL https://...

work page doi:10.18653/v1/w17-4413 2017
[39]

CCN et: Extracting high quality monolingual datasets from web crawl data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm \'a n, Armand Joulin, and Edouard Grave. CCN et: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Fr \'e d \'e ric B \'e chet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Be...

work page 2020
[40]

Wiktionary: The free dictionary, 2025

Wikimedia Foundation . Wiktionary: The free dictionary, 2025. URL https://www.wiktionary.org. Accessed: 2025

work page 2025
[41]

Wang, Jiwei Li, Daniel L \'e vy, Aiming Nie, Dan Jurafsky, and Andrew Y

Ziang Xie, Sida I. Wang, Jiwei Li, Daniel L \'e vy, Aiming Nie, Dan Jurafsky, and Andrew Y. Ng. Data noising as smoothing in neural network language models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1VyHY9gg

work page 2017
[43]

m T 5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Procee...

work page doi:10.18653/v1/2021.naacl-main.41 2021
[44]

Code-switching curriculum learning for multilingual transfer in LLM s

Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, and Hwaran Lee. Code-switching curriculum learning for multilingual transfer in LLM s. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 7816--7836, Vienna, Austria, July 2025. Association for Com...

work page doi:10.18653/v1/2025.findings-acl.407 2025
[45]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4791--4800, Florence, Italy, July 2019. Association for Computational...

work page doi:10.18653/v1/p19-1472 2019

[1] [1]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BigScience Workshop , :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. URL https://arxiv.org/abs/2211.05100

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. URL https://arxiv.org/abs/2303.12712

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

When is multilinguality a curse? language modeling for 250 high-and low-resource languages

Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 4074--4096, 2024

work page 2024

[6] [6]

Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al

Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al. Global piqa: Evaluating physical commonsense reasoning across 100+ languages and cultures. ArXiv, abs/2510.24081, 2025. URL https://api.semanticscholar.org/CorpusID:282401377

work page arXiv 2025

[7] [7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Cross-lingual language model pretraining

Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA, 2019

work page 2019

[9] [9]

Unsupervised Cross-lingual Representation Learning at Scale , booktitle =

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the As...

work page doi:10.18653/v1/2020.acl-main.747 2020

[10] [10]

Emerging Cross-lingual Structure in Pretrained Language Models , booktitle =

Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging Cross -lingual Structure in Pretrained Language Models . In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pp.\ 6022--6034, Online, July 2020 b . Associatio...

work page doi:10.18653/v1/2020.acl-main.536 2020

[11] [11]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, et al. Deepseek-v3 technical report, 2025. URL https://arx...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol...

work page doi:10.18653/v1/n19-1423 2019

[13] [13]

Data augmentation for low-resource neural machine translation

Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Data augmentation for low-resource neural machine translation. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 567--573, Vancouver, Canada, July 2017. Association for Computational Linguistic...

work page doi:10.18653/v1/p17-2090 2017

[14] [14]

Gemini Team , Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, et al. Gemini: A family of highly capable multimoda...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Task-adaptive pretrained language models via clustered-importance sampling

David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=p6ncr0eTKE

work page 2025

[16] [16]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

work page 2022

[17] [17]

Explicit alignment objectives for multilingual bidirectional encoders

Junjie Hu, Melvin Johnson, Orhan Firat, Aditya Siddhant, and Graham Neubig. Explicit alignment objectives for multilingual bidirectional encoders. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the Nort...

work page doi:10.18653/v1/2021.naacl-main.284 2021

[18] [18]

Contextual augmentation: Data augmentation by words with paradigmatic relations

Sosuke Kobayashi. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 452--457, New Orleans, Louisi...

work page doi:10.18653/v1/n18-2072 2018

[19] [19]

Dimakis, Yair Carmon, Achal Dav, Ludwig Schmidt, and Vaishaal Shankar

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

work page 2024

[20] [20]

P re A lign: Boosting cross-lingual transfer by early establishment of multilingual alignment

Jiahuan Li, Shujian Huang, Aarron Ching, Xinyu Dai, and Jiajun Chen. P re A lign: Boosting cross-lingual transfer by early establishment of multilingual alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10246--10257, Miami, Florida, USA, Nove...

work page doi:10.18653/v1/2024.emnlp-main.572 2024

[21] [21]

Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLM s

Danni Liu and Jan Niehues. Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLM s. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 15979--15996, Vienna, Austria, July 202...

work page doi:10.18653/v1/2025.acl-long.778 2025

[22] [22]

ATLAS : Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality

Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, and Sayna Ebrahimi. ATLAS : Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. In The Fourteenth International Conference on Learning Representations, 2026. URL https://op...

work page 2026

[23] [23]

Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingual quality in mind. In Phil Blunsom, Shay Cohen, Paramveer Dhillon, and Percy Liang (eds.), Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp.\ 151--159, Denver, Colorado, June 2015. Association for Computational Ling...

work page doi:10.3115/v1/w15-1521 2015

[24] [24]

Natural language processing applications for low-resource languages

Partha Pakray, Alexander Gelbukh, and Sivaji Bandyopadhyay. Natural language processing applications for low-resource languages. Natural Language Processing, 31 0 (2): 0 183–197, 2025. doi:10.1017/nlp.2024.33

work page doi:10.1017/nlp.2024.33 2025

[25] [25]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Li...

work page doi:10.18653/v1/p16-1144 2016

[26] [26]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?...

work page 2024

[27] [27]

Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025

Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025. URL https://arxiv.org/abs/2506.20920

work page arXiv 2025

[28] [28]

Winogrande: an adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64 0 (9): 0 99–106, August 2021. ISSN 0001-0782. doi:10.1145/3474381. URL https://doi.org/10.1145/3474381

work page doi:10.1145/3474381 2021

[29] [29]

Training bilingual LM s with data constraints in the targeted language

Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual LM s with data constraints in the targeted language. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 19096--19122, Vienna, Austria, July 20...

work page doi:10.18653/v1/2025.findings-acl.977 2025

[30] [30]

A benchmark for learning to translate a new language from one grammar book

Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book. In Arxiv, 2023

work page 2023

[31] [31]

Qwen2.5 Technical Report

Team Qwen , :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianh...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Aya model: An instruction finetuned open-access multilingual language model

Ahmet \"U st \"u n, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D ' souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model. In Lun-Wei ...

work page doi:10.18653/v1/2024.acl-long.845 2024

[34] [34]

Recent advancements and challenges of T urkic C entral A sian language processing

Yana Veitsman and Mareike Hartmann. Recent advancements and challenges of T urkic C entral A sian language processing. In Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, and Lasitha Uyangodage (eds.), Proceedings of the First Workshop on Language Models for Low-Resource Languages, pp...

work page 2025

[35] [35]

S witch O ut: an efficient data augmentation algorithm for neural machine translation

Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. S witch O ut: an efficient data augmentation algorithm for neural machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 856--861, Brussels, Belgium, October-Novemb...

work page doi:10.18653/v1/d18-1100 2018

[36] [36]

Investigating and scaling up code-switching for multilingual language model pre-training

Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, and Shujian Huang. Investigating and scaling up code-switching for multilingual language model pre-training. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Lingui...

work page doi:10.18653/v1/2025.findings-acl.575 2025

[37] [37]

EDA : Easy data augmentation techniques for boosting performance on text classification tasks

Jason Wei and Kai Zou. EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP...

work page doi:10.18653/v1/d19-1670 2019

[38] [38]

Liu, and Matt Gardner

Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin (eds.), Proceedings of the 3rd Workshop on Noisy User-generated Text, pp.\ 94--106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-4413. URL https://...

work page doi:10.18653/v1/w17-4413 2017

[39] [39]

CCN et: Extracting high quality monolingual datasets from web crawl data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm \'a n, Armand Joulin, and Edouard Grave. CCN et: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Fr \'e d \'e ric B \'e chet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Be...

work page 2020

[40] [40]

Wiktionary: The free dictionary, 2025

Wikimedia Foundation . Wiktionary: The free dictionary, 2025. URL https://www.wiktionary.org. Accessed: 2025

work page 2025

[41] [41]

Wang, Jiwei Li, Daniel L \'e vy, Aiming Nie, Dan Jurafsky, and Andrew Y

Ziang Xie, Sida I. Wang, Jiwei Li, Daniel L \'e vy, Aiming Nie, Dan Jurafsky, and Andrew Y. Ng. Data noising as smoothing in neural network language models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1VyHY9gg

work page 2017

[42] [43]

m T 5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Procee...

work page doi:10.18653/v1/2021.naacl-main.41 2021

[43] [44]

Code-switching curriculum learning for multilingual transfer in LLM s

Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, and Hwaran Lee. Code-switching curriculum learning for multilingual transfer in LLM s. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 7816--7836, Vienna, Austria, July 2025. Association for Com...

work page doi:10.18653/v1/2025.findings-acl.407 2025

[44] [45]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4791--4800, Florence, Italy, July 2019. Association for Computational...

work page doi:10.18653/v1/p19-1472 2019