pith. sign in

arxiv: 2605.23885 · v1 · pith:CTN4BLK3new · submitted 2026-05-22 · 💻 cs.CL

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

Pith reviewed 2026-05-25 04:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords cross-lingual knowledge transferlexical interventionmultilingual pretraininglow-resource languagesbilingual vocabularydata-level interventionknowledge transfer
0
0 comments X

The pith

Random lexical swaps in English pretraining data improve knowledge transfer to eight low-resource languages without extra training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LINK, a method that randomly replaces some English words with their translations from a bilingual vocabulary during the pretraining phase on high-resource data. This intervention is intended to help models acquire target-language versions of knowledge that would otherwise come only from scarce target-language text. The approach needs nothing beyond an existing bilingual word list, which can be built at almost no cost. Experiments across eight languages and five model sizes report gains on downstream tasks plus up to a 2x reduction in the training steps needed to reach a given performance level.

Core claim

LINK improves cross-lingual knowledge transfer by performing random word-level lexical substitutions on a portion of the English pretraining corpus at a chosen replacement ratio. Selected English words are swapped with their translations drawn from a bilingual vocabulary, after which the mixed corpus is used for standard pretraining. The method requires no parallel sentences, no translation models, and no additional training stages.

What carries the argument

LINK, the data-level intervention that applies random lexical substitutions from bilingual vocabularies to the high-resource portion of pretraining data.

If this is right

  • Downstream task performance improves in the target language for eight languages and five model sizes.
  • Training reaches equivalent performance levels up to twice as fast.
  • Only a bilingual vocabulary is needed; no parallel data or extra model stages are required.
  • The intervention works when target-language data is scarce.
  • The method can be applied during ordinary pretraining at negligible added cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same substitution trick could be tested on pairs of languages that both have moderate data, not just English to low-resource.
  • If the effect holds, multilingual pretraining pipelines could add target-language coverage by extending existing bilingual dictionaries rather than collecting new corpora.
  • The result suggests that surface lexical overlap may help models align deeper reasoning structures across languages.

Load-bearing premise

Random word-level swaps using a bilingual vocabulary transfer complex knowledge such as scientific reasoning without adding harmful noise.

What would settle it

A controlled run in which the same pretraining data and schedule, but with substitutions disabled or replaced by random non-translation words, yields identical or better downstream results in the target languages.

read the original abstract

Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involving scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK - a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vocabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabulary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes LINK, a data-level intervention for cross-lingual knowledge transfer that randomly substitutes words in portions of the English pretraining corpus with translations drawn from a bilingual vocabulary at a chosen replacement ratio. The method requires no additional training stages, parallel data, or auxiliary models. Evaluation across eight languages and five model sizes is reported to yield notable gains on downstream tasks in the target language together with up to 2x speedup to reach equivalent performance.

Significance. If the empirical results prove robust, the approach would supply a near-zero-cost, training-free mechanism for leveraging high-resource data to improve low-resource language performance on knowledge-intensive tasks, broadening access to effective multilingual models without reliance on scarce parallel resources.

major comments (2)
  1. Abstract and method description: the central claim that random word-level lexical substitutions transfer scientific reasoning and commonsense knowledge rests on the unexamined assumption that mixed-language sequences preserve the gradient signals needed for higher-order reasoning; no analysis, ablation on replacement ratio, or coherence metric is supplied to address the risk that single-word swaps disrupt multi-word terms and local syntax.
  2. Evaluation section: the reported performance gains and 2x speedup are stated without baselines, number of runs, statistical tests, controls for data volume or replacement ratio, or discussion of potential confounds, rendering it impossible to determine whether the results support the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: Abstract and method description: the central claim that random word-level lexical substitutions transfer scientific reasoning and commonsense knowledge rests on the unexamined assumption that mixed-language sequences preserve the gradient signals needed for higher-order reasoning; no analysis, ablation on replacement ratio, or coherence metric is supplied to address the risk that single-word swaps disrupt multi-word terms and local syntax.

    Authors: We agree that the manuscript does not provide explicit analysis of gradient preservation or coherence after substitutions. The empirical results across eight languages and five model sizes show consistent downstream gains, which indirectly supports that learning signals remain effective, but this is insufficient. We will add an ablation varying the replacement ratio (e.g., 0%, 5%, 10%, 20%) with corresponding downstream performance curves, and include a simple coherence metric such as the fraction of substituted sentences whose local syntax remains intact (measured via dependency parsing) or a comparison of perplexity on held-out monolingual data. These additions will appear in a new subsection of the method and experiments. revision: yes

  2. Referee: Evaluation section: the reported performance gains and 2x speedup are stated without baselines, number of runs, statistical tests, controls for data volume or replacement ratio, or discussion of potential confounds, rendering it impossible to determine whether the results support the claimed improvements.

    Authors: We acknowledge that the current evaluation lacks several standard controls. The manuscript already evaluates across eight languages and five model sizes with a fixed total token budget, but does not report baselines, run counts, or statistical tests. In revision we will (1) add a no-intervention baseline with identical data volume, (2) report means and standard deviations over three random seeds, (3) include paired t-tests or Wilcoxon tests for significance, (4) explicitly state that replacement ratio is varied while holding total tokens constant, and (5) add a short discussion of potential confounds including vocabulary overlap and the effect of code-switching on attention patterns. These changes will be incorporated into Section 4 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical intervention evaluated on external downstream tasks

full rationale

The paper presents LINK as a data-level intervention that performs random word-level lexical substitutions from bilingual vocabularies during pretraining of high-resource data. Claims of improved knowledge transfer and training speedup are assessed solely via empirical evaluation on downstream tasks across eight languages and five model sizes. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the method's value is not established by construction or internal redefinition but by external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the stated assumption that bilingual vocabularies are cheaply available; the replacement ratio is mentioned but not quantified or fitted.

axioms (1)
  • domain assumption Bilingual vocabularies can be obtained at near-zero cost for virtually any language
    Invoked in the abstract to justify the method's practicality without additional resources.

pith-pipeline@v0.9.0 · 5727 in / 1302 out tokens · 42937 ms · 2026-05-25T04:08:56.100299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 9 internal anchors

  1. [1]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    BigScience Workshop , :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. URL https://arxiv.org/abs/2211.05100

  2. [2]

    PIQA: Reasoning about Physical Commonsense in Natural Language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  4. [4]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. URL https://arxiv.org/abs/2303.12712

  5. [5]

    When is multilinguality a curse? language modeling for 250 high-and low-resource languages

    Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 4074--4096, 2024

  6. [6]

    Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al

    Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al. Global piqa: Evaluating physical commonsense reasoning across 100+ languages and cultures. ArXiv, abs/2510.24081, 2025. URL https://api.semanticscholar.org/CorpusID:282401377

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018

  8. [8]

    Cross-lingual language model pretraining

    Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA, 2019

  9. [9]

    Unsupervised Cross-lingual Representation Learning at Scale , booktitle =

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the As...

  10. [10]

    Emerging Cross-lingual Structure in Pretrained Language Models , booktitle =

    Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging Cross -lingual Structure in Pretrained Language Models . In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pp.\ 6022--6034, Online, July 2020 b . Associatio...

  11. [11]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, et al. Deepseek-v3 technical report, 2025. URL https://arx...

  12. [12]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol...

  13. [13]

    Data augmentation for low-resource neural machine translation

    Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Data augmentation for low-resource neural machine translation. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 567--573, Vancouver, Canada, July 2017. Association for Computational Linguistic...

  14. [14]

    Gemini Team , Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, et al. Gemini: A family of highly capable multimoda...

  15. [15]

    Task-adaptive pretrained language models via clustered-importance sampling

    David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=p6ncr0eTKE

  16. [16]

    Rae, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

  17. [17]

    Explicit alignment objectives for multilingual bidirectional encoders

    Junjie Hu, Melvin Johnson, Orhan Firat, Aditya Siddhant, and Graham Neubig. Explicit alignment objectives for multilingual bidirectional encoders. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the Nort...

  18. [18]

    Contextual augmentation: Data augmentation by words with paradigmatic relations

    Sosuke Kobayashi. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 452--457, New Orleans, Louisi...

  19. [19]

    Dimakis, Yair Carmon, Achal Dav, Ludwig Schmidt, and Vaishaal Shankar

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

  20. [20]

    P re A lign: Boosting cross-lingual transfer by early establishment of multilingual alignment

    Jiahuan Li, Shujian Huang, Aarron Ching, Xinyu Dai, and Jiajun Chen. P re A lign: Boosting cross-lingual transfer by early establishment of multilingual alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10246--10257, Miami, Florida, USA, Nove...

  21. [21]

    Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLM s

    Danni Liu and Jan Niehues. Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLM s. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 15979--15996, Vienna, Austria, July 202...

  22. [22]

    ATLAS : Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality

    Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, and Sayna Ebrahimi. ATLAS : Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. In The Fourteenth International Conference on Learning Representations, 2026. URL https://op...

  23. [23]

    Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingual quality in mind. In Phil Blunsom, Shay Cohen, Paramveer Dhillon, and Percy Liang (eds.), Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp.\ 151--159, Denver, Colorado, June 2015. Association for Computational Ling...

  24. [24]

    Natural language processing applications for low-resource languages

    Partha Pakray, Alexander Gelbukh, and Sivaji Bandyopadhyay. Natural language processing applications for low-resource languages. Natural Language Processing, 31 0 (2): 0 183–197, 2025. doi:10.1017/nlp.2024.33

  25. [25]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Li...

  26. [26]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?...

  27. [27]

    Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025

    Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025. URL https://arxiv.org/abs/2506.20920

  28. [28]

    Winogrande: an adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64 0 (9): 0 99–106, August 2021. ISSN 0001-0782. doi:10.1145/3474381. URL https://doi.org/10.1145/3474381

  29. [29]

    Training bilingual LM s with data constraints in the targeted language

    Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual LM s with data constraints in the targeted language. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 19096--19122, Vienna, Austria, July 20...

  30. [30]

    A benchmark for learning to translate a new language from one grammar book

    Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book. In Arxiv, 2023

  31. [31]

    Qwen2.5 Technical Report

    Team Qwen , :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianh...

  32. [32]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288

  33. [33]

    Aya model: An instruction finetuned open-access multilingual language model

    Ahmet \"U st \"u n, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D ' souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model. In Lun-Wei ...

  34. [34]

    Recent advancements and challenges of T urkic C entral A sian language processing

    Yana Veitsman and Mareike Hartmann. Recent advancements and challenges of T urkic C entral A sian language processing. In Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, and Lasitha Uyangodage (eds.), Proceedings of the First Workshop on Language Models for Low-Resource Languages, pp...

  35. [35]

    S witch O ut: an efficient data augmentation algorithm for neural machine translation

    Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. S witch O ut: an efficient data augmentation algorithm for neural machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 856--861, Brussels, Belgium, October-Novemb...

  36. [36]

    Investigating and scaling up code-switching for multilingual language model pre-training

    Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, and Shujian Huang. Investigating and scaling up code-switching for multilingual language model pre-training. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Lingui...

  37. [37]

    EDA : Easy data augmentation techniques for boosting performance on text classification tasks

    Jason Wei and Kai Zou. EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP...

  38. [38]

    Liu, and Matt Gardner

    Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin (eds.), Proceedings of the 3rd Workshop on Noisy User-generated Text, pp.\ 94--106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-4413. URL https://...

  39. [39]

    CCN et: Extracting high quality monolingual datasets from web crawl data

    Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm \'a n, Armand Joulin, and Edouard Grave. CCN et: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Fr \'e d \'e ric B \'e chet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Be...

  40. [40]

    Wiktionary: The free dictionary, 2025

    Wikimedia Foundation . Wiktionary: The free dictionary, 2025. URL https://www.wiktionary.org. Accessed: 2025

  41. [41]

    Wang, Jiwei Li, Daniel L \'e vy, Aiming Nie, Dan Jurafsky, and Andrew Y

    Ziang Xie, Sida I. Wang, Jiwei Li, Daniel L \'e vy, Aiming Nie, Dan Jurafsky, and Andrew Y. Ng. Data noising as smoothing in neural network language models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1VyHY9gg

  42. [43]

    m T 5: A massively multilingual pre-trained text-to-text transformer

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Procee...

  43. [44]

    Code-switching curriculum learning for multilingual transfer in LLM s

    Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, and Hwaran Lee. Code-switching curriculum learning for multilingual transfer in LLM s. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 7816--7836, Vienna, Austria, July 2025. Association for Com...

  44. [45]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4791--4800, Florence, Italy, July 2019. Association for Computational...