Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions
Pith reviewed 2026-05-25 04:08 UTC · model grok-4.3
The pith
Random lexical swaps in English pretraining data improve knowledge transfer to eight low-resource languages without extra training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LINK improves cross-lingual knowledge transfer by performing random word-level lexical substitutions on a portion of the English pretraining corpus at a chosen replacement ratio. Selected English words are swapped with their translations drawn from a bilingual vocabulary, after which the mixed corpus is used for standard pretraining. The method requires no parallel sentences, no translation models, and no additional training stages.
What carries the argument
LINK, the data-level intervention that applies random lexical substitutions from bilingual vocabularies to the high-resource portion of pretraining data.
If this is right
- Downstream task performance improves in the target language for eight languages and five model sizes.
- Training reaches equivalent performance levels up to twice as fast.
- Only a bilingual vocabulary is needed; no parallel data or extra model stages are required.
- The intervention works when target-language data is scarce.
- The method can be applied during ordinary pretraining at negligible added cost.
Where Pith is reading between the lines
- The same substitution trick could be tested on pairs of languages that both have moderate data, not just English to low-resource.
- If the effect holds, multilingual pretraining pipelines could add target-language coverage by extending existing bilingual dictionaries rather than collecting new corpora.
- The result suggests that surface lexical overlap may help models align deeper reasoning structures across languages.
Load-bearing premise
Random word-level swaps using a bilingual vocabulary transfer complex knowledge such as scientific reasoning without adding harmful noise.
What would settle it
A controlled run in which the same pretraining data and schedule, but with substitutions disabled or replaced by random non-translation words, yields identical or better downstream results in the target languages.
read the original abstract
Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involving scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK - a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vocabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabulary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LINK, a data-level intervention for cross-lingual knowledge transfer that randomly substitutes words in portions of the English pretraining corpus with translations drawn from a bilingual vocabulary at a chosen replacement ratio. The method requires no additional training stages, parallel data, or auxiliary models. Evaluation across eight languages and five model sizes is reported to yield notable gains on downstream tasks in the target language together with up to 2x speedup to reach equivalent performance.
Significance. If the empirical results prove robust, the approach would supply a near-zero-cost, training-free mechanism for leveraging high-resource data to improve low-resource language performance on knowledge-intensive tasks, broadening access to effective multilingual models without reliance on scarce parallel resources.
major comments (2)
- Abstract and method description: the central claim that random word-level lexical substitutions transfer scientific reasoning and commonsense knowledge rests on the unexamined assumption that mixed-language sequences preserve the gradient signals needed for higher-order reasoning; no analysis, ablation on replacement ratio, or coherence metric is supplied to address the risk that single-word swaps disrupt multi-word terms and local syntax.
- Evaluation section: the reported performance gains and 2x speedup are stated without baselines, number of runs, statistical tests, controls for data volume or replacement ratio, or discussion of potential confounds, rendering it impossible to determine whether the results support the claimed improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and indicate where revisions will be made.
read point-by-point responses
-
Referee: Abstract and method description: the central claim that random word-level lexical substitutions transfer scientific reasoning and commonsense knowledge rests on the unexamined assumption that mixed-language sequences preserve the gradient signals needed for higher-order reasoning; no analysis, ablation on replacement ratio, or coherence metric is supplied to address the risk that single-word swaps disrupt multi-word terms and local syntax.
Authors: We agree that the manuscript does not provide explicit analysis of gradient preservation or coherence after substitutions. The empirical results across eight languages and five model sizes show consistent downstream gains, which indirectly supports that learning signals remain effective, but this is insufficient. We will add an ablation varying the replacement ratio (e.g., 0%, 5%, 10%, 20%) with corresponding downstream performance curves, and include a simple coherence metric such as the fraction of substituted sentences whose local syntax remains intact (measured via dependency parsing) or a comparison of perplexity on held-out monolingual data. These additions will appear in a new subsection of the method and experiments. revision: yes
-
Referee: Evaluation section: the reported performance gains and 2x speedup are stated without baselines, number of runs, statistical tests, controls for data volume or replacement ratio, or discussion of potential confounds, rendering it impossible to determine whether the results support the claimed improvements.
Authors: We acknowledge that the current evaluation lacks several standard controls. The manuscript already evaluates across eight languages and five model sizes with a fixed total token budget, but does not report baselines, run counts, or statistical tests. In revision we will (1) add a no-intervention baseline with identical data volume, (2) report means and standard deviations over three random seeds, (3) include paired t-tests or Wilcoxon tests for significance, (4) explicitly state that replacement ratio is varied while holding total tokens constant, and (5) add a short discussion of potential confounds including vocabulary overlap and the effect of code-switching on attention patterns. These changes will be incorporated into Section 4 and the appendix. revision: yes
Circularity Check
No circularity: empirical intervention evaluated on external downstream tasks
full rationale
The paper presents LINK as a data-level intervention that performs random word-level lexical substitutions from bilingual vocabularies during pretraining of high-resource data. Claims of improved knowledge transfer and training speedup are assessed solely via empirical evaluation on downstream tasks across eight languages and five model sizes. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the method's value is not established by construction or internal redefinition but by external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bilingual vocabularies can be obtained at near-zero cost for virtually any language
Reference graph
Works this paper leans on
-
[1]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop , :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. URL https://arxiv.org/abs/2211.05100
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[4]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. URL https://arxiv.org/abs/2303.12712
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
When is multilinguality a curse? language modeling for 250 high-and low-resource languages
Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 4074--4096, 2024
work page 2024
-
[6]
Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al. Global piqa: Evaluating physical commonsense reasoning across 100+ languages and cultures. ArXiv, abs/2510.24081, 2025. URL https://api.semanticscholar.org/CorpusID:282401377
-
[7]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
Cross-lingual language model pretraining
Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA, 2019
work page 2019
-
[9]
Unsupervised Cross-lingual Representation Learning at Scale , booktitle =
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the As...
-
[10]
Emerging Cross-lingual Structure in Pretrained Language Models , booktitle =
Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging Cross -lingual Structure in Pretrained Language Models . In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pp.\ 6022--6034, Online, July 2020 b . Associatio...
-
[11]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, et al. Deepseek-v3 technical report, 2025. URL https://arx...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol...
-
[13]
Data augmentation for low-resource neural machine translation
Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Data augmentation for low-resource neural machine translation. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 567--573, Vancouver, Canada, July 2017. Association for Computational Linguistic...
-
[14]
Gemini Team , Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, et al. Gemini: A family of highly capable multimoda...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Task-adaptive pretrained language models via clustered-importance sampling
David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=p6ncr0eTKE
work page 2025
-
[16]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...
work page 2022
-
[17]
Explicit alignment objectives for multilingual bidirectional encoders
Junjie Hu, Melvin Johnson, Orhan Firat, Aditya Siddhant, and Graham Neubig. Explicit alignment objectives for multilingual bidirectional encoders. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the Nort...
-
[18]
Contextual augmentation: Data augmentation by words with paradigmatic relations
Sosuke Kobayashi. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 452--457, New Orleans, Louisi...
-
[19]
Dimakis, Yair Carmon, Achal Dav, Ludwig Schmidt, and Vaishaal Shankar
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...
work page 2024
-
[20]
P re A lign: Boosting cross-lingual transfer by early establishment of multilingual alignment
Jiahuan Li, Shujian Huang, Aarron Ching, Xinyu Dai, and Jiajun Chen. P re A lign: Boosting cross-lingual transfer by early establishment of multilingual alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10246--10257, Miami, Florida, USA, Nove...
-
[21]
Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLM s
Danni Liu and Jan Niehues. Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLM s. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 15979--15996, Vienna, Austria, July 202...
-
[22]
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, and Sayna Ebrahimi. ATLAS : Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. In The Fourteenth International Conference on Learning Representations, 2026. URL https://op...
work page 2026
-
[23]
Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingual quality in mind. In Phil Blunsom, Shay Cohen, Paramveer Dhillon, and Percy Liang (eds.), Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp.\ 151--159, Denver, Colorado, June 2015. Association for Computational Ling...
-
[24]
Natural language processing applications for low-resource languages
Partha Pakray, Alexander Gelbukh, and Sivaji Bandyopadhyay. Natural language processing applications for low-resource languages. Natural Language Processing, 31 0 (2): 0 183–197, 2025. doi:10.1017/nlp.2024.33
-
[25]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Li...
-
[26]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?...
work page 2024
-
[27]
Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025. URL https://arxiv.org/abs/2506.20920
-
[28]
Winogrande: an adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64 0 (9): 0 99–106, August 2021. ISSN 0001-0782. doi:10.1145/3474381. URL https://doi.org/10.1145/3474381
-
[29]
Training bilingual LM s with data constraints in the targeted language
Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual LM s with data constraints in the targeted language. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 19096--19122, Vienna, Austria, July 20...
-
[30]
A benchmark for learning to translate a new language from one grammar book
Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book. In Arxiv, 2023
work page 2023
-
[31]
Team Qwen , :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianh...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Aya model: An instruction finetuned open-access multilingual language model
Ahmet \"U st \"u n, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D ' souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model. In Lun-Wei ...
-
[34]
Recent advancements and challenges of T urkic C entral A sian language processing
Yana Veitsman and Mareike Hartmann. Recent advancements and challenges of T urkic C entral A sian language processing. In Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, and Lasitha Uyangodage (eds.), Proceedings of the First Workshop on Language Models for Low-Resource Languages, pp...
work page 2025
-
[35]
S witch O ut: an efficient data augmentation algorithm for neural machine translation
Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. S witch O ut: an efficient data augmentation algorithm for neural machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 856--861, Brussels, Belgium, October-Novemb...
-
[36]
Investigating and scaling up code-switching for multilingual language model pre-training
Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, and Shujian Huang. Investigating and scaling up code-switching for multilingual language model pre-training. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Lingui...
-
[37]
EDA : Easy data augmentation techniques for boosting performance on text classification tasks
Jason Wei and Kai Zou. EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP...
-
[38]
Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin (eds.), Proceedings of the 3rd Workshop on Noisy User-generated Text, pp.\ 94--106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-4413. URL https://...
-
[39]
CCN et: Extracting high quality monolingual datasets from web crawl data
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm \'a n, Armand Joulin, and Edouard Grave. CCN et: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Fr \'e d \'e ric B \'e chet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Be...
work page 2020
-
[40]
Wiktionary: The free dictionary, 2025
Wikimedia Foundation . Wiktionary: The free dictionary, 2025. URL https://www.wiktionary.org. Accessed: 2025
work page 2025
-
[41]
Wang, Jiwei Li, Daniel L \'e vy, Aiming Nie, Dan Jurafsky, and Andrew Y
Ziang Xie, Sida I. Wang, Jiwei Li, Daniel L \'e vy, Aiming Nie, Dan Jurafsky, and Andrew Y. Ng. Data noising as smoothing in neural network language models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1VyHY9gg
work page 2017
-
[43]
m T 5: A massively multilingual pre-trained text-to-text transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Procee...
-
[44]
Code-switching curriculum learning for multilingual transfer in LLM s
Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, and Hwaran Lee. Code-switching curriculum learning for multilingual transfer in LLM s. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 7816--7836, Vienna, Austria, July 2025. Association for Com...
-
[45]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4791--4800, Florence, Italy, July 2019. Association for Computational...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.