How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

Artur Kulmizev; Esther Ploeger; Heather Lent; Jiaming Luo; Johannes Bjerva; Kushal Tatariya; Marcel Bollmann; Miryam de Lhoneux; Wessel Poelman

arxiv: 2411.05527 · v3 · submitted 2024-11-08 · 💻 cs.CL

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

Kushal Tatariya , Artur Kulmizev , Wessel Poelman , Esther Ploeger , Marcel Bollmann , Johannes Bjerva , Jiaming Luo , Heather Lent

show 1 more author

Miryam de Lhoneux

This is my paper

Pith reviewed 2026-05-23 17:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords wikipediadata qualitylow-resource languagesmultilingual nlpdata filteringlanguage modelingbot contentquality ranking

0 comments

The pith

Filtering Wikipedia data for quality issues matches or improves language model performance, especially in lower-quality editions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a web-text filtering process across all non-English Wikipedia and uncovers repeated problems such as script mixing, placeholder articles, and bot-written pages. It groups the editions into a four-level quality scale that lines up with other checks. In language modeling tests, training on the cleaned portions performs at least as well as training on the full raw text, with the clearest improvements in the weaker editions. This matters because Wikipedia serves as a default training source for many multilingual systems, so knowing where the data actually holds up affects how datasets get built. The results point toward treating Wikipedia quality as something that needs active checking rather than assuming it is uniformly high.

Core claim

Subjecting the full non-English Wikipedia to a filtering procedure normally used on noisy web text removes a large share of the data while exposing systematic issues including script and language contamination, repeated template articles, and heavy bot-generated content. These observations are organized into a 4-level quality ranking that tracks closely with other quality signals. Language models trained on the filtered subsets largely match or exceed the performance of models trained on the unfiltered versions, and the advantage is largest for the lower-ranked language editions.

What carries the argument

The 4-level quality ranking of Wikipedia language editions produced by running web-text filtering across the entire non-English collection.

If this is right

Filtered data yields performance that is at least as good as raw Wikipedia in three language modeling setups.
The largest gains appear in lower-quality language editions.
The quality ranking aligns with independent heuristics and can guide data selection.
Bot-generated and template content is concentrated enough to be removed at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering approach could be tested on other widely used multilingual corpora to check whether similar quality patterns appear.
The ranking might serve as a quick proxy for predicting which Wikipedia editions will benefit most from cleaning in tasks beyond language modeling.
Dataset creators could adopt automated quality audits as a standard first step rather than a one-off study.

Load-bearing premise

That a filtering method designed for noisy web pages works equally well on Wikipedia without discarding useful material or adding new biases.

What would settle it

A controlled experiment showing that language models trained on the filtered Wikipedia data consistently underperform models trained on the raw data across several low-resource language editions would falsify the performance claim.

read the original abstract

Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in NLP. However, in recent years, such assumptions of high quality have become the subject of scrutiny in low-resource and multilingual contexts. In this study, we subject the entirety of non-English Wikipedia to a data filtering procedure typically reserved for noisy web-text -- a process which removes a large percentage of the collection's data. In analysing the removed data, we reveal numerous systematic quality issues, such as script and language contamination, repeated template and placeholder articles, and a high concentration of bot-generated content. We consolidate these findings into a 4-level quality ranking of Wikipedia, which shows strong correspondence with alternative quality measures and heuristics. Lastly, we evaluate the downstream impact of quality filtering in three practical language modelling scenarios, showing that models trained on filtered data largely match or outperform those trained on raw Wikipedia, with the largest gains observed for lower-quality language editions. Ultimately, our experiments serve as a first step in establishing quality-aware best practices for Wikipedia utilization in NLP, laying groundwork that can inform future dataset creation and curation efforts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper applies a web-text filtering procedure to the entirety of non-English Wikipedia, analyzes the removed content to document systematic issues such as script/language contamination, template/placeholder articles, and bot-generated content, derives a 4-level quality ranking that correlates with alternative measures, and reports language-model experiments in three scenarios where models trained on the filtered data largely match or outperform those trained on raw Wikipedia, with the largest gains for lower-quality language editions.

Significance. If the central empirical claim holds after controlling for data volume, the work supplies a practical quality ranking and evidence-based guidance on filtering Wikipedia for multilingual and low-resource NLP, backed by transparent analysis of removed data and multi-scenario evaluation. The comprehensive scope across all non-English editions and the validation of the ranking against heuristics are strengths that could inform future dataset curation.

major comments (2)

[Language modelling experiments] Language modelling experiments (final section): the claim that filtered data 'largely match or outperform' raw Wikipedia, with largest gains for lower-quality editions, cannot be isolated from the effect of reduced corpus size. Filtering removes a large percentage of tokens (abstract and analysis section), yet no size-matched random-subsample control from the raw corpus is reported; without it the causal attribution to quality filtering rather than training dynamics on smaller data remains untested and is load-bearing for the main result.
[Analysis of removed data] Quality ranking construction (analysis section): the 4-level ranking is presented as consolidating the identified issues, but the manuscript does not report the relative prevalence or weighting of each issue type (e.g., fraction of removals due to bots vs. templates vs. contamination) or the exact decision thresholds used, making it difficult to assess reproducibility or sensitivity of the ranking.

minor comments (2)

[Abstract] Abstract lacks any quantitative anchors (token removal percentages, number of languages, performance deltas, or dataset sizes), which reduces immediate informativeness even though the full manuscript supplies them.
Figure captions and legends for the quality ranking and LM performance plots should explicitly state the number of languages per quality tier and whether error bars reflect multiple seeds or cross-validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and revise the manuscript accordingly to strengthen the empirical claims and transparency.

read point-by-point responses

Referee: [Language modelling experiments] Language modelling experiments (final section): the claim that filtered data 'largely match or outperform' raw Wikipedia, with largest gains for lower-quality editions, cannot be isolated from the effect of reduced corpus size. Filtering removes a large percentage of tokens (abstract and analysis section), yet no size-matched random-subsample control from the raw corpus is reported; without it the causal attribution to quality filtering rather than training dynamics on smaller data remains untested and is load-bearing for the main result.

Authors: We agree that the absence of a size-matched control leaves open the possibility that performance differences arise from training on smaller data volumes rather than from quality improvements. In the revised manuscript we will add language-modeling runs on randomly subsampled raw Wikipedia corpora whose token counts exactly match those of the filtered versions. These controls will be reported alongside the existing results, allowing direct attribution of gains (especially for lower-quality editions) to the filtering procedure itself. revision: yes
Referee: [Analysis of removed data] Quality ranking construction (analysis section): the 4-level ranking is presented as consolidating the identified issues, but the manuscript does not report the relative prevalence or weighting of each issue type (e.g., fraction of removals due to bots vs. templates vs. contamination) or the exact decision thresholds used, making it difficult to assess reproducibility or sensitivity of the ranking.

Authors: We acknowledge that explicit quantification of issue prevalence and precise thresholds would improve reproducibility. The revised manuscript will add a table (or expanded section) reporting the percentage of removed tokens attributable to each category—script/language contamination, template/placeholder articles, and bot-generated content—together with the exact decision rules and numeric thresholds applied at each filtering stage. This will also include a brief sensitivity analysis of the ranking under modest threshold perturbations. revision: yes

Circularity Check

0 steps flagged

Empirical audit with no derivation chain or self-referential predictions

full rationale

This paper is an empirical audit: it applies a web-text filter to Wikipedia editions, inspects removed tokens for contamination patterns, produces a 4-level quality ranking, and reports LM perplexity/performance deltas on filtered vs. raw corpora. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the abstract or described methodology. The reported outcomes are direct experimental measurements, not quantities forced by construction from inputs within the paper. Self-citations, if present, are not load-bearing for any central claim. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The filtering procedure itself likely contains implicit thresholds or rules that function as free parameters, but none are stated.

pith-pipeline@v0.9.0 · 5755 in / 1086 out tokens · 29165 ms · 2026-05-23T17:33:41.119761+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Factual Inconsistencies in Multilingual Wikipedia Tables
cs.CL 2025-07 unverdicted novelty 4.0

The study introduces a method for detecting and categorizing cross-lingual factual inconsistencies in Wikipedia tables using alignment techniques and metrics on sample data.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

01.AI , Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zo...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

David Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba Alabi, Yanke Mao, Haonan Gao, and En-Shiun Lee. 2024. https://aclanthology.org/2024.eacl-long.14 SIB -200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects . In Proceedings of the 18th Conference of the European Chapter of the Associat...

work page 2024
[5]

Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F

David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, ...

work page doi:10.18653/v1/2022.emnlp-main.298 2022
[6]

David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunn...

work page doi:10.18653/v1/2023.ijcnlp-main.10 2023
[7]

Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David Adelani, and Cristina Espa \ n a-Bonet. 2020. https://aclanthology.org/2020.lrec-1.335 Massive vs. curated embeddings for low-resourced languages: the case of Y or \`u b \'a and T wi . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2754--2762, Marseille, France. European L...

work page 2020
[8]

Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow

Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. 2022. https://aclanthology.org/2022.coling-1.382 Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning . In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336--4349, Gyeongju, Republic of Korea. ...

work page 2022
[9]

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. 2024. http://arxiv.org/abs/2402.16827 A survey on data selection for language models

work page arXiv 2024
[10]

Saied Alshahrani, Norah Alshahrani, and Jeanna Matthews. 2023. https://doi.org/10.18653/v1/2023.trustnlp-1.16 DEPTH +: An enhanced depth metric for W ikipedia corpora quality . In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 175--189, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.trustnlp-1.16 2023
[11]

Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de Vi \ n aspre, and Aitor Soroa. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.499 Does corpus quality really matter for low-resource languages? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7383--7390, Abu Dhabi, United Arab Emirates. Associa...

work page doi:10.18653/v1/2022.emnlp-main.499 2022
[12]

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. https://doi.org/10.18653/v1/2020.acl-main.421 On the cross-lingual transferability of monolingual representations . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623--4637, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.421 2020
[13]

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. https://doi.org/10.18653/v1/2024.acl-long.44 The belebele benchmark: a parallel reading comprehension dataset in 122 language variants . In Proceedings of the 62nd Annual Meeting of t...

work page doi:10.18653/v1/2024.acl-long.44 2024
[14]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. https://doi.org/10.1162/tacl_a_00317 T y D i QA : A benchmark for information-seeking question answering in typologically diverse languages . Transactions of the Association for Computational Linguistics, 8:454--470

work page doi:10.1162/tacl_a_00317 2020
[15]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. http://arxiv.org/abs/1911.02116 Unsupervised cross-lingual representation learning at scale

work page internal anchor Pith review Pith/arXiv arXiv 2020
[16]

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Curran Associates Inc

work page 2019
[17]

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. https://doi.org/10.18653/v1/D18-1269 XNLI : Evaluating cross-lingual sentence representations . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475--2485, Brussels, Belgium. Association...

work page doi:10.18653/v1/d18-1269 2018
[18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

work page doi:10.18653/v1/n19-1423 2019
[19]

Cheikh M. Bamba Dione, David Ifeoluwa Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munk...

work page doi:10.18653/v1/2023.acl-long.609 2023
[20]

Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. 2024. http://arxiv.org/abs/2306.16388 Towards measuring the representation of subj...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. https://openreview.net/forum?id=XPZIaotutsD Deberta: Decoding-enhanced bert with disentangled attention . In International Conference on Learning Representations

work page 2021
[22]

William Held, Camille Harris, Michael Best, and Diyi Yang. 2023. http://arxiv.org/abs/2311.08391 A material lens on coloniality in NLP

work page arXiv 2023
[23]

Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. https://doi.org/10.18653/v1/P16-1145 W iki R eading: A novel large-scale language understanding task over W ikipedia . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:...

work page doi:10.18653/v1/p16-1145 2016
[24]

Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.831 How to train BERT with an academic budget . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644--10652, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

work page doi:10.18653/v1/2021.emnlp-main.831 2021
[25]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

work page doi:10.18653/v1/p17-1147 2017
[26]

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/v1/2020.acl-main.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online. Association for Computational...

work page doi:10.18653/v1/2020.acl-main.560 2020
[27]

Amir Hossein Kargaran, Ayyoob Imani, Fran c ois Yvon, and Hinrich Schuetze. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.410 G lot LID : Language identification for low-resource languages . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6155--6218, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.410 2023
[29]

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyonga...

work page doi:10.1162/tacl_a_00447 2022
[30]

Taku Kudo and John Richardson. 2018. https://doi.org/10.18653/v1/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for Compu...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[31]

Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat

Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2023. http://arxiv.org/abs/2309.04662 Madlad-400: A multilingual and document-level large audited dataset

work page arXiv 2023
[32]

Guillaume Lample and Alexis Conneau. 2019. https://api.semanticscholar.org/CorpusID:58981712 Cross-lingual language model pretraining . ArXiv, abs/1901.07291

work page internal anchor Pith review Pith/arXiv arXiv 2019
[33]

Jens Lehmann, Dhananjay Bhandiwad, Preetam Gattogi, and Sahar Vahdati. 2024. https://doi.org/10.1162/tacl_a_00671 Beyond boundaries: A human-like approach for question answering over structured and unstructured information sources . Transactions of the Association for Computational Linguistics, 12:786--802

work page doi:10.1162/tacl_a_00671 2024
[34]

Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loïc Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders Søgaard, and Johannes Bjerva. 2024. http://arxiv.o...

work page arXiv 2024
[35]

Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. https://doi.org/10.18653/v1/2020.acl-main.653 MLQA : Evaluating cross-lingual extractive question answering . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315--7330, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.653 2020
[36]

Constantine Lignos, Nolan Holley, Chester Palen-Michel, and Jonne S \"a lev \"a . 2022. https://doi.org/10.18653/v1/2022.findings-acl.44 Toward more meaningful resources for lower-resourced languages . In Findings of the Association for Computational Linguistics: ACL 2022, pages 523--532, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-acl.44 2022
[37]

Constantine Lignos, Maya Kruse, and Andrew Rueda. 2023. https://doi.org/10.18653/v1/2023.nlposs-1.17 Improving NER research workflows with S eq S core . In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 147--152, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.nlposs-1.17 2023
[38]

Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. 2024. https://doi.org/10.18653/v1/2024.naacl-long.179 A pretrainer ' s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity . In Proceedings of the 2024 Co...

work page doi:10.18653/v1/2024.naacl-long.179 2024
[39]

Max Marion, Ahmet \"U st \"u n, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. 2023. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564

work page arXiv 2023
[40]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016
[41]

Shamsuddeen Muhammad, Idris Abdulmumin, Abinew Ayele, Nedjma Ousidhoum, David Adelani, Seid Yimam, Ibrahim Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Alipio Jorge, Pavel Brazdil, Felermino Ali, Davis David, Salomey Osei, Bello Shehu-Bello, Falalu Lawan, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Destaw Belay, Wendimu Messell...

work page doi:10.18653/v1/2023.emnlp-main.862 2023
[42]

Rossi, and Thien Huu Nguyen

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. https://aclanthology.org/2024.lrec-main.377 C ultura X : A cleaned, enormous, and multilingual dataset for large language models in 167 languages . In Proceedings of the 2024 Joint International Conference on Computationa...

work page 2024
[43]

Gabriel Nicholas and Aliya Bhatia. 2023. http://arxiv.org/abs/2306.07377 Lost in translation: Large language models in non-english content analysis

work page arXiv 2023
[44]

Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljube s i \'c , Miquel Espl \`a -Gomis, Gema Ram \' rez-S \'a nchez, and Antonio Toral. 2024. https://aclanthology.org/2024.lrec-main.465 Do language models care about text quality? evaluating web-crawled corpora across 11 languages . In Proceedings of the 2024 Joint International Conference on Computationa...

work page 2024
[45]

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. https://doi.org/10.18653/v1/P17-1178 Cross-lingual name tagging and linking for 282 languages . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946--1958, Vancouver, Canada. Association for Com...

work page doi:10.18653/v1/p17-1178 2017
[46]

Guilherme Penedo, Hynek Kydl \' c ek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. 2024. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37:30811--30849

work page 2024
[47]

Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe. 2022. https://doi.org/10.18653/v1/2022.naacl-main.255 Lifting the curse of multilinguality by pre-training modular transformers . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...

work page doi:10.18653/v1/2022.naacl-main.255 2022
[48]

Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.617 MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online. Association for Comput...

work page doi:10.18653/v1/2020.emnlp-main.617 2020
[49]

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1--67

work page 2020
[51]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. http://arxiv.org/abs/1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019. https://doi.org/10.18653/v1/P19-1015 Massively multilingual transfer for NER . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151--164, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1015 2019
[53]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1264 2016
[54]

David Samuel, Andrey Kutuzov, Lilja vrelid, and Erik Velldal. 2023. https://doi.org/10.18653/v1/2023.findings-eacl.146 Trained on 100 million words and still in shape: BERT meets B ritish N ational C orpus . In Findings of the Association for Computational Linguistics: EACL 2023, pages 1954--1974, Dubrovnik, Croatia. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-eacl.146 2023
[55]

Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzm \'a n. 2021. https://doi.org/10.18653/v1/2021.eacl-main.115 W iki M atrix: Mining 135 M parallel sentences in 1620 language pairs from W ikipedia . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1...

work page doi:10.18653/v1/2021.eacl-main.115 2021
[56]

Kushal Tatariya, Heather Lent, and Miryam de Lhoneux. 2023. https://doi.org/10.18653/v1/2023.wassa-1.32 Transfer learning for code-mixed data: Do pretraining languages matter? In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , pages 365--378, Toronto, Canada. Association for Computational ...

work page doi:10.18653/v1/2023.wassa-1.32 2023
[57]

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S. Morcos. 2023. http://arxiv.org/abs/2308.12284 D4: Improving llm pretraining via document de-duplication and diversification

work page arXiv 2023
[58]

Together Computer . 2023. https://github.com/togethercomputer/RedPajama-Data Redpajama: an open dataset for training large language models

work page 2023
[59]

Denny Vrande c i\' c and Markus Kr\" o tzsch. 2014. https://doi.org/10.1145/2629489 Wikidata: a free collaborative knowledgebase . Commun. ACM, 57(10):78–85

work page doi:10.1145/2629489 2014
[60]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

work page 2019
[61]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/W18-5446 GLUE : A multi-task benchmark and analysis platform for natural language understanding . In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels, Be...

work page doi:10.18653/v1/w18-5446 2018
[62]

Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, and Ryan Cotterell, editors. 2023. https://aclanthology.org/2023.conll-babylm.0 Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning . Association for Compu...

work page 2023
[63]

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm \'a n, Armand Joulin, and Edouard Grave. 2020. https://aclanthology.org/2020.lrec-1.494 CCN et: Extracting high quality monolingual datasets from web crawl data . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003--4012, Marseille, F...

work page 2020
[64]

Kyle Wilson. 2019. https://www.theverge.com/2019/5/8/18526739/wikipedia-translation-tool-machine-learning-ai-english Wikipedia has a Google Translate problem

work page 2019
[65]

George Kingsley Zipf. 1949. https://pure.mpg.de/rest/items/item_2407822_4/component/file_2562959/content Human Behavior and the Principle of Least Effort : An Introduction to Human Ecology . Addison-Wesley Press, Cambridge, Massachusetts

work page 1949
[66]

Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2017. https://doi.org/10.18653/v1/W17-2512 Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora . In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 60--67, Vancouver, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/w17-2512 2017

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

01.AI , Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zo...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

David Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba Alabi, Yanke Mao, Haonan Gao, and En-Shiun Lee. 2024. https://aclanthology.org/2024.eacl-long.14 SIB -200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects . In Proceedings of the 18th Conference of the European Chapter of the Associat...

work page 2024

[5] [5]

Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F

David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, ...

work page doi:10.18653/v1/2022.emnlp-main.298 2022

[6] [6]

David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunn...

work page doi:10.18653/v1/2023.ijcnlp-main.10 2023

[7] [7]

Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David Adelani, and Cristina Espa \ n a-Bonet. 2020. https://aclanthology.org/2020.lrec-1.335 Massive vs. curated embeddings for low-resourced languages: the case of Y or \`u b \'a and T wi . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2754--2762, Marseille, France. European L...

work page 2020

[8] [8]

Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow

Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. 2022. https://aclanthology.org/2022.coling-1.382 Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning . In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336--4349, Gyeongju, Republic of Korea. ...

work page 2022

[9] [9]

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. 2024. http://arxiv.org/abs/2402.16827 A survey on data selection for language models

work page arXiv 2024

[10] [10]

Saied Alshahrani, Norah Alshahrani, and Jeanna Matthews. 2023. https://doi.org/10.18653/v1/2023.trustnlp-1.16 DEPTH +: An enhanced depth metric for W ikipedia corpora quality . In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 175--189, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.trustnlp-1.16 2023

[11] [11]

Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de Vi \ n aspre, and Aitor Soroa. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.499 Does corpus quality really matter for low-resource languages? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7383--7390, Abu Dhabi, United Arab Emirates. Associa...

work page doi:10.18653/v1/2022.emnlp-main.499 2022

[12] [12]

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. https://doi.org/10.18653/v1/2020.acl-main.421 On the cross-lingual transferability of monolingual representations . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623--4637, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.421 2020

[13] [13]

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. https://doi.org/10.18653/v1/2024.acl-long.44 The belebele benchmark: a parallel reading comprehension dataset in 122 language variants . In Proceedings of the 62nd Annual Meeting of t...

work page doi:10.18653/v1/2024.acl-long.44 2024

[14] [14]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. https://doi.org/10.1162/tacl_a_00317 T y D i QA : A benchmark for information-seeking question answering in typologically diverse languages . Transactions of the Association for Computational Linguistics, 8:454--470

work page doi:10.1162/tacl_a_00317 2020

[15] [15]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. http://arxiv.org/abs/1911.02116 Unsupervised cross-lingual representation learning at scale

work page internal anchor Pith review Pith/arXiv arXiv 2020

[16] [16]

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Curran Associates Inc

work page 2019

[17] [17]

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. https://doi.org/10.18653/v1/D18-1269 XNLI : Evaluating cross-lingual sentence representations . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475--2485, Brussels, Belgium. Association...

work page doi:10.18653/v1/d18-1269 2018

[18] [18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

work page doi:10.18653/v1/n19-1423 2019

[19] [19]

Cheikh M. Bamba Dione, David Ifeoluwa Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munk...

work page doi:10.18653/v1/2023.acl-long.609 2023

[20] [20]

Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. 2024. http://arxiv.org/abs/2306.16388 Towards measuring the representation of subj...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. https://openreview.net/forum?id=XPZIaotutsD Deberta: Decoding-enhanced bert with disentangled attention . In International Conference on Learning Representations

work page 2021

[22] [22]

William Held, Camille Harris, Michael Best, and Diyi Yang. 2023. http://arxiv.org/abs/2311.08391 A material lens on coloniality in NLP

work page arXiv 2023

[23] [23]

Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. https://doi.org/10.18653/v1/P16-1145 W iki R eading: A novel large-scale language understanding task over W ikipedia . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:...

work page doi:10.18653/v1/p16-1145 2016

[24] [24]

Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.831 How to train BERT with an academic budget . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644--10652, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

work page doi:10.18653/v1/2021.emnlp-main.831 2021

[25] [25]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

work page doi:10.18653/v1/p17-1147 2017

[26] [26]

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/v1/2020.acl-main.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online. Association for Computational...

work page doi:10.18653/v1/2020.acl-main.560 2020

[27] [27]

Amir Hossein Kargaran, Ayyoob Imani, Fran c ois Yvon, and Hinrich Schuetze. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.410 G lot LID : Language identification for low-resource languages . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6155--6218, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.410 2023

[28] [29]

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyonga...

work page doi:10.1162/tacl_a_00447 2022

[29] [30]

Taku Kudo and John Richardson. 2018. https://doi.org/10.18653/v1/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for Compu...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018

[30] [31]

Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat

Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2023. http://arxiv.org/abs/2309.04662 Madlad-400: A multilingual and document-level large audited dataset

work page arXiv 2023

[31] [32]

Guillaume Lample and Alexis Conneau. 2019. https://api.semanticscholar.org/CorpusID:58981712 Cross-lingual language model pretraining . ArXiv, abs/1901.07291

work page internal anchor Pith review Pith/arXiv arXiv 2019

[32] [33]

Jens Lehmann, Dhananjay Bhandiwad, Preetam Gattogi, and Sahar Vahdati. 2024. https://doi.org/10.1162/tacl_a_00671 Beyond boundaries: A human-like approach for question answering over structured and unstructured information sources . Transactions of the Association for Computational Linguistics, 12:786--802

work page doi:10.1162/tacl_a_00671 2024

[33] [34]

Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loïc Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders Søgaard, and Johannes Bjerva. 2024. http://arxiv.o...

work page arXiv 2024

[34] [35]

Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. https://doi.org/10.18653/v1/2020.acl-main.653 MLQA : Evaluating cross-lingual extractive question answering . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315--7330, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.653 2020

[35] [36]

Constantine Lignos, Nolan Holley, Chester Palen-Michel, and Jonne S \"a lev \"a . 2022. https://doi.org/10.18653/v1/2022.findings-acl.44 Toward more meaningful resources for lower-resourced languages . In Findings of the Association for Computational Linguistics: ACL 2022, pages 523--532, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-acl.44 2022

[36] [37]

Constantine Lignos, Maya Kruse, and Andrew Rueda. 2023. https://doi.org/10.18653/v1/2023.nlposs-1.17 Improving NER research workflows with S eq S core . In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 147--152, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.nlposs-1.17 2023

[37] [38]

Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. 2024. https://doi.org/10.18653/v1/2024.naacl-long.179 A pretrainer ' s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity . In Proceedings of the 2024 Co...

work page doi:10.18653/v1/2024.naacl-long.179 2024

[38] [39]

Max Marion, Ahmet \"U st \"u n, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. 2023. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564

work page arXiv 2023

[39] [40]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016

[40] [41]

Shamsuddeen Muhammad, Idris Abdulmumin, Abinew Ayele, Nedjma Ousidhoum, David Adelani, Seid Yimam, Ibrahim Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Alipio Jorge, Pavel Brazdil, Felermino Ali, Davis David, Salomey Osei, Bello Shehu-Bello, Falalu Lawan, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Destaw Belay, Wendimu Messell...

work page doi:10.18653/v1/2023.emnlp-main.862 2023

[41] [42]

Rossi, and Thien Huu Nguyen

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. https://aclanthology.org/2024.lrec-main.377 C ultura X : A cleaned, enormous, and multilingual dataset for large language models in 167 languages . In Proceedings of the 2024 Joint International Conference on Computationa...

work page 2024

[42] [43]

Gabriel Nicholas and Aliya Bhatia. 2023. http://arxiv.org/abs/2306.07377 Lost in translation: Large language models in non-english content analysis

work page arXiv 2023

[43] [44]

Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljube s i \'c , Miquel Espl \`a -Gomis, Gema Ram \' rez-S \'a nchez, and Antonio Toral. 2024. https://aclanthology.org/2024.lrec-main.465 Do language models care about text quality? evaluating web-crawled corpora across 11 languages . In Proceedings of the 2024 Joint International Conference on Computationa...

work page 2024

[44] [45]

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. https://doi.org/10.18653/v1/P17-1178 Cross-lingual name tagging and linking for 282 languages . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946--1958, Vancouver, Canada. Association for Com...

work page doi:10.18653/v1/p17-1178 2017

[45] [46]

Guilherme Penedo, Hynek Kydl \' c ek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. 2024. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37:30811--30849

work page 2024

[46] [47]

Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe. 2022. https://doi.org/10.18653/v1/2022.naacl-main.255 Lifting the curse of multilinguality by pre-training modular transformers . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...

work page doi:10.18653/v1/2022.naacl-main.255 2022

[47] [48]

Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.617 MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online. Association for Comput...

work page doi:10.18653/v1/2020.emnlp-main.617 2020

[48] [49]

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446

work page internal anchor Pith review Pith/arXiv arXiv 2021

[49] [50]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1--67

work page 2020

[50] [51]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. http://arxiv.org/abs/1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [52]

Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019. https://doi.org/10.18653/v1/P19-1015 Massively multilingual transfer for NER . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151--164, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1015 2019

[52] [53]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1264 2016

[53] [54]

David Samuel, Andrey Kutuzov, Lilja vrelid, and Erik Velldal. 2023. https://doi.org/10.18653/v1/2023.findings-eacl.146 Trained on 100 million words and still in shape: BERT meets B ritish N ational C orpus . In Findings of the Association for Computational Linguistics: EACL 2023, pages 1954--1974, Dubrovnik, Croatia. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-eacl.146 2023

[54] [55]

Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzm \'a n. 2021. https://doi.org/10.18653/v1/2021.eacl-main.115 W iki M atrix: Mining 135 M parallel sentences in 1620 language pairs from W ikipedia . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1...

work page doi:10.18653/v1/2021.eacl-main.115 2021

[55] [56]

Kushal Tatariya, Heather Lent, and Miryam de Lhoneux. 2023. https://doi.org/10.18653/v1/2023.wassa-1.32 Transfer learning for code-mixed data: Do pretraining languages matter? In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , pages 365--378, Toronto, Canada. Association for Computational ...

work page doi:10.18653/v1/2023.wassa-1.32 2023

[56] [57]

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S. Morcos. 2023. http://arxiv.org/abs/2308.12284 D4: Improving llm pretraining via document de-duplication and diversification

work page arXiv 2023

[57] [58]

Together Computer . 2023. https://github.com/togethercomputer/RedPajama-Data Redpajama: an open dataset for training large language models

work page 2023

[58] [59]

Denny Vrande c i\' c and Markus Kr\" o tzsch. 2014. https://doi.org/10.1145/2629489 Wikidata: a free collaborative knowledgebase . Commun. ACM, 57(10):78–85

work page doi:10.1145/2629489 2014

[59] [60]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

work page 2019

[60] [61]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/W18-5446 GLUE : A multi-task benchmark and analysis platform for natural language understanding . In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels, Be...

work page doi:10.18653/v1/w18-5446 2018

[61] [62]

Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, and Ryan Cotterell, editors. 2023. https://aclanthology.org/2023.conll-babylm.0 Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning . Association for Compu...

work page 2023

[62] [63]

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm \'a n, Armand Joulin, and Edouard Grave. 2020. https://aclanthology.org/2020.lrec-1.494 CCN et: Extracting high quality monolingual datasets from web crawl data . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003--4012, Marseille, F...

work page 2020

[63] [64]

Kyle Wilson. 2019. https://www.theverge.com/2019/5/8/18526739/wikipedia-translation-tool-machine-learning-ai-english Wikipedia has a Google Translate problem

work page 2019

[64] [65]

George Kingsley Zipf. 1949. https://pure.mpg.de/rest/items/item_2407822_4/component/file_2562959/content Human Behavior and the Principle of Least Effort : An Introduction to Human Ecology . Addison-Wesley Press, Cambridge, Massachusetts

work page 1949

[65] [66]

Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2017. https://doi.org/10.18653/v1/W17-2512 Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora . In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 60--67, Vancouver, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/w17-2512 2017