How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP
Pith reviewed 2026-05-23 17:33 UTC · model grok-4.3
The pith
Filtering Wikipedia data for quality issues matches or improves language model performance, especially in lower-quality editions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Subjecting the full non-English Wikipedia to a filtering procedure normally used on noisy web text removes a large share of the data while exposing systematic issues including script and language contamination, repeated template articles, and heavy bot-generated content. These observations are organized into a 4-level quality ranking that tracks closely with other quality signals. Language models trained on the filtered subsets largely match or exceed the performance of models trained on the unfiltered versions, and the advantage is largest for the lower-ranked language editions.
What carries the argument
The 4-level quality ranking of Wikipedia language editions produced by running web-text filtering across the entire non-English collection.
If this is right
- Filtered data yields performance that is at least as good as raw Wikipedia in three language modeling setups.
- The largest gains appear in lower-quality language editions.
- The quality ranking aligns with independent heuristics and can guide data selection.
- Bot-generated and template content is concentrated enough to be removed at scale.
Where Pith is reading between the lines
- The same filtering approach could be tested on other widely used multilingual corpora to check whether similar quality patterns appear.
- The ranking might serve as a quick proxy for predicting which Wikipedia editions will benefit most from cleaning in tasks beyond language modeling.
- Dataset creators could adopt automated quality audits as a standard first step rather than a one-off study.
Load-bearing premise
That a filtering method designed for noisy web pages works equally well on Wikipedia without discarding useful material or adding new biases.
What would settle it
A controlled experiment showing that language models trained on the filtered Wikipedia data consistently underperform models trained on the raw data across several low-resource language editions would falsify the performance claim.
read the original abstract
Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in NLP. However, in recent years, such assumptions of high quality have become the subject of scrutiny in low-resource and multilingual contexts. In this study, we subject the entirety of non-English Wikipedia to a data filtering procedure typically reserved for noisy web-text -- a process which removes a large percentage of the collection's data. In analysing the removed data, we reveal numerous systematic quality issues, such as script and language contamination, repeated template and placeholder articles, and a high concentration of bot-generated content. We consolidate these findings into a 4-level quality ranking of Wikipedia, which shows strong correspondence with alternative quality measures and heuristics. Lastly, we evaluate the downstream impact of quality filtering in three practical language modelling scenarios, showing that models trained on filtered data largely match or outperform those trained on raw Wikipedia, with the largest gains observed for lower-quality language editions. Ultimately, our experiments serve as a first step in establishing quality-aware best practices for Wikipedia utilization in NLP, laying groundwork that can inform future dataset creation and curation efforts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies a web-text filtering procedure to the entirety of non-English Wikipedia, analyzes the removed content to document systematic issues such as script/language contamination, template/placeholder articles, and bot-generated content, derives a 4-level quality ranking that correlates with alternative measures, and reports language-model experiments in three scenarios where models trained on the filtered data largely match or outperform those trained on raw Wikipedia, with the largest gains for lower-quality language editions.
Significance. If the central empirical claim holds after controlling for data volume, the work supplies a practical quality ranking and evidence-based guidance on filtering Wikipedia for multilingual and low-resource NLP, backed by transparent analysis of removed data and multi-scenario evaluation. The comprehensive scope across all non-English editions and the validation of the ranking against heuristics are strengths that could inform future dataset curation.
major comments (2)
- [Language modelling experiments] Language modelling experiments (final section): the claim that filtered data 'largely match or outperform' raw Wikipedia, with largest gains for lower-quality editions, cannot be isolated from the effect of reduced corpus size. Filtering removes a large percentage of tokens (abstract and analysis section), yet no size-matched random-subsample control from the raw corpus is reported; without it the causal attribution to quality filtering rather than training dynamics on smaller data remains untested and is load-bearing for the main result.
- [Analysis of removed data] Quality ranking construction (analysis section): the 4-level ranking is presented as consolidating the identified issues, but the manuscript does not report the relative prevalence or weighting of each issue type (e.g., fraction of removals due to bots vs. templates vs. contamination) or the exact decision thresholds used, making it difficult to assess reproducibility or sensitivity of the ranking.
minor comments (2)
- [Abstract] Abstract lacks any quantitative anchors (token removal percentages, number of languages, performance deltas, or dataset sizes), which reduces immediate informativeness even though the full manuscript supplies them.
- Figure captions and legends for the quality ranking and LM performance plots should explicitly state the number of languages per quality tier and whether error bars reflect multiple seeds or cross-validation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and revise the manuscript accordingly to strengthen the empirical claims and transparency.
read point-by-point responses
-
Referee: [Language modelling experiments] Language modelling experiments (final section): the claim that filtered data 'largely match or outperform' raw Wikipedia, with largest gains for lower-quality editions, cannot be isolated from the effect of reduced corpus size. Filtering removes a large percentage of tokens (abstract and analysis section), yet no size-matched random-subsample control from the raw corpus is reported; without it the causal attribution to quality filtering rather than training dynamics on smaller data remains untested and is load-bearing for the main result.
Authors: We agree that the absence of a size-matched control leaves open the possibility that performance differences arise from training on smaller data volumes rather than from quality improvements. In the revised manuscript we will add language-modeling runs on randomly subsampled raw Wikipedia corpora whose token counts exactly match those of the filtered versions. These controls will be reported alongside the existing results, allowing direct attribution of gains (especially for lower-quality editions) to the filtering procedure itself. revision: yes
-
Referee: [Analysis of removed data] Quality ranking construction (analysis section): the 4-level ranking is presented as consolidating the identified issues, but the manuscript does not report the relative prevalence or weighting of each issue type (e.g., fraction of removals due to bots vs. templates vs. contamination) or the exact decision thresholds used, making it difficult to assess reproducibility or sensitivity of the ranking.
Authors: We acknowledge that explicit quantification of issue prevalence and precise thresholds would improve reproducibility. The revised manuscript will add a table (or expanded section) reporting the percentage of removed tokens attributable to each category—script/language contamination, template/placeholder articles, and bot-generated content—together with the exact decision rules and numeric thresholds applied at each filtering stage. This will also include a brief sensitivity analysis of the ranking under modest threshold perturbations. revision: yes
Circularity Check
Empirical audit with no derivation chain or self-referential predictions
full rationale
This paper is an empirical audit: it applies a web-text filter to Wikipedia editions, inspects removed tokens for contamination patterns, produces a 4-level quality ranking, and reports LM perplexity/performance deltas on filtered vs. raw corpora. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the abstract or described methodology. The reported outcomes are direct experimental measurements, not quantities forced by construction from inputs within the paper. Self-citations, if present, are not load-bearing for any central claim. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Factual Inconsistencies in Multilingual Wikipedia Tables
The study introduces a method for detecting and categorizing cross-lingual factual inconsistencies in Wikipedia tables using alignment techniques and metrics on sample data.
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
01.AI , Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zo...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
David Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba Alabi, Yanke Mao, Haonan Gao, and En-Shiun Lee. 2024. https://aclanthology.org/2024.eacl-long.14 SIB -200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects . In Proceedings of the 18th Conference of the European Chapter of the Associat...
work page 2024
-
[5]
Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F
David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, ...
-
[6]
David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunn...
-
[7]
Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David Adelani, and Cristina Espa \ n a-Bonet. 2020. https://aclanthology.org/2020.lrec-1.335 Massive vs. curated embeddings for low-resourced languages: the case of Y or \`u b \'a and T wi . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2754--2762, Marseille, France. European L...
work page 2020
-
[8]
Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow
Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. 2022. https://aclanthology.org/2022.coling-1.382 Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning . In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336--4349, Gyeongju, Republic of Korea. ...
work page 2022
-
[9]
Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. 2024. http://arxiv.org/abs/2402.16827 A survey on data selection for language models
-
[10]
Saied Alshahrani, Norah Alshahrani, and Jeanna Matthews. 2023. https://doi.org/10.18653/v1/2023.trustnlp-1.16 DEPTH +: An enhanced depth metric for W ikipedia corpora quality . In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 175--189, Toronto, Canada. Association for Computational Linguistics
-
[11]
Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de Vi \ n aspre, and Aitor Soroa. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.499 Does corpus quality really matter for low-resource languages? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7383--7390, Abu Dhabi, United Arab Emirates. Associa...
-
[12]
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. https://doi.org/10.18653/v1/2020.acl-main.421 On the cross-lingual transferability of monolingual representations . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623--4637, Online. Association for Computational Linguistics
-
[13]
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. https://doi.org/10.18653/v1/2024.acl-long.44 The belebele benchmark: a parallel reading comprehension dataset in 122 language variants . In Proceedings of the 62nd Annual Meeting of t...
-
[14]
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. https://doi.org/10.1162/tacl_a_00317 T y D i QA : A benchmark for information-seeking question answering in typologically diverse languages . Transactions of the Association for Computational Linguistics, 8:454--470
-
[15]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. http://arxiv.org/abs/1911.02116 Unsupervised cross-lingual representation learning at scale
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[16]
Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Curran Associates Inc
work page 2019
-
[17]
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. https://doi.org/10.18653/v1/D18-1269 XNLI : Evaluating cross-lingual sentence representations . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475--2485, Brussels, Belgium. Association...
-
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...
-
[19]
Cheikh M. Bamba Dione, David Ifeoluwa Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munk...
-
[20]
Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. 2024. http://arxiv.org/abs/2306.16388 Towards measuring the representation of subj...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. https://openreview.net/forum?id=XPZIaotutsD Deberta: Decoding-enhanced bert with disentangled attention . In International Conference on Learning Representations
work page 2021
- [22]
-
[23]
Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. https://doi.org/10.18653/v1/P16-1145 W iki R eading: A novel large-scale language understanding task over W ikipedia . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:...
-
[24]
Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.831 How to train BERT with an academic budget . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644--10652, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics
-
[25]
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...
-
[26]
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/v1/2020.acl-main.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online. Association for Computational...
-
[27]
Amir Hossein Kargaran, Ayyoob Imani, Fran c ois Yvon, and Hinrich Schuetze. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.410 G lot LID : Language identification for low-resource languages . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6155--6218, Singapore. Association for Computational Linguistics
-
[29]
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyonga...
-
[30]
Taku Kudo and John Richardson. 2018. https://doi.org/10.18653/v1/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for Compu...
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
-
[31]
Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2023. http://arxiv.org/abs/2309.04662 Madlad-400: A multilingual and document-level large audited dataset
-
[32]
Guillaume Lample and Alexis Conneau. 2019. https://api.semanticscholar.org/CorpusID:58981712 Cross-lingual language model pretraining . ArXiv, abs/1901.07291
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[33]
Jens Lehmann, Dhananjay Bhandiwad, Preetam Gattogi, and Sahar Vahdati. 2024. https://doi.org/10.1162/tacl_a_00671 Beyond boundaries: A human-like approach for question answering over structured and unstructured information sources . Transactions of the Association for Computational Linguistics, 12:786--802
-
[34]
Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loïc Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders Søgaard, and Johannes Bjerva. 2024. http://arxiv.o...
-
[35]
Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. https://doi.org/10.18653/v1/2020.acl-main.653 MLQA : Evaluating cross-lingual extractive question answering . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315--7330, Online. Association for Computational Linguistics
-
[36]
Constantine Lignos, Nolan Holley, Chester Palen-Michel, and Jonne S \"a lev \"a . 2022. https://doi.org/10.18653/v1/2022.findings-acl.44 Toward more meaningful resources for lower-resourced languages . In Findings of the Association for Computational Linguistics: ACL 2022, pages 523--532, Dublin, Ireland. Association for Computational Linguistics
-
[37]
Constantine Lignos, Maya Kruse, and Andrew Rueda. 2023. https://doi.org/10.18653/v1/2023.nlposs-1.17 Improving NER research workflows with S eq S core . In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 147--152, Singapore. Association for Computational Linguistics
-
[38]
Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. 2024. https://doi.org/10.18653/v1/2024.naacl-long.179 A pretrainer ' s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity . In Proceedings of the 2024 Co...
- [39]
-
[40]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[41]
Shamsuddeen Muhammad, Idris Abdulmumin, Abinew Ayele, Nedjma Ousidhoum, David Adelani, Seid Yimam, Ibrahim Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Alipio Jorge, Pavel Brazdil, Felermino Ali, Davis David, Salomey Osei, Bello Shehu-Bello, Falalu Lawan, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Destaw Belay, Wendimu Messell...
-
[42]
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. https://aclanthology.org/2024.lrec-main.377 C ultura X : A cleaned, enormous, and multilingual dataset for large language models in 167 languages . In Proceedings of the 2024 Joint International Conference on Computationa...
work page 2024
- [43]
-
[44]
Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljube s i \'c , Miquel Espl \`a -Gomis, Gema Ram \' rez-S \'a nchez, and Antonio Toral. 2024. https://aclanthology.org/2024.lrec-main.465 Do language models care about text quality? evaluating web-crawled corpora across 11 languages . In Proceedings of the 2024 Joint International Conference on Computationa...
work page 2024
-
[45]
Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. https://doi.org/10.18653/v1/P17-1178 Cross-lingual name tagging and linking for 282 languages . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946--1958, Vancouver, Canada. Association for Com...
-
[46]
Guilherme Penedo, Hynek Kydl \' c ek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. 2024. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37:30811--30849
work page 2024
-
[47]
Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe. 2022. https://doi.org/10.18653/v1/2022.naacl-main.255 Lifting the curse of multilinguality by pre-training modular transformers . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...
-
[48]
Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.617 MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online. Association for Comput...
-
[49]
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[50]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1--67
work page 2020
-
[51]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. http://arxiv.org/abs/1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019. https://doi.org/10.18653/v1/P19-1015 Massively multilingual transfer for NER . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151--164, Florence, Italy. Association for Computational Linguistics
-
[53]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics
-
[54]
David Samuel, Andrey Kutuzov, Lilja vrelid, and Erik Velldal. 2023. https://doi.org/10.18653/v1/2023.findings-eacl.146 Trained on 100 million words and still in shape: BERT meets B ritish N ational C orpus . In Findings of the Association for Computational Linguistics: EACL 2023, pages 1954--1974, Dubrovnik, Croatia. Association for Computational Linguistics
-
[55]
Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzm \'a n. 2021. https://doi.org/10.18653/v1/2021.eacl-main.115 W iki M atrix: Mining 135 M parallel sentences in 1620 language pairs from W ikipedia . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1...
-
[56]
Kushal Tatariya, Heather Lent, and Miryam de Lhoneux. 2023. https://doi.org/10.18653/v1/2023.wassa-1.32 Transfer learning for code-mixed data: Do pretraining languages matter? In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , pages 365--378, Toronto, Canada. Association for Computational ...
- [57]
-
[58]
Together Computer . 2023. https://github.com/togethercomputer/RedPajama-Data Redpajama: an open dataset for training large language models
work page 2023
-
[59]
Denny Vrande c i\' c and Markus Kr\" o tzsch. 2014. https://doi.org/10.1145/2629489 Wikidata: a free collaborative knowledgebase . Commun. ACM, 57(10):78–85
-
[60]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32
work page 2019
-
[61]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/W18-5446 GLUE : A multi-task benchmark and analysis platform for natural language understanding . In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels, Be...
-
[62]
Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, and Ryan Cotterell, editors. 2023. https://aclanthology.org/2023.conll-babylm.0 Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning . Association for Compu...
work page 2023
-
[63]
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm \'a n, Armand Joulin, and Edouard Grave. 2020. https://aclanthology.org/2020.lrec-1.494 CCN et: Extracting high quality monolingual datasets from web crawl data . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003--4012, Marseille, F...
work page 2020
-
[64]
Kyle Wilson. 2019. https://www.theverge.com/2019/5/8/18526739/wikipedia-translation-tool-machine-learning-ai-english Wikipedia has a Google Translate problem
work page 2019
-
[65]
George Kingsley Zipf. 1949. https://pure.mpg.de/rest/items/item_2407822_4/component/file_2562959/content Human Behavior and the Principle of Least Effort : An Introduction to Human Ecology . Addison-Wesley Press, Cambridge, Massachusetts
work page 1949
-
[66]
Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2017. https://doi.org/10.18653/v1/W17-2512 Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora . In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 60--67, Vancouver, Canada. Association for Computational Linguistics
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.