pith. sign in

arxiv: 2103.12028 · v4 · pith:KXAFCM2Vnew · submitted 2021-03-22 · 💻 cs.CL · cs.AI

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

classification 💻 cs.CL cs.AI
keywords corporaauditdatasetsmultilingualqualityissueslanguagetext
0
0 comments X
read the original abstract

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    cs.CL 2023-04 accept novelty 8.0

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  2. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  3. Lessons from the Trenches on Reproducible Evaluation of Language Models

    cs.CL 2024-05 accept novelty 6.0

    The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

  4. Ethical and social risks of harm from Language Models

    cs.CL 2021-12 accept novelty 6.0

    The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...