Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Anastasia Stasenko; Carlos Rosas Hinostroza; Catherine Arnett; David Mach; Eliot Krzystof Jones; Ir\`ene Girard; Ivan P. Yamshchikov; Mattia Nee; Pavel Chizhov; Pierre-Carl Langlais

arxiv: 2506.01732 · v3 · pith:3NEXC3OSnew · submitted 2025-06-02 · 💻 cs.CL

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Pierre-Carl Langlais , Pavel Chizhov , Catherine Arnett , Carlos Rosas Hinostroza , Mattia Nee , Eliot Krzystof Jones , Ir\`ene Girard , David Mach

show 2 more authors

Anastasia Stasenko Ivan P. Yamshchikov

This is my paper

Pith reviewed 2026-05-22 01:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords open datasetLLM pre-trainingethical datamultilingual corpuscode datatwo trillion tokensdata curation

0 comments

The pith

Common Corpus compiles about two trillion tokens of uncopyrighted or open-licensed data for LLM pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Common Corpus as the largest open dataset assembled for pre-training large language models. This collection reaches approximately two trillion tokens by drawing exclusively from sources that are either uncopyrighted or released under open licenses. It spans many languages, including some low-resource ones, and incorporates a substantial volume of code data from various domains and time periods. Sympathetic readers would value this because it addresses legal and ethical concerns around using copyrighted material in model training, potentially allowing more open and compliant development of LLMs. Small models trained on the dataset achieve performance levels similar to those trained on other collections of comparable size.

Core claim

The central claim is that Common Corpus represents the largest open dataset for LLM pre-training, with roughly two trillion tokens sourced from uncopyrighted or openly licensed materials. The dataset features diversity in languages from high-resource European to low-resource ones, plus extensive code content. Detailed information on data provenance, filtering, and curation is provided, and training experiments with two small language models confirm that performance matches that of similar-sized models trained elsewhere, suggesting suitability for multilingual pretraining.

What carries the argument

The assembly process that gathers and filters data from multiple open sources to produce a large, diverse, and legally compliant pre-training corpus.

If this is right

Researchers gain access to a massive scale of open data for pre-training without copyright restrictions.
Models can be developed in ways that comply with data security regulations.
Training becomes possible across a wide range of languages and includes code capabilities.
New opportunities arise for research and applications in varied knowledge domains.
Reproducible experiments in LLM training are supported by the dataset's public nature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could encourage more academic groups to experiment with large-scale pre-training using fully open resources.
Comparisons between models trained on this corpus and those using mixed-license data might reveal effects of data licensing on model behavior.
Extensions might involve adding more recent data sources while maintaining open license compliance.

Load-bearing premise

The sources chosen for inclusion must actually be free of copyright claims and the filtering process must keep enough high-quality material to support effective LLM pre-training.

What would settle it

A direct comparison showing that models trained on Common Corpus consistently underperform models trained on similar volumes of data from other sources on standard language modeling benchmarks would indicate the dataset may not be suitable.

read the original abstract

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs across diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Common Corpus puts out a 2T-token open dataset with decent provenance details, but the ethical claims rest on unverified license assumptions at scale.

read the letter

The main thing to know is that this paper releases Common Corpus, a roughly two-trillion-token collection drawn from sources presented as uncopyrighted or under open licenses, spanning multiple languages and a large code portion. They document the sources, domains, time periods, and the filtering steps used to build it. Training two small models on the data and reporting performance comparable to other models of similar size gives at least a basic signal that the corpus supports pre-training without obvious quality collapse. That part is straightforward and useful for anyone tracking open data options. The soft spot is the compliance argument. The paper relies on source-level selection and automated filters to claim everything meets the ethical standard, but it does not report sampled legal checks, quantitative compliance rates, or an independent audit. At this volume, especially with web-crawled material, that leaves room for non-negligible violations that would undercut the central claim. The model results also stay high-level without detailed baselines or error analysis in the sections I checked. This is mainly for groups building or studying open LLMs who need large-scale alternatives to copyrighted corpora. A reader focused on dataset curation would get practical value from the provenance and filtering descriptions. It deserves peer review because the release itself is substantial and the topic matters, even if the verification side needs more scrutiny from referees.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Common Corpus, claimed to be the largest open dataset for LLM pre-training, consisting of approximately two trillion tokens of uncopyrighted or openly licensed data. It covers diverse languages (high-resource European to low-resource) and substantial code data from web, code, and multilingual sources. The authors detail the provenance, filtering, and curation processes, and report that two small language models trained on the dataset perform comparably to other models of similar size, indicating suitability for multilingual pre-training.

Significance. If the license compliance and content quality claims hold, this dataset would be a valuable contribution to open LLM research by enabling large-scale pre-training without copyright concerns. The multilingual and code coverage addresses gaps in existing resources and supports broader research and entrepreneurial applications. The small-model training provides initial usability evidence, though more detailed quantitative validation would increase its utility for the field.

major comments (2)

[Abstract and provenance/filtering sections] The central claim that the full ~2T-token corpus consists exclusively of uncopyrighted or properly licensed material rests on source-level selection and automated filters without any reported quantitative compliance audit, sampled verification rate, or error analysis. This is load-bearing for the 'ethical' and 'largest open' designation (see Abstract and the provenance/filtering description).
[Model training and evaluation section] The suitability claim based on training two small models is stated without specific metrics, baselines, or error analysis (e.g., no perplexity scores, downstream task results, or comparison models named). This weakens the evidence that the curation preserves high-quality content for effective pre-training.

minor comments (2)

[Dataset description] Provide a clearer token-count breakdown by source category (web, code, multilingual) and time period to improve transparency and reproducibility.
[References and methods] Add explicit citations for all data sources and filtering tools mentioned to support independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of our claims regarding license compliance and model evaluation. We address each major comment below and will incorporate revisions to improve clarity and evidence.

read point-by-point responses

Referee: [Abstract and provenance/filtering sections] The central claim that the full ~2T-token corpus consists exclusively of uncopyrighted or properly licensed material rests on source-level selection and automated filters without any reported quantitative compliance audit, sampled verification rate, or error analysis. This is load-bearing for the 'ethical' and 'largest open' designation (see Abstract and the provenance/filtering description).

Authors: We appreciate the referee identifying this as a load-bearing claim. The Common Corpus was assembled exclusively from sources that are either in the public domain or released under open licenses, with full provenance and filtering details provided in the manuscript. We acknowledge that the original submission did not include a quantitative compliance audit, sampled verification rate, or formal error analysis. In the revised manuscript, we will add a dedicated subsection on compliance verification. This will describe our sampling approach for manual review of documents from each source category, report the verification rate and any discrepancies found, and include an error analysis of the automated filters. These additions will provide more transparent support for the ethical and open designation of the dataset. revision: yes
Referee: [Model training and evaluation section] The suitability claim based on training two small models is stated without specific metrics, baselines, or error analysis (e.g., no perplexity scores, downstream task results, or comparison models named). This weakens the evidence that the curation preserves high-quality content for effective pre-training.

Authors: We agree that greater specificity in the evaluation would strengthen the evidence. The manuscript reports that two small models trained on Common Corpus perform comparably to other models of similar size, but does not detail the metrics or baselines. In the revised version, we will expand the model training and evaluation section to include concrete quantitative results such as perplexity scores on held-out validation data, performance on downstream tasks, and direct comparisons to named models of equivalent scale. We will also add an error analysis discussing any observed quality issues and how the curation steps addressed them. This will better substantiate the suitability of the dataset for multilingual pre-training. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset assembly paper with no derivations or self-referential predictions

full rationale

This is a data curation and assembly paper whose central contribution is the collection, filtering, and licensing verification of approximately two trillion tokens from openly licensed or uncopyrighted sources. The abstract and described sections detail provenance, curation steps, multilingual and code coverage, and small-scale model training to show suitability. No equations, first-principles derivations, fitted parameters, or predictions appear that could reduce to the inputs by construction. Claims rest on empirical description of sources and external benchmarks rather than self-citation chains or definitional loops. The work is self-contained as an empirical contribution; license verification concerns are matters of correctness and auditability, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the domain assumption that chosen sources meet open-license or public-domain criteria and that curation does not destroy utility.

axioms (1)

domain assumption Data sources are either uncopyrighted or released under open licenses that permit LLM pre-training use.
This premise underpins the claim of ethical compliance and is invoked when describing data assembly.

pith-pipeline@v0.9.0 · 5794 in / 1162 out tokens · 49035 ms · 2026-05-22T01:13:05.185346+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens... detailed provenance of data assembling and the details of dataset filtering and curation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We developed a multilingual toxicity classifier, Celadon... OCR correction... PII Removal

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering
cs.CL 2026-04 unverdicted novelty 8.0

RespondeoQA is the first benchmark dataset for question answering and translation between Latin and English, with 7,800 pairs from pedagogical sources and initial LLM evaluations.
A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset
cs.CR 2025-06 unverdicted novelty 6.0

An empirical audit of one web-scraped ML training dataset reveals persistent PII after sanitization, which the authors combine with legal analysis to highlight privacy risks and advocate redefining 'publicly available...
The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings
cs.HC 2026-04 unverdicted novelty 4.0

Advanced LLMs improve EFL writing scores and diversity for lower-proficiency students but correlate with lower expert ratings on deep coherence, acting more as crutches than scaffolds.