Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Pith reviewed 2026-05-22 01:13 UTC · model grok-4.3
The pith
Common Corpus compiles about two trillion tokens of uncopyrighted or open-licensed data for LLM pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Common Corpus represents the largest open dataset for LLM pre-training, with roughly two trillion tokens sourced from uncopyrighted or openly licensed materials. The dataset features diversity in languages from high-resource European to low-resource ones, plus extensive code content. Detailed information on data provenance, filtering, and curation is provided, and training experiments with two small language models confirm that performance matches that of similar-sized models trained elsewhere, suggesting suitability for multilingual pretraining.
What carries the argument
The assembly process that gathers and filters data from multiple open sources to produce a large, diverse, and legally compliant pre-training corpus.
If this is right
- Researchers gain access to a massive scale of open data for pre-training without copyright restrictions.
- Models can be developed in ways that comply with data security regulations.
- Training becomes possible across a wide range of languages and includes code capabilities.
- New opportunities arise for research and applications in varied knowledge domains.
- Reproducible experiments in LLM training are supported by the dataset's public nature.
Where Pith is reading between the lines
- This could encourage more academic groups to experiment with large-scale pre-training using fully open resources.
- Comparisons between models trained on this corpus and those using mixed-license data might reveal effects of data licensing on model behavior.
- Extensions might involve adding more recent data sources while maintaining open license compliance.
Load-bearing premise
The sources chosen for inclusion must actually be free of copyright claims and the filtering process must keep enough high-quality material to support effective LLM pre-training.
What would settle it
A direct comparison showing that models trained on Common Corpus consistently underperform models trained on similar volumes of data from other sources on standard language modeling benchmarks would indicate the dataset may not be suitable.
read the original abstract
Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs across diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Common Corpus, claimed to be the largest open dataset for LLM pre-training, consisting of approximately two trillion tokens of uncopyrighted or openly licensed data. It covers diverse languages (high-resource European to low-resource) and substantial code data from web, code, and multilingual sources. The authors detail the provenance, filtering, and curation processes, and report that two small language models trained on the dataset perform comparably to other models of similar size, indicating suitability for multilingual pre-training.
Significance. If the license compliance and content quality claims hold, this dataset would be a valuable contribution to open LLM research by enabling large-scale pre-training without copyright concerns. The multilingual and code coverage addresses gaps in existing resources and supports broader research and entrepreneurial applications. The small-model training provides initial usability evidence, though more detailed quantitative validation would increase its utility for the field.
major comments (2)
- [Abstract and provenance/filtering sections] The central claim that the full ~2T-token corpus consists exclusively of uncopyrighted or properly licensed material rests on source-level selection and automated filters without any reported quantitative compliance audit, sampled verification rate, or error analysis. This is load-bearing for the 'ethical' and 'largest open' designation (see Abstract and the provenance/filtering description).
- [Model training and evaluation section] The suitability claim based on training two small models is stated without specific metrics, baselines, or error analysis (e.g., no perplexity scores, downstream task results, or comparison models named). This weakens the evidence that the curation preserves high-quality content for effective pre-training.
minor comments (2)
- [Dataset description] Provide a clearer token-count breakdown by source category (web, code, multilingual) and time period to improve transparency and reproducibility.
- [References and methods] Add explicit citations for all data sources and filtering tools mentioned to support independent verification.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of our claims regarding license compliance and model evaluation. We address each major comment below and will incorporate revisions to improve clarity and evidence.
read point-by-point responses
-
Referee: [Abstract and provenance/filtering sections] The central claim that the full ~2T-token corpus consists exclusively of uncopyrighted or properly licensed material rests on source-level selection and automated filters without any reported quantitative compliance audit, sampled verification rate, or error analysis. This is load-bearing for the 'ethical' and 'largest open' designation (see Abstract and the provenance/filtering description).
Authors: We appreciate the referee identifying this as a load-bearing claim. The Common Corpus was assembled exclusively from sources that are either in the public domain or released under open licenses, with full provenance and filtering details provided in the manuscript. We acknowledge that the original submission did not include a quantitative compliance audit, sampled verification rate, or formal error analysis. In the revised manuscript, we will add a dedicated subsection on compliance verification. This will describe our sampling approach for manual review of documents from each source category, report the verification rate and any discrepancies found, and include an error analysis of the automated filters. These additions will provide more transparent support for the ethical and open designation of the dataset. revision: yes
-
Referee: [Model training and evaluation section] The suitability claim based on training two small models is stated without specific metrics, baselines, or error analysis (e.g., no perplexity scores, downstream task results, or comparison models named). This weakens the evidence that the curation preserves high-quality content for effective pre-training.
Authors: We agree that greater specificity in the evaluation would strengthen the evidence. The manuscript reports that two small models trained on Common Corpus perform comparably to other models of similar size, but does not detail the metrics or baselines. In the revised version, we will expand the model training and evaluation section to include concrete quantitative results such as perplexity scores on held-out validation data, performance on downstream tasks, and direct comparisons to named models of equivalent scale. We will also add an error analysis discussing any observed quality issues and how the curation steps addressed them. This will better substantiate the suitability of the dataset for multilingual pre-training. revision: yes
Circularity Check
No circularity: dataset assembly paper with no derivations or self-referential predictions
full rationale
This is a data curation and assembly paper whose central contribution is the collection, filtering, and licensing verification of approximately two trillion tokens from openly licensed or uncopyrighted sources. The abstract and described sections detail provenance, curation steps, multilingual and code coverage, and small-scale model training to show suitability. No equations, first-principles derivations, fitted parameters, or predictions appear that could reduce to the inputs by construction. Claims rest on empirical description of sources and external benchmarks rather than self-citation chains or definitional loops. The work is self-contained as an empirical contribution; license verification concerns are matters of correctness and auditability, not circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Data sources are either uncopyrighted or released under open licenses that permit LLM pre-training use.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens... detailed provenance of data assembling and the details of dataset filtering and curation.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We developed a multilingual toxicity classifier, Celadon... OCR correction... PII Removal
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering
RespondeoQA is the first benchmark dataset for question answering and translation between Latin and English, with 7,800 pairs from pedagogical sources and initial LLM evaluations.
-
A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset
An empirical audit of one web-scraped ML training dataset reveals persistent PII after sanitization, which the authors combine with legal analysis to highlight privacy risks and advocate redefining 'publicly available...
-
The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings
Advanced LLMs improve EFL writing scores and diversity for lower-proficiency students but correlate with lower expert ratings on deep coherence, acting more as crutches than scaffolds.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.