pith:3NEXC3OS
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Common Corpus assembles the largest open dataset of roughly two trillion tokens from uncopyrighted or openly licensed sources for LLM pre-training.
arxiv:2506.01732 v3 · 2025-06-02 · cs.CL
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{3NEXC3OS7AUBW2G66RVX6GXNAK}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Common Corpus is the largest open dataset for LLM pre-training; the assembled data are either uncopyrighted or under open licenses, total about two trillion tokens, and small models trained on it perform comparably to other models of their size, indicating suitability for multilingual pretraining.
The curation and filtering process preserves sufficient quality, diversity, and legal compliance such that performance on two small models generalizes to indicate the dataset is suitable for large-scale LLM pre-training.
Common Corpus is a 2-trillion-token open dataset for LLM pre-training compiled from uncopyrighted and openly licensed sources across diverse languages, domains, and code.
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-20T00:01:34.372681Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
db49716dd2f8281b68def46b7f1aed029a1496f3caa63a09017173f3f3d00a85
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: db49716dd2f8281b68def46b7f1aed029a1496f3caa63a09017173f3f3d00a85
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "2ad1075618137072bb13d0fcb1772234cfe4e2818596d23de85aa28fca48a779",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2025-06-02T14:43:15Z",
"title_canon_sha256": "afd5491f413a573398212bec22c7f48b78553c5917cc12c09eb91b003e254cad"
},
"schema_version": "1.0",
"source": {
"id": "2506.01732",
"kind": "arxiv",
"version": 3
}
}