pith. sign in
Pith Number

pith:3NEXC3OS

pith:2025:3NEXC3OS7AUBW2G66RVX6GXNAK
not attested not anchored not stored refs pending

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Anastasia Stasenko, Carlos Rosas Hinostroza, Catherine Arnett, David Mach, Eliot Krzystof Jones, Ir\`ene Girard, Ivan P. Yamshchikov, Mattia Nee, Pavel Chizhov, Pierre-Carl Langlais

Common Corpus assembles the largest open dataset of roughly two trillion tokens from uncopyrighted or openly licensed sources for LLM pre-training.

arxiv:2506.01732 v3 · 2025-06-02 · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{3NEXC3OS7AUBW2G66RVX6GXNAK}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Common Corpus is the largest open dataset for LLM pre-training; the assembled data are either uncopyrighted or under open licenses, total about two trillion tokens, and small models trained on it perform comparably to other models of their size, indicating suitability for multilingual pretraining.

C2weakest assumption

The curation and filtering process preserves sufficient quality, diversity, and legal compliance such that performance on two small models generalizes to indicate the dataset is suitable for large-scale LLM pre-training.

C3one line summary

Common Corpus is a 2-trillion-token open dataset for LLM pre-training compiled from uncopyrighted and openly licensed sources across diverse languages, domains, and code.

Formal links

2 machine-checked theorem links

Cited by

3 papers in Pith

Receipt and verification
First computed 2026-05-20T00:01:34.372681Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

db49716dd2f8281b68def46b7f1aed029a1496f3caa63a09017173f3f3d00a85

Aliases

arxiv: 2506.01732 · arxiv_version: 2506.01732v3 · doi: 10.48550/arxiv.2506.01732 · pith_short_12: 3NEXC3OS7AUB · pith_short_16: 3NEXC3OS7AUBW2G6 · pith_short_8: 3NEXC3OS
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: db49716dd2f8281b68def46b7f1aed029a1496f3caa63a09017173f3f3d00a85
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "2ad1075618137072bb13d0fcb1772234cfe4e2818596d23de85aa28fca48a779",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2025-06-02T14:43:15Z",
    "title_canon_sha256": "afd5491f413a573398212bec22c7f48b78553c5917cc12c09eb91b003e254cad"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2506.01732",
    "kind": "arxiv",
    "version": 3
  }
}