{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:3NEXC3OS7AUBW2G66RVX6GXNAK","short_pith_number":"pith:3NEXC3OS","schema_version":"1.0","canonical_sha256":"db49716dd2f8281b68def46b7f1aed029a1496f3caa63a09017173f3f3d00a85","source":{"kind":"arxiv","id":"2506.01732","version":3},"attestation_state":"computed","paper":{"title":"Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Common Corpus assembles the largest open dataset of roughly two trillion tokens from uncopyrighted or openly licensed sources for LLM pre-training.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Anastasia Stasenko, Carlos Rosas Hinostroza, Catherine Arnett, David Mach, Eliot Krzystof Jones, Ir\\`ene Girard, Ivan P. Yamshchikov, Mattia Nee, Pavel Chizhov, Pierre-Carl Langlais","submitted_at":"2025-06-02T14:43:15Z","abstract_excerpt":"Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The da"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2506.01732","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2025-06-02T14:43:15Z","cross_cats_sorted":[],"title_canon_sha256":"afd5491f413a573398212bec22c7f48b78553c5917cc12c09eb91b003e254cad","abstract_canon_sha256":"2ad1075618137072bb13d0fcb1772234cfe4e2818596d23de85aa28fca48a779"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-20T00:01:34.373349Z","signature_b64":"Wiytqysys/AVamoAlIXfZtjF7HL/nIsVaQjYYuxiZvmfoDTbGmjhSb78RhBkrKnup/5aZ9lFLFZrRT9xGkrhDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"db49716dd2f8281b68def46b7f1aed029a1496f3caa63a09017173f3f3d00a85","last_reissued_at":"2026-05-20T00:01:34.372681Z","signature_status":"signed_v1","first_computed_at":"2026-05-20T00:01:34.372681Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Common Corpus assembles the largest open dataset of roughly two trillion tokens from uncopyrighted or openly licensed sources for LLM pre-training.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Anastasia Stasenko, Carlos Rosas Hinostroza, Catherine Arnett, David Mach, Eliot Krzystof Jones, Ir\\`ene Girard, Ivan P. Yamshchikov, Mattia Nee, Pavel Chizhov, Pierre-Carl Langlais","submitted_at":"2025-06-02T14:43:15Z","abstract_excerpt":"Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The da"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Common Corpus is the largest open dataset for LLM pre-training; the assembled data are either uncopyrighted or under open licenses, total about two trillion tokens, and small models trained on it perform comparably to other models of their size, indicating suitability for multilingual pretraining.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The curation and filtering process preserves sufficient quality, diversity, and legal compliance such that performance on two small models generalizes to indicate the dataset is suitable for large-scale LLM pre-training.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Common Corpus is a 2-trillion-token open dataset for LLM pre-training compiled from uncopyrighted and openly licensed sources across diverse languages, domains, and code.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Common Corpus assembles the largest open dataset of roughly two trillion tokens from uncopyrighted or openly licensed sources for LLM pre-training.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"280400731004efd0463821c8b983324fee2fe3162fef7790a570e85ea6fbb2d7"},"source":{"id":"2506.01732","kind":"arxiv","version":3},"verdict":{"id":"512b5f89-6149-4d2f-9270-c5d67fec6c57","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T10:51:44.515334Z","strongest_claim":"Common Corpus is the largest open dataset for LLM pre-training; the assembled data are either uncopyrighted or under open licenses, total about two trillion tokens, and small models trained on it perform comparably to other models of their size, indicating suitability for multilingual pretraining.","one_line_summary":"Common Corpus is a 2-trillion-token open dataset for LLM pre-training compiled from uncopyrighted and openly licensed sources across diverse languages, domains, and code.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The curation and filtering process preserves sufficient quality, diversity, and legal compliance such that performance on two small models generalizes to indicate the dataset is suitable for large-scale LLM pre-training.","pith_extraction_headline":"Common Corpus assembles the largest open dataset of roughly two trillion tokens from uncopyrighted or openly licensed sources for LLM pre-training."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2506.01732/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"09b8c5b65a7f5c2ef116c538b1845928f082e94823cbb4d89a0535491ae55fbc"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2506.01732","created_at":"2026-05-20T00:01:34.372796+00:00"},{"alias_kind":"arxiv_version","alias_value":"2506.01732v3","created_at":"2026-05-20T00:01:34.372796+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2506.01732","created_at":"2026-05-20T00:01:34.372796+00:00"},{"alias_kind":"pith_short_12","alias_value":"3NEXC3OS7AUB","created_at":"2026-05-20T00:01:34.372796+00:00"},{"alias_kind":"pith_short_16","alias_value":"3NEXC3OS7AUBW2G6","created_at":"2026-05-20T00:01:34.372796+00:00"},{"alias_kind":"pith_short_8","alias_value":"3NEXC3OS","created_at":"2026-05-20T00:01:34.372796+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":3,"internal_anchor_count":3,"sample":[{"citing_arxiv_id":"2506.17185","citing_title":"A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15460","citing_title":"The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20738","citing_title":"RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering","ref_index":14,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK","json":"https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK.json","graph_json":"https://pith.science/api/pith-number/3NEXC3OS7AUBW2G66RVX6GXNAK/graph.json","events_json":"https://pith.science/api/pith-number/3NEXC3OS7AUBW2G66RVX6GXNAK/events.json","paper":"https://pith.science/paper/3NEXC3OS"},"agent_actions":{"view_html":"https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK","download_json":"https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK.json","view_paper":"https://pith.science/paper/3NEXC3OS","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2506.01732&json=true","fetch_graph":"https://pith.science/api/pith-number/3NEXC3OS7AUBW2G66RVX6GXNAK/graph.json","fetch_events":"https://pith.science/api/pith-number/3NEXC3OS7AUBW2G66RVX6GXNAK/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK/action/storage_attestation","attest_author":"https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK/action/author_attestation","sign_citation":"https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK/action/citation_signature","submit_replication":"https://pith.science/pith/3NEXC3OS7AUBW2G66RVX6GXNAK/action/replication_record"}},"created_at":"2026-05-20T00:01:34.372796+00:00","updated_at":"2026-05-20T00:01:34.372796+00:00"}