The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Aaron Gokaslan; A. Feder Cooper; Alon Albalak; Aviya Skowron; Baber Abbasi; Bhavya Kailkhura; Brian Lester; Brian R. Bartoldson; Colin Raffel; Dashiell Stander

arxiv: 2506.05209 · v1 · pith:L7QFMMIInew · submitted 2025-06-05 · 💻 cs.CL · cs.LG

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal , Brian Lester , Colin Raffel , Sebastian Majstorovic , Stella Biderman , Baber Abbasi , Luca Soldaini , Enrico Shippole

show 19 more authors

A. Feder Cooper Aviya Skowron John Kirchenbauer Shayne Longpre Lintang Sutawika Alon Albalak Zhenlin Xu Guilherme Penedo Loubna Ben Allal Elie Bakouch John David Pressman Honglu Fan Dashiell Stander Guangyu Song Aaron Gokaslan Tom Goldstein Brian R. Bartoldson Bhavya Kailkhura Tyler Murray

This is my paper

classification 💻 cs.CL cs.LG

keywords textcommonllmspilecommalicensedmodelsopenly

0 comments

read the original abstract

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models
cs.SD 2026-07 conditional novelty 6.0

A text-to-procedural-audio system using LLMs to emit controllable categorical configurations, with live crossfading generator and three interchangeable backends for uninterrupted performance.
Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense
cs.CR 2026-06 unverdicted novelty 6.0

Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.