Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
Hash functions are used to obtain a signature for each document: for each hash function, the smallest value is kept from hashing every unique n-gram in the document
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2023 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.