Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner · 2021

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

citing papers explorer

Showing 1 of 1 citing paper.

DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 55
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer