pith. machine review for the scientific record. sign in

arxiv: 2306.01116 · v1 · submitted 2023-06-01 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords web datadata filteringdeduplicationlanguage model pretrainingCommonCrawlRefinedWebzero-shot generalization
0
0 comments X

The pith

Properly filtered and deduplicated web data alone can train powerful language models that outperform those trained on curated corpora like The Pile.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are usually trained on a mix of web data and curated sources such as books or technical papers, because curation is thought to be essential for strong performance. The paper shows that this belief is not necessary: web data processed through careful filtering and deduplication produces models that exceed the results of state-of-the-art training on mixed datasets like The Pile. The authors extract five trillion tokens from CommonCrawl, release a 600 billion token subset called RefinedWeb, and train 1.3 billion and 7.5 billion parameter models on it. This finding matters because it questions whether curation will remain feasible or required as models grow to need trillions of tokens.

Core claim

At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

What carries the argument

RefinedWeb, a dataset extracted from CommonCrawl by applying extensive filtering and deduplication to isolate high-quality web text for pretraining.

If this is right

  • Language model pretraining can scale to trillions of tokens without depending on curated high-quality sources.
  • Web data remains abundant after filtering, yielding at least five trillion usable tokens from CommonCrawl.
  • Releasing large filtered web datasets enables direct comparison and further experimentation with web-only training.
  • Performance advantages on zero-shot tasks suggest that processed web text can match or exceed the generalization value of mixed curated corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work may shift emphasis toward improving filtering algorithms rather than acquiring new curated sources.
  • Web-only pipelines could reduce pressure on limited high-quality curated data pools for very large models.
  • The same filtering approach could be applied and tested on other large web archives or for multilingual and code data.

Load-bearing premise

The undisclosed filtering and deduplication steps produce data whose quality and diversity are truly comparable or superior to curated corpora, and any performance difference is not caused by variations in training procedure or evaluation.

What would settle it

Train two models under identical procedures and compute budgets, one on the released RefinedWeb extract and one on The Pile, then compare zero-shot benchmark scores to isolate whether the data source alone accounts for the reported gains.

read the original abstract

Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents RefinedWeb, a dataset derived solely from CommonCrawl through filtering and deduplication, claiming that 1.3B and 7.5B parameter language models trained exclusively on this web data significantly outperform state-of-the-art models trained on the curated Pile corpus. The authors extract five trillion tokens from the web, publicly release 600 billion tokens of RefinedWeb along with the trained models, and argue that high-quality web data is plentiful and sufficient for powerful zero-shot generalization without curated sources.

Significance. If the reported gains are shown to arise from data quality under matched training conditions, the result would be significant: it challenges the prevailing view that curated corpora are necessary for performant LLMs and indicates that properly processed web data remains abundant even at trillion-token scales. The public release of both the 600B-token extract and the models constitutes a concrete contribution that enables direct replication and further study.

major comments (2)
  1. [Abstract] Abstract: the claim that RefinedWeb models 'significantly outperform' SOTA Pile-trained models is load-bearing for the central thesis, yet the abstract supplies no statement that architecture, optimizer, learning-rate schedule, batch size, or total tokens were identical to the cited baselines; without this, performance differences cannot be attributed to data quality.
  2. [Abstract] Abstract: the filtering and deduplication pipeline is described only at high level; exact heuristics, classifier thresholds, and deduplication radius are not provided, leaving open whether the extracted 600B tokens are effectively more curated than acknowledged and undermining the 'web data only' framing.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by explicitly naming the exact Pile-trained baseline models and papers used for comparison.
  2. Consider adding a table in the experimental section that lists key training hyperparameters side-by-side for the RefinedWeb and Pile runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the training conditions and data pipeline details. Revisions have been made to the abstract to improve transparency while preserving the manuscript's core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that RefinedWeb models 'significantly outperform' SOTA Pile-trained models is load-bearing for the central thesis, yet the abstract supplies no statement that architecture, optimizer, learning-rate schedule, batch size, or total tokens were identical to the cited baselines; without this, performance differences cannot be attributed to data quality.

    Authors: We agree that the abstract should explicitly note the matched conditions to allow direct attribution of gains to data quality. Section 3 of the manuscript details that the 1.3B and 7.5B models use identical architecture (decoder-only transformer), optimizer (AdamW), learning-rate schedule, batch size, and total tokens as the Pile baselines. We have revised the abstract to state: 'significantly outperforming models from the state-of-the-art trained on The Pile under identical training conditions.' This change strengthens the central thesis without altering the reported results. revision: yes

  2. Referee: [Abstract] Abstract: the filtering and deduplication pipeline is described only at high level; exact heuristics, classifier thresholds, and deduplication radius are not provided, leaving open whether the extracted 600B tokens are effectively more curated than acknowledged and undermining the 'web data only' framing.

    Authors: The abstract is intentionally concise, but the full pipeline—including URL filtering heuristics, quality classifier thresholds, and MinHash-based deduplication radius—is specified in Section 2. We have added to the abstract: 'obtained via a multi-stage filtering and deduplication pipeline applied to CommonCrawl.' All data remains exclusively from web sources with no curated additions, preserving the 'web data only' claim. If the referee requires the exact numerical thresholds in the abstract, we can include them in a revised version. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical dataset and training comparison

full rationale

The paper presents an empirical study: it describes filtering and deduplication of CommonCrawl to produce RefinedWeb, trains 1.3B/7.5B models on it, and reports zero-shot performance outperforming Pile-trained baselines. No equations, derivations, or first-principles predictions appear in the provided text. Claims rest on experimental outcomes rather than any reduction of a result to its own inputs by definition, fitted parameters, or self-citation chains. The work is therefore self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no methods section or equations available to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5510 in / 1048 out tokens · 36754 ms · 2026-05-13T20:37:51.963076+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

    cs.CL 2026-04 unverdicted novelty 8.0

    AfriVoices-KE is a 3,000-hour multilingual speech dataset for Dholuo, Kikuyu, Kalenjin, Maasai, and Somali with 750 hours scripted and 2,250 hours spontaneous speech from 4,777 speakers.

  2. Learning the Signature of Memorization in Autoregressive Language Models

    cs.CL 2026-04 accept novelty 8.0

    A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

  3. MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

    cs.CL 2026-05 unverdicted novelty 7.0

    MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.

  4. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

    cs.AR 2026-03 unverdicted novelty 7.0

    ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

  5. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  6. Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

    cs.AI 2026-05 unverdicted novelty 6.0

    MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.

  7. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  8. Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.

  9. MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model

    cs.CE 2026-04 unverdicted novelty 6.0

    MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.

  10. Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

    cs.AI 2026-04 unverdicted novelty 6.0

    Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.

  11. Spike-driven Large Language Model

    cs.NE 2026-04 unverdicted novelty 6.0

    SDLLM is a spike-driven LLM that uses gamma-SQP two-step encoding, bidirectional symmetric quantization, and membrane potential clipping to achieve 7x lower energy consumption and 4.2% higher accuracy than prior spike...

  12. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    cs.CL 2024-02 conditional novelty 6.0

    KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.

  13. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  14. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  15. Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    cs.LG 2023-09 conditional novelty 6.0

    Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.

  16. A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles

    cs.CL 2026-05 unverdicted novelty 5.0

    Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.

  17. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

    cs.LG 2026-04 unverdicted novelty 5.0

    ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

  18. Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy

    cs.AI 2026-04 unverdicted novelty 5.0

    LAPD, derived from the provable preference discrepancy in aligned LLMs, improves zero-shot AI text detection by 45.82% over baselines with claimed statistical dominance over Fast-DetectGPT.

  19. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  20. InternLM2 Technical Report

    cs.CL 2024-03 unverdicted novelty 5.0

    InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

  21. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  22. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  23. Reflections and New Directions for Human-Centered Large Language Models

    cs.CL 2026-05 unverdicted novelty 4.0

    Model developers must address human concerns, preferences, values, and goals with rigor at every stage of the LLM pipeline rather than only in post-training.

  24. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  25. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  26. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 26 Pith papers · 13 internal anchors

  1. [1]

    1 – 9, Mannheim,

    Limerick, 12 July 2021 (Online-Event), pp. 1 – 9, Mannheim,

  2. [2]

    doi: 10.14618/ ids-pub-10468

    Leibniz-Institut f ¨ur Deutsche Sprache. doi: 10.14618/ ids-pub-10468. URL https://nbn-resolving. org/urn:nbn:de:bsz:mh39-104688. Abadji, J., Ortiz Suarez, P., Romary, L., and Sagot, B. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, art. arXiv:2201.06642, January

  3. [3]

    Abbas, A. K. M., Tirumala, K., Simig, D., Ganguli, S., and Morcos, A. S. Semdedup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models,

  4. [4]

    arXiv preprint arXiv:2001.09977 , year=

    Adiwardana, D., Luong, M.-T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y ., et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977,

  5. [6]

    Allamanis, M

    URL https://www.aleph-alpha.com/pdf/2023_ 02_AA_Benchmarks_doc.pdf. Allamanis, M. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153,

  6. [7]

    Mathqa: Towards interpretable math word problem solving with operation-based for- malisms

    Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y ., and Hajishirzi, H. Mathqa: Towards interpretable math word problem solving with operation-based for- malisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...

  7. [8]

    Prost: Physical reasoning about objects through space and time

    Aroca-Ouellette, S., Paik, C., Roncone, A., and Kann, K. Prost: Physical reasoning about objects through space and time. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4597–4608,

  8. [9]

    Efficient large scale language modeling with mixtures of experts

    Artetxe, M., Bhosale, S., Goyal, N., Mihaylov, T., Ott, M., Shleifer, S., Lin, X. V ., Du, J., Iyer, S., Pasunuru, R., et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684,

  9. [10]

    acl-demo.15

    URL https://aclanthology.org/2021. acl-demo.15. Beltagy, I., Lo, K., and Cohan, A. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 3615–3620,

  10. [11]

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for ana- lyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373,

  11. [12]

    org/10.5281/zenodo.5297715

    URL https: //doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata. Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. Gpt-neox-20b: An open-source autoregressive language model. Challenges & Perspectives in Creating Large Language Models, pp. 95,

  12. [13]

    Broder, A. Z. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of Sequences 1997, pp. 21–29. IEEE,

  13. [14]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  14. [15]

    Quantifying Memorization Across Neural Language Models

    Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neu- ral language models. arXiv preprint arXiv:2202.07646,

  15. [16]

    One billion word benchmark for measuring progress in statistical language modeling

    Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., and Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005,

  16. [17]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

  17. [18]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  18. [19]

    Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. In Proceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers), pp. 4171–4186,

  19. [20]

    Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster

    Dey, N., Gosal, G., Khachane, H., Marshall, W., Pathria, R., Tom, M., Hestness, J., et al. Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster. arXiv preprint arXiv:2304.03208,

  20. [21]

    Docu- menting large webtext corpora: A case study on the colos- sal clean crawled corpus

    Dodge, J., Sap, M., Marasovi´c, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Docu- menting large webtext corpora: A case study on the colos- sal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305,

  21. [22]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

  22. [23]

    doi:10.5281/zenodo.5371628 , url =

    URL https: //doi.org/10.5281/zenodo.5371628. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., and Crawford, K. Datasheets for datasets. Communications of the ACM , 64(12):86–92,

  23. [24]

    Semeval- 2012 task 7: Choice of plausible alternatives: An evalua- tion of commonsense causal reasoning

    Gordon, A., Kozareva, Z., and Roemmele, M. Semeval- 2012 task 7: Choice of plausible alternatives: An evalua- tion of commonsense causal reasoning. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main confer- ence and the shared task, and Volume 2: Proceedings of the Sixth International Worksho...

  24. [25]

    Learning word vectors for 157 languages

    Grave, ´E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018),

  25. [26]

    Computing Research Repository , eprint=

    Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, The RefinedWeb dataset for Falcon LLM Z., Henighan, T., Hume, T., et al. Scaling laws and inter- pretability of learning from repeated data. arXiv preprint arXiv:2205.10487,

  26. [27]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556,

  27. [28]

    Pub- medqa: A dataset for biomedical research question an- swering

    Jin, Q., Dhingra, B., Liu, Z., Cohen, W., and Lu, X. Pub- medqa: A dataset for biomedical research question an- swering. In Proceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2567–2577,

  28. [29]

    FastText.zip: Compressing text classification models

    Joulin, A., Grave, E., Bojanowski, P., Douze, M., J ´egou, H., and Mikolov, T. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651,

  29. [30]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  30. [31]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  31. [32]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

  32. [33]

    Adversarial nli: A new benchmark for natural language understanding

    Nie, Y ., Williams, A., Dinan, E., Bansal, M., Weston, J., and Kiela, D. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599,

  33. [34]

    9 – 16, Mannheim,

    Cardiff, 22nd July 2019, pp. 9 – 16, Mannheim,

  34. [35]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    doi: 10.48550/ARXIV .2112.11446. URL https://arxiv.org/abs/2112.11446. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Ex- ploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67,

  35. [36]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili´c, S., Hesslow, D., Castagn´e, R., Luccioni, A. S., Yvon, F., Gall ´e, M., et al. Bloom: A 176b-parameter open-access multilin- gual language model. arXiv preprint arXiv:2211.05100, 2022a. Scao, T. L., Wang, T., Hesslow, D., Saulnier, L., Bekman, S., Bari, M. S., Bideman, S., Elsahar, H., Muennighoff, N., ...

  36. [37]

    com/CLD2Owners/cld2 (last updated on August 2015),

    Software available at https://github. com/CLD2Owners/cld2 (last updated on August 2015),

  37. [38]

    LaMDA: Language Models for Dialog Applications

    Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kul- shreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y ., et al. Lamda: Language models for dialog appli- cations. arXiv preprint arXiv:2201.08239,

  38. [39]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971,

  39. [40]

    Trinh, T. H. and Le, Q. V . A simple method for common- sense reasoning. arXiv preprint arXiv:1806.02847,

  40. [41]

    Will we run out of data? limits of llm scaling based on human-generated data

    Villalobos, P., Sevilla, J., Heim, L., Besiroglu, T., Hobbhahn, M., and Ho, A. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325,

  41. [42]

    A., Anderson, K., Kohli, P., Coppin, B., and Huang, P.-S

    The RefinedWeb dataset for Falcon LLM Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mellor, J., Hendricks, L. A., Anderson, K., Kohli, P., Coppin, B., and Huang, P.-S. Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2447–2469,

  42. [43]

    mt5: A massively multilingual pre-trained text-to-text transformer

    Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. mt5: A massively multilingual pre-trained text-to-text transformer. In Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483–498,

  43. [44]

    Pangu- alpha: Large-scale autoregressive pretrained chinese lan- guage models with auto-parallel computation

    Zeng, W., Ren, X., Su, T., Wang, H., Liao, Y ., Wang, Z., Jiang, X., Yang, Z., Wang, K., Zhang, X., et al. Pangu- alpha: Large-scale autoregressive pretrained chinese lan- guage models with auto-parallel computation. arXiv preprint arXiv:2104.12369,

  44. [45]

    ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

    Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K., and Van Durme, B. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885,

  45. [46]

    OPT: Open Pre-trained Transformer Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

  46. [47]

    RefinedWeb Datasheet MOTIVATION For what purpose was the dataset cre- ated? RefinedWeb was created to serve as a large-scale dataset for the pretrain- ing of large language models

    The RefinedWeb dataset for Falcon LLM A. RefinedWeb Datasheet MOTIVATION For what purpose was the dataset cre- ated? RefinedWeb was created to serve as a large-scale dataset for the pretrain- ing of large language models. It may be used on its own, or augmented with curated sources (e.g., Wikipedia, StackOverflow). Who created the dataset and on behalf of...

  47. [48]

    Af- ter this first preprocessing stage, we filter data using heuristics from MassiveWeb (Rae et al.,

    to extract content from pages, and perform language identification with the fastText classifier from CCNet (Wenzek et al., 2020). Af- ter this first preprocessing stage, we filter data using heuristics from MassiveWeb (Rae et al.,

  48. [49]

    and our own line-wise corrections (Ap- pendix G.2). Finally, we run extensive deduplication, removing URLs revisited across dumps (Section 3.3) and performing subsequently fuzzy and exact substring deduplication, with each stage drawing from Lee et al. (2022). See Section 3 for further details and Table 2 for an outline. Was the “raw” data saved in additi...

  49. [50]

    Architecture based on GPT-3 (Brown et al., 2020), with ALiBi positional encodings (Press et al.,

    Model type and information about train- ing Falcon-RW are autoregressive Transformer models trained with a causal language modeling objective. Architecture based on GPT-3 (Brown et al., 2020), with ALiBi positional encodings (Press et al.,

  50. [51]

    See Section 4.1 for details

    and FlashAt- tention (Dao et al., 2022). See Section 4.1 for details. Licence Apache 2.0: https://www.apache.org/licenses/ LICENSE-2.0. Point of contact falconllm@tii.ae INTENDED USE Primary intended uses Research on large language models, and the influence of adequately filtered and deduplicated web data on the properties of large language models (fairne...

  51. [52]

    Preprocessing We use the default prompts and setup of Gao et al

    Motivation We selected and aggregated tasks to build comparisons with other models in the literature (see Section 4.1; Appendix F.1 for details). Preprocessing We use the default prompts and setup of Gao et al. (2021). TRAINING DATA See the dedicated datasheet in Table

  52. [53]

    Table 7: Model card for Falcon-RW, following the framework introduced by Mitchell et al. (2019). The RefinedWeb dataset for Falcon LLM C. Dataset analysis The large-scale and diverse nature of web corpora make them difficult to document and analyse as a whole; we provide some key metrics in the section, focusing on document lengths in Figure 5(a), and a b...

  53. [54]

    100 101 102 103 104 105 106 Document length in tokens RefinedWeb RW-Filtered RW-Raw The Pile C4 OSCAR 2019 OSCAR 22.01 (a) Document Lengths blogspot.com wordpress.com google.com youtube.com scribd.com issuu.com biomedcentral.com yahoo.com typepad.com archive.org wikia.com fanfiction.net cnn.com nytimes.com science.gov ufl.edu fandom.com slashdot.org thegua...

  54. [55]

    (a) We find the OSCAR datasets and RW-Raw to have similar document length distributions; following filtering, most of the short documents are discarded from RW-Filtered

    Make-up of RefinedWeb in document lengths (left) and top domains (right). (a) We find the OSCAR datasets and RW-Raw to have similar document length distributions; following filtering, most of the short documents are discarded from RW-Filtered. As deduplication removes spans, it reintroduces shorter documents to RefinedWeb. We note the make-up of C4 and Re...

  55. [56]

    Perplexity in bits-per-bytes on Wikitext (wiki-bpb, lower is better.) Model size 1B 7B Dataset The Pile RW RW wiki-bpb ↓ 0.64 0.66 0.60 The RefinedWeb dataset for Falcon LLM E.3

    Models trained on RefinedWeb achieve performance close to models trained on The Pile on Wikitext, despite not having seen any content from Wikipedia. Perplexity in bits-per-bytes on Wikitext (wiki-bpb, lower is better.) Model size 1B 7B Dataset The Pile RW RW wiki-bpb ↓ 0.64 0.66 0.60 The RefinedWeb dataset for Falcon LLM E.3. Does deduplication help with...

  56. [57]

    However, our experiments were only performed at small-scale (1B models trained on 30GT), and we see high variability in outcomes across tasks

    Deduplication may reduce the degradation in performance incurred by multiple epochs. However, our experiments were only performed at small-scale (1B models trained on 30GT), and we see high variability in outcomes across tasks. Zero-shot performance measured on the agg-dev-2 aggregate (HellaSwag, PIQA, ARC, BoolQ, COPA, MRPC, SciQ). Individual curves for ...

  57. [58]

    also comes in a smaller series, up to 2.7B parameters, following the recommendations of µ-parametrization (Yang et al., 2021). As we found the performance of this smaller series to be close to the main series of models (see Figure 8), and as it does not include models of a similar compute scale as the ones we compare to, we chose not to report it in our m...

  58. [59]

    Note in Figure 1 that the results from the GPT-3 paper are still ahead of results obtained through the API with the EAI evaluation harness

    We source evaluation results from a variety of papers across the literature, maximizing task coverage.Although most results come from the EAI Evaluation Harness (Gao et al., 2021), results from PaLM and GPT-3 are sourced from their respective papers. Note in Figure 1 that the results from the GPT-3 paper are still ahead of results obtained through the API...

  59. [60]

    We posit a few possible hypotheses: • Differences between curated and web data

    Nevertheless, a difference in our findings and theirs remain. We posit a few possible hypotheses: • Differences between curated and web data. It is possible that web data is more sensitive to duplicates. For instance, the most common duplicates in web data (e.g., spam) may be more detrimental than the most common duplicates in curated data. This suggests ...

  60. [61]

    Finally, we note that Biderman et al

    • Differences in pretraining. Finally, we note that Biderman et al. (2023) chooses to perform a partial extra epoch on the deduplicated data to reach 300GT, while we always perform a single epoch. Their setting corresponds to a data-constrained scenario, which is more realistic for the curated data they study; for us, web data is plentiful, so deduplicati...

  61. [62]

    Zero-shot performance on our core aggregate, gap between Cerebras-GPT with µ-param and without

    slightly improves performance in the Cerebras-GPT series (Dey et al., 2023). Zero-shot performance on our core aggregate, gap between Cerebras-GPT with µ-param and without. Individual curves for per-task results and 1-σ standard deviation across all tasks in the aggregate in transparent. 100 101 102 Compute [PF-days] −2 −1 0 1 2 3 Absolute deduplication 0...

  62. [63]

    Zero-shot performance on our core aggregate, gap between Pythia trained on the deduplicated and vanilla Pile

    In our core aggregate, deduplication brings a small improvement to the Pythia suite (Biderman et al., 2023). Zero-shot performance on our core aggregate, gap between Pythia trained on the deduplicated and vanilla Pile. Individual curves for per-task results and 1-σ standard deviation across all tasks in the aggregate in transparent. The RefinedWeb dataset...

  63. [64]

    Full-scale models trained on RefinedWeb (Falcon-RW) and other models from the state-of-the-art. Across models trained on The Pile, the Pythia models are the closest to our achitecture: they use FlashAttention with rotary embeddings–with for only notably exception the use of parallel attention and feedforward for their models. Training budget C in PF-days ...

  64. [65]

    Datasets such as OSCAR and C4 also have significant multilingual versions, which have enjoyed wide adoption (Xue et al., 2021)

    Common massive web-scrape and LLM English datasets. Datasets such as OSCAR and C4 also have significant multilingual versions, which have enjoyed wide adoption (Xue et al., 2021). For OSCAR, the size corresponds to the non-deduplicated version, and is estimated from the number of words x0,75 (average number of words per tokens). General information Web da...

  65. [66]

    GPT-J (Wang & Komatsuzaki, 2021), GPT- NeoX-20B (Black et al., 2022), Pythia (Biderman et al., 2023), Cerebras-GPT (Dey et al.,

  66. [67]

    Gopher (Rae et al., 2021), Chinchilla (Hoffmann et al.,

  67. [68]

    Details of the Macrodata Refinement pipeline G.1

    Document and line-level URL blocklist Exact & fuzzy The RefinedWeb dataset for Falcon LLM G. Details of the Macrodata Refinement pipeline G.1. URL filtering As discussed in Section 3.1, we base our filtering of adult documents only on the URL itself, and not on the content of the documents. This design choice was motivated by: (1) challenges in avoiding o...

  68. [69]

    This blocklist is applied at the URL filtering stage, along with the adult content blocklist

    RefinedWeb is stripped from common so-called high-quality sources to simplify combining it with existing curated corpora. This blocklist is applied at the URL filtering stage, along with the adult content blocklist. Curated data source Domain name blocked arxiv arxiv.org AskUbuntu askubuntu.com StackOverflow stackoverflow.com stackapps.com stackexchange.c...

  69. [70]

    Hash functions are used to obtain a signature for each document: for each hash function, the smallest value is kept from hashing every unique n-gram in the document

    and obtain the set of unique n-grams for each document. Hash functions are used to obtain a signature for each document: for each hash function, the smallest value is kept from hashing every unique n-gram in the document. If two documents are similar, then there is a high probability that they will have the same minimum hash (MinHash) for at least some of...

  70. [71]

    Since comparing MinHash signatures between every possible document pair is computationally expensive, we apply a locality sensitive hashing version of MinHash, MinHash LSH

    of the sets of their unique n-grams (the sets being di and dj): J(di, dj) = |di ∩ dj| |di ∪ dj| (1) Matching. Since comparing MinHash signatures between every possible document pair is computationally expensive, we apply a locality sensitive hashing version of MinHash, MinHash LSH. A document signature is split into r buckets, each with b minhashes. Docum...

  71. [72]

    Finally, we cluster documents across all buckets — if documents A and B match in one bucket and B and C in another, A-B-C becomes a cluster

    This means that for each document, we compute a total of 9000 minhashes, and that the probability that a document pair with similarity 0.75 or 0.8 will be marked as duplicates will be 76% and 99.4% (respectively), diminishing rapidly for smaller similarity values. Finally, we cluster documents across all buckets — if documents A and B match in one bucket ...

  72. [73]

    in linear time—an array of the indexes to a lexicographical ordering of all the suffixes in the sequence. Finally, duplicate sequences can also be found in linear time using the suffix array, by simply traversing the ordered list of suffixes and comparing the beginning of each pair of two consecutive suffixes. We apply the same normalization and tokenizat...

  73. [74]

    URL with different GET parameters don’t always result in significantly different page content. http://gamesandbiz.blogspot.com/2010/ 07/bad-reviews-can-hurt-game-sales.ht ml?showComment=1278486430242 http://gamesandbiz.blogspot.com/2010/ 07/bad-reviews-can-hurt-game-sales.ht ml?showComment=1278499674195 https://www.ocean-oxygen.org/home;jse ssionid=1E3290...