pith. sign in

arxiv: 2107.06499 · v2 · pith:KGI2ZLF7new · submitted 2021-07-14 · 💻 cs.CL · cs.LG

Deduplicating Training Data Makes Language Models Better

Pith reviewed 2026-05-24 13:35 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords language modelsdeduplicationmemorizationtraining dataevaluationdata cleaningnear-duplicates
0
0 comments X

The pith

Deduplicating training datasets reduces language model memorization by a factor of ten.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard language modeling datasets contain many near-duplicate examples and long repetitive substrings, which cause models to copy more than 1 percent of their unprompted output directly from training data. By building tools to remove these repeats, the authors demonstrate that the resulting models produce memorized text ten times less often while reaching target accuracy in fewer training steps. The same process also shrinks train-test overlap that affects more than 4 percent of typical validation examples, yielding cleaner performance measurements. A sympathetic reader would care because the work isolates a concrete data-cleaning step that simultaneously improves efficiency, reduces leakage, and strengthens evaluation.

Core claim

Existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. Removing these duplicates with the authors' tools produces models that emit memorized text ten times less frequently, require fewer train steps to reach the same or better accuracy, and exhibit reduced train-test overlap that affects over 4 percent of standard validation sets.

What carries the argument

The deduplication process that identifies and removes near-duplicate examples together with long repetitive substrings from training corpora.

If this is right

  • Models emit memorized text ten times less frequently after deduplication.
  • Fewer training steps suffice to reach the same or better accuracy.
  • Train-test overlap drops, allowing more reliable evaluation of model quality.
  • A single repeated sentence can be removed from a corpus even when it appears over 60,000 times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standard data pipelines for language models may need routine deduplication to limit unintended copying of training content.
  • The same cleaning step could be tested on non-language tasks to check whether repetition removal yields similar efficiency gains.
  • Reduced memorization might lower the risk of models reproducing private or copyrighted material present in the original data.
  • Future experiments could vary the strictness of deduplication thresholds to measure the point at which further removal begins to hurt diversity.

Load-bearing premise

The observed drops in memorization and training steps are produced by the removal of duplicates themselves rather than by incidental shifts in data distribution or training dynamics.

What would settle it

Train identical model architectures on the original dataset and on its deduplicated version, then compare the fraction of unprompted generations that match training text verbatim and the number of steps needed to reach a fixed validation accuracy.

Figures

Figures reproduced from arXiv: 2107.06499 by Andrew Nystrom, Chiyuan Zhang, Chris Callison-Burch, Daphne Ippolito, Douglas Eck, Katherine Lee, Nicholas Carlini.

Figure 1
Figure 1. Figure 1: The distribution of near-duplicate cluster [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Impact of deduplicating the training set on [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The proportion of generations which have [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Histograms of document similarities. don’t build the data S = S1 || S2 but rather let S 0 1 = S1 || S2[uptoK] for some K greater than the longest substring match. Then we build the arrays on S 0 1 and S2. To merge the arrays together we can remove the items from the first array af￾ter index |S1| and merge-sort insert them into the second. Parallel merge of partial suffix arrays. We now merge these separate… view at source ↗
Figure 5
Figure 5. Figure 5: For each substring of length k, we plot the probability that there exists a second identical length￾k substring in the same train set. Matches with length under 10 subword tokens are common, and account for 90% of tokens. We choose a threshold of 50 for experi￾ments. one; formally: m(k) = Pr i∈[N] [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Memorized continuations distribution train examples without any duplicates, validation examples with duplicates in train, and validation examples without any duplicates. URLs with many duplicates [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of deduplicating the training set on validation perplexity. In [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The distribution of near-duplicate cluster [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard language modeling datasets such as C4 contain extensive near-duplicates and long repetitive substrings, causing trained models to emit verbatim memorized text in over 1% of unprompted outputs. The authors introduce two deduplication tools, apply them to remove highly repeated content (e.g., a 61-word sentence repeated >60k times), and report that models trained on the resulting deduplicated data emit memorized text ten times less frequently, reach equivalent or better accuracy in fewer training steps, and exhibit reduced train-test overlap (affecting >4% of validation sets). Code for deduplication and reproduction is released.

Significance. If the central empirical results hold after addressing controls, the work is significant because it identifies a pervasive data-quality issue in LM pretraining corpora and supplies practical, open-source tools that measurably reduce memorization while improving training efficiency and evaluation validity. The public release of code is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim that deduplication itself produces the 10× drop in verbatim memorization and the reduction in required training steps rests on a direct comparison of original C4 versus deduplicated C4; no ablation is reported that applies an equivalent reduction in dataset size or matches n-gram/token-frequency distributions while preserving duplicates. Without such a control, the observed effects remain consistent with incidental distribution shifts rather than the removal of duplicates per se.
  2. [§3 and §5] §3 (Deduplication tools) and §5 (Results on memorization): the quantitative claim of a reduction “from >1% to 0.1%” verbatim copying is load-bearing for the main result, yet the precise measurement protocol (prompting strategy, length of copied spans, exact definition of “verbatim”) is not cross-checked against a frequency-matched non-deduplicated baseline, leaving the causal attribution under-supported.
minor comments (2)
  1. [§3] The paper would benefit from an explicit statement of the deduplication thresholds and hash parameters used for the C4 experiments so that the exact data reduction can be reproduced from the released code.
  2. [§5] Table or figure captions reporting the 10× memorization reduction should include the exact number of evaluation prompts and the definition of “emitted memorized text” for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and outline revisions to strengthen the causal claims in the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that deduplication itself produces the 10× drop in verbatim memorization and the reduction in required training steps rests on a direct comparison of original C4 versus deduplicated C4; no ablation is reported that applies an equivalent reduction in dataset size or matches n-gram/token-frequency distributions while preserving duplicates. Without such a control, the observed effects remain consistent with incidental distribution shifts rather than the removal of duplicates per se.

    Authors: We agree that the absence of a size-matched or frequency-matched control leaves open the possibility that some effects could arise from distribution shifts rather than duplicate removal alone. In the revised manuscript we will add an ablation that randomly subsamples the original C4 to the same token count as the deduplicated version and report memorization rates, downstream accuracy, and training efficiency for direct comparison. This will isolate the contribution of duplicate removal from simple size reduction. revision: yes

  2. Referee: [§3 and §5] §3 (Deduplication tools) and §5 (Results on memorization): the quantitative claim of a reduction “from >1% to 0.1%” verbatim copying is load-bearing for the main result, yet the precise measurement protocol (prompting strategy, length of copied spans, exact definition of “verbatim”) is not cross-checked against a frequency-matched non-deduplicated baseline, leaving the causal attribution under-supported.

    Authors: Section 5 already specifies the evaluation protocol (100-token prompts, exact 50-token overlap detection, and the >1% to 0.1% figures). We nevertheless accept that a frequency-matched non-deduplicated baseline would provide stronger evidence that the reduction is attributable to duplicate removal. In revision we will construct such a baseline by re-weighting the original C4 to preserve n-gram statistics while retaining duplicates, rerun the memorization evaluation, and include the results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential claims

full rationale

The paper reports experimental results from training language models on C4 and other datasets before and after applying deduplication heuristics. All central claims (reduced verbatim memorization, faster convergence, lower train-test overlap) are direct measurements from these runs rather than outputs of any equation, fitted parameter, or uniqueness theorem. No load-bearing steps reduce to self-citation chains, ansatzes, or renamings; the work is self-contained against external benchmarks via the released code and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the empirical observation that duplicates drive memorization and on the assumption that the deduplication procedure isolates that causal factor without introducing new biases.

axioms (1)
  • domain assumption Near-duplicates and repetitive substrings in training data are the primary cause of elevated verbatim memorization rates in language models.
    This premise directly links the dataset analysis to the reported 10x reduction in copied output.

pith-pipeline@v0.9.0 · 5690 in / 1272 out tokens · 29952 ms · 2026-05-24T13:35:16.795188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    cs.CL 2023-04 accept novelty 8.0

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  2. Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

    cs.LG 2025-09 unverdicted novelty 7.0

    Introduces the first active learning framework for unaligned multimodal data that selects alignments using uncertainty and diversity to cut annotation costs by up to 40% on benchmarks while preserving accuracy.

  3. Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

    cs.AI 2024-06 conditional novelty 7.0

    LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

  4. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  5. Quantifying Memorization Across Neural Language Models

    cs.LG 2022-02 unverdicted novelty 7.0

    Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.

  6. Improving language models by retrieving from trillions of tokens

    cs.CL 2021-12 unverdicted novelty 7.0

    RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.

  7. Multitask Prompted Training Enables Zero-Shot Task Generalization

    cs.LG 2021-10 conditional novelty 7.0

    Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

  8. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  9. Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

    cs.SE 2026-05 accept novelty 6.0

    A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

  10. Provable Knowledge Acquisition and Extraction in One-Layer Transformers

    cs.LG 2025-07 unverdicted novelty 6.0

    In a stylized one-layer transformer, pre-training encodes factual knowledge via relation-specific feature directions and attention patterns; fine-tuning extracts it through a relation-covering mechanism that succeeds ...

  11. Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    cs.AI 2025-07 conditional novelty 6.0

    Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

  12. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  13. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    cs.CV 2024-03 conditional novelty 6.0

    Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

  14. Scaling Data-Constrained Language Models

    cs.CL 2023-05 conditional novelty 6.0

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  15. The False Promise of Imitating Proprietary LLMs

    cs.CL 2023-05 conditional novelty 6.0

    Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.

  16. SemDeDup: Data-efficient learning at web-scale through semantic deduplication

    cs.LG 2023-03 unverdicted novelty 6.0

    SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.

  17. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  18. Scaling Laws and Interpretability of Learning from Repeated Data

    cs.LG 2022-05 accept novelty 6.0

    Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.

  19. GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    cs.CL 2022-04 accept novelty 6.0

    GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

  20. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  21. Can Humans Detect AI? Mining Textual Signals of AI-Assisted Writing Under Varying Scrutiny Conditions

    cs.HC 2026-04 unverdicted novelty 5.0

    Warned AI-assisted writers had their documents selected as human 54.13% of the time by judges versus 45.87% for unwarned writers, despite no measurable differences in text features.

  22. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    cs.AI 2023-08 accept novelty 5.0

    Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.

  23. PaLM 2 Technical Report

    cs.CL 2023-05 unverdicted novelty 5.0

    PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

  24. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  25. Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference

    cs.CL 2026-05 unverdicted novelty 4.0

    Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.

  26. Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

    cs.CL 2026-05 unverdicted novelty 4.0

    Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.

  27. Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    cs.CV 2025-02 unverdicted novelty 4.0

    Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

  28. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  29. Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

    cs.CL 2025-09 unverdicted novelty 3.0

    Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.

  30. Data-Centric Foundation Models in Computational Healthcare: A Survey

    cs.LG 2024-01 unverdicted novelty 3.0

    The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.

  31. Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices

    cs.DC 2025-03 unverdicted novelty 2.0

    Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 31 Pith papers · 7 internal anchors

  1. [1]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143--153

  4. [4]

    Devansh Arpit, Stanis aw Jastrz e bski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. 2017. A closer look at memorization in deep networks. In International Conference on Machine Learning, pages 233--242. PMLR

  5. [5]

    documentation debt

    Jack Bandy and Nicholas Vincent. 2021. http://arxiv.org/abs/2105.05241 Addressing "documentation debt" in machine learning research: A retrospective datasheet for bookcorpus

  6. [6]

    Bender and Batya Friedman

    Emily M. Bender and Batya Friedman. 2018. https://doi.org/10.1162/tacl_a_00041 Data statements for natural language processing: Toward mitigating system bias and enabling better science . Transactions of the Association for Computational Linguistics, 6:587--604

  7. [7]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? -5pt [scale=0.1] parrot.png . In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, page 610–623, New York, NY, ...

  8. [8]

    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. http://github.com/eleutherai/gpt-neo GPT-Neo : Large scale autoregressive language modeling with mesh-tensorflow

  9. [9]

    Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426

  10. [10]

    Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21--29. IEEE

  11. [11]

    Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr. 2022. What does it mean for a language model to preserve privacy? arXiv preprint

  12. [12]

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33

  13. [13]

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. http://arxiv.org/abs/2012.07805 Extracting training data from large language models

  14. [14]

    Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005

  15. [15]

    Hung Chim and Xiaotie Deng. 2007. https://doi.org/10.1145/1242572.1242590 A new suffix tree similarity measure for document clustering . In Proceedings of the 16th International Conference on World Wide Web, WWW '07, page 121–130, New York, NY, USA. Association for Computing Machinery

  16. [16]

    Edith Cohen. 2016. http://www.cohenwang.com/edith/Surveys/minhash.pdf Min-hash sketches: A brief survey

  17. [17]

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

  18. [19]

    Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, and Matt Gardner. 2021 b . http://arxiv.org/abs/2104.08758 Documenting the english colossal clean crawled corpus . arXiv preprint arXiv:2104.08758

  19. [20]

    Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems

  20. [21]

    Gabriel, Tsung-Ting Kuo, Julian McAuley, and Chun-Nan Hsu

    Rodney A. Gabriel, Tsung-Ting Kuo, Julian McAuley, and Chun-Nan Hsu. 2018. https://doi.org/https://doi.org/10.1016/j.jbi.2018.04.009 Identifying and characterizing highly similar notes in big clinical note datasets . Journal of Biomedical Informatics, 82:63--69

  21. [22]

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027

  22. [23]

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III au2, and Kate Crawford. 2020. http://arxiv.org/abs/1803.09010 Datasheets for datasets

  23. [24]

    David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34

  24. [25]

    Mandy Guo, Zihang Dai, Denny Vrandecic, and Rami Al-Rfou. 2020. http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.296.pdf Wiki-40b: Multilingual language model dataset . In LREC 2020

  25. [26]

    Bikash Gyawali, Lucas Anastasiou, and Petr Knoth. 2020. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 901--910

  26. [27]

    Paul Jaccard. 1912. The distribution of the flora in the alpine zone. New phytologist, 11(2):37--50

  27. [28]

    Juha K \"a rkk \"a inen and Peter Sanders. 2003. Simple linear work suffix array construction. In International colloquium on automata, languages, and programming, pages 943--955. Springer

  28. [29]

    Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Annual Symposium on Combinatorial Pattern Matching, pages 200--210. Springer

  29. [30]

    Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935--948

  30. [31]

    Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In 2009 data compression conference, pages 193--202. IEEE

  31. [32]

    David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. http://arxiv.org/abs/2104.10350 Carbon emissions and large neural network training

  32. [33]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

  33. [34]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

  34. [35]

    Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596--4604. PMLR

  35. [36]

    Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2020. Towards controllable biases in language generation. arXiv preprint arXiv:2005.00268

  36. [37]

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3--18. IEEE

  37. [38]

    Cory Stephenson, Suchismita Padhy, Abhinav Ganesh, Yue Hui, Hanlin Tang, and SueYeon Chung. 2021. On the geometry of generalization and memorization in deep neural networks. In International Conference on Learning Representations

  38. [39]

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. http://arxiv.org/abs/1906.02243 Energy and policy considerations for deep learning in nlp

  39. [40]

    Piotr Teterwak, Chiyuan Zhang, Dilip Krishnan, and Michael C Mozer. 2021. Understanding invariance via feedforward inversion of discriminatively trained classifiers. In International Conference on Machine Learning, pages 10225--10235. PMLR

  40. [41]

    Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847

  41. [42]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762

  42. [43]

    Yannick Versley and Yana Panchenko. 2012. Not just bigger: Towards better-quality web corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7), pages 44--52

  43. [44]

    Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125

  44. [45]

    Ryan Webster, Julien Rabin, Loïc Simon, and Frédéric Jurie. 2019. https://doi.org/10.1109/CVPR.2019.01153 Detecting overfitting of deep generative networks via latent recovery . In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11265--11274

  45. [46]

    Peter Weiner. 1973. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1--11. IEEE

  46. [47]

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934

  47. [48]

    Mikio Yamamoto and Kenneth W Church. 2001. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30

  48. [49]

    Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. arXiv preprint arXiv:1905.12616

  49. [50]

    Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie...

  50. [51]

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27

  51. [52]

    Jakub Łącki, Vahab Mirrokni, and Michał Włodarczyk. 2018. http://arxiv.org/abs/1807.10727 Connected components at scale via local contractions