SEDD: Scalable and Efficient Dataset Deduplication with GPUs
Pith reviewed 2026-05-23 06:43 UTC · model grok-4.3
The pith
A streaming-based GPU framework for MinHash deduplication achieves up to 158 times the speed of CPU methods while maintaining duplicate detection accuracy above 0.95 Jaccard similarity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed framework accelerates MinHash-based deduplication by using a streaming-based approach instead of physical data shuffling, a computationally efficient partially reusable hash function, highly optimized GPU kernels, and a hardware-aware automatic parameter selection mechanism, leading to speedups of up to 158 times over CPU-based tools and 7.8 times over other GPU tools for 30 million documents, with duplicate sets achieving Jaccard similarities over 0.95 to standard MinHash results.
What carries the argument
The streaming-based replacement for physical data shuffling in distributed MinHash LSH, supported by a partially reusable hash function and hardware-aware parameter selection.
Load-bearing premise
The streaming-based replacement for physical data shuffling preserves the exact duplicate sets that the original MinHash LSH algorithm would have found without introducing additional false negatives.
What would settle it
A side-by-side run on the same dataset where the new method's duplicate sets show Jaccard similarity below 0.95 with standard MinHash or where known duplicates are missed.
Figures
read the original abstract
Dataset deduplication is widely recognized as a crucial preprocessing step that enhances data quality and improves the performance of large language models. A commonly used method for this process is the MinHash Locality-Sensitive Hashing (LSH) algorithm. Recently, GPU-accelerated frameworks such as NVIDIA NeMo Curator have been introduced to handle large-scale corpora; however, they remain suboptimal due to high communication overhead from physical data shuffling and underutilization of GPU resources. In this paper, we propose SEDD, a high-performance GPU-accelerated deduplication framework optimized for distributed cluster environments. SEDD introduces a computationally efficient, partially reusable hash function, alongside highly optimized GPU kernels and a hardware-aware automatic parameter selection mechanism. By replacing traditional data shuffling with a streaming-based approach, SEDD significantly mitigates communication bottlenecks. Our framework outperforms the CPU-based deduplication tool in SlimPajama by up to 158$\times$ and the GPU-based tool in NVIDIA NeMo Curator by up to 7.8$\times$ when processing 30 million documents on a node with four GPUs. Notably, SEDD dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speedups of up to 375$\times$ over the CPU baseline. Despite these gains in efficiency, SEDD maintains high deduplication fidelity, with duplicate document sets achieving Jaccard similarities of over 0.95 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 3 hours on an 8-node 32-GPU V100 cluster. The related code is publicly available on GitHub (https://github.com/mcrl/SEDD).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SEDD, a GPU-accelerated MinHash LSH deduplication framework for large-scale datasets. It introduces a reusable hash function, optimized GPU kernels, automatic parameter selection, and a streaming replacement for physical data shuffling to reduce communication overhead. Empirical results claim up to 158× speedup over SlimPajama (CPU), 7.8× over NeMo Curator (GPU) on 30M documents with 4 GPUs, 375× faster MinHash signature generation, and processing of 1.2T tokens in 3 hours on an 8-node 32-GPU cluster, while reporting duplicate document sets with Jaccard similarity >0.95 versus standard MinHash.
Significance. If the performance numbers hold under identical conditions and the fidelity claim is rigorously validated, SEDD would provide a practically useful tool for efficient distributed deduplication in LLM data pipelines, addressing known bottlenecks in GPU utilization and inter-node communication.
major comments (1)
- [Abstract and fidelity evaluation section] Abstract and fidelity evaluation section: the claim that duplicate detection quality is preserved rests on duplicate document sets achieving Jaccard similarity >0.95 versus standard MinHash. This aggregate set-level metric does not establish that the exact duplicate sets (or the underlying pair-wise Jaccard distribution) match those of the original MinHash LSH algorithm, nor does it exclude additional false negatives introduced by the streaming replacement for physical shuffling. A pair-level agreement analysis or formal argument for equivalence is required to secure the central fidelity guarantee.
minor comments (1)
- The manuscript states that code is publicly available on GitHub but does not provide the repository URL or commit hash in the main text or appendix, which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comment regarding the fidelity evaluation. We address the concern directly below and outline the planned revision.
read point-by-point responses
-
Referee: [Abstract and fidelity evaluation section] Abstract and fidelity evaluation section: the claim that duplicate detection quality is preserved rests on duplicate document sets achieving Jaccard similarity >0.95 versus standard MinHash. This aggregate set-level metric does not establish that the exact duplicate sets (or the underlying pair-wise Jaccard distribution) match those of the original MinHash LSH algorithm, nor does it exclude additional false negatives introduced by the streaming replacement for physical shuffling. A pair-level agreement analysis or formal argument for equivalence is required to secure the central fidelity guarantee.
Authors: We agree that the reported set-level Jaccard similarity (>0.95) on the final duplicate document sets is an aggregate metric and does not directly quantify pairwise agreement or rule out additional false negatives. The streaming replacement for physical shuffling is intended to preserve equivalence by computing MinHash signatures locally and exchanging only the necessary hash values for LSH bucketing without reordering the underlying data; however, we recognize that an explicit verification is needed. In the revised manuscript we will add a pair-level agreement analysis (precision/recall of duplicate pairs and Jaccard distribution over pairs) on a held-out subset, together with a short formal argument showing that the streaming procedure computes identical signatures and bucket assignments to the shuffled baseline. revision: yes
Circularity Check
No circularity; empirical performance and fidelity claims are self-contained
full rationale
The paper introduces SEDD as a GPU-optimized MinHash LSH deduplication system with streaming replacement for shuffling, reusable hash functions, and GPU kernels. All central claims (speedups of 158× vs SlimPajama, 7.8× vs NeMo Curator, 375× for signature generation, and Jaccard >0.95 fidelity) are direct wall-clock and quality measurements from experiments on fixed document sets against external baselines. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear; the evaluation chain consists solely of implementation, execution, and comparison, remaining independent of its own outputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By replacing traditional data shuffling with a streaming-based approach... duplicate document sets achieving Jaccard similarities of over 0.95 compared to those identified by the standard MinHash algorithm.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FED compares all pairs of documents in the bucket in O(B²)... optimized GPU comparison kernel
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework
SubQuad delivers near-subquadratic retrieval and equity-aware clustering for adaptive immune repertoires, achieving faster throughput and memory use while maintaining recall and subgroup balance on viral and tumor data.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2402.16827
A survey on data selection for language models. arXiv preprint arXiv:2402.16827. Miltiadis Allamanis
-
[2]
The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN Interna- tional Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143–153. Austin Appleby
work page 2019
-
[3]
On the resemblance and con- tainment of documents. In Proceedings. Compres- sion and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020a. Language models are few-shot lear...
work page 1997
-
[4]
The llama 3 herd of models. arXiv preprint arXiv:2407.21783. D. Eastlake 3rd and P. Jones
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
What’s in my big data? Preprint, arXiv:2310.20707. 11 Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al
-
[6]
Textbooks are all you need. arXiv preprint arXiv:2306.11644. Mark Harris
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
https://developer.nvidia.com/blog/ using-shared-memory-cuda-cc/
Using shared memory in cuda c/c++. https://developer.nvidia.com/blog/ using-shared-memory-cuda-cc/ . Accessed: 2024-12-15. Jordan Hoffmann, Sebastian Borgeaud, Arthur Men- sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al
work page 2024
-
[8]
Training Compute-Optimal Large Language Models
Train- ing compute-optimal large language models. arXiv preprint arXiv:2203.15556. Piotr Indyk and Rajeev Motwani
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Textbooks Are All You Need II: phi-1.5 technical report
Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463. Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
In 2008 5th IEEE international symposium on biomed- ical imaging: from nano to macro , pages 836–838
Cuda: Scalable parallel program- ming for high-performance scientific computing. In 2008 5th IEEE international symposium on biomed- ical imaging: from nano to macro , pages 836–838. IEEE. Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, et al
work page 2008
-
[11]
arXiv preprint arXiv:2312.10523
Paloma: A benchmark for evaluating language model fit. arXiv preprint arXiv:2312.10523. Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susan- nah Young, et al. 2021a. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112...
-
[12]
Slimpajama-dc: Un- derstanding data combinations for llm training. arXiv preprint arXiv:2309.10818. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al
-
[13]
Using deep- speed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990. Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
LLaMA: Open and Efficient Foundation Language Models
D4: Improving llm pretrain- ing via document de-duplication and diversification. Advances in Neural Information Processing Systems, 36:53983–53995. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave,...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.