pith. sign in

arxiv: 2501.01046 · v4 · pith:S3XDXIQYnew · submitted 2025-01-02 · 💻 cs.CL

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

Pith reviewed 2026-05-23 06:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords dataset deduplicationMinHash LSHGPU kernelsstreaming approachdistributed computingdata qualitylarge language modelshash functions
0
0 comments X

The pith

A streaming-based GPU framework for MinHash deduplication achieves up to 158 times the speed of CPU methods while maintaining duplicate detection accuracy above 0.95 Jaccard similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a GPU-accelerated framework for dataset deduplication using the MinHash LSH algorithm. It replaces physical data shuffling with a streaming approach to reduce communication overhead in distributed settings. The framework also features a partially reusable hash function, optimized GPU kernels, and automatic parameter selection based on hardware. If correct, this enables much faster processing of large corpora for training language models without sacrificing the quality of duplicate removal. Experiments show it can handle 1.2 trillion tokens in 3 hours on a 32-GPU cluster.

Core claim

The proposed framework accelerates MinHash-based deduplication by using a streaming-based approach instead of physical data shuffling, a computationally efficient partially reusable hash function, highly optimized GPU kernels, and a hardware-aware automatic parameter selection mechanism, leading to speedups of up to 158 times over CPU-based tools and 7.8 times over other GPU tools for 30 million documents, with duplicate sets achieving Jaccard similarities over 0.95 to standard MinHash results.

What carries the argument

The streaming-based replacement for physical data shuffling in distributed MinHash LSH, supported by a partially reusable hash function and hardware-aware parameter selection.

Load-bearing premise

The streaming-based replacement for physical data shuffling preserves the exact duplicate sets that the original MinHash LSH algorithm would have found without introducing additional false negatives.

What would settle it

A side-by-side run on the same dataset where the new method's duplicate sets show Jaccard similarity below 0.95 with standard MinHash or where known duplicates are missed.

Figures

Figures reproduced from arXiv: 2501.01046 by Chaewon Kim, Jaejin Lee, Youngjun Son.

Figure 1
Figure 1. Figure 1: The process of MinHash generation. RealNews Dataset index: 12347 text: Margins matter. The more Synovis Life Tech￾nologies (Nasdaq: SYNO) keeps of each buck it earns in revenue, the more money it has to invest in growth, fund new strategic plans, or (gasp!) distribute to share￾holders. Healthy margins often separate pretenders from the best stocks in the market. That’s why we check up on margins at least o… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of duplicate documents. ponents, known as shingles or n-grams and repre￾sented by the set of its shingles. Then, the shingles are mapped to zero or positive integers using H hash functions, f1, f2, · · · , fH. As a result, we have H sequences of integers to represent a doc￾ument. A MinHash signature for the document is a sequence of the minimum values obtained from each hash function, resulting in… view at source ↗
Figure 3
Figure 3. Figure 3: Hashing by MinHash LSH. LSH combines MinHash with Locality-Sensitive Hashing (LSH) (Indyk and Motwani, 1998). The critical difference between MinHash and MinHash LSH lies in the stage of generating dupli￾cate pairs. In MinHash LSH, the signature column vector in the signature matrix of each document is divided into b bands, each of which has r inte￾gers. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The overview of FED. 3.1 Overview of FED’s Minhash LSH Assume that the raw dataset consists of multiple files and that they are stored in a suitable format, such as JSONL, as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Dataset deduplication is widely recognized as a crucial preprocessing step that enhances data quality and improves the performance of large language models. A commonly used method for this process is the MinHash Locality-Sensitive Hashing (LSH) algorithm. Recently, GPU-accelerated frameworks such as NVIDIA NeMo Curator have been introduced to handle large-scale corpora; however, they remain suboptimal due to high communication overhead from physical data shuffling and underutilization of GPU resources. In this paper, we propose SEDD, a high-performance GPU-accelerated deduplication framework optimized for distributed cluster environments. SEDD introduces a computationally efficient, partially reusable hash function, alongside highly optimized GPU kernels and a hardware-aware automatic parameter selection mechanism. By replacing traditional data shuffling with a streaming-based approach, SEDD significantly mitigates communication bottlenecks. Our framework outperforms the CPU-based deduplication tool in SlimPajama by up to 158$\times$ and the GPU-based tool in NVIDIA NeMo Curator by up to 7.8$\times$ when processing 30 million documents on a node with four GPUs. Notably, SEDD dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speedups of up to 375$\times$ over the CPU baseline. Despite these gains in efficiency, SEDD maintains high deduplication fidelity, with duplicate document sets achieving Jaccard similarities of over 0.95 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 3 hours on an 8-node 32-GPU V100 cluster. The related code is publicly available on GitHub (https://github.com/mcrl/SEDD).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents SEDD, a GPU-accelerated MinHash LSH deduplication framework for large-scale datasets. It introduces a reusable hash function, optimized GPU kernels, automatic parameter selection, and a streaming replacement for physical data shuffling to reduce communication overhead. Empirical results claim up to 158× speedup over SlimPajama (CPU), 7.8× over NeMo Curator (GPU) on 30M documents with 4 GPUs, 375× faster MinHash signature generation, and processing of 1.2T tokens in 3 hours on an 8-node 32-GPU cluster, while reporting duplicate document sets with Jaccard similarity >0.95 versus standard MinHash.

Significance. If the performance numbers hold under identical conditions and the fidelity claim is rigorously validated, SEDD would provide a practically useful tool for efficient distributed deduplication in LLM data pipelines, addressing known bottlenecks in GPU utilization and inter-node communication.

major comments (1)
  1. [Abstract and fidelity evaluation section] Abstract and fidelity evaluation section: the claim that duplicate detection quality is preserved rests on duplicate document sets achieving Jaccard similarity >0.95 versus standard MinHash. This aggregate set-level metric does not establish that the exact duplicate sets (or the underlying pair-wise Jaccard distribution) match those of the original MinHash LSH algorithm, nor does it exclude additional false negatives introduced by the streaming replacement for physical shuffling. A pair-level agreement analysis or formal argument for equivalence is required to secure the central fidelity guarantee.
minor comments (1)
  1. The manuscript states that code is publicly available on GitHub but does not provide the repository URL or commit hash in the main text or appendix, which would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment regarding the fidelity evaluation. We address the concern directly below and outline the planned revision.

read point-by-point responses
  1. Referee: [Abstract and fidelity evaluation section] Abstract and fidelity evaluation section: the claim that duplicate detection quality is preserved rests on duplicate document sets achieving Jaccard similarity >0.95 versus standard MinHash. This aggregate set-level metric does not establish that the exact duplicate sets (or the underlying pair-wise Jaccard distribution) match those of the original MinHash LSH algorithm, nor does it exclude additional false negatives introduced by the streaming replacement for physical shuffling. A pair-level agreement analysis or formal argument for equivalence is required to secure the central fidelity guarantee.

    Authors: We agree that the reported set-level Jaccard similarity (>0.95) on the final duplicate document sets is an aggregate metric and does not directly quantify pairwise agreement or rule out additional false negatives. The streaming replacement for physical shuffling is intended to preserve equivalence by computing MinHash signatures locally and exchanging only the necessary hash values for LSH bucketing without reordering the underlying data; however, we recognize that an explicit verification is needed. In the revised manuscript we will add a pair-level agreement analysis (precision/recall of duplicate pairs and Jaccard distribution over pairs) on a held-out subset, together with a short formal argument showing that the streaming procedure computes identical signatures and bucket assignments to the shuffled baseline. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical performance and fidelity claims are self-contained

full rationale

The paper introduces SEDD as a GPU-optimized MinHash LSH deduplication system with streaming replacement for shuffling, reusable hash functions, and GPU kernels. All central claims (speedups of 158× vs SlimPajama, 7.8× vs NeMo Curator, 375× for signature generation, and Jaccard >0.95 fidelity) are direct wall-clock and quality measurements from experiments on fixed document sets against external baselines. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear; the evaluation chain consists solely of implementation, execution, and comparison, remaining independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on the standard correctness properties of MinHash LSH and on the assumption that GPU kernels and streaming communication can be implemented without changing those properties; no new free parameters, axioms, or invented entities are introduced beyond ordinary systems tuning.

pith-pipeline@v0.9.0 · 5850 in / 1280 out tokens · 37080 ms · 2026-05-23T06:43:27.260062+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework

    cs.LG 2026-02 unverdicted novelty 5.0

    SubQuad delivers near-subquadratic retrieval and equity-aware clustering for adaptive immune repertoires, achieving faster throughput and memory use while maintaining recall and subgroup balance on viral and tumor data.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    arXiv preprint arXiv:2402.16827

    A survey on data selection for language models. arXiv preprint arXiv:2402.16827. Miltiadis Allamanis

  2. [2]

    In Proceedings of the 2019 ACM SIGPLAN Interna- tional Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143–153

    The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN Interna- tional Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143–153. Austin Appleby

  3. [3]

    In Proceedings

    On the resemblance and con- tainment of documents. In Proceedings. Compres- sion and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020a. Language models are few-shot lear...

  4. [4]

    The Llama 3 Herd of Models

    The llama 3 herd of models. arXiv preprint arXiv:2407.21783. D. Eastlake 3rd and P. Jones

  5. [5]

    11 Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al

    What’s in my big data? Preprint, arXiv:2310.20707. 11 Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al

  6. [6]

    Textbooks Are All You Need

    Textbooks are all you need. arXiv preprint arXiv:2306.11644. Mark Harris

  7. [7]

    https://developer.nvidia.com/blog/ using-shared-memory-cuda-cc/

    Using shared memory in cuda c/c++. https://developer.nvidia.com/blog/ using-shared-memory-cuda-cc/ . Accessed: 2024-12-15. Jordan Hoffmann, Sebastian Borgeaud, Arthur Men- sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al

  8. [8]

    Training Compute-Optimal Large Language Models

    Train- ing compute-optimal large language models. arXiv preprint arXiv:2203.15556. Piotr Indyk and Rajeev Motwani

  9. [9]

    Textbooks Are All You Need II: phi-1.5 technical report

    Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463. Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym

  10. [10]

    In 2008 5th IEEE international symposium on biomed- ical imaging: from nano to macro , pages 836–838

    Cuda: Scalable parallel program- ming for high-performance scientific computing. In 2008 5th IEEE international symposium on biomed- ical imaging: from nano to macro , pages 836–838. IEEE. Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, et al

  11. [11]

    arXiv preprint arXiv:2312.10523

    Paloma: A benchmark for evaluating language model fit. arXiv preprint arXiv:2312.10523. Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susan- nah Young, et al. 2021a. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112...

  12. [12]

    Xing , title =

    Slimpajama-dc: Un- derstanding data combinations for llm training. arXiv preprint arXiv:2309.10818. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al

  13. [13]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Using deep- speed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990. Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos

  14. [14]

    LLaMA: Open and Efficient Foundation Language Models

    D4: Improving llm pretrain- ing via document de-duplication and diversification. Advances in Neural Information Processing Systems, 36:53983–53995. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave,...