Deduplicating Training Data Makes Language Models Better
Pith reviewed 2026-05-24 13:35 UTC · model grok-4.3
The pith
Deduplicating training datasets reduces language model memorization by a factor of ten.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. Removing these duplicates with the authors' tools produces models that emit memorized text ten times less frequently, require fewer train steps to reach the same or better accuracy, and exhibit reduced train-test overlap that affects over 4 percent of standard validation sets.
What carries the argument
The deduplication process that identifies and removes near-duplicate examples together with long repetitive substrings from training corpora.
If this is right
- Models emit memorized text ten times less frequently after deduplication.
- Fewer training steps suffice to reach the same or better accuracy.
- Train-test overlap drops, allowing more reliable evaluation of model quality.
- A single repeated sentence can be removed from a corpus even when it appears over 60,000 times.
Where Pith is reading between the lines
- Standard data pipelines for language models may need routine deduplication to limit unintended copying of training content.
- The same cleaning step could be tested on non-language tasks to check whether repetition removal yields similar efficiency gains.
- Reduced memorization might lower the risk of models reproducing private or copyrighted material present in the original data.
- Future experiments could vary the strictness of deduplication thresholds to measure the point at which further removal begins to hurt diversity.
Load-bearing premise
The observed drops in memorization and training steps are produced by the removal of duplicates themselves rather than by incidental shifts in data distribution or training dynamics.
What would settle it
Train identical model architectures on the original dataset and on its deduplicated version, then compare the fraction of unprompted generations that match training text verbatim and the number of steps needed to reach a fixed validation accuracy.
Figures
read the original abstract
We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard language modeling datasets such as C4 contain extensive near-duplicates and long repetitive substrings, causing trained models to emit verbatim memorized text in over 1% of unprompted outputs. The authors introduce two deduplication tools, apply them to remove highly repeated content (e.g., a 61-word sentence repeated >60k times), and report that models trained on the resulting deduplicated data emit memorized text ten times less frequently, reach equivalent or better accuracy in fewer training steps, and exhibit reduced train-test overlap (affecting >4% of validation sets). Code for deduplication and reproduction is released.
Significance. If the central empirical results hold after addressing controls, the work is significant because it identifies a pervasive data-quality issue in LM pretraining corpora and supplies practical, open-source tools that measurably reduce memorization while improving training efficiency and evaluation validity. The public release of code is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the central claim that deduplication itself produces the 10× drop in verbatim memorization and the reduction in required training steps rests on a direct comparison of original C4 versus deduplicated C4; no ablation is reported that applies an equivalent reduction in dataset size or matches n-gram/token-frequency distributions while preserving duplicates. Without such a control, the observed effects remain consistent with incidental distribution shifts rather than the removal of duplicates per se.
- [§3 and §5] §3 (Deduplication tools) and §5 (Results on memorization): the quantitative claim of a reduction “from >1% to 0.1%” verbatim copying is load-bearing for the main result, yet the precise measurement protocol (prompting strategy, length of copied spans, exact definition of “verbatim”) is not cross-checked against a frequency-matched non-deduplicated baseline, leaving the causal attribution under-supported.
minor comments (2)
- [§3] The paper would benefit from an explicit statement of the deduplication thresholds and hash parameters used for the C4 experiments so that the exact data reduction can be reproduced from the released code.
- [§5] Table or figure captions reporting the 10× memorization reduction should include the exact number of evaluation prompts and the definition of “emitted memorized text” for clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and outline revisions to strengthen the causal claims in the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that deduplication itself produces the 10× drop in verbatim memorization and the reduction in required training steps rests on a direct comparison of original C4 versus deduplicated C4; no ablation is reported that applies an equivalent reduction in dataset size or matches n-gram/token-frequency distributions while preserving duplicates. Without such a control, the observed effects remain consistent with incidental distribution shifts rather than the removal of duplicates per se.
Authors: We agree that the absence of a size-matched or frequency-matched control leaves open the possibility that some effects could arise from distribution shifts rather than duplicate removal alone. In the revised manuscript we will add an ablation that randomly subsamples the original C4 to the same token count as the deduplicated version and report memorization rates, downstream accuracy, and training efficiency for direct comparison. This will isolate the contribution of duplicate removal from simple size reduction. revision: yes
-
Referee: [§3 and §5] §3 (Deduplication tools) and §5 (Results on memorization): the quantitative claim of a reduction “from >1% to 0.1%” verbatim copying is load-bearing for the main result, yet the precise measurement protocol (prompting strategy, length of copied spans, exact definition of “verbatim”) is not cross-checked against a frequency-matched non-deduplicated baseline, leaving the causal attribution under-supported.
Authors: Section 5 already specifies the evaluation protocol (100-token prompts, exact 50-token overlap detection, and the >1% to 0.1% figures). We nevertheless accept that a frequency-matched non-deduplicated baseline would provide stronger evidence that the reduction is attributable to duplicate removal. In revision we will construct such a baseline by re-weighting the original C4 to preserve n-gram statistics while retaining duplicates, rerun the memorization evaluation, and include the results. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations or self-referential claims
full rationale
The paper reports experimental results from training language models on C4 and other datasets before and after applying deduplication heuristics. All central claims (reduced verbatim memorization, faster convergence, lower train-test overlap) are direct measurements from these runs rather than outputs of any equation, fitted parameter, or uniqueness theorem. No load-bearing steps reduce to self-citation chains, ansatzes, or renamings; the work is self-contained against external benchmarks via the released code and datasets.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Near-duplicates and repetitive substrings in training data are the primary cause of elevated verbatim memorization rates in language models.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop two tools... Exact substring matching... Approximate full document matching uses hash-based techniques (Broder, 1997)...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Deduplication allows us to train models that emit memorized text ten times less frequently...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 31 Pith papers
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data
Introduces the first active learning framework for unaligned multimodal data that selects alignments using uncertainty and diversity to cut annotation costs by up to 40% on benchmarks while preserving accuracy.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Quantifying Memorization Across Neural Language Models
Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
-
Improving language models by retrieving from trillions of tokens
RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
Provable Knowledge Acquisition and Extraction in One-Layer Transformers
In a stylized one-layer transformer, pre-training encodes factual knowledge via relation-specific feature directions and attention patterns; fine-tuning extracts it through a relation-covering mechanism that succeeds ...
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
The False Promise of Imitating Proprietary LLMs
Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
-
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
Scaling Laws and Interpretability of Learning from Repeated Data
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
-
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Can Humans Detect AI? Mining Textual Signals of AI-Assisted Writing Under Varying Scrutiny Conditions
Warned AI-assisted writers had their documents selected as human 54.13% of the time by judges versus 45.87% for unwarned writers, despite no measurable differences in text features.
-
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.
-
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
-
Data-Centric Foundation Models in Computational Healthcare: A Survey
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
-
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143--153
work page 2019
-
[4]
Devansh Arpit, Stanis aw Jastrz e bski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. 2017. A closer look at memorization in deep networks. In International Conference on Machine Learning, pages 233--242. PMLR
work page 2017
-
[5]
Jack Bandy and Nicholas Vincent. 2021. http://arxiv.org/abs/2105.05241 Addressing "documentation debt" in machine learning research: A retrospective datasheet for bookcorpus
-
[6]
Emily M. Bender and Batya Friedman. 2018. https://doi.org/10.1162/tacl_a_00041 Data statements for natural language processing: Toward mitigating system bias and enabling better science . Transactions of the Association for Computational Linguistics, 6:587--604
-
[7]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? -5pt [scale=0.1] parrot.png . In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, page 610–623, New York, NY, ...
-
[8]
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. http://github.com/eleutherai/gpt-neo GPT-Neo : Large scale autoregressive language modeling with mesh-tensorflow
work page 2021
-
[9]
Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426
work page 1970
-
[10]
Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21--29. IEEE
work page 1997
-
[11]
Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr. 2022. What does it mean for a language model to preserve privacy? arXiv preprint
work page 2022
-
[12]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33
work page 2020
- [13]
-
[14]
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[15]
Hung Chim and Xiaotie Deng. 2007. https://doi.org/10.1145/1242572.1242590 A new suffix tree similarity measure for document clustering . In Proceedings of the 16th International Conference on World Wide Web, WWW '07, page 121–130, New York, NY, USA. Association for Computing Machinery
-
[16]
Edith Cohen. 2016. http://www.cohenwang.com/edith/Surveys/minhash.pdf Min-hash sketches: A brief survey
work page 2016
-
[17]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [19]
-
[20]
Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems
work page 2020
-
[21]
Gabriel, Tsung-Ting Kuo, Julian McAuley, and Chun-Nan Hsu
Rodney A. Gabriel, Tsung-Ting Kuo, Julian McAuley, and Chun-Nan Hsu. 2018. https://doi.org/https://doi.org/10.1016/j.jbi.2018.04.009 Identifying and characterizing highly similar notes in big clinical note datasets . Journal of Biomedical Informatics, 82:63--69
-
[22]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [23]
-
[24]
David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34
work page 2003
-
[25]
Mandy Guo, Zihang Dai, Denny Vrandecic, and Rami Al-Rfou. 2020. http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.296.pdf Wiki-40b: Multilingual language model dataset . In LREC 2020
work page 2020
-
[26]
Bikash Gyawali, Lucas Anastasiou, and Petr Knoth. 2020. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 901--910
work page 2020
-
[27]
Paul Jaccard. 1912. The distribution of the flora in the alpine zone. New phytologist, 11(2):37--50
work page 1912
-
[28]
Juha K \"a rkk \"a inen and Peter Sanders. 2003. Simple linear work suffix array construction. In International colloquium on automata, languages, and programming, pages 943--955. Springer
work page 2003
-
[29]
Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Annual Symposium on Combinatorial Pattern Matching, pages 200--210. Springer
work page 2003
-
[30]
Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935--948
work page 1993
-
[31]
Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In 2009 data compression conference, pages 193--202. IEEE
work page 2009
-
[32]
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. http://arxiv.org/abs/2104.10350 Carbon emissions and large neural network training
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
work page 2019
-
[34]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67
work page 2020
-
[35]
Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596--4604. PMLR
work page 2018
- [36]
-
[37]
Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3--18. IEEE
work page 2017
-
[38]
Cory Stephenson, Suchismita Padhy, Abhinav Ganesh, Yue Hui, Hanlin Tang, and SueYeon Chung. 2021. On the geometry of generalization and memorization in deep neural networks. In International Conference on Learning Representations
work page 2021
-
[39]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. http://arxiv.org/abs/1906.02243 Energy and policy considerations for deep learning in nlp
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[40]
Piotr Teterwak, Chiyuan Zhang, Dilip Krishnan, and Michael C Mozer. 2021. Understanding invariance via feedforward inversion of discriminatively trained classifiers. In International Conference on Machine Learning, pages 10225--10235. PMLR
work page 2021
- [41]
-
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
Yannick Versley and Yana Panchenko. 2012. Not just bigger: Towards better-quality web corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7), pages 44--52
work page 2012
- [44]
-
[45]
Ryan Webster, Julien Rabin, Loïc Simon, and Frédéric Jurie. 2019. https://doi.org/10.1109/CVPR.2019.01153 Detecting overfitting of deep generative networks via latent recovery . In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11265--11274
-
[46]
Peter Weiner. 1973. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1--11. IEEE
work page 1973
- [47]
-
[48]
Mikio Yamamoto and Kenneth W Church. 2001. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30
work page 2001
- [49]
-
[50]
Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie...
-
[51]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27
work page 2015
-
[52]
Jakub Łącki, Vahab Mirrokni, and Michał Włodarczyk. 2018. http://arxiv.org/abs/1807.10727 Connected components at scale via local contractions
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.