Deduplicating Training Data Makes Language Models Better

Andrew Nystrom; Chiyuan Zhang; Chris Callison-Burch; Daphne Ippolito; Douglas Eck; Katherine Lee; Nicholas Carlini

arxiv: 2107.06499 · v2 · pith:KGI2ZLF7new · submitted 2021-07-14 · 💻 cs.CL · cs.LG

Deduplicating Training Data Makes Language Models Better

Katherine Lee , Daphne Ippolito , Andrew Nystrom , Chiyuan Zhang , Douglas Eck , Chris Callison-Burch , Nicholas Carlini This is my paper

Pith reviewed 2026-05-24 13:35 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords language modelsdeduplicationmemorizationtraining dataevaluationdata cleaningnear-duplicates

0 comments

The pith

Deduplicating training datasets reduces language model memorization by a factor of ten.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard language modeling datasets contain many near-duplicate examples and long repetitive substrings, which cause models to copy more than 1 percent of their unprompted output directly from training data. By building tools to remove these repeats, the authors demonstrate that the resulting models produce memorized text ten times less often while reaching target accuracy in fewer training steps. The same process also shrinks train-test overlap that affects more than 4 percent of typical validation examples, yielding cleaner performance measurements. A sympathetic reader would care because the work isolates a concrete data-cleaning step that simultaneously improves efficiency, reduces leakage, and strengthens evaluation.

Core claim

Existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. Removing these duplicates with the authors' tools produces models that emit memorized text ten times less frequently, require fewer train steps to reach the same or better accuracy, and exhibit reduced train-test overlap that affects over 4 percent of standard validation sets.

What carries the argument

The deduplication process that identifies and removes near-duplicate examples together with long repetitive substrings from training corpora.

If this is right

Models emit memorized text ten times less frequently after deduplication.
Fewer training steps suffice to reach the same or better accuracy.
Train-test overlap drops, allowing more reliable evaluation of model quality.
A single repeated sentence can be removed from a corpus even when it appears over 60,000 times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard data pipelines for language models may need routine deduplication to limit unintended copying of training content.
The same cleaning step could be tested on non-language tasks to check whether repetition removal yields similar efficiency gains.
Reduced memorization might lower the risk of models reproducing private or copyrighted material present in the original data.
Future experiments could vary the strictness of deduplication thresholds to measure the point at which further removal begins to hurt diversity.

Load-bearing premise

The observed drops in memorization and training steps are produced by the removal of duplicates themselves rather than by incidental shifts in data distribution or training dynamics.

What would settle it

Train identical model architectures on the original dataset and on its deduplicated version, then compare the fraction of unprompted generations that match training text verbatim and the number of steps needed to reach a fixed validation accuracy.

Figures

Figures reproduced from arXiv: 2107.06499 by Andrew Nystrom, Chiyuan Zhang, Chris Callison-Burch, Daphne Ippolito, Douglas Eck, Katherine Lee, Nicholas Carlini.

**Figure 2.** Figure 2: Impact of deduplicating the training set on [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The proportion of generations which have [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Histograms of document similarities. don’t build the data S = S1 || S2 but rather let S 0 1 = S1 || S2[uptoK] for some K greater than the longest substring match. Then we build the arrays on S 0 1 and S2. To merge the arrays together we can remove the items from the first array after index |S1| and merge-sort insert them into the second. Parallel merge of partial suffix arrays. We now merge these separate… view at source ↗

**Figure 5.** Figure 5: For each substring of length k, we plot the probability that there exists a second identical lengthk substring in the same train set. Matches with length under 10 subword tokens are common, and account for 90% of tokens. We choose a threshold of 50 for experiments. one; formally: m(k) = Pr i∈[N] [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Memorized continuations distribution train examples without any duplicates, validation examples with duplicates in train, and validation examples without any duplicates. URLs with many duplicates [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of deduplicating the training set on validation perplexity. In [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: The distribution of near-duplicate cluster [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Deduplication cuts verbatim memorization roughly 10x on C4-scale data and ships usable tools, but the experiments do not yet separate duplicate removal from the distribution shifts it creates.

read the letter

The core result here is that standard LM corpora contain heavy duplication, leading to over 1% verbatim copying in model outputs, and that removing those duplicates drops the rate by about 10x while also trimming the steps needed to reach target accuracy. They also flag that over 4% of validation examples overlap with training data in common benchmarks. The practical tools for finding and stripping near-duplicates at C4 scale, plus the released code, are the clearest new pieces; prior work had noted duplication in other domains but not quantified its effect on LM memorization at this granularity or supplied drop-in fixes for the datasets people actually use.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard language modeling datasets such as C4 contain extensive near-duplicates and long repetitive substrings, causing trained models to emit verbatim memorized text in over 1% of unprompted outputs. The authors introduce two deduplication tools, apply them to remove highly repeated content (e.g., a 61-word sentence repeated >60k times), and report that models trained on the resulting deduplicated data emit memorized text ten times less frequently, reach equivalent or better accuracy in fewer training steps, and exhibit reduced train-test overlap (affecting >4% of validation sets). Code for deduplication and reproduction is released.

Significance. If the central empirical results hold after addressing controls, the work is significant because it identifies a pervasive data-quality issue in LM pretraining corpora and supplies practical, open-source tools that measurably reduce memorization while improving training efficiency and evaluation validity. The public release of code is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central claim that deduplication itself produces the 10× drop in verbatim memorization and the reduction in required training steps rests on a direct comparison of original C4 versus deduplicated C4; no ablation is reported that applies an equivalent reduction in dataset size or matches n-gram/token-frequency distributions while preserving duplicates. Without such a control, the observed effects remain consistent with incidental distribution shifts rather than the removal of duplicates per se.
[§3 and §5] §3 (Deduplication tools) and §5 (Results on memorization): the quantitative claim of a reduction “from >1% to 0.1%” verbatim copying is load-bearing for the main result, yet the precise measurement protocol (prompting strategy, length of copied spans, exact definition of “verbatim”) is not cross-checked against a frequency-matched non-deduplicated baseline, leaving the causal attribution under-supported.

minor comments (2)

[§3] The paper would benefit from an explicit statement of the deduplication thresholds and hash parameters used for the C4 experiments so that the exact data reduction can be reproduced from the released code.
[§5] Table or figure captions reporting the 10× memorization reduction should include the exact number of evaluation prompts and the definition of “emitted memorized text” for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and outline revisions to strengthen the causal claims in the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that deduplication itself produces the 10× drop in verbatim memorization and the reduction in required training steps rests on a direct comparison of original C4 versus deduplicated C4; no ablation is reported that applies an equivalent reduction in dataset size or matches n-gram/token-frequency distributions while preserving duplicates. Without such a control, the observed effects remain consistent with incidental distribution shifts rather than the removal of duplicates per se.

Authors: We agree that the absence of a size-matched or frequency-matched control leaves open the possibility that some effects could arise from distribution shifts rather than duplicate removal alone. In the revised manuscript we will add an ablation that randomly subsamples the original C4 to the same token count as the deduplicated version and report memorization rates, downstream accuracy, and training efficiency for direct comparison. This will isolate the contribution of duplicate removal from simple size reduction. revision: yes
Referee: [§3 and §5] §3 (Deduplication tools) and §5 (Results on memorization): the quantitative claim of a reduction “from >1% to 0.1%” verbatim copying is load-bearing for the main result, yet the precise measurement protocol (prompting strategy, length of copied spans, exact definition of “verbatim”) is not cross-checked against a frequency-matched non-deduplicated baseline, leaving the causal attribution under-supported.

Authors: Section 5 already specifies the evaluation protocol (100-token prompts, exact 50-token overlap detection, and the >1% to 0.1% figures). We nevertheless accept that a frequency-matched non-deduplicated baseline would provide stronger evidence that the reduction is attributable to duplicate removal. In revision we will construct such a baseline by re-weighting the original C4 to preserve n-gram statistics while retaining duplicates, rerun the memorization evaluation, and include the results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential claims

full rationale

The paper reports experimental results from training language models on C4 and other datasets before and after applying deduplication heuristics. All central claims (reduced verbatim memorization, faster convergence, lower train-test overlap) are direct measurements from these runs rather than outputs of any equation, fitted parameter, or uniqueness theorem. No load-bearing steps reduce to self-citation chains, ansatzes, or renamings; the work is self-contained against external benchmarks via the released code and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the empirical observation that duplicates drive memorization and on the assumption that the deduplication procedure isolates that causal factor without introducing new biases.

axioms (1)

domain assumption Near-duplicates and repetitive substrings in training data are the primary cause of elevated verbatim memorization rates in language models.
This premise directly links the dataset analysis to the reported 10x reduction in copied output.

pith-pipeline@v0.9.0 · 5690 in / 1272 out tokens · 29952 ms · 2026-05-24T13:35:16.795188+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop two tools... Exact substring matching... Approximate full document matching uses hash-based techniques (Broder, 1997)...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Deduplication allows us to train models that emit memorized text ten times less frequently...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data
cs.LG 2025-09 unverdicted novelty 7.0

Introduces the first active learning framework for unaligned multimodal data that selects alignments using uncertainty and diversity to cut annotation costs by up to 40% on benchmarks while preserving accuracy.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
cs.AI 2024-06 conditional novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Quantifying Memorization Across Neural Language Models
cs.LG 2022-02 unverdicted novelty 7.0

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
Improving language models by retrieving from trillions of tokens
cs.CL 2021-12 unverdicted novelty 7.0

RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
Multitask Prompted Training Enables Zero-Shot Task Generalization
cs.LG 2021-10 conditional novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Provable Knowledge Acquisition and Extraction in One-Layer Transformers
cs.LG 2025-07 unverdicted novelty 6.0

In a stylized one-layer transformer, pre-training encodes factual knowledge via relation-specific feature directions and attention patterns; fine-tuning extracts it through a relation-covering mechanism that succeeds ...
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
cs.AI 2025-07 conditional novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
cs.CV 2024-03 conditional novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
The False Promise of Imitating Proprietary LLMs
cs.CL 2023-05 conditional novelty 6.0

Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
cs.LG 2023-03 unverdicted novelty 6.0

SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Scaling Laws and Interpretability of Learning from Repeated Data
cs.LG 2022-05 accept novelty 6.0

Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
cs.CL 2022-04 accept novelty 6.0

GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Can Humans Detect AI? Mining Textual Signals of AI-Assisted Writing Under Varying Scrutiny Conditions
cs.HC 2026-04 unverdicted novelty 5.0

Warned AI-assisted writers had their documents selected as human 54.13% of the time by judges versus 45.87% for unwarned writers, despite no measurable differences in text features.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
cs.AI 2023-08 accept novelty 5.0

Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
cs.CL 2026-05 unverdicted novelty 4.0

Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
cs.CL 2026-05 unverdicted novelty 4.0

Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
cs.CL 2025-09 unverdicted novelty 3.0

Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
Data-Centric Foundation Models in Computational Healthcare: A Survey
cs.LG 2024-01 unverdicted novelty 3.0

The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
cs.DC 2025-03 unverdicted novelty 2.0

Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 31 Pith papers · 7 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143--153

work page 2019
[4]

Devansh Arpit, Stanis aw Jastrz e bski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. 2017. A closer look at memorization in deep networks. In International Conference on Machine Learning, pages 233--242. PMLR

work page 2017
[5]

documentation debt

Jack Bandy and Nicholas Vincent. 2021. http://arxiv.org/abs/2105.05241 Addressing "documentation debt" in machine learning research: A retrospective datasheet for bookcorpus

work page arXiv 2021
[6]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. 2018. https://doi.org/10.1162/tacl_a_00041 Data statements for natural language processing: Toward mitigating system bias and enabling better science . Transactions of the Association for Computational Linguistics, 6:587--604

work page doi:10.1162/tacl_a_00041 2018
[7]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? -5pt [scale=0.1] parrot.png . In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, page 610–623, New York, NY, ...

work page doi:10.1145/3442188.3445922 2021
[8]

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. http://github.com/eleutherai/gpt-neo GPT-Neo : Large scale autoregressive language modeling with mesh-tensorflow

work page 2021
[9]

Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426

work page 1970
[10]

Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21--29. IEEE

work page 1997
[11]

Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr. 2022. What does it mean for a language model to preserve privacy? arXiv preprint

work page 2022
[12]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33

work page 2020
[13]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. http://arxiv.org/abs/2012.07805 Extracting training data from large language models

work page arXiv 2020
[14]

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005

work page internal anchor Pith review Pith/arXiv arXiv 2013
[15]

Hung Chim and Xiaotie Deng. 2007. https://doi.org/10.1145/1242572.1242590 A new suffix tree similarity measure for document clustering . In Proceedings of the 16th International Conference on World Wide Web, WWW '07, page 121–130, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/1242572.1242590 2007
[16]

Edith Cohen. 2016. http://www.cohenwang.com/edith/Surveys/minhash.pdf Min-hash sketches: A brief survey

work page 2016
[17]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

work page internal anchor Pith review Pith/arXiv arXiv 2019
[19]

Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, and Matt Gardner. 2021 b . http://arxiv.org/abs/2104.08758 Documenting the english colossal clean crawled corpus . arXiv preprint arXiv:2104.08758

work page arXiv 2021
[20]

Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems

work page 2020
[21]

Gabriel, Tsung-Ting Kuo, Julian McAuley, and Chun-Nan Hsu

Rodney A. Gabriel, Tsung-Ting Kuo, Julian McAuley, and Chun-Nan Hsu. 2018. https://doi.org/https://doi.org/10.1016/j.jbi.2018.04.009 Identifying and characterizing highly similar notes in big clinical note datasets . Journal of Biomedical Informatics, 82:63--69

work page doi:10.1016/j.jbi.2018.04.009 2018
[22]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020
[23]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III au2, and Kate Crawford. 2020. http://arxiv.org/abs/1803.09010 Datasheets for datasets

work page arXiv 2020
[24]

David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34

work page 2003
[25]

Mandy Guo, Zihang Dai, Denny Vrandecic, and Rami Al-Rfou. 2020. http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.296.pdf Wiki-40b: Multilingual language model dataset . In LREC 2020

work page 2020
[26]

Bikash Gyawali, Lucas Anastasiou, and Petr Knoth. 2020. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 901--910

work page 2020
[27]

Paul Jaccard. 1912. The distribution of the flora in the alpine zone. New phytologist, 11(2):37--50

work page 1912
[28]

Juha K \"a rkk \"a inen and Peter Sanders. 2003. Simple linear work suffix array construction. In International colloquium on automata, languages, and programming, pages 943--955. Springer

work page 2003
[29]

Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Annual Symposium on Combinatorial Pattern Matching, pages 200--210. Springer

work page 2003
[30]

Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935--948

work page 1993
[31]

Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In 2009 data compression conference, pages 193--202. IEEE

work page 2009
[32]

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. http://arxiv.org/abs/2104.10350 Carbon emissions and large neural network training

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

work page 2019
[34]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

work page 2020
[35]

Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596--4604. PMLR

work page 2018
[36]

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2020. Towards controllable biases in language generation. arXiv preprint arXiv:2005.00268

work page arXiv 2020
[37]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3--18. IEEE

work page 2017
[38]

Cory Stephenson, Suchismita Padhy, Abhinav Ganesh, Yue Hui, Hanlin Tang, and SueYeon Chung. 2021. On the geometry of generalization and memorization in deep neural networks. In International Conference on Learning Representations

work page 2021
[39]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. http://arxiv.org/abs/1906.02243 Energy and policy considerations for deep learning in nlp

work page internal anchor Pith review Pith/arXiv arXiv 2019
[40]

Piotr Teterwak, Chiyuan Zhang, Dilip Krishnan, and Michael C Mozer. 2021. Understanding invariance via feedforward inversion of discriminatively trained classifiers. In International Conference on Machine Learning, pages 10225--10235. PMLR

work page 2021
[41]

Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847

work page arXiv 2018
[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Yannick Versley and Yana Panchenko. 2012. Not just bigger: Towards better-quality web corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7), pages 44--52

work page 2012
[44]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125

work page arXiv 2019
[45]

Ryan Webster, Julien Rabin, Loïc Simon, and Frédéric Jurie. 2019. https://doi.org/10.1109/CVPR.2019.01153 Detecting overfitting of deep generative networks via latent recovery . In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11265--11274

work page doi:10.1109/cvpr.2019.01153 2019
[46]

Peter Weiner. 1973. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1--11. IEEE

work page 1973
[47]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934

work page arXiv 2020
[48]

Mikio Yamamoto and Kenneth W Church. 2001. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30

work page 2001
[49]

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. arXiv preprint arXiv:1905.12616

work page arXiv 2019
[50]

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie...

work page arXiv 2021
[51]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27

work page 2015
[52]

Jakub Łącki, Vahab Mirrokni, and Michał Włodarczyk. 2018. http://arxiv.org/abs/1807.10727 Connected components at scale via local contractions

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143--153

work page 2019

[4] [4]

Devansh Arpit, Stanis aw Jastrz e bski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. 2017. A closer look at memorization in deep networks. In International Conference on Machine Learning, pages 233--242. PMLR

work page 2017

[5] [5]

documentation debt

Jack Bandy and Nicholas Vincent. 2021. http://arxiv.org/abs/2105.05241 Addressing "documentation debt" in machine learning research: A retrospective datasheet for bookcorpus

work page arXiv 2021

[6] [6]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. 2018. https://doi.org/10.1162/tacl_a_00041 Data statements for natural language processing: Toward mitigating system bias and enabling better science . Transactions of the Association for Computational Linguistics, 6:587--604

work page doi:10.1162/tacl_a_00041 2018

[7] [7]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? -5pt [scale=0.1] parrot.png . In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, page 610–623, New York, NY, ...

work page doi:10.1145/3442188.3445922 2021

[8] [8]

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. http://github.com/eleutherai/gpt-neo GPT-Neo : Large scale autoregressive language modeling with mesh-tensorflow

work page 2021

[9] [9]

Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426

work page 1970

[10] [10]

Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21--29. IEEE

work page 1997

[11] [11]

Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr. 2022. What does it mean for a language model to preserve privacy? arXiv preprint

work page 2022

[12] [12]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33

work page 2020

[13] [13]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. http://arxiv.org/abs/2012.07805 Extracting training data from large language models

work page arXiv 2020

[14] [14]

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005

work page internal anchor Pith review Pith/arXiv arXiv 2013

[15] [15]

Hung Chim and Xiaotie Deng. 2007. https://doi.org/10.1145/1242572.1242590 A new suffix tree similarity measure for document clustering . In Proceedings of the 16th International Conference on World Wide Web, WWW '07, page 121–130, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/1242572.1242590 2007

[16] [16]

Edith Cohen. 2016. http://www.cohenwang.com/edith/Surveys/minhash.pdf Min-hash sketches: A brief survey

work page 2016

[17] [17]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [19]

Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, and Matt Gardner. 2021 b . http://arxiv.org/abs/2104.08758 Documenting the english colossal clean crawled corpus . arXiv preprint arXiv:2104.08758

work page arXiv 2021

[19] [20]

Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems

work page 2020

[20] [21]

Gabriel, Tsung-Ting Kuo, Julian McAuley, and Chun-Nan Hsu

Rodney A. Gabriel, Tsung-Ting Kuo, Julian McAuley, and Chun-Nan Hsu. 2018. https://doi.org/https://doi.org/10.1016/j.jbi.2018.04.009 Identifying and characterizing highly similar notes in big clinical note datasets . Journal of Biomedical Informatics, 82:63--69

work page doi:10.1016/j.jbi.2018.04.009 2018

[21] [22]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020

[22] [23]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III au2, and Kate Crawford. 2020. http://arxiv.org/abs/1803.09010 Datasheets for datasets

work page arXiv 2020

[23] [24]

David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34

work page 2003

[24] [25]

Mandy Guo, Zihang Dai, Denny Vrandecic, and Rami Al-Rfou. 2020. http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.296.pdf Wiki-40b: Multilingual language model dataset . In LREC 2020

work page 2020

[25] [26]

Bikash Gyawali, Lucas Anastasiou, and Petr Knoth. 2020. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 901--910

work page 2020

[26] [27]

Paul Jaccard. 1912. The distribution of the flora in the alpine zone. New phytologist, 11(2):37--50

work page 1912

[27] [28]

Juha K \"a rkk \"a inen and Peter Sanders. 2003. Simple linear work suffix array construction. In International colloquium on automata, languages, and programming, pages 943--955. Springer

work page 2003

[28] [29]

Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Annual Symposium on Combinatorial Pattern Matching, pages 200--210. Springer

work page 2003

[29] [30]

Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935--948

work page 1993

[30] [31]

Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In 2009 data compression conference, pages 193--202. IEEE

work page 2009

[31] [32]

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. http://arxiv.org/abs/2104.10350 Carbon emissions and large neural network training

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [33]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

work page 2019

[33] [34]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

work page 2020

[34] [35]

Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596--4604. PMLR

work page 2018

[35] [36]

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2020. Towards controllable biases in language generation. arXiv preprint arXiv:2005.00268

work page arXiv 2020

[36] [37]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3--18. IEEE

work page 2017

[37] [38]

Cory Stephenson, Suchismita Padhy, Abhinav Ganesh, Yue Hui, Hanlin Tang, and SueYeon Chung. 2021. On the geometry of generalization and memorization in deep neural networks. In International Conference on Learning Representations

work page 2021

[38] [39]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. http://arxiv.org/abs/1906.02243 Energy and policy considerations for deep learning in nlp

work page internal anchor Pith review Pith/arXiv arXiv 2019

[39] [40]

Piotr Teterwak, Chiyuan Zhang, Dilip Krishnan, and Michael C Mozer. 2021. Understanding invariance via feedforward inversion of discriminatively trained classifiers. In International Conference on Machine Learning, pages 10225--10235. PMLR

work page 2021

[40] [41]

Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847

work page arXiv 2018

[41] [42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [43]

Yannick Versley and Yana Panchenko. 2012. Not just bigger: Towards better-quality web corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7), pages 44--52

work page 2012

[43] [44]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125

work page arXiv 2019

[44] [45]

Ryan Webster, Julien Rabin, Loïc Simon, and Frédéric Jurie. 2019. https://doi.org/10.1109/CVPR.2019.01153 Detecting overfitting of deep generative networks via latent recovery . In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11265--11274

work page doi:10.1109/cvpr.2019.01153 2019

[45] [46]

Peter Weiner. 1973. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1--11. IEEE

work page 1973

[46] [47]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934

work page arXiv 2020

[47] [48]

Mikio Yamamoto and Kenneth W Church. 2001. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30

work page 2001

[48] [49]

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. arXiv preprint arXiv:1905.12616

work page arXiv 2019

[49] [50]

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie...

work page arXiv 2021

[50] [51]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27

work page 2015

[51] [52]

Jakub Łącki, Vahab Mirrokni, and Michał Włodarczyk. 2018. http://arxiv.org/abs/1807.10727 Connected components at scale via local contractions

work page internal anchor Pith review Pith/arXiv arXiv 2018