Quantifying Memorization Across Neural Language Models
Pith reviewed 2026-05-13 22:00 UTC · model grok-4.3
The pith
Memorization in language models increases log-linearly with model size, data duplication, and prompt length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model. Surprisingly, we find the situation becomes more complicated when generalizing these results across model families. On the whole, we find that memorization in LMs is more prevalent than previously believed and will likely get worse as models continues to scale, at least without active mitigations.
What carries the argument
log-linear relationships quantifying memorization rate as a function of model capacity, duplication count, and prompt context length
If this is right
- Larger models will emit more memorized training data verbatim.
- Training examples that appear multiple times are memorized at higher rates.
- Longer context prompts increase the rate at which memorized sequences are emitted.
- The precise scaling behavior differs across distinct model families.
Where Pith is reading between the lines
- Training data pipelines may need systematic deduplication to slow the growth of memorization.
- Privacy protections for user data in training sets will require active interventions rather than relying on scale alone.
- The trends could be tested on future models to confirm whether they persist beyond current sizes.
Load-bearing premise
That verbatim emission under the chosen prompting and matching criteria accurately captures the privacy, utility, and fairness harms, and that the log-linear trends will continue to hold at larger scales without additional confounding factors.
What would settle it
Measuring the memorization rate on a model with twice the capacity of the largest tested model and checking whether it continues to follow the same log-linear increase.
read the original abstract
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model. Surprisingly, we find the situation becomes more complicated when generalizing these results across model families. On the whole, we find that memorization in LMs is more prevalent than previously believed and will likely get worse as models continues to scale, at least without active mitigations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that large language models emit memorized training data verbatim when prompted appropriately, and identifies three log-linear relationships quantifying this: memorization increases with model capacity, the number of times a training example is duplicated, and the number of context tokens used in the prompt. It reports that these trends hold within model families but become more complicated across families, concluding that memorization is more prevalent than previously believed and will likely worsen with continued scaling absent mitigations.
Significance. If the reported log-linear trends are robust, the work supplies a quantitative basis for predicting memorization risks as models scale, directly relevant to privacy, utility, and fairness concerns in LM deployment. The empirical framing across multiple model families and duplication regimes strengthens its potential impact on understanding scaling laws for memorization.
major comments (3)
- [Methods (memorization measurement and prompting procedure)] The central operationalization of memorization (exact string match between model output and training example after a k-token prefix prompt) is load-bearing for all three log-linear claims; the manuscript should include sensitivity checks on the matching threshold, decoding method (e.g., greedy vs. sampling), and prefix selection strategy, as these choices could artifactually produce or alter the reported slopes.
- [Results (cross-family comparison) and Discussion] The abstract notes that results become complicated when generalizing across model families, yet the manuscript provides limited analysis of potential confounders such as optimizer choice, data ordering, or regularization; without such controls, the within-family log-linear fits cannot reliably support claims of generality or predict behavior at larger scales.
- [Experimental results (capacity, duplication, and context scaling plots)] The log-linear relationships are fitted directly to the observed emission rates; the paper should report goodness-of-fit statistics, confidence intervals on the slopes, and any ablation on the duplication-count and context-length regimes to confirm the trends are not driven by a small number of high-duplication outliers.
minor comments (2)
- [Figures 2-4] Figure axes and legends should explicitly label the log scales and indicate the exact matching criterion used for each data point to improve readability.
- [Related Work] The related-work section should more explicitly contrast the chosen exact-match criterion with prior definitions of memorization that incorporate semantic similarity or partial matches.
Simulated Author's Rebuttal
Thank you for the detailed and constructive referee report. We appreciate the suggestions for strengthening the manuscript and have revised it to address the major comments as detailed below.
read point-by-point responses
-
Referee: The central operationalization of memorization (exact string match between model output and training example after a k-token prefix prompt) is load-bearing for all three log-linear claims; the manuscript should include sensitivity checks on the matching threshold, decoding method (e.g., greedy vs. sampling), and prefix selection strategy, as these choices could artifactually produce or alter the reported slopes.
Authors: We agree that the definition of memorization is central to our results. In the revised manuscript, we have added a new appendix section with sensitivity analyses on the matching threshold (comparing exact match to edit-distance thresholds of 1-5 tokens), decoding strategies (greedy vs. top-p sampling with p=0.9), and prefix selection (randomly sampled prefixes vs. the original fixed ones). These checks confirm that the log-linear trends persist across variations, although absolute emission rates shift modestly; the slopes remain within 10% of the original values. revision: yes
-
Referee: The abstract notes that results become complicated when generalizing across model families, yet the manuscript provides limited analysis of potential confounders such as optimizer choice, data ordering, or regularization; without such controls, the within-family log-linear fits cannot reliably support claims of generality or predict behavior at larger scales.
Authors: We acknowledge the difficulty of cross-family generalization and the potential role of confounders. The manuscript already highlights this complication in the abstract and Section 5. Performing fully controlled retraining experiments across families (matching optimizer, data order, and regularization) is infeasible within the scope of this study due to the prohibitive compute cost of training multiple large models from scratch. We have expanded the discussion section to more explicitly caution against overgeneralization and to frame the within-family results as the primary, more reliable contribution. revision: partial
-
Referee: The log-linear relationships are fitted directly to the observed emission rates; the paper should report goodness-of-fit statistics, confidence intervals on the slopes, and any ablation on the duplication-count and context-length regimes to confirm the trends are not driven by a small number of high-duplication outliers.
Authors: We have updated all scaling plots to include R² goodness-of-fit values and 95% confidence intervals on the fitted slopes. We also added an ablation study (now in the appendix) that removes the top 5% of highest-duplication examples and refits the lines; the log-linear trends remain statistically significant with only minor changes to the slopes. Similar ablations for context-length regimes are included. revision: yes
Circularity Check
No significant circularity in empirical quantification of memorization trends
full rationale
The paper reports three log-linear relationships as direct experimental observations obtained by training models of varying capacity, duplicating examples a controlled number of times, prompting with varying context lengths, and measuring exact string matches between outputs and training data. These measurements are not derived from parameters fitted to the same data in a self-referential loop, nor do they rely on self-citations for load-bearing uniqueness theorems or ansatzes. The findings are presented as empirical quantifications rather than first-principles derivations, making the reported trends independent of any circular reduction to their own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- memorization matching threshold
Forward citations
Cited by 39 Pith papers
-
Privacy Auditing with Zero (0) Training Run
Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
-
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
MusicLM: Generating Music From Text
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion
RC-RAG boosts long-tail relation completion by infusing paraphrases into RAG stages, yielding up to 40.6 EM gains on benchmarks across five LLMs with no fine-tuning.
-
Memory Dial: A Training Framework for Controllable Memorization in Language Models
Memory Dial is a new training method that makes memorization pressure an explicit, controllable variable during language model training, with experiments showing increased accuracy on seen data while unseen performanc...
-
When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation
LLM tabular generators leak memorized numeric strings, allowing a no-box attack to achieve near-perfect membership inference on some state-of-the-art models.
-
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
-
PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning
PrivUn shows privacy unlearning in LLMs produces gradient-driven ripple effects and only shallow forgetting across layers, with new strategies proposed for deeper removal.
-
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
-
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
Scaling Laws and Interpretability of Learning from Repeated Data
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
-
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Runtime-Structured Task Decomposition for Agentic Coding Systems
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
-
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.
-
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
-
Gemma 3 Technical Report
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and ...
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
Towards the Anonymization of the Language Modeling
Authors introduce MLM and CLM specialization methods that avoid memorizing identifiers in sensitive training data while aiming for a privacy-utility tradeoff on medical datasets.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
-
Data-Centric Foundation Models in Computational Healthcare: A Survey
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
-
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
Reference graph
Works this paper leans on
-
[1]
Deep learning with differential privacy
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318,
work page 2016
-
[2]
Large-scale differentially private bert, 2021
Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi. Large-scale differen- tially private BERT. arXiv preprint arXiv:2108.01624,
-
[3]
doi:10.5281/zenodo.5297715 , url =
URL https://doi.org/ 10.5281/zenodo.5297715. If you use this software, please cite it using these metadata. Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr. What does it mean for a language model to preserve privacy?,
-
[4]
Brown, Dawn Song, Úlfar Er- lingsson, Alina Oprea, and Colin Raffel
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. arXiv preprint arXiv:2012.07805,
-
[5]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Karan Ganju, Qi Wang, Wei Yang, Carl A Gunter, and Nikita Borisov. Property inference attacks on fully connected neural networks using permutation invariant representations. In Proceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 619–633,
work page 2018
-
[8]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Ethical challenges in data-driven dialogue systems
Peter Henderson, Koustuv Sinha, Nicolas Angelard-Gontier, Nan Rosemary Ke, Genevieve Fried, Ryan Lowe, and Joelle Pineau. Ethical challenges in data-driven dialogue systems. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 123–129,
work page 2018
-
[10]
Sergey Ioffe and Christian Szegedy
Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private SGD? arXiv preprint arXiv:2006.07709,
-
[11]
Evaluating differentially private machine learning in practice
10 Published as a conference paper at ICLR 2023 Bargav Jayaraman and David Evans. Evaluating differentially private machine learning in practice. In 28th{USENIX} Security Symposium ({USENIX} Security 19), pages 1895–1912,
work page 2023
-
[12]
Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. arXiv preprint arXiv:2202.06539,
-
[14]
URL https://arxiv.org/abs/2107.06499. R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. How much do language models copy from their training data? Evaluating linguistic novelty in text generation us- ing RA VEN.CoRR, abs/2111.09509,
-
[15]
URL https://arxiv.org/abs/2111.09509. Milad Nasr, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, and Nicholas Carlini. Adver- sary instantiation: Lower bounds for differentially private machine learning. arXiv preprint arXiv:2101.04535,
-
[16]
URL http://jmlr.org/papers/v21/20-074.html. Swaroop Ramaswamy, Om Thakkar, Rajiv Mathews, Galen Andrew, H Brendan McMahan, and Françoise Beaufays. Training production language models without memorizing user data. arXiv preprint arXiv:2009.10031,
-
[17]
Membership inference attacks against machine learning models
Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE,
work page 2017
-
[18]
Privacy risk in machine learning: Analyzing the connection to overfitting
Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 268–282. IEEE,
work page 2018
-
[19]
arXiv preprint arXiv:2112.12938 , year=
Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, and Nicholas Carlini. Counterfactual memorization in neural language models. arXiv preprint arXiv:2112.12938,
-
[20]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
how many times is this sequence present in the training dataset
11 Published as a conference paper at ICLR 2023 A I MPLEMENTATION DETAILS FOR DATASET CREATION Intuitively speaking, it is straightforward to construct a dataset containing specifiable proportions of documents at various frequencies. We need only enumerate all sequences repeated various numbers of times, and then sample uniformly at random from each of the...
work page 2023
-
[22]
tokens. We do not see significant differences between the fraction of extractable tokens with varying prompt lengths across various sequence lengths. 12 Published as a conference paper at ICLR 2023 Prompt Continuation (== 6B) 2.7B 1.3B 125M Gallery "Though defensive violence will always be 'a sad necessity' in the eyes of men of principle, it would be stil...
work page 2023
-
[23]
compared to sequences of length 100 (prompt length = 50). Alternate definition of extractability. Our main experiments report a sequence as “extractable” if the model’s generated continuation is identical to the true suffix within that training example. This method is a loose lower bound on memorization. Consider two sequences x1, x2 both contained in the t...
work page 2023
-
[24]
15 Published as a conference paper at ICLR 2023 Prompt Continuation (== 6B) 2.7B 1.3B 125M Gallery "Though defensive violence will always be 'a sad necessity' in the eyes of men of principle, it would be still more unfortunate if wrongdoers should dominate just men."- St. Augustine "A new idea is first condemned as ridiculous, and then dismissed as trivial...
work page 2023
-
[25]
, such as Google, Bing and Yahoo!, use crawlers to find pages for their algorithmic search results
16 Published as a conference paper at ICLR 2023 Prompt Continuation (== 6B) 2.7B 1.3B 125M _GPL(crypto_unregister_alg); int crypto_register_template(struct crypto_template *tmpl) { struct crypto_template *q; int err = -EEXIST; down_write(&crypto_alg_sem); list_for_each_entry(q, &crypto_template_list, list) { if (q == tmpl) list_for_each_entry(q, &crypto_a...
work page 2023
-
[26]
Prompt 6B 2.7B 1.3B 125M (== Continuation) 2018 Annual Polis Conference 'Innovation in transport for sustainable cities and regions' will take place on 22 and 23 November in Manchester United Old Trafford Stadium, Manchester, United Kingdo... The 2018 Annual Polis Conference 'Innovation in transport for sustainable cities and regions' will take place on 2...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.