pith. sign in

arxiv: 2606.06286 · v1 · pith:WSYXKACKnew · submitted 2026-06-04 · 💻 cs.CL · cs.AI

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

Pith reviewed 2026-06-28 01:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords memorizationLLMspropensitytraining data leakageevaluation frameworkprefix attacksnon-adversarial prompts
0
0 comments X

The pith

LLMs leak training data much more under prefix attacks than under ordinary prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops PropMe, a framework that measures not only whether language models can reproduce training data when forced, but how likely they are to do so under typical user interactions. It contrasts prefix-based attacks that maximize extraction with generic and dataset-specific prompts that mimic normal use. Across two open models and two large corpora, the authors show a clear gap: capability to leak is high when attacked, yet propensity remains low in non-adversarial settings. They also observe that continued pre-training on shifted data can lower both memorization and its propensity. The work concludes that audits need to report both worst-case extractability and ordinary leakage rates.

Core claim

Prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. Continued pre-training from one model to another on partially different data reduces both memorization capability and propensity for the original corpus.

What carries the argument

PropMe, a propensity-aware evaluation framework that applies a metric transformation to existing memorization functions to quantify leakage likelihood under non-adversarial prompts, paired with the SimpleTrace pipeline that attributes generations to training corpora via infini-gram.

If this is right

  • Memorization capability can decrease when later training emphasizes partially different data.
  • Audits that only measure worst-case extractability will overstate leakage risk in ordinary use.
  • Propensity metrics provide a distinct signal from raw capability metrics.
  • Lightweight tracing pipelines can scale attribution to large training corpora for both verbatim and transformed metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the low propensity holds across more models, standard extraction benchmarks may need recalibration to avoid overstating privacy risks in deployed systems.
  • The reduction in memorization after continued pre-training suggests a possible mitigation approach worth testing on additional architectures and data mixtures.
  • Extending the propensity transformation to other existing metrics could allow direct comparison of leakage tendencies across published studies.

Load-bearing premise

The chosen generic and dataset-specific prompts constitute a representative sample of ordinary, non-adversarial user interactions that would occur in deployment.

What would settle it

A large-scale collection of model outputs from users employing only generic or dataset-specific prompts that shows high rates of verbatim or near-verbatim reproduction of training data.

Figures

Figures reproduced from arXiv: 2606.06286 by Gianluca Barmina, Lukas Galke Poech, Peter Schneider-Kamp.

Figure 1
Figure 1. Figure 1: Left: PROPME framework overview with propensity and capability prompts, back-tracing to full training set and memorization/propensity measurements. Right: propensity metrics results for different combinations of models and dataset, this tells us what is the propensity of a given model to leak data of a certain dataset. The metrics used are defined and detailed in Sections 2, 3.2 4.3 requires integrity, con… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluating overlapping between prompts and datasets across all prompt settings (Dynaword top, Common [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Memorization metrics for Dynaword across three training stages of the DFM Decoder model. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Propensity scores for Dynaword across train [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Span length distributions for Dynaword across training stages and prompt settings. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Memorization metrics for Common Pile across three training stages of the DFM Decoder model. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Propensity scores for Common Pile across [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Span length distributions for Common Pile across training stages and prompt settings. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Span length distributions for Dynaword (DFM [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Span length distributions for Common Pile [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Span length distributions for Common Pile and Dynaword in DFM Decoder across generic, specific, and [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Span length distributions for Common Pile memorization in Comma and DFM Decoder across generic, [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
read the original abstract

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the PropMe framework for propensity-aware memorization evaluation in LLMs, which contrasts prefix-based capability attacks with non-adversarial (generic and dataset-specific) prompts. It proposes a metric transformation applied to existing memorization functions to derive propensity scores, and presents SimpleTrace, a tracing pipeline using infini-gram for deterministic attribution of generations to training corpora. Evaluations on the fully-open Comma and DFM Decoder models across Common Pile and Dynaword datasets (in two languages) report a consistent gap: prefix attacks produce substantially stronger memorization signals while propensity scores remain low overall. The work also observes reduced memorization and propensity in DFM Decoder (continually pretrained from Comma) on Common Pile, and recommends that audits report both worst-case extractability and ordinary propensity.

Significance. If the central empirical gap holds after addressing prompt representativeness, the distinction between capability and propensity offers a practical refinement for assessing real-world leakage risks, moving beyond purely adversarial evaluations. The provision of an open tracing pipeline and results on fully-open models are concrete strengths that could support reproducible follow-up work. The observation that continual pretraining can lower memorization propensity is a falsifiable claim worth testing in other settings.

major comments (2)
  1. [Abstract] Abstract: The headline claim that models 'rarely' leak training data in 'more common non-adversarial settings' rests on the assumption that the chosen generic and dataset-specific prompts are representative of ordinary user interactions, yet no sampling procedure, coverage argument, or comparison against real query logs is supplied to justify this choice.
  2. [Evaluation] Evaluation section (implied by abstract results): Without the full specification of data exclusion rules, exact prompt templates, and statistical testing procedures for the reported gaps, it is impossible to determine whether the consistent capability-propensity separation is robust to post-hoc analysis choices or dataset construction decisions.
minor comments (1)
  1. [Abstract] The abstract states evaluations occur 'in two languages' but does not name the languages or break out results by language, reducing clarity on the scope of the findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The distinction between capability and propensity is central to our contribution, and we address the concerns about prompt representativeness and methodological transparency below.

read point-by-point responses
  1. Referee: [Abstract] The headline claim that models 'rarely' leak training data in 'more common non-adversarial settings' rests on the assumption that the chosen generic and dataset-specific prompts are representative of ordinary user interactions, yet no sampling procedure, coverage argument, or comparison against real query logs is supplied to justify this choice.

    Authors: We agree that the abstract phrasing implies broader representativeness than our evaluation directly supports. The generic and dataset-specific prompts were constructed as illustrative non-adversarial baselines rather than a statistically sampled distribution of user queries. We will revise the abstract to qualify the claim (e.g., 'in the non-adversarial prompt settings we study') and add a limitations paragraph discussing the absence of real query-log validation while noting that the capability-propensity gap is robust across the two prompt families we tested. revision: yes

  2. Referee: [Evaluation] Without the full specification of data exclusion rules, exact prompt templates, and statistical testing procedures for the reported gaps, it is impossible to determine whether the consistent capability-propensity separation is robust to post-hoc analysis choices or dataset construction decisions.

    Authors: We will expand the Evaluation section (and add an appendix) with the complete data exclusion criteria, verbatim prompt templates for both generic and dataset-specific conditions, and the exact statistical procedures (including any multiple-comparison corrections) used to assess the gaps. These details exist in our internal experimental logs but were not fully reproduced in the submitted manuscript; the revision will make the pipeline fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies independent transformation to existing metrics.

full rationale

The paper defines PropMe as a framework that applies a proposed metric transformation to existing memorization functions in order to produce propensity metrics, then evaluates the resulting quantities on Comma and DFM Decoder using SimpleTrace on Common Pile and Dynaword. No equation or definition reduces the propensity scores to quantities fitted from the same evaluation data by construction, nor does any central claim rest on a self-citation chain, imported uniqueness theorem, or renamed known result. The gap between capability (prefix attacks) and propensity (generic/dataset-specific prompts) is an empirical comparison whose validity depends on prompt representativeness rather than on any definitional or fitting circularity. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The framework relies on the unstated assumption that infini-gram provides deterministic verbatim attribution at scale.

pith-pipeline@v0.9.1-grok · 5798 in / 990 out tokens · 24358 ms · 2026-06-28T01:38:38.684017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2012.07805 , year=

    Extracting Training Data from Large Language Models , author=. arXiv preprint arXiv:2012.07805 , year=

  2. [2]

    Proceedings of the 40th International Conference on Machine Learning (ICML) , year=

    Quantifying Memorization Across Neural Language Models , author=. Proceedings of the 40th International Conference on Machine Learning (ICML) , year=

  3. [3]

    International Conference on Learning Representations (ICLR) , year=

    Detecting Pretraining Data from Large Language Models , author=. International Conference on Learning Representations (ICLR) , year=

  4. [4]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

    Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

  5. [5]

    Proceedings of the 17th International Natural Language Generation Conference (INLG) , year=

    A Comprehensive Analysis of Memorization in Large Language Models , author=. Proceedings of the 17th International Natural Language Generation Conference (INLG) , year=

  6. [6]

    arXiv preprint arXiv:2504.07096 , year=

    OLMOTRACE: Tracing Language Model Outputs Back to Trillions of Training Tokens , author=. arXiv preprint arXiv:2504.07096 , year=

  7. [7]

    arXiv preprint arXiv:2601.02671 , year=

    Extracting books from production language models , author=. arXiv preprint arXiv:2601.02671 , year=

  8. [8]

    ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

    Extracting memorized pieces of (copyrighted) books from open-weight language models , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

  9. [9]

    arXiv preprint arXiv:2310.13771 , year=

    Copyright Violations and Large Language Models , author=. arXiv preprint arXiv:2310.13771 , year=

  10. [10]

    International Conference on Learning Representations (ICLR) , year=

    Privacy Auditing of Large Language Models , author=. International Conference on Learning Representations (ICLR) , year=

  11. [11]

    Proceedings of the 39th International Conference on Machine Learning (ICML) , year=

    Deduplicating Training Data Mitigates Privacy Risks in Language Models , author=. Proceedings of the 39th International Conference on Machine Learning (ICML) , year=

  12. [12]

    , title =

    Wolfe, Cameron R. , title =. 2025 , note =

  13. [13]

    arXiv preprint arXiv:2401.17377 , year=

    Infini-gram: Scaling unbounded n-gram language models to a trillion tokens , author=. arXiv preprint arXiv:2401.17377 , year=

  14. [14]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Demystifying verbatim memorization in large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  15. [15]

    Alpaca against vicuna: Using llms to uncover memorization of llms , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  16. [16]

    Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference , year=

    Evaluating LLM Memorization Using Soft Token Sparsity , author=. Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference , year=

  17. [17]

    arXiv preprint arXiv:2407.14985 , year=

    Generalization vs Memorization: Tracing Language Models' Capabilities Back to Pretraining Data , author=. arXiv preprint arXiv:2407.14985 , year=

  18. [18]

    Analyzing memorization in large language models through the lens of model attribution , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  19. [19]

    arXiv preprint arXiv:2009.03300 , year=

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  20. [20]

    arXiv preprint arXiv:2508.02271 , year=

    Dynaword: From One-shot to Continuously Developed Datasets , author=. arXiv preprint arXiv:2508.02271 , year=

  21. [21]

    2022 IEEE symposium on security and privacy (SP) , pages=

    Membership inference attacks from first principles , author=. 2022 IEEE symposium on security and privacy (SP) , pages=. 2022 , organization=

  22. [22]

    Membership Inference Attacks Against Machine Learning Models , year=

    Shokri, Reza and Stronati, Marco and Song, Congzheng and Shmatikov, Vitaly , booktitle=. Membership Inference Attacks Against Machine Learning Models , year=

  23. [23]

    The Thirteenth International Conference on Learning Representations , year=

    Scalable extraction of training data from aligned, production language models , author=. The Thirteenth International Conference on Learning Representations , year=

  24. [24]

    arXiv preprint arXiv:2411.10242 , year=

    Measuring non-adversarial reproduction of training data in large language models , author=. arXiv preprint arXiv:2411.10242 , year=

  25. [25]

    Proceedings of the 16th International Natural Language Generation Conference , pages=

    Preventing generation of verbatim memorization in language models gives a false sense of privacy , author=. Proceedings of the 16th International Natural Language Generation Conference , pages=

  26. [26]

    1: An 8tb dataset of public domain and openly licensed text , author=

    The common pile v0. 1: An 8tb dataset of public domain and openly licensed text , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    2026 , month = apr, day =

  28. [28]

    arXiv preprint arXiv:2602.18182 , year =

    Capabilities Ain't All You Need: Measuring Propensities in AI , author =. arXiv preprint arXiv:2602.18182 , year =. doi:10.48550/arXiv.2602.18182 , archivePrefix =. 2602.18182 , primaryClass =

  29. [29]

    arXiv preprint arXiv:2603.00063 , year =

    Measuring What AI Systems Might Do: Towards A Measurement Science in AI , author =. arXiv preprint arXiv:2603.00063 , year =. doi:10.48550/arXiv.2603.00063 , archivePrefix =. 2603.00063 , primaryClass =

  30. [30]

    arXiv preprint arXiv:2305.15324 , year =

    Model Evaluation for Extreme Risks , author =. arXiv preprint arXiv:2305.15324 , year =. doi:10.48550/arXiv.2305.15324 , archivePrefix =. 2305.15324 , primaryClass =

  31. [31]

    Advances in Neural Information Processing Systems 37 , year =

    Stress-Testing Capability Elicitation With Password-Locked Models , author =. Advances in Neural Information Processing Systems 37 , year =

  32. [32]

    Proceedings of the 42nd International Conference on Machine Learning , year =

    The Elicitation Game: Evaluating Capability Elicitation Techniques , author =. Proceedings of the 42nd International Conference on Machine Learning , year =

  33. [33]

    arXiv preprint arXiv:2406.07358 , year =

    AI Sandbagging: Language Models can Strategically Underperform on Evaluations , author =. arXiv preprint arXiv:2406.07358 , year =. doi:10.48550/arXiv.2406.07358 , archivePrefix =. 2406.07358 , primaryClass =

  34. [34]

    arXiv preprint arXiv:2505.23836 , year =

    Large Language Models Often Know When They Are Being Evaluated , author =. arXiv preprint arXiv:2505.23836 , year =. doi:10.48550/arXiv.2505.23836 , archivePrefix =. 2505.23836 , primaryClass =

  35. [35]

    Frontier Models are Capable of In-context Scheming

    Frontier Models are Capable of In-context Scheming , author =. arXiv preprint arXiv:2412.04984 , year =. doi:10.48550/arXiv.2412.04984 , archivePrefix =. 2412.04984 , primaryClass =

  36. [36]

    arXiv preprint arXiv:2603.01608 , year =

    Evaluating and Understanding Scheming Propensity in LLM Agents , author =. arXiv preprint arXiv:2603.01608 , year =. doi:10.48550/arXiv.2603.01608 , archivePrefix =. 2603.01608 , primaryClass =

  37. [37]

    AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

    AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents , author =. arXiv preprint arXiv:2506.04018 , year =. doi:10.48550/arXiv.2506.04018 , archivePrefix =. 2506.04018 , primaryClass =

  38. [38]

    Propensity Inference: Environmental Contributors to LLM Behaviour

    Propensity Inference: Environmental Contributors to LLM Behaviour , author =. arXiv preprint arXiv:2604.21098 , year =. doi:10.48550/arXiv.2604.21098 , archivePrefix =. 2604.21098 , primaryClass =

  39. [39]

    2016 , date =

    Official Journal of the European Union , number =. 2016 , date =

  40. [40]

    2024 , date =

    Official Journal of the European Union , number =. 2024 , date =

  41. [41]

    2019 , date =

    Official Journal of the European Union , number =. 2019 , date =