pith. sign in

arxiv: 2606.10062 · v1 · pith:MEQ57VBYnew · submitted 2026-06-08 · 💻 cs.AI · cs.MA

Deployment-Time Memorization in Foundation-Model Agents

Pith reviewed 2026-06-27 16:19 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords foundation model agentsmemorizationprivacydeletionsummarizationpersonalizationextraction risk
0
0 comments X

The pith

Key-fact summarization in agent memory reduces extraction by 76% on Gemma 3 12B while preserving nearly all personalization recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how memory-design choices in long-lived foundation-model agents jointly determine personalization utility, extraction risk, and deletion fidelity. It sweeps summarization aggressiveness, retrieval breadth, and deletion mode on the LongMemEval benchmark while tracking Personalization Recall, Adversarial Extraction Rate, and Forgetting Residue Score. Key-fact summarization cuts canary extraction sharply on two models without meaningful loss in recall utility. After compression, retrieving more items stops increasing leakage. Raw-only deletion still leaves summary copies recoverable in about 20% of cases, and only full-pipeline or tombstone deletion brings residue to zero.

Core claim

By treating agent memory as an explicit deployment-time function and measuring it on the privacy-utility frontier with Personalization Recall and Adversarial Extraction Rate, plus the new Forgetting Residue Score, the work shows that key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall. Once content is compressed away, increasing retrieval breadth no longer restores leakage. Raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero.

What carries the argument

The privacy-utility frontier defined by Personalization Recall (PR) and Adversarial Extraction Rate (AER), extended by the Forgetting Residue Score (FRS) to track deletion across memory tiers, controlled through the three knobs of summarization aggressiveness, retrieval breadth k, and deletion mode.

If this is right

  • Key-fact summarization can be applied to lower extraction risk substantially while retaining most personalization utility.
  • Once summarization is in place, increasing retrieval breadth no longer increases extraction risk.
  • Deletion must target derived memory copies, not only raw entries, to drive forgetting residue to zero.
  • Agent memory systems should be evaluated on recall, extractability, and erasability together rather than in isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent builders could default to summarized memory tiers to improve baseline privacy properties.
  • The same compression and deletion patterns may appear in other retrieval-augmented or memory-based systems outside the tested models.
  • Benchmarks focused only on recall or extraction without deletion testing would miss the residue failure mode shown here.

Load-bearing premise

The LongMemEval benchmark, the chosen memory-design knobs, and the metrics PR, AER, and FRS are representative of real deployment scenarios and capture the relevant privacy-utility tradeoffs without missing important failure modes.

What would settle it

Re-running the full sweep of summarization levels, k values, and deletion modes on a dataset of actual long-term user-agent conversations and checking whether the 76% and 64% extraction reductions and the 20% residue rate still appear.

Figures

Figures reproduced from arXiv: 2606.10062 by Alet Blanken, Amine Anoun, Andy Olsen, Dalmo Cirne, Guilin Zhang, Jerry Ting, Kai Zhao, Lei (Rachel) Chen, Xu Chu, Zeke Miller.

Figure 1
Figure 1. Figure 1: Privacy–utility frontier and probe decomposition. Top: Frontier on Gemma 3 12B and GPT-4o-mini; color en￾codes S, each line traces k ∈ {1, 3, 6, 10, 25}. Under S=0 the line stretches diagonally (PR/AER both rise with k); under S≥1 it collapses to a near-point (k-flatness). Bottom: AER by probe type at S ∈ {0, 1}. The direct–indirect spread ∆DI := |AERdirect − AERindirect| collapses under summarization on b… view at source ↗
read the original abstract

Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses parametric memorization or audits fixed memory configurations, but does not characterize how memory-design choices jointly shape personalization utility, extraction risk, and deletion fidelity. We study this surface as deployment-time memorization, formulating agent memory as a privacy-utility frontier measured by Personalization Recall (PR) and Adversarial Extraction Rate (AER), and sweeping three memory-design knobs: summarization aggressiveness, retrieval breadth (k), and deletion mode. We further introduce the Forgetting Residue Score (FRS) to quantify whether deleted information remains recoverable from derived memory tiers. On LongMemEval, key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall; critically, once content is compressed away, increasing k no longer restores leakage. The same compression, however, induces a deletion-fidelity failure: raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero. Together, these results establish that persistent agent memory must be evaluated as a first-class memorization mechanism -- assessed by what it helps agents recall, what it makes extractable, and what it can truly erase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper frames deployment-time memorization in long-lived foundation-model agents as a privacy-utility frontier. It introduces Personalization Recall (PR), Adversarial Extraction Rate (AER), and Forgetting Residue Score (FRS), then sweeps three memory-design knobs (summarization aggressiveness, retrieval breadth k, deletion mode) on the LongMemEval benchmark. Using Gemma 3 12B and GPT-4o-mini, it reports that key-fact summarization cuts canary extraction by 76% and 64% respectively while preserving nearly all PR; post-compression, larger k does not restore leakage; raw-only deletion leaves ~20% FRS residue from derived summaries, while full-pipeline purge or tombstone redaction drives residue to zero.

Significance. If the empirical results hold, the work supplies a concrete, knob-swept characterization of how persistent agent memory jointly affects utility, extractability, and true deletability. It explicitly credits the introduction of FRS for quantifying deletion fidelity across memory tiers and the demonstration that summarization can decouple extraction risk from retrieval breadth. These findings supply falsifiable, deployment-relevant predictions for agent memory design.

major comments (2)
  1. [Evaluation on LongMemEval] Evaluation section (LongMemEval experiments): the headline AER reductions (76% on Gemma 3 12B, 64% on GPT-4o-mini) and the 20% FRS residue under raw deletion are measured exclusively on this single benchmark with the three newly defined scalar metrics. No cross-benchmark validation or mapping to real deployment axes (adaptive summary-targeted adversaries, cross-session derived copies, heterogeneous user distributions) is provided, so the claimed privacy-utility frontier and deletion-fidelity conclusion risk being benchmark-specific.
  2. [Deletion Fidelity Experiments] Deletion-fidelity experiments: the claim that only full-pipeline purge or tombstone redaction drives worst-tier residue to zero rests on FRS capturing all recoverable derived copies. The manuscript does not test whether FRS misses other failure modes (e.g., partial summary regeneration or multi-agent sharing), which is load-bearing for the central argument that compression induces a deletion-fidelity failure.
minor comments (2)
  1. [Abstract] Abstract: the quantitative claims appear without reference to error bars, number of trials, or statistical tests; adding a brief clause on these would improve immediate readability.
  2. [Introduction] Notation: PR, AER, and FRS are introduced without a compact formal definition in the abstract or early sections; a one-sentence inline definition would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Evaluation on LongMemEval] Evaluation section (LongMemEval experiments): the headline AER reductions (76% on Gemma 3 12B, 64% on GPT-4o-mini) and the 20% FRS residue under raw deletion are measured exclusively on this single benchmark with the three newly defined scalar metrics. No cross-benchmark validation or mapping to real deployment axes (adaptive summary-targeted adversaries, cross-session derived copies, heterogeneous user distributions) is provided, so the claimed privacy-utility frontier and deletion-fidelity conclusion risk being benchmark-specific.

    Authors: LongMemEval is a benchmark specifically constructed for evaluating long-term memory in agents, which aligns directly with the deployment-time memorization setting studied here. Results are reported consistently across two models (Gemma 3 12B and GPT-4o-mini), providing internal validation of the observed trends. We agree that cross-benchmark experiments and explicit mapping to additional deployment axes would strengthen generalizability claims. In revision we will add a limitations subsection that states the single-benchmark scope and lists the suggested axes as priorities for follow-on work; the core empirical claims will be qualified accordingly. revision: partial

  2. Referee: [Deletion Fidelity Experiments] Deletion-fidelity experiments: the claim that only full-pipeline purge or tombstone redaction drives worst-tier residue to zero rests on FRS capturing all recoverable derived copies. The manuscript does not test whether FRS misses other failure modes (e.g., partial summary regeneration or multi-agent sharing), which is load-bearing for the central argument that compression induces a deletion-fidelity failure.

    Authors: FRS is defined and computed strictly over the memory tiers present in the evaluated agent architecture (raw entries and their derived summaries). The reported experiments show that raw-only deletion leaves measurable residue in those summaries while full-pipeline purge removes it. We did not evaluate additional failure modes such as partial regeneration outside the tested pipeline or cross-agent sharing. In revision we will explicitly bound the claim to the tiers and deletion modes studied, and add a sentence noting that broader failure modes remain open for future investigation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements of new metrics on benchmark

full rationale

The paper is an empirical study that defines three new scalar metrics (PR, AER, FRS), introduces three memory-design knobs, and reports direct experimental measurements on the LongMemEval benchmark. No derivation chain, first-principles prediction, fitted parameter renamed as prediction, or self-citation is invoked to support the central claims; the reported reductions (76%/64% AER, ~20% FRS residue) are measured outcomes rather than algebraic identities or self-referential constructions. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claims rest on the validity of the newly introduced metrics and the representativeness of the LongMemEval benchmark and chosen design knobs; no free parameters or invented entities are explicitly described in the abstract.

free parameters (2)
  • summarization aggressiveness
    One of the three memory-design knobs swept in the experiments.
  • retrieval breadth k
    One of the three memory-design knobs swept in the experiments.

pith-pipeline@v0.9.1-grok · 5815 in / 1264 out tokens · 18874 ms · 2026-06-27T16:19:12.379855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 1 linked inside Pith

  1. [1]

    Deep learning with differential privacy

    Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. InACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 308–318, 2016

  2. [2]

    Machine unlearning

    Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. InIEEE Symposium on Security and Privacy, 2021

  3. [3]

    The secret sharer: Evaluating and testing unintended memorization in neural networks

    Carlini, N., Liu, C., Erlingsson, U., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In28th USENIX Secu- rity Symposium, pp. 267–284, 2019

  4. [4]

    Extracting train- ing data from large language models

    Erlingsson, U., Oprea, A., and Raffel, C. Extracting train- ing data from large language models. In30th USENIX Security Symposium, 2021

  5. [5]

    Quantifying memorization across neural language models

    Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tram`er, F., and Zhang, C. Quantifying memorization across neural language models. InInternational Conference on Learn- ing Representations (ICLR), 2023

  6. [6]

    Trojan hippo: Weaponizing agent memory for data exfiltration, 2026

    Das, D., Piet, J., Kaviani, D., Beurer-Kellner, L., Tram `er, F., and Wagner, D. Trojan hippo: Weaponizing agent memory for data exfiltration, 2026. El Yagoubi, F., Badu-Marfo, G., and Al Mallah, R. AgentLeak: A full-stack benchmark for privacy leakage in multi-agent LLM systems, 2026

  7. [7]

    Not what you’ve signed up for: Com- promising real-world LLM-integrated applications with indirect prompt injection

    Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. Not what you’ve signed up for: Com- promising real-world LLM-integrated applications with indirect prompt injection. InACM Workshop on Artificial Intelligence and Security (AISec), 2023

  8. [8]

    Retrieval-augmented genera- tion for knowledge-intensive NLP tasks

    Goyal, N., K¨uttler, H., Lewis, M., Yih, W.-t., Rockt¨aschel, T., Riedel, S., and Kiela, D. Retrieval-augmented genera- tion for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  9. [9]

    A survey on the security of long-term memory in LLM agents: Toward mnemonic sovereignty, 2026

    Lin, Z., Li, C., and Chen, K. A survey on the security of long-term memory in LLM agents: Toward mnemonic sovereignty, 2026

  10. [10]

    Topology matters: Measuring memory leakage in multi-agent LLMs, 2025

    Liu, J., Cao, D., Wei, Y ., Su, T., Liang, Y ., Dong, Y ., Liu, Y ., Zhao, Y ., and Hu, X. Topology matters: Measuring memory leakage in multi-agent LLMs, 2025

  11. [11]

    Can LLMs keep a secret? testing privacy implications of language models via contextual integrity theory

    Shokri, R., and Choi, Y . Can LLMs keep a secret? testing privacy implications of language models via contextual integrity theory. InInternational Conference on Learning Representations (ICLR), 2024

  12. [12]

    Scalable extraction of training data from (production) language models, 2023

    Tram`er, F., and Lee, K. Scalable extraction of training data from (production) language models, 2023

  13. [13]

    Yin, H., and Nguyen, Q. V . H. A survey of machine unlearning, 2022

  14. [14]

    Stoica, I., and Gonzalez, J. E. MemGPT: Towards LLMs as operating systems, 2023

  15. [15]

    S., O’Brien, J

    Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. InACM Symposium on User Interface Software and Technology (UIST), 2023

  16. [16]

    and Gurevych, I

    Reimers, N. and Gurevych, I. Sentence-BERT: Sentence em- beddings using siamese BERT-networks. InConference on Empirical Methods in Natural Language Processing, 2019

  17. [17]

    Mem- bership inference attacks against machine learning mod- els

    Shokri, R., Stronati, M., Song, C., and Shmatikov, V . Mem- bership inference attacks against machine learning mod- els. InIEEE Symposium on Security and Privacy, 2017

  18. [18]

    Unveiling privacy risks in LLM agent memory

    Wang, B., He, W., Zeng, S., Xiang, Z., Xing, Y ., Tang, J., and He, P. Unveiling privacy risks in LLM agent memory. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

  19. [19]

    LongMemEval: Benchmarking chat assistants on long- term interactive memory

    Wu, D., Wang, H., Yu, W., Zhang, Y ., Chang, K.-W., and Yu, D. LongMemEval: Benchmarking chat assistants on long- term interactive memory. InInternational Conference on Learning Representations, 2025. arXiv:2410.10813

  20. [20]

    The good and the bad: Exploring privacy issues in retrieval-augmented generation

    Wang, S., Yin, D., Chang, Y ., and Tang, J. The good and the bad: Exploring privacy issues in retrieval-augmented generation. InFindings of the Association for Computa- tional Linguistics: ACL, pp. 4505–4524, 2024

  21. [21]

    Adaptive memory admis- sion control for LLM agents

    Zhang, G., Jiang, W., Wang, X., Behr, A., Zhao, K., Fried- man, J., Chu, X., and Anoun, A. Adaptive memory admis- sion control for LLM agents. InInternational Conference on Learning Representations (ICLR), 2026

  22. [22]

    Memory- Bank: Enhancing large language models with long-term memory

    Zhong, W., Guo, L., Gao, Q., Ye, H., and Wang, Y . Memory- Bank: Enhancing large language models with long-term memory. InAAAI Conference on Artificial Intelligence, 2024. 6