Deployment-Time Memorization in Foundation-Model Agents

Alet Blanken; Amine Anoun; Andy Olsen; Dalmo Cirne; Guilin Zhang; Jerry Ting; Kai Zhao; Lei (Rachel) Chen; Xu Chu; Zeke Miller

arxiv: 2606.10062 · v1 · pith:MEQ57VBYnew · submitted 2026-06-08 · 💻 cs.AI · cs.MA

Deployment-Time Memorization in Foundation-Model Agents

Lei (Rachel) Chen , Guilin Zhang , Kai Zhao , Dalmo Cirne , Andy Olsen , Xu Chu , Zeke Miller , Alet Blanken

show 2 more authors

Amine Anoun Jerry Ting

This is my paper

Pith reviewed 2026-06-27 16:19 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords foundation model agentsmemorizationprivacydeletionsummarizationpersonalizationextraction risk

0 comments

The pith

Key-fact summarization in agent memory reduces extraction by 76% on Gemma 3 12B while preserving nearly all personalization recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how memory-design choices in long-lived foundation-model agents jointly determine personalization utility, extraction risk, and deletion fidelity. It sweeps summarization aggressiveness, retrieval breadth, and deletion mode on the LongMemEval benchmark while tracking Personalization Recall, Adversarial Extraction Rate, and Forgetting Residue Score. Key-fact summarization cuts canary extraction sharply on two models without meaningful loss in recall utility. After compression, retrieving more items stops increasing leakage. Raw-only deletion still leaves summary copies recoverable in about 20% of cases, and only full-pipeline or tombstone deletion brings residue to zero.

Core claim

By treating agent memory as an explicit deployment-time function and measuring it on the privacy-utility frontier with Personalization Recall and Adversarial Extraction Rate, plus the new Forgetting Residue Score, the work shows that key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall. Once content is compressed away, increasing retrieval breadth no longer restores leakage. Raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero.

What carries the argument

The privacy-utility frontier defined by Personalization Recall (PR) and Adversarial Extraction Rate (AER), extended by the Forgetting Residue Score (FRS) to track deletion across memory tiers, controlled through the three knobs of summarization aggressiveness, retrieval breadth k, and deletion mode.

If this is right

Key-fact summarization can be applied to lower extraction risk substantially while retaining most personalization utility.
Once summarization is in place, increasing retrieval breadth no longer increases extraction risk.
Deletion must target derived memory copies, not only raw entries, to drive forgetting residue to zero.
Agent memory systems should be evaluated on recall, extractability, and erasability together rather than in isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent builders could default to summarized memory tiers to improve baseline privacy properties.
The same compression and deletion patterns may appear in other retrieval-augmented or memory-based systems outside the tested models.
Benchmarks focused only on recall or extraction without deletion testing would miss the residue failure mode shown here.

Load-bearing premise

The LongMemEval benchmark, the chosen memory-design knobs, and the metrics PR, AER, and FRS are representative of real deployment scenarios and capture the relevant privacy-utility tradeoffs without missing important failure modes.

What would settle it

Re-running the full sweep of summarization levels, k values, and deletion modes on a dataset of actual long-term user-agent conversations and checking whether the 76% and 64% extraction reductions and the 20% residue rate still appear.

Figures

Figures reproduced from arXiv: 2606.10062 by Alet Blanken, Amine Anoun, Andy Olsen, Dalmo Cirne, Guilin Zhang, Jerry Ting, Kai Zhao, Lei (Rachel) Chen, Xu Chu, Zeke Miller.

**Figure 1.** Figure 1: Privacy–utility frontier and probe decomposition. Top: Frontier on Gemma 3 12B and GPT-4o-mini; color encodes S, each line traces k ∈ {1, 3, 6, 10, 25}. Under S=0 the line stretches diagonally (PR/AER both rise with k); under S≥1 it collapses to a near-point (k-flatness). Bottom: AER by probe type at S ∈ {0, 1}. The direct–indirect spread ∆DI := |AERdirect − AERindirect| collapses under summarization on b… view at source ↗

read the original abstract

Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses parametric memorization or audits fixed memory configurations, but does not characterize how memory-design choices jointly shape personalization utility, extraction risk, and deletion fidelity. We study this surface as deployment-time memorization, formulating agent memory as a privacy-utility frontier measured by Personalization Recall (PR) and Adversarial Extraction Rate (AER), and sweeping three memory-design knobs: summarization aggressiveness, retrieval breadth (k), and deletion mode. We further introduce the Forgetting Residue Score (FRS) to quantify whether deleted information remains recoverable from derived memory tiers. On LongMemEval, key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall; critically, once content is compressed away, increasing k no longer restores leakage. The same compression, however, induces a deletion-fidelity failure: raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero. Together, these results establish that persistent agent memory must be evaluated as a first-class memorization mechanism -- assessed by what it helps agents recall, what it makes extractable, and what it can truly erase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines three new metrics for agent memory and shows summarization cuts extraction while creating deletion residue on LongMemEval.

read the letter

The main takeaway is that this work treats agent memory as an active deployment mechanism rather than just model weights. They introduce Personalization Recall, Adversarial Extraction Rate, and Forgetting Residue Score, then measure how summarization aggressiveness, retrieval k, and deletion mode jointly affect utility, extraction, and true forgetting.

What stands out is the joint sweep and the specific numbers. Key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall. Once content is compressed, larger k no longer restores leakage. The same compression creates a deletion problem: raw-only deletion leaves derived summary copies recoverable in roughly 20% of cases, and only full-pipeline purge or tombstone redaction drives residue to zero. These are concrete, previously unmeasured effects.

The central claims hold on the reported data. The metrics are new and the setup connects three aspects that prior work handled separately.

The soft spots are around scope. Everything runs on LongMemEval with the three chosen knobs, so the observed frontier could shift under adaptive attacks on summaries, heterogeneous user data, or multi-agent sharing. The paper would be stronger with more datasets or explicit discussion of how well these metrics proxy real deployments. The abstract gives clean percentages, but the full methods section needs to show error bars and statistical detail for the results to land solidly.

This is for researchers working on agent memory design and privacy controls. Readers focused on those tradeoffs will find the new evaluation surface and the reported numbers useful.

It deserves peer review. The work is grounded enough in new measurements to warrant referee time.

Referee Report

2 major / 2 minor

Summary. The paper frames deployment-time memorization in long-lived foundation-model agents as a privacy-utility frontier. It introduces Personalization Recall (PR), Adversarial Extraction Rate (AER), and Forgetting Residue Score (FRS), then sweeps three memory-design knobs (summarization aggressiveness, retrieval breadth k, deletion mode) on the LongMemEval benchmark. Using Gemma 3 12B and GPT-4o-mini, it reports that key-fact summarization cuts canary extraction by 76% and 64% respectively while preserving nearly all PR; post-compression, larger k does not restore leakage; raw-only deletion leaves ~20% FRS residue from derived summaries, while full-pipeline purge or tombstone redaction drives residue to zero.

Significance. If the empirical results hold, the work supplies a concrete, knob-swept characterization of how persistent agent memory jointly affects utility, extractability, and true deletability. It explicitly credits the introduction of FRS for quantifying deletion fidelity across memory tiers and the demonstration that summarization can decouple extraction risk from retrieval breadth. These findings supply falsifiable, deployment-relevant predictions for agent memory design.

major comments (2)

[Evaluation on LongMemEval] Evaluation section (LongMemEval experiments): the headline AER reductions (76% on Gemma 3 12B, 64% on GPT-4o-mini) and the 20% FRS residue under raw deletion are measured exclusively on this single benchmark with the three newly defined scalar metrics. No cross-benchmark validation or mapping to real deployment axes (adaptive summary-targeted adversaries, cross-session derived copies, heterogeneous user distributions) is provided, so the claimed privacy-utility frontier and deletion-fidelity conclusion risk being benchmark-specific.
[Deletion Fidelity Experiments] Deletion-fidelity experiments: the claim that only full-pipeline purge or tombstone redaction drives worst-tier residue to zero rests on FRS capturing all recoverable derived copies. The manuscript does not test whether FRS misses other failure modes (e.g., partial summary regeneration or multi-agent sharing), which is load-bearing for the central argument that compression induces a deletion-fidelity failure.

minor comments (2)

[Abstract] Abstract: the quantitative claims appear without reference to error bars, number of trials, or statistical tests; adding a brief clause on these would improve immediate readability.
[Introduction] Notation: PR, AER, and FRS are introduced without a compact formal definition in the abstract or early sections; a one-sentence inline definition would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [Evaluation on LongMemEval] Evaluation section (LongMemEval experiments): the headline AER reductions (76% on Gemma 3 12B, 64% on GPT-4o-mini) and the 20% FRS residue under raw deletion are measured exclusively on this single benchmark with the three newly defined scalar metrics. No cross-benchmark validation or mapping to real deployment axes (adaptive summary-targeted adversaries, cross-session derived copies, heterogeneous user distributions) is provided, so the claimed privacy-utility frontier and deletion-fidelity conclusion risk being benchmark-specific.

Authors: LongMemEval is a benchmark specifically constructed for evaluating long-term memory in agents, which aligns directly with the deployment-time memorization setting studied here. Results are reported consistently across two models (Gemma 3 12B and GPT-4o-mini), providing internal validation of the observed trends. We agree that cross-benchmark experiments and explicit mapping to additional deployment axes would strengthen generalizability claims. In revision we will add a limitations subsection that states the single-benchmark scope and lists the suggested axes as priorities for follow-on work; the core empirical claims will be qualified accordingly. revision: partial
Referee: [Deletion Fidelity Experiments] Deletion-fidelity experiments: the claim that only full-pipeline purge or tombstone redaction drives worst-tier residue to zero rests on FRS capturing all recoverable derived copies. The manuscript does not test whether FRS misses other failure modes (e.g., partial summary regeneration or multi-agent sharing), which is load-bearing for the central argument that compression induces a deletion-fidelity failure.

Authors: FRS is defined and computed strictly over the memory tiers present in the evaluated agent architecture (raw entries and their derived summaries). The reported experiments show that raw-only deletion leaves measurable residue in those summaries while full-pipeline purge removes it. We did not evaluate additional failure modes such as partial regeneration outside the tested pipeline or cross-agent sharing. In revision we will explicitly bound the claim to the tiers and deletion modes studied, and add a sentence noting that broader failure modes remain open for future investigation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements of new metrics on benchmark

full rationale

The paper is an empirical study that defines three new scalar metrics (PR, AER, FRS), introduces three memory-design knobs, and reports direct experimental measurements on the LongMemEval benchmark. No derivation chain, first-principles prediction, fitted parameter renamed as prediction, or self-citation is invoked to support the central claims; the reported reductions (76%/64% AER, ~20% FRS residue) are measured outcomes rather than algebraic identities or self-referential constructions. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claims rest on the validity of the newly introduced metrics and the representativeness of the LongMemEval benchmark and chosen design knobs; no free parameters or invented entities are explicitly described in the abstract.

free parameters (2)

summarization aggressiveness
One of the three memory-design knobs swept in the experiments.
retrieval breadth k
One of the three memory-design knobs swept in the experiments.

pith-pipeline@v0.9.1-grok · 5815 in / 1264 out tokens · 18874 ms · 2026-06-27T16:19:12.379855+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 1 linked inside Pith

[1]

Deep learning with differential privacy

Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. InACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 308–318, 2016

2016
[2]

Machine unlearning

Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. InIEEE Symposium on Security and Privacy, 2021

2021
[3]

The secret sharer: Evaluating and testing unintended memorization in neural networks

Carlini, N., Liu, C., Erlingsson, U., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In28th USENIX Secu- rity Symposium, pp. 267–284, 2019

2019
[4]

Extracting train- ing data from large language models

Erlingsson, U., Oprea, A., and Raffel, C. Extracting train- ing data from large language models. In30th USENIX Security Symposium, 2021

2021
[5]

Quantifying memorization across neural language models

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tram`er, F., and Zhang, C. Quantifying memorization across neural language models. InInternational Conference on Learn- ing Representations (ICLR), 2023

2023
[6]

Trojan hippo: Weaponizing agent memory for data exfiltration, 2026

Das, D., Piet, J., Kaviani, D., Beurer-Kellner, L., Tram `er, F., and Wagner, D. Trojan hippo: Weaponizing agent memory for data exfiltration, 2026. El Yagoubi, F., Badu-Marfo, G., and Al Mallah, R. AgentLeak: A full-stack benchmark for privacy leakage in multi-agent LLM systems, 2026

2026
[7]

Not what you’ve signed up for: Com- promising real-world LLM-integrated applications with indirect prompt injection

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. Not what you’ve signed up for: Com- promising real-world LLM-integrated applications with indirect prompt injection. InACM Workshop on Artificial Intelligence and Security (AISec), 2023

2023
[8]

Retrieval-augmented genera- tion for knowledge-intensive NLP tasks

Goyal, N., K¨uttler, H., Lewis, M., Yih, W.-t., Rockt¨aschel, T., Riedel, S., and Kiela, D. Retrieval-augmented genera- tion for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[9]

A survey on the security of long-term memory in LLM agents: Toward mnemonic sovereignty, 2026

Lin, Z., Li, C., and Chen, K. A survey on the security of long-term memory in LLM agents: Toward mnemonic sovereignty, 2026

2026
[10]

Topology matters: Measuring memory leakage in multi-agent LLMs, 2025

Liu, J., Cao, D., Wei, Y ., Su, T., Liang, Y ., Dong, Y ., Liu, Y ., Zhao, Y ., and Hu, X. Topology matters: Measuring memory leakage in multi-agent LLMs, 2025

2025
[11]

Can LLMs keep a secret? testing privacy implications of language models via contextual integrity theory

Shokri, R., and Choi, Y . Can LLMs keep a secret? testing privacy implications of language models via contextual integrity theory. InInternational Conference on Learning Representations (ICLR), 2024

2024
[12]

Scalable extraction of training data from (production) language models, 2023

Tram`er, F., and Lee, K. Scalable extraction of training data from (production) language models, 2023

2023
[13]

Yin, H., and Nguyen, Q. V . H. A survey of machine unlearning, 2022

2022
[14]

Stoica, I., and Gonzalez, J. E. MemGPT: Towards LLMs as operating systems, 2023

2023
[15]

S., O’Brien, J

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. InACM Symposium on User Interface Software and Technology (UIST), 2023

2023
[16]

and Gurevych, I

Reimers, N. and Gurevych, I. Sentence-BERT: Sentence em- beddings using siamese BERT-networks. InConference on Empirical Methods in Natural Language Processing, 2019

2019
[17]

Mem- bership inference attacks against machine learning mod- els

Shokri, R., Stronati, M., Song, C., and Shmatikov, V . Mem- bership inference attacks against machine learning mod- els. InIEEE Symposium on Security and Privacy, 2017

2017
[18]

Unveiling privacy risks in LLM agent memory

Wang, B., He, W., Zeng, S., Xiang, Z., Xing, Y ., Tang, J., and He, P. Unveiling privacy risks in LLM agent memory. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

2025
[19]

LongMemEval: Benchmarking chat assistants on long- term interactive memory

Wu, D., Wang, H., Yu, W., Zhang, Y ., Chang, K.-W., and Yu, D. LongMemEval: Benchmarking chat assistants on long- term interactive memory. InInternational Conference on Learning Representations, 2025. arXiv:2410.10813

Pith/arXiv arXiv 2025
[20]

The good and the bad: Exploring privacy issues in retrieval-augmented generation

Wang, S., Yin, D., Chang, Y ., and Tang, J. The good and the bad: Exploring privacy issues in retrieval-augmented generation. InFindings of the Association for Computa- tional Linguistics: ACL, pp. 4505–4524, 2024

2024
[21]

Adaptive memory admis- sion control for LLM agents

Zhang, G., Jiang, W., Wang, X., Behr, A., Zhao, K., Fried- man, J., Chu, X., and Anoun, A. Adaptive memory admis- sion control for LLM agents. InInternational Conference on Learning Representations (ICLR), 2026

2026
[22]

Memory- Bank: Enhancing large language models with long-term memory

Zhong, W., Guo, L., Gao, Q., Ye, H., and Wang, Y . Memory- Bank: Enhancing large language models with long-term memory. InAAAI Conference on Artificial Intelligence, 2024. 6

2024

[1] [1]

Deep learning with differential privacy

Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. InACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 308–318, 2016

2016

[2] [2]

Machine unlearning

Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. InIEEE Symposium on Security and Privacy, 2021

2021

[3] [3]

The secret sharer: Evaluating and testing unintended memorization in neural networks

Carlini, N., Liu, C., Erlingsson, U., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In28th USENIX Secu- rity Symposium, pp. 267–284, 2019

2019

[4] [4]

Extracting train- ing data from large language models

Erlingsson, U., Oprea, A., and Raffel, C. Extracting train- ing data from large language models. In30th USENIX Security Symposium, 2021

2021

[5] [5]

Quantifying memorization across neural language models

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tram`er, F., and Zhang, C. Quantifying memorization across neural language models. InInternational Conference on Learn- ing Representations (ICLR), 2023

2023

[6] [6]

Trojan hippo: Weaponizing agent memory for data exfiltration, 2026

Das, D., Piet, J., Kaviani, D., Beurer-Kellner, L., Tram `er, F., and Wagner, D. Trojan hippo: Weaponizing agent memory for data exfiltration, 2026. El Yagoubi, F., Badu-Marfo, G., and Al Mallah, R. AgentLeak: A full-stack benchmark for privacy leakage in multi-agent LLM systems, 2026

2026

[7] [7]

Not what you’ve signed up for: Com- promising real-world LLM-integrated applications with indirect prompt injection

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. Not what you’ve signed up for: Com- promising real-world LLM-integrated applications with indirect prompt injection. InACM Workshop on Artificial Intelligence and Security (AISec), 2023

2023

[8] [8]

Retrieval-augmented genera- tion for knowledge-intensive NLP tasks

Goyal, N., K¨uttler, H., Lewis, M., Yih, W.-t., Rockt¨aschel, T., Riedel, S., and Kiela, D. Retrieval-augmented genera- tion for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020

[9] [9]

A survey on the security of long-term memory in LLM agents: Toward mnemonic sovereignty, 2026

Lin, Z., Li, C., and Chen, K. A survey on the security of long-term memory in LLM agents: Toward mnemonic sovereignty, 2026

2026

[10] [10]

Topology matters: Measuring memory leakage in multi-agent LLMs, 2025

Liu, J., Cao, D., Wei, Y ., Su, T., Liang, Y ., Dong, Y ., Liu, Y ., Zhao, Y ., and Hu, X. Topology matters: Measuring memory leakage in multi-agent LLMs, 2025

2025

[11] [11]

Can LLMs keep a secret? testing privacy implications of language models via contextual integrity theory

Shokri, R., and Choi, Y . Can LLMs keep a secret? testing privacy implications of language models via contextual integrity theory. InInternational Conference on Learning Representations (ICLR), 2024

2024

[12] [12]

Scalable extraction of training data from (production) language models, 2023

Tram`er, F., and Lee, K. Scalable extraction of training data from (production) language models, 2023

2023

[13] [13]

Yin, H., and Nguyen, Q. V . H. A survey of machine unlearning, 2022

2022

[14] [14]

Stoica, I., and Gonzalez, J. E. MemGPT: Towards LLMs as operating systems, 2023

2023

[15] [15]

S., O’Brien, J

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. InACM Symposium on User Interface Software and Technology (UIST), 2023

2023

[16] [16]

and Gurevych, I

Reimers, N. and Gurevych, I. Sentence-BERT: Sentence em- beddings using siamese BERT-networks. InConference on Empirical Methods in Natural Language Processing, 2019

2019

[17] [17]

Mem- bership inference attacks against machine learning mod- els

Shokri, R., Stronati, M., Song, C., and Shmatikov, V . Mem- bership inference attacks against machine learning mod- els. InIEEE Symposium on Security and Privacy, 2017

2017

[18] [18]

Unveiling privacy risks in LLM agent memory

Wang, B., He, W., Zeng, S., Xiang, Z., Xing, Y ., Tang, J., and He, P. Unveiling privacy risks in LLM agent memory. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

2025

[19] [19]

LongMemEval: Benchmarking chat assistants on long- term interactive memory

Wu, D., Wang, H., Yu, W., Zhang, Y ., Chang, K.-W., and Yu, D. LongMemEval: Benchmarking chat assistants on long- term interactive memory. InInternational Conference on Learning Representations, 2025. arXiv:2410.10813

Pith/arXiv arXiv 2025

[20] [20]

The good and the bad: Exploring privacy issues in retrieval-augmented generation

Wang, S., Yin, D., Chang, Y ., and Tang, J. The good and the bad: Exploring privacy issues in retrieval-augmented generation. InFindings of the Association for Computa- tional Linguistics: ACL, pp. 4505–4524, 2024

2024

[21] [21]

Adaptive memory admis- sion control for LLM agents

Zhang, G., Jiang, W., Wang, X., Behr, A., Zhao, K., Fried- man, J., Chu, X., and Anoun, A. Adaptive memory admis- sion control for LLM agents. InInternational Conference on Learning Representations (ICLR), 2026

2026

[22] [22]

Memory- Bank: Enhancing large language models with long-term memory

Zhong, W., Guo, L., Gao, Q., Ye, H., and Wang, Y . Memory- Bank: Enhancing large language models with long-term memory. InAAAI Conference on Artificial Intelligence, 2024. 6

2024