pith. sign in

arxiv: 2606.15903 · v2 · pith:ABQ7MLLVnew · submitted 2026-06-14 · 💻 cs.CL · cs.AI

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

Pith reviewed 2026-06-27 03:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords agent memoryforgetting failurescontrol-plane placementmutation-time hookForgetEvalintent-aware deletionadversarial evaluation
0
0 comments X

The pith

Placing LLM assistance at the mutation-time control plane recovers intent-aware deletion and achieves 91.7-93.2% overall forgetting accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates how the placement of LLM assistance within the control plane of agent memory systems affects the recovery of different forgetting failure modes. It identifies three distinct regimes across thirteen configurations: deterministic methods work for some categories but not others, inscribe-time placement fixes different ones, and mutation-time placement covers intent-aware deletion effectively. A mutation-time hook achieves 78-85% recovery on intent-aware deletion and 91.7-93.2% overall accuracy. This is relevant because forgetting, rather than recall, is the main source of production failures in agents, yet benchmarks have focused on recall. The authors release ForgetEval to enable systematic testing of these issues.

Core claim

Control-plane placement shapes forgetting recovery in agent memory systems. Comparing thirteen configurations on a 385-case adversarial surface reveals three regimes with complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization; inscribe-time LLM recovers canonicalization but not intent-aware deletion; mutation-time hooks recover intent-aware deletion at 78-85% and achieve 91.7-93.2% overall, at modest cost, without changing the recall path. The evaluation uses ForgetEval, a 1000-case templated suite plus adversarial layer scored by deterministic substring match.

What carries the argument

Control-plane placement of LLM assistance for memory mutations (supersede, release, purge) in agent systems.

If this is right

  • Deterministic primitives achieve high accuracy on lexical and temporal forgetting categories but only 5% on identifier-obfuscation and 0% on cross-lingual.
  • Inscribe-time LLM use achieves 100% on canonicalization but 0% on prefix-collision and compound-fact for intent-aware deletion.
  • Mutation-time hook achieves 78-85% on intent-aware deletion and 91.7-93.2% overall.
  • The approach costs $0.17 per 385-case run with 2.3s per case latency compared to 64-191ms for deterministic.
  • The recall path remains unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent designers could combine multiple placements for broader coverage without increasing recall latency.
  • The ForgetEval suite could be extended to test other memory operations beyond forgetting.
  • Production systems might benefit from monitoring which forgetting modes occur most frequently to choose placement.
  • The asymmetry in canonicalization recovery suggests that inscription and mutation phases have distinct roles in memory integrity.

Load-bearing premise

The 385-case adversarial surface and 1000-case ForgetEval suite, scored by deterministic substring match, capture the forgetting failure modes that arise in actual production agent deployments.

What would settle it

A production deployment log or new test set where the mutation-time hook fails to recover intent-aware deletion at 78-85% rates.

Figures

Figures reproduced from arXiv: 2606.15903 by Dongxu Yang.

Figure 1
Figure 1. Figure 1: Recall–forgetting trade-off characterization [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
read the original abstract

Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that the architectural placement of the control plane (handling supersede, release, purge mutations) relative to the recall plane in LLM agent memory systems determines recoverable forgetting failure modes. Comparing 13 configurations on ForgetEval—a 1000-case templated suite plus 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted, oracle-validated), scored by deterministic substring match—the authors identify three regimes: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but not intent-aware deletion (0% on prefix-collision/compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) while achieving 91.7-93.2% overall, at $0.17 per 385-case run and 2.3s/case latency (vs. 64-191ms deterministic), with recall path unchanged. Additional contributions include a 6-method Adapter Protocol, 10-annotator IAA (Fleiss' kappa 0.958), 77-case external replication (+27.8 pt joint-placement lift), and public release of ForgetEval/adapters under MIT.

Significance. If the results hold under robust evaluation, the work usefully demonstrates that forgetting is architecturally distinct from recall and that placement choices yield complementary coverage of failure modes, with concrete cost/latency trade-offs. The release of the benchmark, adapters, and external replication data strengthens reproducibility and could inform agent system design. The honest N/A scoring and high IAA are positive features.

major comments (3)
  1. [Evaluation / ForgetEval scoring] Evaluation section (scoring protocol for ForgetEval): The central quantitative claims (78-85% recovery on intent-aware deletion; 91.7-93.2% overall) rest on deterministic substring match. This metric is least secure for prefix-collision and compound-fact categories, where semantic leakage or paraphrased retention could occur without a substring hit yet still be counted as recovered; the paper provides no semantic validation or human judgment baseline to confirm that substring absence reliably indicates successful intent-aware deletion.
  2. [Results / Placement regimes] Results on placement regimes (Table or Figure reporting per-category rates): The distinction among the three regimes depends on the 385-case adversarial surface plus 1000-case templated suite being representative; however, no mapping or justification is given showing that these cases cover the forgetting failure modes observed in production agent deployments, weakening the claim that the mutation-time hook 'brightens nearly all categories simultaneously.'
  3. [External validation] External replication (77-case subset): While the external-authored cases replicate the canonicalization asymmetry, the manuscript does not detail selection criteria, exclusion rules, or raw per-case scores for this subset, making it difficult to assess whether it independently corroborates the main findings or introduces post-hoc selection effects.
minor comments (2)
  1. [Abstract / Cost analysis] The abstract reports specific numbers ($0.17 per run, 2.3s/case) without a corresponding breakdown or table in the main text showing how these were computed across the 13 configurations.
  2. [Methods / Adapter Protocol] Notation for the six-method Adapter Protocol is introduced without a dedicated diagram or pseudocode listing the 130 lines, which would aid readers implementing heterogeneous stores.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on evaluation rigor, representativeness, and reproducibility. We address each major comment below with proposed revisions to strengthen the manuscript where the points identify genuine gaps.

read point-by-point responses
  1. Referee: [Evaluation / ForgetEval scoring] Evaluation section (scoring protocol for ForgetEval): The central quantitative claims (78-85% recovery on intent-aware deletion; 91.7-93.2% overall) rest on deterministic substring match. This metric is least secure for prefix-collision and compound-fact categories, where semantic leakage or paraphrased retention could occur without a substring hit yet still be counted as recovered; the paper provides no semantic validation or human judgment baseline to confirm that substring absence reliably indicates successful intent-aware deletion.

    Authors: We agree that deterministic substring match can overstate recovery on intent-aware categories if models paraphrase while retaining semantics. The metric is conservative for detecting retention (any exact substring triggers failure) but does not rule out paraphrased leakage. We will revise the evaluation section to acknowledge this limitation explicitly and add a human judgment baseline: two annotators will score a random sample of 50 cases from prefix-collision and compound-fact categories for semantic retention, reporting agreement with the substring metric. revision: yes

  2. Referee: [Results / Placement regimes] Results on placement regimes (Table or Figure reporting per-category rates): The distinction among the three regimes depends on the 385-case adversarial surface plus 1000-case templated suite being representative; however, no mapping or justification is given showing that these cases cover the forgetting failure modes observed in production agent deployments, weakening the claim that the mutation-time hook 'brightens nearly all categories simultaneously.'

    Authors: ForgetEval cases were derived from failure modes documented in prior agent memory literature (canonicalization, temporal, lexical, and intent-aware deletion). The adversarial layer was built to target these via hand-crafted and LLM-drafted examples with oracle validation. We will add a dedicated paragraph in the benchmark section mapping each category to cited production-style failure reports from the literature. However, a direct empirical mapping to specific proprietary production logs is outside the paper's scope and would require access to non-public data; we will note this as a limitation on generalizability to all deployments. revision: partial

  3. Referee: [External validation] External replication (77-case subset): While the external-authored cases replicate the canonicalization asymmetry, the manuscript does not detail selection criteria, exclusion rules, or raw per-case scores for this subset, making it difficult to assess whether it independently corroborates the main findings or introduces post-hoc selection effects.

    Authors: We will expand the external validation subsection to specify: selection was stratified random sampling from the 385 adversarial cases (ensuring coverage of all 8 categories); exclusion applied only to the 12 cases where the four blind contributors failed to reach consensus on oracle labels (leaving 77); and raw per-case scores plus contributor annotations will be released in the public repository. This removes any ambiguity about post-hoc selection. revision: yes

Circularity Check

0 steps flagged

Empirical architectural comparison with no derivation chain or self-referential fitting

full rationale

The paper is an empirical study comparing thirteen memory-system configurations on the released ForgetEval benchmark (1000 templated + 385 adversarial cases scored by deterministic substring match). No equations, derivations, fitted parameters, or predictions appear in the provided text; results are direct observations of performance differences across placement regimes. The Adapter Protocol and scoring method are explicitly defined and released rather than derived from prior self-citations. No load-bearing self-citation, ansatz smuggling, or renaming of known results is present. The work is self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the adversarial test cases and the assumption that deterministic substring scoring captures intent-aware deletion accurately.

axioms (1)
  • domain assumption The 385 adversarial cases and 1000-case templated suite capture the relevant forgetting failure modes in real agent systems
    The paper's conclusion that mutation-time placement brightens nearly all categories rests on these cases distinguishing the regimes.

pith-pipeline@v0.9.1-grok · 5846 in / 1226 out tokens · 44000 ms · 2026-06-27T03:58:18.036716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 14 linked inside Pith

  1. [1]

    Evaluating Memory in

    Hu, Yuanzhe and Wang, Yu and McAuley, Julian , booktitle =. Evaluating Memory in. 2026 , note =

  2. [2]

    Findings of the Association for Computational Linguistics (ACL Findings) , year =

    From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents , author =. Findings of the Association for Computational Linguistics (ACL Findings) , year =

  3. [3]

    Findings of the Association for Computational Linguistics (EMNLP Findings) , year =

    To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models , author =. Findings of the Association for Computational Linguistics (EMNLP Findings) , year =

  4. [4]

    2024 , url =

    Jin, Zhuoran and Cao, Pengfei and Wang, Chenhao and He, Zhitao and Yuan, Hongbang and Li, Jiachun and Chen, Yubo and Liu, Kang and Zhao, Jun , booktitle =. 2024 , url =

  5. [5]

    and Kolter, J

    Maini, Pratyush and Feng, Zhili and Schwarzschild, Avi and Lipton, Zachary C. and Kolter, J. Zico , journal =. 2024 , url =

  6. [6]

    2025 , note =

    Forgetful but Faithful: A Cognitive Memory Architecture and Benchmark for Privacy-Aware Generative Agents , author =. 2025 , note =

  7. [7]

    Zhao, Yujie and Yuan, Boqin and Huang, Junbo and Yuan, Haocheng and Yu, Zhongming and Xu, Haozhou and Hu, Lanxiang and Shankarampeta, Abhilash and Huang, Zimeng and Ni, Wentao and Tian, Yuandong and Zhao, Jishen , year =

  8. [9]

    Gu, Yingjie and Xiong, Wenjian and Wang, Liqiang and Ren, Pengcheng and Li, Chao and Zhang, Xiaojing and Guo, Yijuan and Sun, Qi and Ma, Jingyao and Shi, Shidang , year =

  9. [10]

    He, Zexue and Wang, Yu and Zhi, Churan and Hu, Yuanzhe and Chen, Tzu-Ping and Yin, Lang and Chen, Ze and Wu, Tong Arthur and Ouyang, Siru and Wang, Zihan and Pei, Jiaxin and McAuley, Julian and Choi, Yejin and Pentland, Alex , year =

  10. [11]

    Wang, Yuyao and Zhang, Zhongjian and Chi, Mo and Yu, Kaichi and Li, Yuhan and Peng, Miao and Tong, Bing and Zhang, Chen and Zhou, Yan and Li, Jia , year =

  11. [12]

    Agentic Unlearning: When

    Wang, Bin and Wang, Fan and Wang, Pingping and Cong, Jinyu and Yu, Yang and Yin, Yilong and Han, Zhongyi and Wei, Benzheng , year =. Agentic Unlearning: When

  12. [13]

    Don't Ask the

    Reddy, Vikas and Challaram, Sumanth , year =. Don't Ask the

  13. [14]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in. 2022 , note =

  14. [15]

    ICLR , year =

    Mass-Editing Memory in a Transformer , author =. ICLR , year =

  15. [16]

    and Potts, Christopher and Chen, Danqi , booktitle =

    Zhong, Zexuan and Wu, Zhengxuan and Manning, Christopher D. and Potts, Christopher and Chen, Danqi , booktitle =. 2023 , note =

  16. [17]

    Findings of the Association for Computational Linguistics (ACL Findings) , year =

    Model Editing at Scale leads to Gradual and Catastrophic Forgetting , author =. Findings of the Association for Computational Linguistics (ACL Findings) , year =

  17. [18]

    2025 , note =

    Tan, Haoran and Zhang, Zeyu and Ma, Chen and Chen, Xu and Dai, Quanyu and Dong, Zhenhua , booktitle =. 2025 , note =

  18. [21]

    2024 , howpublished =

    MemPalace: An open-source AI memory system , author =. 2024 , howpublished =

  19. [23]

    2024 , howpublished =

    Letta: Stateful agents framework (successor to MemGPT) , author =. 2024 , howpublished =

  20. [25]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  21. [27]

    2024 , howpublished =

    Cognee: Memory layer for AI agents , author =. 2024 , howpublished =

  22. [28]

    ICLR , year =

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author =. ICLR , year =

  23. [29]

    EACL , year =

    MTEB: Massive Text Embedding Benchmark , author =. EACL , year =

  24. [30]

    NeurIPS Datasets and Benchmarks , year =

    BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author =. NeurIPS Datasets and Benchmarks , year =

  25. [31]

    Ebbinghaus, Hermann , year =

  26. [32]

    Coding processes in human memory , pages =

    Theoretical implications of directed forgetting , author =. Coding processes in human memory , pages =

  27. [33]

    Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

    Remembering can cause forgetting: Retrieval dynamics in long-term memory , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume =

  28. [34]

    IEEE Symposium on Security and Privacy , year =

    Towards Making Systems Forget with Machine Unlearning , author =. IEEE Symposium on Security and Privacy , year =

  29. [35]

    IEEE Symposium on Security and Privacy , year =

    Machine Unlearning , author =. IEEE Symposium on Security and Privacy , year =

  30. [36]

    2016 , howpublished =

    Regulation (EU) 2016/679 of the European Parliament and of the Council (General Data Protection Regulation) , author =. 2016 , howpublished =

  31. [37]

    1998 , publisher =

    Pathmarks , author =. 1998 , publisher =

  32. [38]

    2024 , howpublished =

    sqlite-vec: A vector search SQLite extension , author =. 2024 , howpublished =

  33. [39]

    2024 , howpublished =

    fastembed: Fast, accurate, lightweight embeddings via ONNX runtime , author =. 2024 , howpublished =

  34. [40]

    SIGIR , pages =

    Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods , author =. SIGIR , pages =

  35. [41]

    2011 , howpublished =

    Ed25519: high-speed high-security signatures , author =. 2011 , howpublished =

  36. [42]

    2025 , note =

    Graphiti: Temporal Knowledge Graphs for Memory , author =. 2025 , note =

  37. [43]

    2025 , howpublished =

  38. [44]

    Saad Alqithami. 2025. https://arxiv.org/abs/2512.12856 Forgetful but faithful: A cognitive memory architecture and benchmark for privacy-aware generative agents . Introduces FiFA benchmark and Memory-Aware Retention Schema (MaRS); six forgetting policies for privacy-preserving generative agents. arXiv:2512.12856

  39. [45]

    Anderson and Barbara A

    Michael C. Anderson and Barbara A. Spellman. 1995. Remembering can cause forgetting: Retrieval dynamics in long-term memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21(5):1063--1087

  40. [46]

    Robert A. Bjork. 1972. Theoretical implications of directed forgetting. Coding processes in human memory, pages 217--235

  41. [47]

    Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In IEEE Symposium on Security and Privacy. ArXiv:1912.03817. SISA: Sharded, Isolated, Sliced, Aggregated; retrain affected shard only

  42. [48]

    machine unlearning

    Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In IEEE Symposium on Security and Privacy. Coined the term "machine unlearning". Summation-form training for asymptotically-faster point removal

  43. [49]

    CaviraOSS Contributors . 2025. OpenMemory : Self-hosted long-term AI memory engine. https://github.com/CaviraOSS/OpenMemory. Open-source persistent memory store for LLM applications, MIT-licensed

  44. [50]

    Prateek Chhikara and 1 others. 2025. https://arxiv.org/abs/2504.19413 Mem0: Building production-ready ai agents with scalable long-term memory . arXiv preprint arXiv:2504.19413. Also appearing at ECAI 2025

  45. [51]

    Cognee contributors . 2024. Cognee: Memory layer for ai agents. https://github.com/topoteretes/cognee. No peer-reviewed evaluation published; graph updates documented as destructive chunk-replace

  46. [52]

    Cormack, Charles L

    Gordon V. Cormack, Charles L. A. Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR, pages 758--759

  47. [53]

    U ber das Ged \

    Hermann Ebbinghaus. 1885. \"U ber das Ged \"a chtnis: Untersuchungen zur experimentellen Psychologie . Duncker & Humblot. The original forgetting-curve experiment, on himself, using non-sense syllables

  48. [54]

    Right to erasure

    European Parliament and Council . 2016. Regulation (eu) 2016/679 of the european parliament and of the council (general data protection regulation). Official Journal of the European Union, L 119, 1--88. See in particular Article 17, "Right to erasure"

  49. [55]

    Alex Garcia. 2024. sqlite-vec: A vector search sqlite extension. https://github.com/asg017/sqlite-vec

  50. [56]

    Yingjie Gu, Wenjian Xiong, Liqiang Wang, Pengcheng Ren, Chao Li, Xiaojing Zhang, Yijuan Guo, Qi Sun, Jingyao Ma, and Shidang Shi. 2026. https://arxiv.org/abs/2604.20300 FSFM : A biologically-inspired framework for selective forgetting of agent memory . Framework + taxonomy of forgetting mechanisms (passive decay / active deletion / safety-triggered / adap...

  51. [57]

    Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli. 2024. Model editing at scale leads to gradual and catastrophic forgetting. In Findings of the Association for Computational Linguistics (ACL Findings). Shows knowledge editing causes catastrophic forgetting of unrelated facts

  52. [58]

    Bernal Jim \'e nez Guti \'e rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. https://arxiv.org/abs/2405.14831 Hipporag: Neurobiologically inspired long-term memory for large language models . Advances in Neural Information Processing Systems (NeurIPS)

  53. [59]

    Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, and Alex Pentland. 2026. https://arxiv.org/abs/2602.16313 MemoryArena : Benchmarking agent memory in interdependent multi-session agentic tasks . Memory-Agent-Environment loops; couples memorization ...

  54. [60]

    Yuanzhe Hu, Yu Wang, and Julian McAuley. 2026. https://arxiv.org/abs/2507.05257 Evaluating memory in LLM agents via incremental multi-turn interactions . In International Conference on Learning Representations (ICLR). Introduces MemoryAgentBench evaluating four memory competencies: Accurate Retrieval, Test-Time Learning, Long-Range Understanding, Selectiv...

  55. [61]

    Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, and Jun Zhao. 2024. https://arxiv.org/abs/2406.10890 RWKU : Benchmarking real-world knowledge unlearning for large language models . In NeurIPS

  56. [62]

    Letta contributors . 2024. Letta: Stateful agents framework (successor to memgpt). https://github.com/letta-ai/letta

  57. [63]

    Lipton, and J

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. 2024. https://arxiv.org/abs/2401.06121 TOFU : A task of fictitious unlearning for LLMs . arXiv preprint arXiv:2401.06121

  58. [64]

    MemPalace contributors . 2024. Mempalace: An open-source ai memory system. https://github.com/mempalace/mempalace. Verbatim-retention memory system; baseline on LongMemEval-S. No deletion / supersession primitives exposed in the public API

  59. [65]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT . In NeurIPS. ROME: knowledge-editing method, foundational supersession-of-facts work

  60. [66]

    Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass-editing memory in a transformer. In ICLR. MEMIT: batch knowledge editing in LLM weights

  61. [67]

    Niklas Muennighoff, Nouamane Tazi, Lo \" c Magne, and Nils Reimers. 2023. Mteb: Massive text embedding benchmark. In EACL. ArXiv:2210.07316. 56 datasets across 8 task families, 112+ languages

  62. [68]

    Patil, Ion Stoica, and Joseph E

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2310.08560 Memgpt: Towards llms as operating systems . arXiv preprint arXiv:2310.08560

  63. [69]

    Qdrant team . 2024. fastembed: Fast, accurate, lightweight embeddings via onnx runtime. https://github.com/qdrant/fastembed

  64. [70]

    Preston Rasmussen and 1 others. 2025. https://arxiv.org/abs/2501.13956 Zep: A temporal knowledge graph architecture for agent memory . arXiv preprint arXiv:2501.13956

  65. [71]

    Vikas Reddy and Sumanth Challaram. 2026. https://arxiv.org/abs/2606.01435 Don't ask the LLM to track freshness: A deterministic recipe for memory conflict resolution . On MemoryAgentBench FactConsolidation, deterministic version-aware aggregation (max-serial / max-timestamp) beats LLM-mediated conflict resolution; the bottleneck is post-retrieval assembly...

  66. [72]

    Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. 2025. https://aclanthology.org/2025.findings-acl.989/ MemBench : Towards more comprehensive evaluation on the memory of LLM -based agents . In Findings of the Association for Computational Linguistics (ACL Findings). Factual vs.\ reflective memory evaluation; comprehensive benchmark c...

  67. [73]

    Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In NeurIPS Datasets and Benchmarks. ArXiv:2104.08663. 18 zero-shot IR datasets

  68. [74]

    Bozhong Tian, Xiaozhuan Liang, Siyuan Cheng, Qingbin Liu, Mengru Wang, Dianbo Sui, Xi Chen, Huajun Chen, and Ningyu Zhang. 2024. https://arxiv.org/abs/2407.01920 To forget or not? towards practical knowledge unlearning for large language models . In Findings of the Association for Computational Linguistics (EMNLP Findings). Introduces KnowUnDo benchmark w...

  69. [75]

    Md Nayem Uddin, Kumar Shubham, Eduardo Blanco, Chitta Baral, and Gengyu Wang. 2026. https://arxiv.org/abs/2604.20006 From recall to forgetting: Benchmarking long-term memory for personalized agents . In Findings of the Association for Computational Linguistics (ACL Findings). Introduces FAMA (Forgetting-Aware Memory Accuracy), a single aggregate metric th...

  70. [76]

    Bin Wang, Fan Wang, Pingping Wang, Jinyu Cong, Yang Yu, Yilong Yin, Zhongyi Han, and Benzheng Wei. 2026 a . https://arxiv.org/abs/2602.17692 Agentic unlearning: When LLM agent meets machine unlearning . Synchronized Backflow Unlearning (SBU): joint unlearning across model parameters and persistent memory pathways. arXiv:2602.17692

  71. [77]

    Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, Yuhan Li, Miao Peng, Bing Tong, Chen Zhang, Yan Zhou, and Jia Li. 2026 b . https://arxiv.org/abs/2605.18421 EvoMemBench : Benchmarking agent memory from a self-evolving perspective . Memory benchmark on scope (in-/cross-episode) x content (knowledge / execution) axes; 15 memory methods. arXiv:2605.18421

  72. [78]

    Di Wu and 1 others. 2025. https://arxiv.org/abs/2410.10813 Longmemeval: Benchmarking chat assistants on long-term interactive memory . ICLR. ArXiv:2410.10813. 500 questions across seven categories at two scales (S=115k tokens, M=1.5M tokens)

  73. [79]

    Menglin Xia, Xuchao Zhang, Shantanu Dixit, Paramaguru Harimurugan, Rujia Wang, Victor Ruhle, Robert Sim, Chetan Bansal, and Saravan Rajmohan. 2026. https://arxiv.org/abs/2602.03315 Memora: A harmonic memory representation balancing abstraction and specificity . arXiv preprint arXiv:2602.03315. Retrieval-method paper from Microsoft Research using the name ...

  74. [80]

    Wujiang Xu and 1 others. 2025. https://arxiv.org/abs/2502.12110 A-mem: Agentic memory for llm agents . arXiv preprint arXiv:2502.12110

  75. [81]

    Zep AI . 2025. Graphiti: Temporal knowledge graphs for memory. https://github.com/getzep/graphiti. Open-source successor to the deprecated Zep CE

  76. [82]

    Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. 2023. https://arxiv.org/abs/2307.03941 Right to be forgotten in the era of large language models: Implications, challenges, and solutions . arXiv preprint arXiv:2307.03941. GDPR right-to-be-forgotten compliance challenges for LLMs; surveys differe...

  77. [83]

    Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, and Jishen Zhao. 2026. https://arxiv.org/abs/2602.22769 AMA-Bench : Evaluating long-horizon memory for agentic applications . Long-horizon agentic memory benchmark covering states, actions, observations...

  78. [84]

    Manning, Christopher Potts, and Danqi Chen

    Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, and Danqi Chen. 2023. MQUAKE : Assessing knowledge editing in language models via multi-hop questions. In EMNLP. Multi-hop counterfactual evaluation; basis for FactConsolidation in memoryagentbench