pith. sign in

arxiv: 2606.24775 · v1 · pith:FVTECA7Mnew · submitted 2026-06-23 · 💻 cs.CL · cs.DB· cs.IR

Are We Ready For An Agent-Native Memory System?

Pith reviewed 2026-06-25 23:50 UTC · model grok-4.3

classification 💻 cs.CL cs.DBcs.IR
keywords agent memoryLLM agentsmemory systemsdata managementevaluation frameworkworkload alignmentmaintenance strategiesretrieval systems
0
0 comments X

The pith

Agent memory systems succeed when their structure aligns with the specific workload bottleneck instead of any one design dominating all cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a systematic experimental study of memory for LLM agents by decomposing the systems into four functional modules and testing twelve representative implementations plus baselines. It runs the systems on five benchmark workloads drawn from eleven datasets to measure how module choices affect fidelity, precision, correctness, and stability. The results show performance varies with how well the memory organization matches the dominant demands of each workload. The work also compares maintenance approaches and finds that handling updates locally incurs lower cost than performing global reorganization. These observations point toward the need for memory designs that are tuned to agent execution patterns rather than treating memory as a black-box add-on.

Core claim

Agent memory for large language model agents has evolved into a full data management system, yet evaluations have remained limited to end-to-end task metrics. By decomposing memory into representation and storage, extraction, retrieval and routing, and maintenance modules, systematic testing of twelve systems across five workloads demonstrates that effectiveness is determined by alignment between memory structure and workload bottleneck, and that localized maintenance is more cost-efficient than global reorganization.

What carries the argument

The four-module analytical framework that decomposes agent memory into representation and storage, extraction, retrieval and routing, and maintenance.

If this is right

  • Different agent workloads require different memory architectures depending on their dominant bottleneck.
  • Localized maintenance strategies deliver lower operational cost than global reorganization under realistic update loads.
  • Fine-grained module-level measurements are needed to predict effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability.
  • End-to-end task success metrics alone cannot reveal the system-level cost and robustness trade-offs in agent memory.
  • Future agent memory systems should be designed around workload-specific module combinations rather than monolithic structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent builders may benefit from profiling expected workloads first and then selecting or composing memory modules to match the identified bottleneck.
  • The module decomposition could be tested for completeness by adding workloads that involve frequent cross-agent knowledge sharing.
  • Cost measurements under the localized approach may encourage deployment choices that favor incremental updates in production agent systems.
  • Extending the evaluation to include multi-agent coordination scenarios could reveal additional interactions between retrieval and maintenance modules.

Load-bearing premise

The twelve chosen memory systems, five workloads, and eleven datasets adequately represent the range of agent memory scenarios and the four-module breakdown captures the essential functional aspects without important omissions or interactions.

What would settle it

A workload or dataset in which a poorly aligned architecture still achieves high performance on the measured metrics, or in which global reorganization proves cheaper than localized maintenance under the same update patterns.

Figures

Figures reproduced from arXiv: 2606.24775 by Fan Wu, Feiyu Xiong, Guoliang Li, Hongming Xu, Shaokun Han, Wei Zhou, Xuanhe Zhou, Zhiyu Li.

Figure 1
Figure 1. Figure 1: Typical Execution Workflows of Agent Memory. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Memory Representation Methods. selecting prompts, tool descriptions, and retrieved facts) to mitigate context rot [2]. In contrast, an agent memory system (1) is a per￾sistent and updatable infrastructure for managing agent-specific state over time and (2) governs the full long-term memory lifecycle, including memory representation, storage, retrieval, and mainte￾nance, rather than merely packing the curre… view at source ↗
Figure 3
Figure 3. Figure 3: Memory Storage Methods. ▶ Temporal Knowledge Graphs.This sub-category models memory using graph topologies to map entities and their interconnections, natively supporting temporal reasoning and conflict detection. For example, Zep partitions memory into formally defined, temporally￾aware knowledge graphs (e.g., episode, entity, and community sub￾graphs). Similarly, Mem0𝑔 formalizes memory as a directed lab… view at source ↗
Figure 4
Figure 4. Figure 4: Memory Extraction Methods. MemOS proposes the MemCube, a unified data object that orga￾nizes memory into three distinct payloads (plain-text, activation, and parametric memory) alongside structured details (e.g., ID tags). 3.1.2 Physical Storage and Indexing. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Memory Maintenance Methods. ▶ Generative Query Expansion. Unlike rigid function calling, this approach uses natural language generation to synthesize in￾termediate clues or decompose complex intents before mapping them to the index (e.g., rewriting vague prompts into descriptive search strings). SimpleMem uses an Intent-Aware Retrieval Plan￾ning module where the LLM dissects queries, calculates adaptive se… view at source ↗
Figure 7
Figure 7. Figure 7: Effectiveness of Memory Systems over LoCoMo, MemoryAgentBench (LongMemEval), LifeLongAgentBench (DB-Bench). workload coverage, MemoryOS and MemOS remain closest to the frontier overall, suggesting that robustness comes not from a single universal memory form, but from preserving the right evidence at the right level of abstraction before final matching. In particu￾lar, (1) Temporal or graph-organized memor… view at source ↗
Figure 8
Figure 8. Figure 8: Retrieval Results of Memory Systems over LoCoMo. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation of LLM Backbones. experiments: (1) Update Robustness Comparison, which evaluates whether systems can absorb fact revisions and answer temporally grounded queries after updates; and (2) Backbone Robustness Abla￾tion, which tests whether this behavior remains stable when only the LLM backbone changes. In (1) Update Robustness Comparison, [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) Context-length robustness on LongBench; (b) Session-history growth on LongMemEval; (c) Temporal evidence-distance drift on LoCoMo. sessions earlier), while hierarchical or summary-first organization preserves session-level structure (e.g., first locating the relevant ses￾sion before resolving a specific local detail) so the LLM can narrow attention before final generation. Pure long-context prompting … view at source ↗
Figure 11
Figure 11. Figure 11: Operation Cost of Memory Systems. repeatedly reorganize a large global state are the least effi￾cient. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ablation of Maintenance Strategies. Planning + Reflect, which further introduces a lightweight reflection stage. We evaluate on LongMemEval to measure scattered-history retrieval relevance, and on LoCoMo to assess provenance-sensitive memory access and supporting-memory identification, reporting overlap-based measures (e.g., Substr. EM, ROUGE-L F1). O10-(Planning and Fusion): Explicit planning and balance… view at source ↗
read the original abstract

Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at https://github.com/OpenDataBox/MemoryData.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that agent memory for LLM agents should be analyzed via a four-module decomposition (representation/storage, extraction, retrieval/routing, maintenance); an evaluation of 12 representative systems plus baselines on five workloads spanning 11 datasets shows that no architecture dominates across scenarios, with performance instead depending on alignment to workload-specific bottlenecks, and that localized maintenance is more cost-efficient than global reorganization. These conclusions rest on end-to-end metrics, module ablations, and cost analyses, with code released publicly.

Significance. If the chosen systems and workloads prove representative, the work would supply concrete empirical evidence on architectural trade-offs and cost-performance frontiers that current end-to-end task-success benchmarks obscure. The public code release strengthens reproducibility and enables follow-on studies. The result would usefully shift the field from monolithic black-box evaluations toward modular, bottleneck-aware design.

major comments (2)
  1. [Experimental evaluation and workload description] The central claim that 'no single architecture dominates' (Abstract) is load-bearing on the representativeness of the 12 systems, 5 workloads, and 11 datasets; the manuscript must explicitly demonstrate that these selections expose distinct per-module bottlenecks and include scenarios such as continual lifelong consolidation with conflicting updates or strict latency constraints, otherwise the observed lack of dominance could be an artifact of the chosen slice rather than a general property.
  2. [Framework definition and ablation studies] The four-module decomposition is presented as capturing the essential functional aspects, yet the paper does not show that cross-module feedback loops (e.g., maintenance affecting retrieval routing) are negligible; if such interactions dominate in untested regimes, the fine-grained ablation results on individual module effects would not generalize.
minor comments (1)
  1. [Evaluation methodology] The abstract states that 'extensive end-to-end tests and ablations' were performed; the main text should report the exact statistical controls, number of runs, and any post-hoc workload filtering to allow readers to assess robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help strengthen the manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Experimental evaluation and workload description] The central claim that 'no single architecture dominates' (Abstract) is load-bearing on the representativeness of the 12 systems, 5 workloads, and 11 datasets; the manuscript must explicitly demonstrate that these selections expose distinct per-module bottlenecks and include scenarios such as continual lifelong consolidation with conflicting updates or strict latency constraints, otherwise the observed lack of dominance could be an artifact of the chosen slice rather than a general property.

    Authors: We agree that demonstrating representativeness is essential to support the central claim. The five workloads were selected to target different module-level bottlenecks (e.g., storage/retrieval intensity in long-context QA, update frequency in conversational settings, and maintenance load in multi-session tasks), with the 11 datasets providing coverage across domains. In the revision we will add an explicit mapping in Section 3 (and a new table) that links each workload to the primary bottlenecks it stresses, supported by the module ablation results already reported. However, scenarios involving continual lifelong consolidation with conflicting updates or strict latency constraints are not present in the current evaluation suite. We will add a limitations paragraph acknowledging that the 'no single architecture dominates' finding is scoped to the tested workloads and may not extend to these unexamined regimes; we view these as valuable directions for follow-on work rather than changes feasible within the current experimental budget. revision: partial

  2. Referee: [Framework definition and ablation studies] The four-module decomposition is presented as capturing the essential functional aspects, yet the paper does not show that cross-module feedback loops (e.g., maintenance affecting retrieval routing) are negligible; if such interactions dominate in untested regimes, the fine-grained ablation results on individual module effects would not generalize.

    Authors: The four-module framework is offered as an analytical lens for isolating effects rather than an assertion of complete independence. Our ablations hold three modules fixed while varying the fourth, and the resulting performance patterns remain stable across the five workloads. That said, we did not explicitly measure or bound cross-module interactions such as maintenance-induced changes to retrieval routing. In the revised manuscript we will expand the framework section with a short discussion of potential feedback loops, cite any indirect evidence from the existing ablations (e.g., cases where maintenance updates produced only marginal shifts in retrieval metrics), and add a caveat that strong interactions in untested regimes could limit the generalizability of the per-module conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or self-referential predictions

full rationale

The paper is an experimental benchmarking study that decomposes agent memory into four modules, evaluates 12 existing systems on 5 workloads across 11 datasets, and reports observed performance trade-offs. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology. Conclusions follow directly from the reported end-to-end and ablation results rather than reducing to any quantity defined by the paper's own inputs. This matches the default case of a self-contained empirical analysis with no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The paper's main addition is the proposed four-module framework plus the empirical comparison; free parameters are the concrete choices of which systems and workloads to include.

free parameters (2)
  • Selection of the 12 memory systems
    Chosen as representative; the set directly determines which architectures are compared and which conclusions can be drawn.
  • Selection of the 5 workloads and 11 datasets
    Defines the scenarios under which performance and cost are measured.
axioms (1)
  • domain assumption The four modules (representation/storage, extraction, retrieval/routing, maintenance) form a complete and non-overlapping decomposition of agent memory functionality.
    This decomposition is the foundation of the entire analytical framework.

pith-pipeline@v0.9.1-grok · 5830 in / 1240 out tokens · 30392 ms · 2026-06-25T23:50:18.158299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    https://www.claude.com/product/claude-code

    Claude Code.(Anthropic). https://www.claude.com/product/claude-code

  2. [2]

    Anthropic Engineering. 2025. Effective context engineering for AI agents. https:// www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

  3. [3]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InACL (1). Association for Computational Linguistics, 3119– 3137

  4. [4]

    Liana Caminal et al. 2025. Filtered Vector Search: State-of-the-art and Research Opportunities. InProceedings of the VLDB Endowment, Vol. 18. 5488–5491

  5. [5]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

  6. [6]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint arXiv:2504.19413(2025)

  7. [7]

    Pengfei Du. 2026. Memory for Autonomous LLM Agents: Mechanisms, Evalua- tion, and Emerging Frontiers.arXiv preprint arXiv:2603.07670(2026)

  8. [8]

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang

  9. [9]

    LightMem: Lightweight and Efficient Memory-Augmented Generation

    LightMem: Lightweight and Efficient Memory-Augmented Generation. CoRRabs/2510.18866 (2025). arXiv:2510.18866 doi:10.48550/ARXIV.2510.18866

  10. [10]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey.CoRR abs/2312.10997 (2023)

  11. [11]

    Google. 2025. Memory – Agent Development Kit (ADK). https://google.github. io/adk-docs/sessions/memory/

  12. [12]

    Yifan Hu, Siyin Liu, Yifei Yue, Guoqiang Zhang, Benyou Liu, Fengbin Zhu, Jingkuan Lin, et al . 2025. Memory in the Age of AI Agents.arXiv preprint arXiv:2512.13564(2025)

  13. [13]

    Guozhang Kang, Zhenying Ge, Jie Hu, Xinyuan Zhang, Li Wang, and Jianfeng Zhan. 2025. BigVectorBench: Heterogeneous Data Embedding and Compound Queries are Essential in Evaluating Vector Databases.Proceedings of the VLDB Endowment18, 6 (2025), 1536–1549

  14. [14]

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. 2025. Memory OS of AI Agent. InEMNLP. Association for Computational Linguistics, 25961–25970

  15. [15]

    Arijit Khan, Yuyu Luo, Wenjie Zhang, Mingjie Zhou, and Xiaofang Zhou. 2025. Retrieval-augmented Generation (RAG): What is There for Data Management Researchers?ACM SIGMOD Record54, 4 (2025)

  16. [16]

    Guoliang Li, Xuanhe Zhou, and Xinyang Zhao. 2024. LLM for Data Management. Proc. VLDB Endow.17, 12 (2024), 4213–4216

  17. [17]

    Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan...

  18. [18]

    Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. 2026. SimpleMem: Efficient Lifelong Memory for LLM Agents.CoRRabs/2601.02553 (2026)

  19. [19]

    Gonzalez, and Aditya G

    Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, and Aditya G. Parameswaran. 2026. Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First. InProceedings of the 16th Annual Conference ...

  20. [20]

    Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu. 2023. MemoChat: Tuning LLMs to Use Memos for Con- sistent Long-Range Open-Domain Conversation.CoRRabs/2308.08239 (2023). arXiv:2308.08239 doi:10.48550/ARXIV.2308.08239

  21. [21]

    Yuyu Luo, Guoliang Li, Ju Fan, and Nan Tang. 2026. Data Agents: Levels, State of the Art, and Open Problems.arXiv preprint arXiv:2602.04261(2026). SIGMOD 2026 Tutorial

  22. [22]

    Adyasha Maharana, Dong-Ho Lee, Sergey Turishcheva, Kezhen Nham, Golnaz Jandaghi, Jay Pujara, and Xiang Ren. 2024. Evaluating Very Long-Term Conver- sational Memory of LLM Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

  23. [23]

    Vasilije Markovic, Lazar Obradovic, László Hajdu, and Jovan Pavlovic. 2025. Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning.CoRRabs/2505.24478 (2025)

  24. [24]

    MemoryAgentBench Team. 2026. Evaluating Memory in LLM Agents via In- cremental Multi-Turn Interactions. InFourteenth International Conference on Learning Representations (ICLR)

  25. [25]

    Microsoft. 2025. Introducing Copilot Memory: A More Productive and Person- alized AI. https://techcommunity.microsoft.com/blog/microsoft365copilotblog/ introducing-copilot-memory

  26. [26]

    OpenAI. 2026. Context Engineering for Personalization – State Management with Long-Term Memory Notes using OpenAI Agents SDK. https://developers. openai.com/cookbook/examples/agents_sdk/context_personalization/

  27. [27]

    Patil, Kevin Lin, Sarah Wooders, and Joseph E

    Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems.arXiv preprint arXiv:2310.08560(2023)

  28. [28]

    Preston Rasmussen, Pavel Paliychuk, Travis Beauvais, and Jesse Ryan. 2025. Zep: A Temporal Knowledge Graph Architecture for Agent Memory.arXiv preprint arXiv:2501.13956(2025)

  29. [29]

    Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao. 2025. From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/ forum?id=moXtEmCleY

  30. [30]

    Harmanpreet Singh, Nikhil Verma, Yixiao Wang, Manasa Bharadwaj, Homa Fashandi, Kevin Ferreira, and Chul Lee. 2024. Personal Large Language Model Agents: A Case Study on Tailored Travel Planning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. Association for Computational Linguistics, Miami, Florid...

  31. [31]

    Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong

  32. [32]

    InACL (Findings) (Findings of ACL)

    MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents. InACL (Findings) (Findings of ACL). Association for Compu- tational Linguistics, 19336–19352

  33. [33]

    Zhiwei Tang et al. 2026. LLM Agent Memory: A Survey from a Unified Repre- sentation.arXiv preprint arXiv:2603.0359(2026)

  34. [34]

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, and Kai-Wei Chang. 2024. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. arXiv preprint arXiv:2410.10813(2024)

  35. [35]

    Yanchen Wu, Tenghui Lin, Yingli Zhou, Fangyuan Zhang, Qintian Guo, Xun Zhou, Sibo Wang, Xilin Liu, Yuchi Ma, and Yixiang Fang. 2026. Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework. Proceedings of the VLDB Endowment(2026)

  36. [36]

    Wujiang Xu et al. 2025. A-MEM: Agentic Memory for LLM Agents.arXiv preprint arXiv:2502.12110(2025)

  37. [37]

    Chao Yang, Chuan Zhou, Yanghua Xiao, Shuai Dong, Liang Zhuang, et al. 2026. Graph-based Agent Memory: Taxonomy, Techniques, and Applications.arXiv preprint arXiv:2602.05665(2026)

  38. [38]

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. 2025. MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent.CoRRabs/2507.02259 (2025)

  39. [39]

    Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. 2025. A Survey on the Memory Mechanism of Large Language Model based Agents.ACM Transactions on Information Systems (2025)

  40. [40]

    Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, Zhong-Zhi Li, Yingying Zhang, Le Song, and Qianli Ma. 2025. LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners.CoRRabs/2505.11942 (2025)

  41. [41]

    Junhao Zheng, Chengming Shi, Xidi Cai, Qiuke Li, Duzhen Zhang, Chenxing Li, Dong Yu, and Qianli Ma. 2025. Lifelong Learning of Large Language Model based Agents: A Roadmap.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  42. [42]

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. Mem- oryBank: Enhancing Large Language Models with Long-Term Memory. InAAAI. AAAI Press, 19724–19731

  43. [43]

    Wei Zhou, Xuanhe Zhou, Qikang He, Guoliang Li, Bingsheng He, Quanqing Xu, and Fan Wu. 2026. Automating Database-Native Function Code Synthesis with LLMs.Proc. ACM Manag. Data3, 4 (2026), 141:1–141:26

  44. [44]

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. 2025. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents.CoRR abs/2506.15841 (2025). 14