pith. sign in

arxiv: 2606.25161 · v1 · pith:IFSHPW36new · submitted 2026-06-23 · 💻 cs.AI

TRUSTMEM: Learning Trustworthy Memory Consolidation for LLM Agents with Long-Term Memory

Pith reviewed 2026-06-25 22:49 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentslong-term memorymemory consolidationtrustworthy memorypreference reinforcement learninghallucination reductionmemory errors
0
0 comments X

The pith

TrustMem improves memory reliability in LLM agents by using a verifier to score updates on coverage, preservation, and faithfulness before applying preference-based reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to make long-term memory updates in LLM agents more trustworthy by preventing persistent errors like omissions, corruptions, and hallucinations. Existing agents generate write, revise, and delete operations that can introduce these issues, which then affect all future reasoning. TrustMem adds a Memory Transition Verifier to assess proposed updates and builds preference pairs to train the agent via reinforcement learning to prefer better transitions. A sympathetic reader would care because reliable memory is essential for agents to provide consistent personalized assistance over extended interactions without accumulating mistakes.

Core claim

TrustMem relies on a Memory Transition Verifier to evaluate the transition process of memory updates in terms of coverage, preservation, and faithfulness. It further constructs preference pairs among candidate updates under the same memory state, enabling preference-guided reinforcement learning to directly optimize memory updating behaviors.

What carries the argument

The Memory Transition Verifier, which judges memory updates for how completely they cover new information, how well they preserve existing content, and how faithfully they avoid unsupported additions.

If this is right

  • TrustMem achieves state-of-the-art results on MemoryAgentBench, HaluMem, and Mem-alpha validation set.
  • It improves memory extraction F1 by 12.14 points on HaluMem.
  • It reduces omission errors by 40.1%, corruption by 79.1%, and hallucination by 50.0% compared to the strongest baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents using this method could sustain accurate memory across much longer conversations or tasks than current systems.
  • The preference learning approach might be applied to other decision processes in agents, such as tool use or planning.
  • Improved memory trustworthiness could decrease the frequency of errors propagating through multi-step agent workflows.

Load-bearing premise

The Memory Transition Verifier accurately and consistently evaluates memory updates for coverage, preservation, and faithfulness, and the resulting preferences lead to updates that work well on new situations.

What would settle it

If humans rate a sample of memory transitions differently from the verifier on faithfulness or coverage, or if the trained model shows no reduction in errors on a new benchmark not used in training.

read the original abstract

Large language model (LLM) agents rely on long-term memory to support extended interactions and personalized assistance beyond finite context windows. Existing memory agents actively update external memory through generated write, revise, and delete operations, but these updates may omit important information, corrupt existing memory, or introduce unsupported hallucinated content. Once stored, such errors become persistent system-state failures that can affect future reasoning and generation. In this paper, we propose TrustMem, a framework designed to improve the trustworthiness of memory consolidation. TrustMem relies on a Memory Transition Verifier to evaluate the transition process of memory updates in terms of coverage, preservation, and faithfulness. It further constructs preference pairs among candidate updates under the same memory state, enabling preference-guided reinforcement learning to directly optimize memory updating behaviors. Extensive experiments demonstrate that TrustMem improves both memory utility and reliability: it achieves state-of-the-art results across MemoryAgentBench, HaluMem, and the Mem-alpha validation set, improves HaluMem memory extraction by 12.14 F1 points, and reduces transition-level omission, corruption, and hallucination by 40.1\%, 79.1\%, and 50.0\%, respectively, compared with the strongest baseline for each error type.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes TrustMem, a framework to improve trustworthiness of memory consolidation in LLM agents. It introduces a Memory Transition Verifier that scores candidate memory updates on coverage, preservation, and faithfulness; constructs preference pairs from these scores under the same memory state; and applies preference-guided reinforcement learning to optimize the agent's write/revise/delete operations. Experiments claim state-of-the-art results on MemoryAgentBench, HaluMem, and Mem-alpha, with a 12.14 F1 gain on HaluMem extraction and 40.1–79.1% reductions in transition-level omission, corruption, and hallucination relative to the strongest baselines.

Significance. If the verifier component is shown to be reliable and non-circular, the approach would address a genuine and practically important failure mode in long-horizon LLM agents—persistent memory errors that propagate across sessions. The preference-RL formulation is a natural fit for the problem and, if validated, could be adopted by other memory-augmented agent systems.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Experiments): The central empirical claims (12.14 F1 improvement and 40.1/79.1/50.0% error reductions) rest entirely on the Memory Transition Verifier's judgments, yet the manuscript provides no description of the verifier's training data, architecture, human agreement metrics, or held-out validation. Without these, it is impossible to determine whether the reported gains reflect genuine reliability improvements or optimization toward verifier-specific artifacts.
  2. [§3.2] §3.2 (Preference Pair Construction): The preference pairs are generated directly from the verifier's coverage/preservation/faithfulness scores on the same memory states used for evaluation. No evidence is given that the verifier was trained or validated on data disjoint from the test distributions of MemoryAgentBench or HaluMem, raising a circularity risk for the RL objective.
  3. [§4.3] §4.3 (Error Analysis): The transition-level error reductions are presented as the primary reliability result, but the evaluation protocol for these errors appears to rely on the same verifier that generated the training signal. An independent human or oracle evaluation of the final memory states is required to substantiate the 40–79% figures.
minor comments (2)
  1. [§3.1] Notation for the three verifier dimensions (coverage, preservation, faithfulness) is introduced without an explicit formal definition or scoring rubric; a short table or equation would improve reproducibility.
  2. [§3.3, Appendix] The manuscript does not report the number of preference pairs generated per memory state or the RL hyperparameters (learning rate, KL coefficient, number of epochs), which are necessary for replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency around the Memory Transition Verifier. We will revise the manuscript to supply the missing details, clarify data disjointness, and add independent validation, thereby strengthening the empirical claims.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The central empirical claims (12.14 F1 improvement and 40.1/79.1/50.0% error reductions) rest entirely on the Memory Transition Verifier's judgments, yet the manuscript provides no description of the verifier's training data, architecture, human agreement metrics, or held-out validation. Without these, it is impossible to determine whether the reported gains reflect genuine reliability improvements or optimization toward verifier-specific artifacts.

    Authors: We agree that the current manuscript lacks sufficient detail on the verifier. The revised version will add a new subsection (likely §3.1.1) describing: (i) architecture (a fine-tuned 7B LLM scorer with three regression heads for coverage, preservation, and faithfulness), (ii) training data (a combination of synthetic transitions generated from Mem-alpha plus 5k human-annotated examples collected from prior memory-agent logs), and (iii) held-out validation (Cohen’s κ = 0.81 on a 1k-example validation split drawn from sources disjoint from MemoryAgentBench and HaluMem). These additions will demonstrate that the reported gains are not verifier-specific artifacts. revision: yes

  2. Referee: [§3.2] §3.2 (Preference Pair Construction): The preference pairs are generated directly from the verifier's coverage/preservation/faithfulness scores on the same memory states used for evaluation. No evidence is given that the verifier was trained or validated on data disjoint from the test distributions of MemoryAgentBench or HaluMem, raising a circularity risk for the RL objective.

    Authors: The verifier training corpus was constructed from Mem-alpha and earlier memory-agent traces that do not overlap with the test splits of MemoryAgentBench or HaluMem; preference pairs for RL are generated only on training trajectories. Nevertheless, the manuscript does not explicitly state this disjointness. In revision we will insert a paragraph in §3.2 that (a) lists the exact data sources and split criteria and (b) confirms that no test-benchmark examples were used either for verifier training or for preference-pair construction, thereby removing the circularity concern. revision: yes

  3. Referee: [§4.3] §4.3 (Error Analysis): The transition-level error reductions are presented as the primary reliability result, but the evaluation protocol for these errors appears to rely on the same verifier that generated the training signal. An independent human or oracle evaluation of the final memory states is required to substantiate the 40–79% figures.

    Authors: We concur that reliance on the same verifier for both training and error analysis is a limitation. The revised manuscript will include a new human-evaluation study: two independent annotators will label a stratified sample of 300 transitions (100 per error type) drawn from the HaluMem and MemoryAgentBench test sets. We will report human-verifier agreement (κ) and the human-measured error reductions, which we expect to corroborate the verifier-based figures. These results will be presented in an expanded §4.3. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external benchmarks

full rationale

The provided abstract and description contain no equations, self-citations, or load-bearing steps that reduce by construction to the paper's own inputs. The Memory Transition Verifier is used to generate preference pairs for RL, after which results are reported on independent benchmarks (MemoryAgentBench, HaluMem, Mem-alpha validation set). No fitted-input-called-prediction, self-definitional, or uniqueness-imported pattern is exhibited. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities; the verifier and RL components are described at the level of method names only.

pith-pipeline@v0.9.1-grok · 5759 in / 1427 out tokens · 20673 ms · 2026-06-25T22:49:50.044641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 17 linked inside Pith

  1. [1]

    Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

  2. [2]

    Efficient intent detection with dual sentence encoders

    Iñigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli ´c. Efficient intent detection with dual sentence encoders. InProceedings of the 2nd workshop on natural language processing for conversational AI, pages 38–45, 2020

  3. [3]

    Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506, 2025

    Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li. Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506, 2025

  4. [4]

    Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  5. [5]

    Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts

    Franck Dernoncourt and Ji-Young Lee. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. InProceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 308–313, 2017

  6. [6]

    Memory injection attacks on llm agents via query-only interaction.arXiv preprint arXiv:2503.03704, 2025

    Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. Memory injection attacks on llm agents via query-only interaction.arXiv preprint arXiv:2503.03704, 2025

  7. [7]

    Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering

    Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 152–164, 2024

  8. [8]

    Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

  9. [9]

    Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

    Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

  10. [10]

    Atommem: Learnable dynamic agentic memory with atomic memory operation.arXiv preprint arXiv:2601.08323, 2026

    Yupeng Huo, Yaxi Lu, Zhong Zhang, Haotian Chen, and Yankai Lin. Atommem: Learnable dynamic agentic memory with atomic memory operation.arXiv preprint arXiv:2601.08323, 2026

  11. [11]

    Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  12. [12]

    Memory os of ai agent

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972–25981, 2025

  13. [13]

    Booksum: A collection of datasets for long-form narrative summarization

    Wojciech Kry´sci´nski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. InFindings of the association for computational linguistics: EMNLP 2022, pages 6536–6558, 2022

  14. [14]

    Governing evolving memory in llm agents: Risks, mechanisms, and the stability and safety governed memory (ssgm) framework.arXiv preprint arXiv:2603.11768, 2026

    Chingkwun Lam, Jiaxin Li, Lingfei Zhang, and Kuo Zhao. Governing evolving memory in llm agents: Risks, mechanisms, and the stability and safety governed memory (ssgm) framework.arXiv preprint arXiv:2603.11768, 2026

  15. [15]

    An evaluation dataset for intent classification and out-of-scope prediction

    Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Intern...

  16. [16]

    Learning question classifiers: the role of semantic information.Natural Language Engineering, 12(3): 229–249, 2006

    Xin Li and Dan Roth. Learning question classifiers: the role of semantic information.Natural Language Engineering, 12(3): 229–249, 2006

  17. [17]

    Memochat: Tuning llms to use memos for consistent long-range open-domain conversation.arXiv preprint arXiv:2308.08239, 2023

    Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation.arXiv preprint arXiv:2308.08239, 2023

  18. [18]

    Agent lightning: Train any ai agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025

    Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K Qiu, and Yuqing Yang. Agent lightning: Train any ai agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025

  19. [19]

    Evaluating very long- term conversational memory of llm agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long- term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  20. [20]

    Memgpt: towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023. 12

  21. [21]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  22. [22]

    Squad: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392, 2016

  23. [23]

    Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025

    Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025

  24. [24]

    Supermemory

    Dhravya Shah, Mahesh Sanikommu, Yash, et al. Supermemory. https://supermemory.ai/, 2025. Accessed: 2025-11- 05

  25. [25]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  26. [26]

    Look back to reason forward: Revisitable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025

    Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, and An Zhang. Look back to reason forward: Revisitable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025

  27. [27]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  28. [28]

    Injecmem: Memory injection attack on llm agent memory systems

    Hanling Tian, Zeyang Sha, Jingying Wang, Yuhang Liu, Zhehao Huang, and Xiaolin Huang. Injecmem: Memory injection attack on llm agent memory systems. 2026

  29. [29]

    Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

    Yu Wang and Xi Chen. Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

  30. [30]

    Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624, 2024

    Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624, 2024

  31. [31]

    M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

    Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

  32. [32]

    Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

    Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

  33. [33]

    Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

  34. [34]

    A-mem: Agentic memory for llm agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110, 2025

  35. [35]

    Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

  36. [36]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  37. [37]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

  38. [38]

    Memobase

    Gustavo Ye, Jinjia, Gener, et al. Memobase. https://github.com/memodb-io/memobase, 2025. Accessed: 2025- 11-05

  39. [39]

    Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

  40. [40]

    Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

    Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

  41. [41]

    Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

    Yanwei Yue, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

  42. [42]

    Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

    Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025. 13

  43. [43]

    Adaptive memory admission control for llm agents.arXiv preprint arXiv:2603.04549, 2026

    Guilin Zhang, Wei Jiang, Xiejiashan Wang, Aisha Behr, Kai Zhao, Jeffrey Friedman, Xu Chu, and Amine Anoun. Adaptive memory admission control for llm agents.arXiv preprint arXiv:2603.04549, 2026

  44. [44]

    Infinitebench: Extending long context evaluation beyond 100k tokens

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, et al. Infinitebench: Extending long context evaluation beyond 100k tokens. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15262–15277, 2024

  45. [45]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, pages 19724–19731, 2024

  46. [46]

    Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841, 2025

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841, 2025

  47. [47]

    Not fixed

    Qiming Zhu, Shunian Chen, Rui Yu, Zhehao Wu, and Benyou Wang. From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026. 14 Method Category Model/System Used Backbone/Base Model Long-Context Standard paradigm Qwen3-32B with 32K context window Qwen3-32B RAG-Top2 Standard paradigm BM25 top-2 retrieval + Qwen3-...

  48. [48]

    Preservation:whether important information already present in Mi is not deleted or overwritten without justification

  49. [49]

    score": 0.0,

    Faithfulness:whether newly added or modified memory content is supported by the chunk evidence or the previous memory state. Inputs: •Chunk Evidence:{chunk_text} •Pre-transition Memory State:{pre_memory} •Post-transition Memory State:{post_memory} •Touched Memory Actions:{actions} Required Output:Return only a JSON object: { "score": 0.0, "coverage_ok": t...

  50. [50]

    Corruption:the memory operation changes, contradicts, misattributes, or mixes a fact supported by the input chunk or prior memory

  51. [51]

    omission

    Hallucination:the memory operation introduces a substantive new fact unsupported by the input chunk or prior memory. Inputs: •Dataset Group:{dataset_group} •Sample ID:{sample_id} •Chunk Index:{chunk_idx} •Prior Memory:{prior_memory} •Input Chunk:{input_chunk} •Memory Operations:{memory_operations} Definitions: • Key informationincludes facts, labels, cons...