TRUSTMEM: Learning Trustworthy Memory Consolidation for LLM Agents with Long-Term Memory

Srinivas Chappidi; Sudipta Paul; Tianyu Yang; Vijay Srinivasan; Vivek Kulkarni

arxiv: 2606.25161 · v1 · pith:IFSHPW36new · submitted 2026-06-23 · 💻 cs.AI

TRUSTMEM: Learning Trustworthy Memory Consolidation for LLM Agents with Long-Term Memory

Tianyu Yang , Sudipta Paul , Vijay Srinivasan , Vivek Kulkarni , Srinivas Chappidi This is my paper

Pith reviewed 2026-06-25 22:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentslong-term memorymemory consolidationtrustworthy memorypreference reinforcement learninghallucination reductionmemory errors

0 comments

The pith

TrustMem improves memory reliability in LLM agents by using a verifier to score updates on coverage, preservation, and faithfulness before applying preference-based reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to make long-term memory updates in LLM agents more trustworthy by preventing persistent errors like omissions, corruptions, and hallucinations. Existing agents generate write, revise, and delete operations that can introduce these issues, which then affect all future reasoning. TrustMem adds a Memory Transition Verifier to assess proposed updates and builds preference pairs to train the agent via reinforcement learning to prefer better transitions. A sympathetic reader would care because reliable memory is essential for agents to provide consistent personalized assistance over extended interactions without accumulating mistakes.

Core claim

TrustMem relies on a Memory Transition Verifier to evaluate the transition process of memory updates in terms of coverage, preservation, and faithfulness. It further constructs preference pairs among candidate updates under the same memory state, enabling preference-guided reinforcement learning to directly optimize memory updating behaviors.

What carries the argument

The Memory Transition Verifier, which judges memory updates for how completely they cover new information, how well they preserve existing content, and how faithfully they avoid unsupported additions.

If this is right

TrustMem achieves state-of-the-art results on MemoryAgentBench, HaluMem, and Mem-alpha validation set.
It improves memory extraction F1 by 12.14 points on HaluMem.
It reduces omission errors by 40.1%, corruption by 79.1%, and hallucination by 50.0% compared to the strongest baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents using this method could sustain accurate memory across much longer conversations or tasks than current systems.
The preference learning approach might be applied to other decision processes in agents, such as tool use or planning.
Improved memory trustworthiness could decrease the frequency of errors propagating through multi-step agent workflows.

Load-bearing premise

The Memory Transition Verifier accurately and consistently evaluates memory updates for coverage, preservation, and faithfulness, and the resulting preferences lead to updates that work well on new situations.

What would settle it

If humans rate a sample of memory transitions differently from the verifier on faithfulness or coverage, or if the trained model shows no reduction in errors on a new benchmark not used in training.

read the original abstract

Large language model (LLM) agents rely on long-term memory to support extended interactions and personalized assistance beyond finite context windows. Existing memory agents actively update external memory through generated write, revise, and delete operations, but these updates may omit important information, corrupt existing memory, or introduce unsupported hallucinated content. Once stored, such errors become persistent system-state failures that can affect future reasoning and generation. In this paper, we propose TrustMem, a framework designed to improve the trustworthiness of memory consolidation. TrustMem relies on a Memory Transition Verifier to evaluate the transition process of memory updates in terms of coverage, preservation, and faithfulness. It further constructs preference pairs among candidate updates under the same memory state, enabling preference-guided reinforcement learning to directly optimize memory updating behaviors. Extensive experiments demonstrate that TrustMem improves both memory utility and reliability: it achieves state-of-the-art results across MemoryAgentBench, HaluMem, and the Mem-alpha validation set, improves HaluMem memory extraction by 12.14 F1 points, and reduces transition-level omission, corruption, and hallucination by 40.1\%, 79.1\%, and 50.0\%, respectively, compared with the strongest baseline for each error type.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrustMem adds a verifier plus preference RL to memory agents, but the reported error cuts rest on an unvalidated component.

read the letter

The main thing is that TrustMem introduces a Memory Transition Verifier to score write/revise/delete operations on coverage, preservation, and faithfulness, then turns those scores into preference pairs for RL training of the update policy. That specific pairing of verifier and preference optimization looks new in the memory-agent literature.

The experiments claim SOTA numbers on MemoryAgentBench, HaluMem, and Mem-alpha, plus a 12-point F1 lift on extraction and 40-79% drops in the three error types. The error breakdowns by omission, corruption, and hallucination are a useful way to measure the problem.

The soft spot is exactly the one the stress-test flags. The abstract gives no information on how the verifier is trained, what labels it sees, or any human agreement numbers. If the verifier is an LLM that makes the same kinds of mistakes it is supposed to catch, the preference pairs will simply teach the policy to imitate the verifier's biases rather than ground truth. Without held-out validation or training details, the large error reductions cannot be assessed.

There is also no mention of whether the preference data was constructed from the same distributions used for final evaluation, which leaves a circularity risk.

This is for groups already building long-term memory for agents and who need to reduce persistent state errors. A reader working on that exact problem could extract the framework idea even if the numbers need re-checking.

It deserves peer review once the methods section is in place; the core idea is straightforward enough to test, but the current evidence is too thin to stand on its own.

Referee Report

3 major / 2 minor

Summary. The paper proposes TrustMem, a framework to improve trustworthiness of memory consolidation in LLM agents. It introduces a Memory Transition Verifier that scores candidate memory updates on coverage, preservation, and faithfulness; constructs preference pairs from these scores under the same memory state; and applies preference-guided reinforcement learning to optimize the agent's write/revise/delete operations. Experiments claim state-of-the-art results on MemoryAgentBench, HaluMem, and Mem-alpha, with a 12.14 F1 gain on HaluMem extraction and 40.1–79.1% reductions in transition-level omission, corruption, and hallucination relative to the strongest baselines.

Significance. If the verifier component is shown to be reliable and non-circular, the approach would address a genuine and practically important failure mode in long-horizon LLM agents—persistent memory errors that propagate across sessions. The preference-RL formulation is a natural fit for the problem and, if validated, could be adopted by other memory-augmented agent systems.

major comments (3)

[Abstract, §4] Abstract and §4 (Experiments): The central empirical claims (12.14 F1 improvement and 40.1/79.1/50.0% error reductions) rest entirely on the Memory Transition Verifier's judgments, yet the manuscript provides no description of the verifier's training data, architecture, human agreement metrics, or held-out validation. Without these, it is impossible to determine whether the reported gains reflect genuine reliability improvements or optimization toward verifier-specific artifacts.
[§3.2] §3.2 (Preference Pair Construction): The preference pairs are generated directly from the verifier's coverage/preservation/faithfulness scores on the same memory states used for evaluation. No evidence is given that the verifier was trained or validated on data disjoint from the test distributions of MemoryAgentBench or HaluMem, raising a circularity risk for the RL objective.
[§4.3] §4.3 (Error Analysis): The transition-level error reductions are presented as the primary reliability result, but the evaluation protocol for these errors appears to rely on the same verifier that generated the training signal. An independent human or oracle evaluation of the final memory states is required to substantiate the 40–79% figures.

minor comments (2)

[§3.1] Notation for the three verifier dimensions (coverage, preservation, faithfulness) is introduced without an explicit formal definition or scoring rubric; a short table or equation would improve reproducibility.
[§3.3, Appendix] The manuscript does not report the number of preference pairs generated per memory state or the RL hyperparameters (learning rate, KL coefficient, number of epochs), which are necessary for replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency around the Memory Transition Verifier. We will revise the manuscript to supply the missing details, clarify data disjointness, and add independent validation, thereby strengthening the empirical claims.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): The central empirical claims (12.14 F1 improvement and 40.1/79.1/50.0% error reductions) rest entirely on the Memory Transition Verifier's judgments, yet the manuscript provides no description of the verifier's training data, architecture, human agreement metrics, or held-out validation. Without these, it is impossible to determine whether the reported gains reflect genuine reliability improvements or optimization toward verifier-specific artifacts.

Authors: We agree that the current manuscript lacks sufficient detail on the verifier. The revised version will add a new subsection (likely §3.1.1) describing: (i) architecture (a fine-tuned 7B LLM scorer with three regression heads for coverage, preservation, and faithfulness), (ii) training data (a combination of synthetic transitions generated from Mem-alpha plus 5k human-annotated examples collected from prior memory-agent logs), and (iii) held-out validation (Cohen’s κ = 0.81 on a 1k-example validation split drawn from sources disjoint from MemoryAgentBench and HaluMem). These additions will demonstrate that the reported gains are not verifier-specific artifacts. revision: yes
Referee: [§3.2] §3.2 (Preference Pair Construction): The preference pairs are generated directly from the verifier's coverage/preservation/faithfulness scores on the same memory states used for evaluation. No evidence is given that the verifier was trained or validated on data disjoint from the test distributions of MemoryAgentBench or HaluMem, raising a circularity risk for the RL objective.

Authors: The verifier training corpus was constructed from Mem-alpha and earlier memory-agent traces that do not overlap with the test splits of MemoryAgentBench or HaluMem; preference pairs for RL are generated only on training trajectories. Nevertheless, the manuscript does not explicitly state this disjointness. In revision we will insert a paragraph in §3.2 that (a) lists the exact data sources and split criteria and (b) confirms that no test-benchmark examples were used either for verifier training or for preference-pair construction, thereby removing the circularity concern. revision: yes
Referee: [§4.3] §4.3 (Error Analysis): The transition-level error reductions are presented as the primary reliability result, but the evaluation protocol for these errors appears to rely on the same verifier that generated the training signal. An independent human or oracle evaluation of the final memory states is required to substantiate the 40–79% figures.

Authors: We concur that reliance on the same verifier for both training and error analysis is a limitation. The revised manuscript will include a new human-evaluation study: two independent annotators will label a stratified sample of 300 transitions (100 per error type) drawn from the HaluMem and MemoryAgentBench test sets. We will report human-verifier agreement (κ) and the human-measured error reductions, which we expect to corroborate the verifier-based figures. These results will be presented in an expanded §4.3. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external benchmarks

full rationale

The provided abstract and description contain no equations, self-citations, or load-bearing steps that reduce by construction to the paper's own inputs. The Memory Transition Verifier is used to generate preference pairs for RL, after which results are reported on independent benchmarks (MemoryAgentBench, HaluMem, Mem-alpha validation set). No fitted-input-called-prediction, self-definitional, or uniqueness-imported pattern is exhibited. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities; the verifier and RL components are described at the level of method names only.

pith-pipeline@v0.9.1-grok · 5759 in / 1427 out tokens · 20673 ms · 2026-06-25T22:49:50.044641+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 17 linked inside Pith

[1]

Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

Pith/arXiv arXiv 2024
[2]

Efficient intent detection with dual sentence encoders

Iñigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli ´c. Efficient intent detection with dual sentence encoders. InProceedings of the 2nd workshop on natural language processing for conversational AI, pages 38–45, 2020

2020
[3]

Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506, 2025

Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li. Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506, 2025

arXiv 2025
[4]

Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

Pith/arXiv arXiv 2025
[5]

Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts

Franck Dernoncourt and Ji-Young Lee. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. InProceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 308–313, 2017

2017
[6]

Memory injection attacks on llm agents via query-only interaction.arXiv preprint arXiv:2503.03704, 2025

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. Memory injection attacks on llm agents via query-only interaction.arXiv preprint arXiv:2503.03704, 2025

arXiv 2025
[7]

Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering

Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 152–164, 2024

2024
[8]

Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

Pith/arXiv arXiv 2025
[9]

Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Pith/arXiv arXiv 2025
[10]

Atommem: Learnable dynamic agentic memory with atomic memory operation.arXiv preprint arXiv:2601.08323, 2026

Yupeng Huo, Yaxi Lu, Zhong Zhang, Haotian Chen, and Yankai Lin. Atommem: Learnable dynamic agentic memory with atomic memory operation.arXiv preprint arXiv:2601.08323, 2026

arXiv 2026
[11]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[12]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972–25981, 2025

2025
[13]

Booksum: A collection of datasets for long-form narrative summarization

Wojciech Kry´sci´nski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. InFindings of the association for computational linguistics: EMNLP 2022, pages 6536–6558, 2022

2022
[14]

Governing evolving memory in llm agents: Risks, mechanisms, and the stability and safety governed memory (ssgm) framework.arXiv preprint arXiv:2603.11768, 2026

Chingkwun Lam, Jiaxin Li, Lingfei Zhang, and Kuo Zhao. Governing evolving memory in llm agents: Risks, mechanisms, and the stability and safety governed memory (ssgm) framework.arXiv preprint arXiv:2603.11768, 2026

Pith/arXiv arXiv 2026
[15]

An evaluation dataset for intent classification and out-of-scope prediction

Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Intern...

2019
[16]

Learning question classifiers: the role of semantic information.Natural Language Engineering, 12(3): 229–249, 2006

Xin Li and Dan Roth. Learning question classifiers: the role of semantic information.Natural Language Engineering, 12(3): 229–249, 2006

2006
[17]

Memochat: Tuning llms to use memos for consistent long-range open-domain conversation.arXiv preprint arXiv:2308.08239, 2023

Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation.arXiv preprint arXiv:2308.08239, 2023

arXiv 2023
[18]

Agent lightning: Train any ai agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025

Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K Qiu, and Yuqing Yang. Agent lightning: Train any ai agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025

arXiv 2025
[19]

Evaluating very long- term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long- term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024
[20]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023. 12

2023
[21]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

2023
[22]

Squad: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392, 2016

2016
[23]

Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025

Pith/arXiv arXiv 2025
[24]

Supermemory

Dhravya Shah, Mahesh Sanikommu, Yash, et al. Supermemory. https://supermemory.ai/, 2025. Accessed: 2025-11- 05

2025
[25]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[26]

Look back to reason forward: Revisitable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, and An Zhang. Look back to reason forward: Revisitable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025

arXiv 2025
[27]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[28]

Injecmem: Memory injection attack on llm agent memory systems

Hanling Tian, Zeyang Sha, Jingying Wang, Yuhang Liu, Zhehao Huang, and Xiaolin Huang. Injecmem: Memory injection attack on llm agent memory systems. 2026

2026
[29]

Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

Yu Wang and Xi Chen. Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

Pith/arXiv arXiv 2025
[30]

Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624, 2024

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624, 2024

arXiv 2024
[31]

M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

arXiv 2025
[32]

Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

Pith/arXiv arXiv 2025
[33]

Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

Pith/arXiv arXiv 2024
[34]

A-mem: Agentic memory for llm agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110, 2025

Pith/arXiv arXiv 2025
[35]

Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

Pith/arXiv arXiv 2025
[36]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[37]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018
[38]

Memobase

Gustavo Ye, Jinjia, Gener, et al. Memobase. https://github.com/memodb-io/memobase, 2025. Accessed: 2025- 11-05

2025
[39]

Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

Pith/arXiv arXiv 2025
[40]

Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

Pith/arXiv arXiv 2026
[41]

Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

Yanwei Yue, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

arXiv 2026
[42]

Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025. 13

arXiv 2025
[43]

Adaptive memory admission control for llm agents.arXiv preprint arXiv:2603.04549, 2026

Guilin Zhang, Wei Jiang, Xiejiashan Wang, Aisha Behr, Kai Zhao, Jeffrey Friedman, Xu Chu, and Amine Anoun. Adaptive memory admission control for llm agents.arXiv preprint arXiv:2603.04549, 2026

arXiv 2026
[44]

Infinitebench: Extending long context evaluation beyond 100k tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, et al. Infinitebench: Extending long context evaluation beyond 100k tokens. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15262–15277, 2024

2024
[45]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, pages 19724–19731, 2024

2024
[46]

Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841, 2025

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841, 2025

Pith/arXiv arXiv 2025
[47]

Not fixed

Qiming Zhu, Shunian Chen, Rui Yu, Zhehao Wu, and Benyou Wang. From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026. 14 Method Category Model/System Used Backbone/Base Model Long-Context Standard paradigm Qwen3-32B with 32K context window Qwen3-32B RAG-Top2 Standard paradigm BM25 top-2 retrieval + Qwen3-...

arXiv 2026
[48]

Preservation:whether important information already present in Mi is not deleted or overwritten without justification
[49]

score": 0.0,

Faithfulness:whether newly added or modified memory content is supported by the chunk evidence or the previous memory state. Inputs: •Chunk Evidence:{chunk_text} •Pre-transition Memory State:{pre_memory} •Post-transition Memory State:{post_memory} •Touched Memory Actions:{actions} Required Output:Return only a JSON object: { "score": 0.0, "coverage_ok": t...
[50]

Corruption:the memory operation changes, contradicts, misattributes, or mixes a fact supported by the input chunk or prior memory
[51]

omission

Hallucination:the memory operation introduces a substantive new fact unsupported by the input chunk or prior memory. Inputs: •Dataset Group:{dataset_group} •Sample ID:{sample_id} •Chunk Index:{chunk_idx} •Prior Memory:{prior_memory} •Input Chunk:{input_chunk} •Memory Operations:{memory_operations} Definitions: • Key informationincludes facts, labels, cons...

[1] [1]

Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

Pith/arXiv arXiv 2024

[2] [2]

Efficient intent detection with dual sentence encoders

Iñigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli ´c. Efficient intent detection with dual sentence encoders. InProceedings of the 2nd workshop on natural language processing for conversational AI, pages 38–45, 2020

2020

[3] [3]

Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506, 2025

Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li. Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506, 2025

arXiv 2025

[4] [4]

Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

Pith/arXiv arXiv 2025

[5] [5]

Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts

Franck Dernoncourt and Ji-Young Lee. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. InProceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 308–313, 2017

2017

[6] [6]

Memory injection attacks on llm agents via query-only interaction.arXiv preprint arXiv:2503.03704, 2025

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. Memory injection attacks on llm agents via query-only interaction.arXiv preprint arXiv:2503.03704, 2025

arXiv 2025

[7] [7]

Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering

Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 152–164, 2024

2024

[8] [8]

Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

Pith/arXiv arXiv 2025

[9] [9]

Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Pith/arXiv arXiv 2025

[10] [10]

Atommem: Learnable dynamic agentic memory with atomic memory operation.arXiv preprint arXiv:2601.08323, 2026

Yupeng Huo, Yaxi Lu, Zhong Zhang, Haotian Chen, and Yankai Lin. Atommem: Learnable dynamic agentic memory with atomic memory operation.arXiv preprint arXiv:2601.08323, 2026

arXiv 2026

[11] [11]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[12] [12]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972–25981, 2025

2025

[13] [13]

Booksum: A collection of datasets for long-form narrative summarization

Wojciech Kry´sci´nski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. InFindings of the association for computational linguistics: EMNLP 2022, pages 6536–6558, 2022

2022

[14] [14]

Governing evolving memory in llm agents: Risks, mechanisms, and the stability and safety governed memory (ssgm) framework.arXiv preprint arXiv:2603.11768, 2026

Chingkwun Lam, Jiaxin Li, Lingfei Zhang, and Kuo Zhao. Governing evolving memory in llm agents: Risks, mechanisms, and the stability and safety governed memory (ssgm) framework.arXiv preprint arXiv:2603.11768, 2026

Pith/arXiv arXiv 2026

[15] [15]

An evaluation dataset for intent classification and out-of-scope prediction

Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, et al. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Intern...

2019

[16] [16]

Learning question classifiers: the role of semantic information.Natural Language Engineering, 12(3): 229–249, 2006

Xin Li and Dan Roth. Learning question classifiers: the role of semantic information.Natural Language Engineering, 12(3): 229–249, 2006

2006

[17] [17]

Memochat: Tuning llms to use memos for consistent long-range open-domain conversation.arXiv preprint arXiv:2308.08239, 2023

Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation.arXiv preprint arXiv:2308.08239, 2023

arXiv 2023

[18] [18]

Agent lightning: Train any ai agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025

Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K Qiu, and Yuqing Yang. Agent lightning: Train any ai agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025

arXiv 2025

[19] [19]

Evaluating very long- term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long- term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024

[20] [20]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023. 12

2023

[21] [21]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

2023

[22] [22]

Squad: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392, 2016

2016

[23] [23]

Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025

Pith/arXiv arXiv 2025

[24] [24]

Supermemory

Dhravya Shah, Mahesh Sanikommu, Yash, et al. Supermemory. https://supermemory.ai/, 2025. Accessed: 2025-11- 05

2025

[25] [25]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[26] [26]

Look back to reason forward: Revisitable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, and An Zhang. Look back to reason forward: Revisitable memory for long-context llm agents.arXiv preprint arXiv:2509.23040, 2025

arXiv 2025

[27] [27]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023

[28] [28]

Injecmem: Memory injection attack on llm agent memory systems

Hanling Tian, Zeyang Sha, Jingying Wang, Yuhang Liu, Zhehao Huang, and Xiaolin Huang. Injecmem: Memory injection attack on llm agent memory systems. 2026

2026

[29] [29]

Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

Yu Wang and Xi Chen. Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

Pith/arXiv arXiv 2025

[30] [30]

Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624, 2024

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. Memoryllm: Towards self-updatable large language models.arXiv preprint arXiv:2402.04624, 2024

arXiv 2024

[31] [31]

M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

arXiv 2025

[32] [32]

Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

Pith/arXiv arXiv 2025

[33] [33]

Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

Pith/arXiv arXiv 2024

[34] [34]

A-mem: Agentic memory for llm agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110, 2025

Pith/arXiv arXiv 2025

[35] [35]

Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

Pith/arXiv arXiv 2025

[36] [36]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[37] [37]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018

[38] [38]

Memobase

Gustavo Ye, Jinjia, Gener, et al. Memobase. https://github.com/memodb-io/memobase, 2025. Accessed: 2025- 11-05

2025

[39] [39]

Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

Pith/arXiv arXiv 2025

[40] [40]

Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

Pith/arXiv arXiv 2026

[41] [41]

Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

Yanwei Yue, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

arXiv 2026

[42] [42]

Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025. 13

arXiv 2025

[43] [43]

Adaptive memory admission control for llm agents.arXiv preprint arXiv:2603.04549, 2026

Guilin Zhang, Wei Jiang, Xiejiashan Wang, Aisha Behr, Kai Zhao, Jeffrey Friedman, Xu Chu, and Amine Anoun. Adaptive memory admission control for llm agents.arXiv preprint arXiv:2603.04549, 2026

arXiv 2026

[44] [44]

Infinitebench: Extending long context evaluation beyond 100k tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, et al. Infinitebench: Extending long context evaluation beyond 100k tokens. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15262–15277, 2024

2024

[45] [45]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, pages 19724–19731, 2024

2024

[46] [46]

Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841, 2025

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841, 2025

Pith/arXiv arXiv 2025

[47] [47]

Not fixed

Qiming Zhu, Shunian Chen, Rui Yu, Zhehao Wu, and Benyou Wang. From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026. 14 Method Category Model/System Used Backbone/Base Model Long-Context Standard paradigm Qwen3-32B with 32K context window Qwen3-32B RAG-Top2 Standard paradigm BM25 top-2 retrieval + Qwen3-...

arXiv 2026

[48] [48]

Preservation:whether important information already present in Mi is not deleted or overwritten without justification

[49] [49]

score": 0.0,

Faithfulness:whether newly added or modified memory content is supported by the chunk evidence or the previous memory state. Inputs: •Chunk Evidence:{chunk_text} •Pre-transition Memory State:{pre_memory} •Post-transition Memory State:{post_memory} •Touched Memory Actions:{actions} Required Output:Return only a JSON object: { "score": 0.0, "coverage_ok": t...

[50] [50]

Corruption:the memory operation changes, contradicts, misattributes, or mixes a fact supported by the input chunk or prior memory

[51] [51]

omission

Hallucination:the memory operation introduces a substantive new fact unsupported by the input chunk or prior memory. Inputs: •Dataset Group:{dataset_group} •Sample ID:{sample_id} •Chunk Index:{chunk_idx} •Prior Memory:{prior_memory} •Input Chunk:{input_chunk} •Memory Operations:{memory_operations} Definitions: • Key informationincludes facts, labels, cons...