Recognition: no theorem link
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
Pith reviewed 2026-05-16 21:17 UTC · model grok-4.3
The pith
A new benchmark shows current LLM memory agents fall short on four core competencies from cognitive science.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemoryAgentBench reformats existing long-context datasets into incremental multi-turn interactions and adds new tasks to create the first benchmark covering all four memory competencies, revealing that current agent architectures from basic context use to tool-integrated external memory consistently fail to perform well across accurate retrieval, test-time learning, long-range understanding, and selective forgetting.
What carries the argument
MemoryAgentBench, the benchmark that turns static long-context datasets into incremental multi-turn interactions to test the four memory competencies together.
If this is right
- Agents require integrated designs that handle retrieval, updates, long connections, and forgetting at the same time rather than in isolation.
- Benchmarks for agents should shift from single-turn long-context tests to incremental multi-turn evaluations.
- Future memory modules need explicit mechanisms for test-time learning and selective forgetting to close the observed gaps.
- Evaluation across diverse agent types highlights that external memory tools alone do not solve the full set of competencies.
Where Pith is reading between the lines
- Agents built with better memory handling could sustain coherent performance over much longer interaction histories without repeated errors.
- The benchmark setup could be adapted to test memory demands in multi-agent collaboration or tool-using environments.
- Links to human memory research suggest combining neural retrieval with explicit forgetting rules might address the shortfalls more directly than scaling context alone.
Load-bearing premise
The four competencies drawn from memory science form the complete essential set for agents, and converting static datasets to multi-turn format keeps the original measurement properties intact.
What would settle it
An agent architecture that scores strongly on all four competencies within MemoryAgentBench while also showing stable performance in open-ended real-world multi-turn conversations would support the results; failure to correlate with external memory tasks would challenge them.
read the original abstract
Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MemoryAgentBench, a benchmark for evaluating memory in LLM agents. Drawing on memory science, it defines four core competencies (accurate retrieval, test-time learning, long-range understanding, and selective forgetting), transforms existing long-context datasets into incremental multi-turn interactions, and evaluates a range of agents from simple context/RAG systems to those with external memory modules. The central empirical claim is that current methods fall short of mastering all four competencies.
Significance. If the dataset transformations preserve the original information structure and retrieval demands without introducing artifacts, the benchmark would fill a notable gap by providing the first systematic coverage of all four competencies in an interactive setting. The evaluation across diverse agent architectures supplies concrete evidence of current limitations and could usefully direct future work on memory mechanisms.
major comments (2)
- [§3] §3 (Benchmark Construction): The transformation of static long-context datasets into incremental multi-turn format is presented without explicit validation steps such as information-theoretic equivalence checks, dependency-chain preservation tests, or controlled ablations on turn structure. This assumption is load-bearing for the claim that scores reflect the intended four competencies rather than new artifacts (e.g., artificial recency biases).
- [§4] §4 (Experiments and Results): The reported performance shortfalls across agents lack accompanying statistical controls, confidence intervals, or ablation studies that isolate memory-specific effects from confounding factors such as varying context lengths or prompt formatting. Without these, the conclusion that agents 'fall short of mastering all four competencies' rests on descriptive comparisons whose robustness is unclear.
minor comments (2)
- [Abstract / §2] The abstract and §2 could more precisely state the selection criteria used when curating and transforming the source datasets.
- [Figures / Tables] Figure and table captions would benefit from explicit mention of the exact metrics (e.g., accuracy, F1) and number of runs underlying each reported score.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We agree that the benchmark construction and experimental analysis would benefit from additional validation and statistical controls, and we plan to incorporate these elements in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The transformation of static long-context datasets into incremental multi-turn format is presented without explicit validation steps such as information-theoretic equivalence checks, dependency-chain preservation tests, or controlled ablations on turn structure. This assumption is load-bearing for the claim that scores reflect the intended four competencies rather than new artifacts (e.g., artificial recency biases).
Authors: We acknowledge that the current version describes the dataset transformations and curation process but does not report the explicit validation steps suggested. In the revision we will add information-theoretic equivalence checks between original and transformed versions, dependency-chain preservation tests, and controlled ablations varying turn structure. These additions will demonstrate that the multi-turn format preserves the original retrieval demands and does not introduce artifacts such as artificial recency biases. revision: yes
-
Referee: [§4] §4 (Experiments and Results): The reported performance shortfalls across agents lack accompanying statistical controls, confidence intervals, or ablation studies that isolate memory-specific effects from confounding factors such as varying context lengths or prompt formatting. Without these, the conclusion that agents 'fall short of mastering all four competencies' rests on descriptive comparisons whose robustness is unclear.
Authors: We agree that the experimental section would be strengthened by statistical controls. In the revision we will report confidence intervals (computed over multiple random seeds where applicable), include ablation studies that isolate memory-module effects while holding context length and prompt format fixed, and add controls for the identified confounding factors. These changes will provide a clearer basis for the claim that current agents fall short on all four competencies. revision: yes
Circularity Check
No circularity: empirical benchmark construction from external datasets and cognitive framing
full rationale
The paper presents an empirical benchmark (MemoryAgentBench) by curating and transforming static long-context datasets into incremental multi-turn interactions to cover four competencies drawn from memory science literature. No equations, fitted parameters, or predictions are defined; the central claims rest on evaluation results across agent types rather than any self-referential derivation. Dataset transformation and competency selection are presented as design choices justified by external cognitive science, not by internal reduction or self-citation chains. The work is self-contained against external benchmarks and does not rename known results or smuggle ansatzes via citations in a load-bearing way.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Four core competencies (accurate retrieval, test-time learning, long-range understanding, selective forgetting) are essential for memory agents, based on classic theories from memory science and cognitive science.
Forward citations
Cited by 18 Pith papers
-
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
Belief Memory: Agent Memory Under Partial Observability
BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
-
Belief Memory: Agent Memory Under Partial Observability
BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
-
When to Forget: A Memory Governance Primitive
Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
$\delta$-mem: Efficient Online Memory for Large Language Models
δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
-
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
-
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
-
Stateless Decision Memory for Enterprise AI Agents
Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
-
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
-
Opal: Private Memory for Personal AI
Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Reference graph
Works this paper leans on
-
[1]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding andreasoningonrealisticlong-contextmultitasks.arXiv preprint arXiv:2412.15204,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Amanda Bertsch, Maor Ivgi, Emily Xiao, Uri Alon, Jonathan Berant, Matthew R Gorm- ley, and Graham Neubig. In-context learning with long-context models: An in-depth exploration.arXiv preprint arXiv:2405.00200,
-
[4]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Association for Computational Linguis- tics. doi: 10.18653/v1/2020.nlp4convai-1.5. URLhttps://aclanthology.org/2020. nlp4convai-1.5/. Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.nlp4convai-1.5 2020
- [5]
-
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
URLhttps://deepmind.google/technologies/gemini/ pro/. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 ...
work page 2019
-
[7]
URL https://aclanthology.org/2024.sighan-1.18/
Association for Computational Linguistics. URL https://aclanthology.org/2024.sighan-1.18/. Hermann Ebbinghaus. Memory: A contribution to experimental psychology.Annals of Neurosciences, 20(4):155–156,
work page 2024
-
[8]
URLhttps: //pubmed.ncbi.nlm.nih.gov/25206041/
doi: 10.5214/ans.0972.7531.200408. URLhttps: //pubmed.ncbi.nlm.nih.gov/25206041/. Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From lo- cal to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130,
-
[9]
Alphaedit: Null-spaceconstrainedknowledgeeditingforlanguage models.arXiv preprint arXiv:2410.02355,
Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, andTat-SengChua. Alphaedit: Null-spaceconstrainedknowledgeeditingforlanguage models.arXiv preprint arXiv:2410.02355,
-
[10]
Robert Friel, Masha Belyi, and Atindriyo Sanyal. Ragbench: Explainable benchmark for retrieval-augmented generation systems.arXiv preprint arXiv:2407.11005,
-
[11]
From RAG to Memory: Non-Parametric Continual Learning for Large Language Models
Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From rag to memory: Non-parametric continual learning for large language models.arXiv preprint arXiv:2502.14802,
work page internal anchor Pith review arXiv
-
[12]
RULER: What's the Real Context Size of Your Long-Context Language Models?
URLhttp://arxiv.org/abs/2404.06654. arXiv:2404.06654 [cs]. Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Ping Luo, and Guohao Li. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Unsupervised Dense Information Retrieval with Contrastive Learning
URLhttps://github.com/camel-ai/owl. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Ar- mand Joulin, and Edouard Grave. Unsupervised dense information retrieval with con- trastive learning.arXiv preprint arXiv:2112.09118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
URL https://books.google.com/books?id=JO1RL9BcI44C. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Memory os of ai agent.arXiv preprint arXiv:2506.06326,
Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent.arXiv preprint arXiv:2506.06326,
-
[16]
12 Published as a conference paper at ICLR 2026 Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, and Mohit Iyyer. One thousand and one pairs: A" novel" challenge for long-context language models.arXiv preprint arXiv:2406.16264,
-
[17]
URLhttps://github.com/kingjulio8238/Memary. GitHub repository. Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stam- bler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associ...
work page 2025
-
[18]
Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K
Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. An evaluation dataset for intent classification and out-of-scope prediction. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conferen...
work page 2019
-
[19]
Association for Computational Linguis- tics. doi: 10.18653/v1/D19-1131. URLhttps://aclanthology.org/D19-1131/. Dong-HoLee, AdyashaMaharana, JayPujara, XiangRen, andFrancescoBarbieri. Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270,
-
[20]
Loogle: Can long-context language models understand long contexts?arXiv preprint arXiv:2311.04939,
Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle: Can long-context language models understand long contexts?arXiv preprint arXiv:2311.04939,
-
[21]
Xianming Li, Julius Lipp, Aamir Shakir, Rui Huang, and Jing Li. Bmx: Entropy-weighted similarity and semantic-enhanced lexical search.arXiv preprint arXiv:2408.06643,
-
[22]
Xin Li and Dan Roth. Learning question classifiers. InCOLING 2002: The 19th Inter- national Conference on Computational Linguistics,
work page 2002
-
[23]
URLhttps://aclanthology. org/C02-1150/. Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, and Joseph E Gonzalez. Sleep-time compute: Beyond inference scaling at test-time.arXiv preprint arXiv:2504.13171,
-
[24]
Benchmarking Natural Language Understanding Services for building Conversational Agents
URLhttps: //arxiv.org/abs/1903.05566. Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huany- ong Liu, Tong Xu, and Enhong Chen. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.ACM Transactions on In- formation Systems, 43(2):1–32,
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[25]
Evaluating Very Long-Term Conversational Memory of LLM Agents
13 Published as a conference paper at ICLR 2026 Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Vasilije Markovic, Lazar Obradovic, Laszlo Hajdu, and Jovan Pavlovic. Optimizing the interface between knowledge graphs and llms for complex reasoning.arXiv preprint arXiv:2505.24478,
-
[27]
doi: 10.1037/0033-295X.102.3.419. URLhttps://pubmed. ncbi.nlm.nih.gov/7624455/. memodb-io and Memobase contributors. Memobase: Profile-based long-term memory for ai applications,
-
[28]
Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,
Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A Rossi, Se- unghyun Yoon, and Hinrich Schütze. Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,
-
[29]
Kilt: a benchmark for knowledge intensive language tasks
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...
work page 2021
-
[30]
Long2rag: Evaluating long-context& long-formretrieval-augmented generationwithkeypoint recall
14 Published as a conference paper at ICLR 2026 Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, and Wei Xu. Long2rag: Evaluating long-context& long-formretrieval-augmented generationwithkeypoint recall. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 4852– 4872,
work page 2026
-
[31]
Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation
Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation. InProceedings of the ACM on Web Conference 2025, pp. 2366– 2377,
work page 2025
-
[32]
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
T 2-ragbench: Text-and-table benchmark for evaluating retrieval-augmented generation
JanStrich, EnesKutayIsgorur, MaximilianTrescher, ChrisBiemann, andMartinSemmann. T 2-ragbench: Text-and-table benchmark for evaluating retrieval-augmented generation. arXiv preprint arXiv:2506.12071,
-
[34]
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Freshllms: Refreshing large language models with search engine augmentation
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc Le, et al. Freshllms: Refreshing large language models with search engine augmentation. InFindings of the Association for Computational Lin- guistics: ACL 2024, pp. 13697–13720,
work page 2024
-
[36]
Luanbo Wan and Weizhi Ma. Storybench: A dynamic benchmark for evaluating long-term memory with multi turns.arXiv preprint arXiv:2506.13356,
-
[37]
Novelqa: A benchmark for long-range novel question answering.arXiv preprint arXiv:2403.12766, 2024a
Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guang- sheng Bao, Qian Wang, and Yue Zhang. Novelqa: A benchmark for long-range novel question answering.arXiv preprint arXiv:2403.12766, 2024a. Minzheng Wang, Longze Chen, Fu Cheng, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no docum...
-
[38]
Self- updatable large language models by integrating context into model parameters
Yu Wang, Xinshuang Liu, Xiusi Chen, Sean O’Brien, Junda Wu, and Julian McAuley. Self- updatable large language models by integrating context into model parameters. InThe Thirteenth International Conference on Learning Representations. Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. ...
-
[39]
URLhttps://openreview.net/forum?id=OcqbkROe8J. Yu Wang, Chi Han, Tongtong Wu, Xiaoxin He, Wangchunshu Zhou, Nafis Sadeq, Xiusi Chen, ZexueHe, WeiWang, GholamrezaHaffari, HengJi, andJulianJ.McAuley. Towards lifespan cognitive systems.TMLR, 2025/02. Maria Wimber, Arjen Alink, Ian Charest, Nikolaus Kriegeskorte, and Michael C. Ander- son. Retrieval induces a...
work page 2025
-
[40]
URL https://www.nature.com/articles/nn.3973
doi: 10.1038/nn.3973. URL https://www.nature.com/articles/nn.3973. Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InThe Thir- teenth International Conference on Learning Representations,
-
[41]
Pcl: Peer- contrastive learning with diverse augmentations for unsupervised sentence embeddings
Qiyu Wu, Chongyang Tao, Tao Shen, Can Xu, Xiubo Geng, and Daxin Jiang. Pcl: Peer- contrastive learning with diverse augmentations for unsupervised sentence embeddings. arXiv preprint arXiv:2201.12093,
-
[42]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Detectiveqa: Evaluating long-context reasoning on detective novels.arXiv preprint arXiv:2409.02465,
Zhe Xu, Jiasheng Ye, Xiaoran Liu, Xiangyang Liu, Tianxiang Sun, Zhigeng Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, et al. Detectiveqa: Evaluating long-context reasoning on detective novels.arXiv preprint arXiv:2409.02465,
-
[44]
Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. Helmet: How to evaluate long-context language models effectively and thoroughly.arXiv preprint arXiv:2410.02694,
-
[45]
Explicit memory learning with expectation maximization
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, and Xuan-Jing Huang. Explicit memory learning with expectation maximization. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 16618–16635,
work page 2024
-
[46]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long- context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Tian Yu, Shaolei Zhang, and Yang Feng. Auto-rag: Autonomous retrieval-augmented gen- eration for large language models.arXiv preprint arXiv:2411.19443,
-
[48]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
MemoryBank: Enhancing Large Language Models with Long-Term Memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory.arXiv preprint arXiv:2305.10250, 2023a. 16 Published as a conference paper at ICLR 2026 Zexuan Zhong, Zhengxuan Wu, Christopher D Manning, Christopher Potts, and Danqi Chen. Mquake: Assessing knowledge editing in language models via mu...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
17 Published as a conference paper at ICLR 2026 A The Use of Large Language Models (LLMs) In this paper writing process, we used an LLM to assist with content polishing—for example, identifying grammatical errors and suggesting revisions for sentences that were unclear or potentially ambiguous. Additionally, we used the LLM to generate character icons, wh...
work page 2026
-
[51]
B.1 Accurate Retrieval (AR) B.1.1 Definition of AR The task of accurately retrieving information has been extensively explored in prior work. In the domain of long-context modeling, the Needle-in-a-Haystack (NIAH) task is widely used to evaluate a model’s ability to locate the specific value based on a given key within a lengthy input. In the RAG setting,...
work page 2024
-
[52]
and use the dot product of this score with the F1 score as the final evaluation metric. B.4 Selective Forgetting (SF) B.4.1 Definition of SF In long-term interactions, agents often face evolving or conflicting information—whether about the external world (e.g., changes in political leadership) or user-specific facts (e.g., a 19 Published as a conference p...
work page 2026
-
[53]
and knowledge unlearning (Wang et al., 2024e), which focus on modifying or removing factual knowledge from language models. We define Selective Forgetting (SF) as the agent’s ability to detect and resolve contradictions between out of date knowledge and newly acquired information, ensuring the agent remains aligned with current realities and user states. ...
work page 2013
-
[54]
introduces a dual-system pipeline with a 21 Published as a conference paper at ICLR 2026 light “global-memory” model to guide retrieval and a stronger model for final answers. HippoRAG-v2 (Gutiérrez et al.,
work page 2026
-
[55]
is a temporal knowledge-graph memory platform for agents, designed to assemble and retrieve long-term conversational and business context. C.3 Agentic Memory Agents For Agentic Memory Agents, We evaluate the Self-RAG (Asai et al., 2023), MemGPT (Packer et al., 2023), and MIRIX (Wang & Chen,
work page 2023
-
[56]
Self-RAG use LLMs as the agent to decide when/what to retrieve and to critique its own outputs
on our benchmark. Self-RAG use LLMs as the agent to decide when/what to retrieve and to critique its own outputs. MemGPT operates the hierarchical memory management, paging relevant snip- pets between short-term and long-term stores and using event-driven interrupts to maintain coherence and evolvability over extended interactions. MIRIX adopts a multi-ag...
work page 2024
-
[57]
For all three types of tasks, RAG-based agents generally underperform compared to their respective GPT-4o-mini backbones. This observation highlights certain limitations inherent to the RAG approach. For instance, in TTL tasks, RAG-based methods often struggle to more accurately retrieve context from memory that is closely associated with the input. 22 Pu...
work page 2026
-
[58]
However, as the context length increases, the performance of these agents declines accordingly
For Long-Context Agents, tasks in the AR series generally achieve satisfactory performance at relatively small context lengths (e.g., around 50K tokens). However, as the context length increases, the performance of these agents declines accordingly. In contrast, for the RAG-based agents Mem0 and Cognee, their performance is significantly lower than that o...
work page 2026
-
[59]
From the table, we find that using a smaller chunk size requires significantly more time for memory construction, especially for methods such as HippoRAG-v2, Mem0, Cognee, and MemGPT. Meanwhile, methods such as Mem0, Cognee and MIRIX need extremely high resources when constructing the memory. E.6 GPU Memory Usage Comparison In main experiments, we mostly ...
work page 2026
-
[60]
While for the HippoRAG-v2 method, we follow the same experimental setting as in Gutiérrez et al
F.2 Settings of the RAG Agents For the embedding model selection in Structure-Augmented RAG Agents and Agentic Mem- ory Agents, most approaches utilize OpenAI’s embedding models, such as Text-Embed-3- Small. While for the HippoRAG-v2 method, we follow the same experimental setting as in Gutiérrez et al. (2025), employing the NV-Embed-v2 model. We implemen...
work page 2025
-
[61]
27 Published as a conference paper at ICLR 2026 G Task Rationale and Justification for Selective Forgetting Task While the Selective Forgetting task may appear specialized or even synthetic at first glance, it is designed to address a fundamental, universal challenge in long-term memory systems: maintaining context efficiency and mitigating interference b...
work page 2026
-
[62]
Please memorize the following information for future questions
Model/Architecture MH-Doc QA MCC Detective QA FC-SH Est. Cost (USD) Performance Est. Cost (USD) Performance Est. Cost (USD) Performance Est. Cost (USD) Performance GPT-4o-mini $0.01 43.0 $0.008 82.0 $0.01 63.4 $0.01 45.0GPT-4.1-mini $0.043 66.0 $0.011 75.6 $0.013 56.3 $0.027 36.0RAG Agents (BM25 + 4o-mini) <$0.001 56.0 $0.006 75.4 $0.006 52.1 <$0.001 48.0...
work page 2026
-
[63]
Model/Setting FC-SH FC-MH Avg. GPT-4.1-mini (Baseline) 36.0 5.0 20.5 GPT-4.1-mini (Policy A) 40.0 4.0 22.0 (+1.5) GPT-4.1-mini (Policy B) 28.0 4.0 16.0 Table 19: Overwrite policy ablation results on Selective Forgetting tasks (Accuracy %). K.2.2 Key Insights 1.Limited Generalization with Aggressive Updates: While Policy A slightly improves performance on ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.