arxiv: 2507.05257 · v3 · submitted 2025-07-07 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu , Yu Wang , Julian McAuley

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM agentsmemory evaluationmulti-turn interactionsbenchmarkretrievalforgettingcognitive competencieslong-context

0 comments

The pith

A new benchmark shows current LLM memory agents fall short on four core competencies from cognitive science.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that effective memory in LLM agents requires four abilities: accurate retrieval of information, learning updates during use, grasping long-range connections, and selectively forgetting irrelevant details. Existing tests either limit context or use static setups that ignore how agents build knowledge turn by turn in real interactions. The authors address this by creating MemoryAgentBench, which converts long-context datasets into incremental multi-turn formats while adding targeted new tasks to cover every competency. Testing simple context methods, RAG systems, and advanced agents with external memory shows none master all four at once. This gap points to the need for memory designs that handle accumulation and change more comprehensively.

Core claim

MemoryAgentBench reformats existing long-context datasets into incremental multi-turn interactions and adds new tasks to create the first benchmark covering all four memory competencies, revealing that current agent architectures from basic context use to tool-integrated external memory consistently fail to perform well across accurate retrieval, test-time learning, long-range understanding, and selective forgetting.

What carries the argument

MemoryAgentBench, the benchmark that turns static long-context datasets into incremental multi-turn interactions to test the four memory competencies together.

If this is right

Agents require integrated designs that handle retrieval, updates, long connections, and forgetting at the same time rather than in isolation.
Benchmarks for agents should shift from single-turn long-context tests to incremental multi-turn evaluations.
Future memory modules need explicit mechanisms for test-time learning and selective forgetting to close the observed gaps.
Evaluation across diverse agent types highlights that external memory tools alone do not solve the full set of competencies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents built with better memory handling could sustain coherent performance over much longer interaction histories without repeated errors.
The benchmark setup could be adapted to test memory demands in multi-agent collaboration or tool-using environments.
Links to human memory research suggest combining neural retrieval with explicit forgetting rules might address the shortfalls more directly than scaling context alone.

Load-bearing premise

The four competencies drawn from memory science form the complete essential set for agents, and converting static datasets to multi-turn format keeps the original measurement properties intact.

What would settle it

An agent architecture that scores strongly on all four competencies within MemoryAgentBench while also showing stable performance in open-ended real-world multi-turn conversations would support the results; failure to correlate with external memory tasks would challenge them.

read the original abstract

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemoryAgentBench fills a real gap with its four-competency multi-turn setup, but the dataset transformations need explicit validation to support the main claims.

read the letter

The paper's core move is turning static long-context datasets into incremental multi-turn interactions so agents have to handle memory across turns rather than in one shot. That matches how real agents would work and gives coverage of the four competencies pulled from memory science: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. No prior benchmark hits all four in this format, so the construction itself is the main addition. They evaluate a range of setups from plain context to RAG to external memory modules and show none of them handle the full set well. That result is useful for anyone building agents that need to accumulate and manage information over time. The transformation step is the soft spot. Converting book-style QA or similar static tasks into multi-turn sequences could change dependency structure or introduce recency effects that weren't in the originals, and the abstract does not spell out checks for information equivalence or ablation on turn structure. If those checks are missing or weak in the full text, the benchmark scores become harder to interpret as pure measures of the intended competencies. The empirical section would also benefit from clearer reporting on how they scored each competency and whether they controlled for prompt sensitivity. This is the kind of work that belongs in a reading group for people focused on agent memory or evaluation. It is worth sending to peer review because the gap it targets is genuine and the benchmark is a concrete step forward, even if revisions will likely be needed around the transformation validation and metric details.

Referee Report

2 major / 2 minor

Summary. The paper introduces MemoryAgentBench, a benchmark for evaluating memory in LLM agents. Drawing on memory science, it defines four core competencies (accurate retrieval, test-time learning, long-range understanding, and selective forgetting), transforms existing long-context datasets into incremental multi-turn interactions, and evaluates a range of agents from simple context/RAG systems to those with external memory modules. The central empirical claim is that current methods fall short of mastering all four competencies.

Significance. If the dataset transformations preserve the original information structure and retrieval demands without introducing artifacts, the benchmark would fill a notable gap by providing the first systematic coverage of all four competencies in an interactive setting. The evaluation across diverse agent architectures supplies concrete evidence of current limitations and could usefully direct future work on memory mechanisms.

major comments (2)

[§3] §3 (Benchmark Construction): The transformation of static long-context datasets into incremental multi-turn format is presented without explicit validation steps such as information-theoretic equivalence checks, dependency-chain preservation tests, or controlled ablations on turn structure. This assumption is load-bearing for the claim that scores reflect the intended four competencies rather than new artifacts (e.g., artificial recency biases).
[§4] §4 (Experiments and Results): The reported performance shortfalls across agents lack accompanying statistical controls, confidence intervals, or ablation studies that isolate memory-specific effects from confounding factors such as varying context lengths or prompt formatting. Without these, the conclusion that agents 'fall short of mastering all four competencies' rests on descriptive comparisons whose robustness is unclear.

minor comments (2)

[Abstract / §2] The abstract and §2 could more precisely state the selection criteria used when curating and transforming the source datasets.
[Figures / Tables] Figure and table captions would benefit from explicit mention of the exact metrics (e.g., accuracy, F1) and number of runs underlying each reported score.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the benchmark construction and experimental analysis would benefit from additional validation and statistical controls, and we plan to incorporate these elements in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The transformation of static long-context datasets into incremental multi-turn format is presented without explicit validation steps such as information-theoretic equivalence checks, dependency-chain preservation tests, or controlled ablations on turn structure. This assumption is load-bearing for the claim that scores reflect the intended four competencies rather than new artifacts (e.g., artificial recency biases).

Authors: We acknowledge that the current version describes the dataset transformations and curation process but does not report the explicit validation steps suggested. In the revision we will add information-theoretic equivalence checks between original and transformed versions, dependency-chain preservation tests, and controlled ablations varying turn structure. These additions will demonstrate that the multi-turn format preserves the original retrieval demands and does not introduce artifacts such as artificial recency biases. revision: yes
Referee: [§4] §4 (Experiments and Results): The reported performance shortfalls across agents lack accompanying statistical controls, confidence intervals, or ablation studies that isolate memory-specific effects from confounding factors such as varying context lengths or prompt formatting. Without these, the conclusion that agents 'fall short of mastering all four competencies' rests on descriptive comparisons whose robustness is unclear.

Authors: We agree that the experimental section would be strengthened by statistical controls. In the revision we will report confidence intervals (computed over multiple random seeds where applicable), include ablation studies that isolate memory-module effects while holding context length and prompt format fixed, and add controls for the identified confounding factors. These changes will provide a clearer basis for the claim that current agents fall short on all four competencies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction from external datasets and cognitive framing

full rationale

The paper presents an empirical benchmark (MemoryAgentBench) by curating and transforming static long-context datasets into incremental multi-turn interactions to cover four competencies drawn from memory science literature. No equations, fitted parameters, or predictions are defined; the central claims rest on evaluation results across agent types rather than any self-referential derivation. Dataset transformation and competency selection are presented as design choices justified by external cognitive science, not by internal reduction or self-citation chains. The work is self-contained against external benchmarks and does not rename known results or smuggle ansatzes via citations in a load-bearing way.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the four listed competencies are the essential ones for memory agents and that multi-turn reformulations of existing datasets faithfully test incremental memory processing.

axioms (1)

domain assumption Four core competencies (accurate retrieval, test-time learning, long-range understanding, selective forgetting) are essential for memory agents, based on classic theories from memory science and cognitive science.
Explicitly stated in the abstract as the foundation for the benchmark design.

pith-pipeline@v0.9.0 · 5567 in / 1341 out tokens · 33258 ms · 2026-05-16T21:17:36.128863+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 conditional novelty 8.0

GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
When to Forget: A Memory Governance Primitive
cs.AI 2026-04 unverdicted novelty 7.0

Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
$\delta$-mem: Efficient Online Memory for Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
cs.AI 2026-04 unverdicted novelty 6.0

Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
Stateless Decision Memory for Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
cs.CV 2026-04 unverdicted novelty 6.0

FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
Opal: Private Memory for Personal AI
cs.CR 2026-04 unverdicted novelty 6.0

Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 17 Pith papers · 15 internal anchors

[1]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding andreasoningonrealisticlong-contextmultitasks.arXiv preprint arXiv:2412.15204,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

In-context learning with long-context models: An in-depth exploration.arXiv preprint arXiv:2405.00200,

Amanda Bertsch, Maor Ivgi, Emily Xiao, Uri Alon, Jonathan Berant, Matthew R Gorm- ley, and Graham Neubig. In-context learning with long-context models: An in-depth exploration.arXiv preprint arXiv:2405.00200,

work page arXiv
[4]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Association for Computational Linguis- tics. doi: 10.18653/v1/2020.nlp4convai-1.5. URLhttps://aclanthology.org/2020. nlp4convai-1.5/. Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.nlp4convai-1.5 2020
[5]

Gemini pro,

11 Published as a conference paper at ICLR 2026 DeepMind. Gemini pro,

work page 2026
[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

URLhttps://deepmind.google/technologies/gemini/ pro/. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 ...

work page 2019
[7]

URL https://aclanthology.org/2024.sighan-1.18/

Association for Computational Linguistics. URL https://aclanthology.org/2024.sighan-1.18/. Hermann Ebbinghaus. Memory: A contribution to experimental psychology.Annals of Neurosciences, 20(4):155–156,

work page 2024
[8]

URLhttps: //pubmed.ncbi.nlm.nih.gov/25206041/

doi: 10.5214/ans.0972.7531.200408. URLhttps: //pubmed.ncbi.nlm.nih.gov/25206041/. Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From lo- cal to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130,

work page doi:10.5214/ans.0972.7531.200408
[9]

Alphaedit: Null-spaceconstrainedknowledgeeditingforlanguage models.arXiv preprint arXiv:2410.02355,

Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, andTat-SengChua. Alphaedit: Null-spaceconstrainedknowledgeeditingforlanguage models.arXiv preprint arXiv:2410.02355,

work page arXiv
[10]

Ragbench: Explainable benchmark for retrieval-augmented generation systems.arXiv preprint arXiv:2407.11005,

Robert Friel, Masha Belyi, and Atindriyo Sanyal. Ragbench: Explainable benchmark for retrieval-augmented generation systems.arXiv preprint arXiv:2407.11005,

work page arXiv
[11]

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From rag to memory: Non-parametric continual learning for large language models.arXiv preprint arXiv:2502.14802,

work page internal anchor Pith review arXiv
[12]

RULER: What's the Real Context Size of Your Long-Context Language Models?

URLhttp://arxiv.org/abs/2404.06654. arXiv:2404.06654 [cs]. Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Ping Luo, and Guohao Li. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Unsupervised Dense Information Retrieval with Contrastive Learning

URLhttps://github.com/camel-ai/owl. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Ar- mand Joulin, and Edouard Grave. Unsupervised dense information retrieval with con- trastive learning.arXiv preprint arXiv:2112.09118,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

URL https://books.google.com/books?id=JO1RL9BcI44C. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Memory os of ai agent.arXiv preprint arXiv:2506.06326,

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent.arXiv preprint arXiv:2506.06326,

work page arXiv
[16]

One thousand and one pairs: A" novel" challenge for long-context language models.arXiv preprint arXiv:2406.16264,

12 Published as a conference paper at ICLR 2026 Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, and Mohit Iyyer. One thousand and one pairs: A" novel" challenge for long-context language models.arXiv preprint arXiv:2406.16264,

work page arXiv 2026
[17]

GitHub repository

URLhttps://github.com/kingjulio8238/Memary. GitHub repository. Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stam- bler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associ...

work page 2025
[18]

Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K

Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. An evaluation dataset for intent classification and out-of-scope prediction. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conferen...

work page 2019
[19]

doi: 10.18653/v1/D19-1131

Association for Computational Linguis- tics. doi: 10.18653/v1/D19-1131. URLhttps://aclanthology.org/D19-1131/. Dong-HoLee, AdyashaMaharana, JayPujara, XiangRen, andFrancescoBarbieri. Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270,

work page doi:10.18653/v1/d19-1131
[20]

Loogle: Can long-context language models understand long contexts?arXiv preprint arXiv:2311.04939,

Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle: Can long-context language models understand long contexts?arXiv preprint arXiv:2311.04939,

work page arXiv
[21]

Bmx: Entropy-weighted similarity and semantic-enhanced lexical search.arXiv preprint arXiv:2408.06643,

Xianming Li, Julius Lipp, Aamir Shakir, Rui Huang, and Jing Li. Bmx: Entropy-weighted similarity and semantic-enhanced lexical search.arXiv preprint arXiv:2408.06643,

work page arXiv
[22]

Learning question classifiers

Xin Li and Dan Roth. Learning question classifiers. InCOLING 2002: The 19th Inter- national Conference on Computational Linguistics,

work page 2002
[23]

org/C02-1150/

URLhttps://aclanthology. org/C02-1150/. Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, and Joseph E Gonzalez. Sleep-time compute: Beyond inference scaling at test-time.arXiv preprint arXiv:2504.13171,

work page arXiv
[24]

Benchmarking Natural Language Understanding Services for building Conversational Agents

URLhttps: //arxiv.org/abs/1903.05566. Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huany- ong Liu, Tong Xu, and Enhong Chen. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.ACM Transactions on In- formation Systems, 43(2):1–32,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[25]

Evaluating Very Long-Term Conversational Memory of LLM Agents

13 Published as a conference paper at ICLR 2026 Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Optimizing the interface between knowledge graphs and llms for complex reasoning.arXiv preprint arXiv:2505.24478,

Vasilije Markovic, Lazar Obradovic, Laszlo Hajdu, and Jovan Pavlovic. Optimizing the interface between knowledge graphs and llms for complex reasoning.arXiv preprint arXiv:2505.24478,

work page arXiv
[27]

URLhttps://pubmed

doi: 10.1037/0033-295X.102.3.419. URLhttps://pubmed. ncbi.nlm.nih.gov/7624455/. memodb-io and Memobase contributors. Memobase: Profile-based long-term memory for ai applications,

work page doi:10.1037/0033-295x.102.3.419
[28]

Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A Rossi, Se- unghyun Yoon, and Hinrich Schütze. Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,

work page arXiv
[29]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

work page 2021
[30]

Long2rag: Evaluating long-context& long-formretrieval-augmented generationwithkeypoint recall

14 Published as a conference paper at ICLR 2026 Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, and Wei Xu. Long2rag: Evaluating long-context& long-formretrieval-augmented generationwithkeypoint recall. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 4852– 4872,

work page 2026
[31]

Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation

Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation. InProceedings of the ACM on Web Conference 2025, pp. 2366– 2377,

work page 2025
[32]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

T 2-ragbench: Text-and-table benchmark for evaluating retrieval-augmented generation

JanStrich, EnesKutayIsgorur, MaximilianTrescher, ChrisBiemann, andMartinSemmann. T 2-ragbench: Text-and-table benchmark for evaluating retrieval-augmented generation. arXiv preprint arXiv:2506.12071,

work page arXiv
[34]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Freshllms: Refreshing large language models with search engine augmentation

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc Le, et al. Freshllms: Refreshing large language models with search engine augmentation. InFindings of the Association for Computational Lin- guistics: ACL 2024, pp. 13697–13720,

work page 2024
[36]

Storybench: A dynamic benchmark for evaluating long-term memory with multi turns.arXiv preprint arXiv:2506.13356,

Luanbo Wan and Weizhi Ma. Storybench: A dynamic benchmark for evaluating long-term memory with multi turns.arXiv preprint arXiv:2506.13356,

work page arXiv
[37]

Novelqa: A benchmark for long-range novel question answering.arXiv preprint arXiv:2403.12766, 2024a

Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guang- sheng Bao, Qian Wang, and Yue Zhang. Novelqa: A benchmark for long-range novel question answering.arXiv preprint arXiv:2403.12766, 2024a. Minzheng Wang, Longze Chen, Fu Cheng, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no docum...

work page arXiv 2024
[38]

Self- updatable large language models by integrating context into model parameters

Yu Wang, Xinshuang Liu, Xiusi Chen, Sean O’Brien, Junda Wu, and Julian McAuley. Self- updatable large language models by integrating context into model parameters. InThe Thirteenth International Conference on Learning Representations. Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. ...

work page arXiv 2026
[39]

Yu Wang, Chi Han, Tongtong Wu, Xiaoxin He, Wangchunshu Zhou, Nafis Sadeq, Xiusi Chen, ZexueHe, WeiWang, GholamrezaHaffari, HengJi, andJulianJ.McAuley

URLhttps://openreview.net/forum?id=OcqbkROe8J. Yu Wang, Chi Han, Tongtong Wu, Xiaoxin He, Wangchunshu Zhou, Nafis Sadeq, Xiusi Chen, ZexueHe, WeiWang, GholamrezaHaffari, HengJi, andJulianJ.McAuley. Towards lifespan cognitive systems.TMLR, 2025/02. Maria Wimber, Arjen Alink, Ian Charest, Nikolaus Kriegeskorte, and Michael C. Ander- son. Retrieval induces a...

work page 2025
[40]

URL https://www.nature.com/articles/nn.3973

doi: 10.1038/nn.3973. URL https://www.nature.com/articles/nn.3973. Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InThe Thir- teenth International Conference on Learning Representations,

work page doi:10.1038/nn.3973
[41]

Pcl: Peer- contrastive learning with diverse augmentations for unsupervised sentence embeddings

Qiyu Wu, Chongyang Tao, Tao Shen, Can Xu, Xiubo Geng, and Daxin Jiang. Pcl: Peer- contrastive learning with diverse augmentations for unsupervised sentence embeddings. arXiv preprint arXiv:2201.12093,

work page arXiv
[42]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Detectiveqa: Evaluating long-context reasoning on detective novels.arXiv preprint arXiv:2409.02465,

Zhe Xu, Jiasheng Ye, Xiaoran Liu, Xiangyang Liu, Tianxiang Sun, Zhigeng Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, et al. Detectiveqa: Evaluating long-context reasoning on detective novels.arXiv preprint arXiv:2409.02465,

work page arXiv
[44]

Helmet: How to evaluate long-context language models effectively and thoroughly.arXiv preprint arXiv:2410.02694,

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. Helmet: How to evaluate long-context language models effectively and thoroughly.arXiv preprint arXiv:2410.02694,

work page arXiv
[45]

Explicit memory learning with expectation maximization

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, and Xuan-Jing Huang. Explicit memory learning with expectation maximization. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 16618–16635,

work page 2024
[46]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long- context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Auto-rag: Autonomous retrieval-augmented gen- eration for large language models.arXiv preprint arXiv:2411.19443,

Tian Yu, Shaolei Zhang, and Yang Feng. Auto-rag: Autonomous retrieval-augmented gen- eration for large language models.arXiv preprint arXiv:2411.19443,

work page arXiv
[48]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory.arXiv preprint arXiv:2305.10250, 2023a. 16 Published as a conference paper at ICLR 2026 Zexuan Zhong, Zhengxuan Wu, Christopher D Manning, Christopher Potts, and Danqi Chen. Mquake: Assessing knowledge editing in language models via mu...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Additionally, we used the LLM to generate character icons, which were later used in the creation of our main plot visualization

17 Published as a conference paper at ICLR 2026 A The Use of Large Language Models (LLMs) In this paper writing process, we used an LLM to assist with content polishing—for example, identifying grammatical errors and suggesting revisions for sentences that were unclear or potentially ambiguous. Additionally, we used the LLM to generate character icons, wh...

work page 2026
[51]

long context

B.1 Accurate Retrieval (AR) B.1.1 Definition of AR The task of accurately retrieving information has been extensively explored in prior work. In the domain of long-context modeling, the Needle-in-a-Haystack (NIAH) task is widely used to evaluate a model’s ability to locate the specific value based on a given key within a lengthy input. In the RAG setting,...

work page 2024
[52]

and use the dot product of this score with the F1 score as the final evaluation metric. B.4 Selective Forgetting (SF) B.4.1 Definition of SF In long-term interactions, agents often face evolving or conflicting information—whether about the external world (e.g., changes in political leadership) or user-specific facts (e.g., a 19 Published as a conference p...

work page 2026
[53]

and knowledge unlearning (Wang et al., 2024e), which focus on modifying or removing factual knowledge from language models. We define Selective Forgetting (SF) as the agent’s ability to detect and resolve contradictions between out of date knowledge and newly acquired information, ensuring the agent remains aligned with current realities and user states. ...

work page 2013
[54]

global-memory

introduces a dual-system pipeline with a 21 Published as a conference paper at ICLR 2026 light “global-memory” model to guide retrieval and a stronger model for final answers. HippoRAG-v2 (Gutiérrez et al.,

work page 2026
[55]

C.3 Agentic Memory Agents For Agentic Memory Agents, We evaluate the Self-RAG (Asai et al., 2023), MemGPT (Packer et al., 2023), and MIRIX (Wang & Chen,

is a temporal knowledge-graph memory platform for agents, designed to assemble and retrieve long-term conversational and business context. C.3 Agentic Memory Agents For Agentic Memory Agents, We evaluate the Self-RAG (Asai et al., 2023), MemGPT (Packer et al., 2023), and MIRIX (Wang & Chen,

work page 2023
[56]

Self-RAG use LLMs as the agent to decide when/what to retrieve and to critique its own outputs

on our benchmark. Self-RAG use LLMs as the agent to decide when/what to retrieve and to critique its own outputs. MemGPT operates the hierarchical memory management, paging relevant snip- pets between short-term and long-term stores and using event-driven interrupts to maintain coherence and evolvability over extended interactions. MIRIX adopts a multi-ag...

work page 2024
[57]

label: {{label}}

For all three types of tasks, RAG-based agents generally underperform compared to their respective GPT-4o-mini backbones. This observation highlights certain limitations inherent to the RAG approach. For instance, in TTL tasks, RAG-based methods often struggle to more accurately retrieve context from memory that is closely associated with the input. 22 Pu...

work page 2026
[58]

However, as the context length increases, the performance of these agents declines accordingly

For Long-Context Agents, tasks in the AR series generally achieve satisfactory performance at relatively small context lengths (e.g., around 50K tokens). However, as the context length increases, the performance of these agents declines accordingly. In contrast, for the RAG-based agents Mem0 and Cognee, their performance is significantly lower than that o...

work page 2026
[59]

Meanwhile, methods such as Mem0, Cognee and MIRIX need extremely high resources when constructing the memory

From the table, we find that using a smaller chunk size requires significantly more time for memory construction, especially for methods such as HippoRAG-v2, Mem0, Cognee, and MemGPT. Meanwhile, methods such as Mem0, Cognee and MIRIX need extremely high resources when constructing the memory. E.6 GPU Memory Usage Comparison In main experiments, we mostly ...

work page 2026
[60]

While for the HippoRAG-v2 method, we follow the same experimental setting as in Gutiérrez et al

F.2 Settings of the RAG Agents For the embedding model selection in Structure-Augmented RAG Agents and Agentic Mem- ory Agents, most approaches utilize OpenAI’s embedding models, such as Text-Embed-3- Small. While for the HippoRAG-v2 method, we follow the same experimental setting as in Gutiérrez et al. (2025), employing the NV-Embed-v2 model. We implemen...

work page 2025
[61]

27 Published as a conference paper at ICLR 2026 G Task Rationale and Justification for Selective Forgetting Task While the Selective Forgetting task may appear specialized or even synthetic at first glance, it is designed to address a fundamental, universal challenge in long-term memory systems: maintaining context efficiency and mitigating interference b...

work page 2026
[62]

Please memorize the following information for future questions

Model/Architecture MH-Doc QA MCC Detective QA FC-SH Est. Cost (USD) Performance Est. Cost (USD) Performance Est. Cost (USD) Performance Est. Cost (USD) Performance GPT-4o-mini $0.01 43.0 $0.008 82.0 $0.01 63.4 $0.01 45.0GPT-4.1-mini $0.043 66.0 $0.011 75.6 $0.013 56.3 $0.027 36.0RAG Agents (BM25 + 4o-mini) <$0.001 56.0 $0.006 75.4 $0.006 52.1 <$0.001 48.0...

work page 2026
[63]

Model/Setting FC-SH FC-MH Avg. GPT-4.1-mini (Baseline) 36.0 5.0 20.5 GPT-4.1-mini (Policy A) 40.0 4.0 22.0 (+1.5) GPT-4.1-mini (Policy B) 28.0 4.0 16.0 Table 19: Overwrite policy ablation results on Selective Forgetting tasks (Accuracy %). K.2.2 Key Insights 1.Limited Generalization with Aggressive Updates: While Policy A slightly improves performance on ...

work page 2026