MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Elias Stengel-Eskin; Hyunji Lee; Joykirat Singh; Justin Chih-Yao Chen; Mohit Bansal; Zaid Khan

arxiv: 2605.18565 · v2 · pith:2LGE4ZKGnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Hyunji Lee , Justin Chih-Yao Chen , Joykirat Singh , Zaid Khan , Elias Stengel-Eskin , Mohit Bansal This is my paper

Pith reviewed 2026-05-20 10:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords memory-augmented agentslong-horizon evaluationmulti-target interferencefact updatingretrieval limitsbenchmark constructionaggregated reasoning

0 comments

The pith

Current memory-augmented agents achieve only 27.9 percent accuracy on average when handling updated facts that interfere across long contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MINTEval to test how agents recall and reason over information in long, evolving settings where new details can overwrite or interfere with earlier ones. It builds long interconnected contexts averaging 138.8k tokens across four domains, with up to 1.8M tokens in some cases, and creates 15.6k questions split between single-fact retrieval and multi-fact aggregation. Evaluation of seven systems, from plain long-context models to dedicated memory frameworks, reveals consistently low performance that worsens when facts are revised by later information. The central finding is that retrieval and memory construction are the primary bottlenecks rather than reasoning itself. A reader would care because real agents must operate over days or weeks of changing data without losing track of prior states.

Core claim

MINTEval consists of long, highly interconnected contexts with frequent updates that induce multi-target interference, spanning state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. The benchmark includes single-target recall and multi-target aggregation questions. Across seven evaluated systems, average accuracy reaches only 27.9 percent, with particular weakness on aggregated reasoning; performance is limited by retrieval and memory construction, and accuracy degrades as the number of intervening updates grows.

What carries the argument

MINTEval benchmark that constructs long-horizon contexts with repeated updates to induce measurable interference between target facts and evaluates both recall and aggregated reasoning.

If this is right

Retrieval components must be redesigned to handle fact revisions without losing earlier evidence.
Memory construction processes need mechanisms that preserve access to older facts despite later changes.
Aggregated reasoning over multiple interfered pieces remains a distinct failure mode separate from simple recall.
Performance gaps appear consistently across domains, indicating limited generalization in current memory approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improved handling of interference could enable agents to maintain coherent state over multi-day tasks such as ongoing software projects.
The observed degradation pattern points toward possible benefits from explicit update tracking or versioned memory stores.
Benchmark results may inform hybrid systems that combine long-context processing with selective memory refresh.
Similar interference issues likely appear in other continual-learning settings where knowledge arrives incrementally.

Load-bearing premise

The constructed contexts and question types produce interference patterns that match those encountered by real agents rather than arising mainly from the benchmark design itself.

What would settle it

A memory system whose accuracy on multi-target questions stays above 70 percent even when the number of intervening updates reaches the maximum tested levels in the benchmark.

Figures

Figures reproduced from arXiv: 2605.18565 by Elias Stengel-Eskin, Hyunji Lee, Joykirat Singh, Justin Chih-Yao Chen, Mohit Bansal, Zaid Khan.

**Figure 1.** Figure 1: Left: MINTEVAL spans four realistic domains: state tracking, dialogue, GitHub commits, and Wikipedia revisions, with five question categories probing different aspects of memory behavior. Middle: The contexts are inherently dynamic and continuously evolving, naturally creating frequent destructive interference. Right: Existing memory systems show distinct failure modes: (1) full-context methods are computa… view at source ↗

**Figure 2.** Figure 2: Error due to missing evidence in memory (green) or incorrect answers despite the evidence being present (green–blue gap). Only 58.3% of cases contain the required evidence, making retrieval/memory construction the main bottleneck; answering errors add a 25.2% drop. A perfect system would reach 100%. 10 20 30 40 50 60 70 80 90 100 Lookback Distance 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy Method Full … view at source ↗

**Figure 5.** Figure 5: Performance vs. different chunk sizes when processing memories for the MemAgent model (CS = Chunk Size). Increasing CS generally improves performance, and Simple questions are the least sensitive to CS, since it only requires recalling recent information. a substantial performance improvement (55.7%). In contrast, this gap becomes much smaller when retrieval or memory systems are introduced (avg. 1.7%), … view at source ↗

**Figure 6.** Figure 6: MemAgent performance on Wiki Revisions and Git Commits across different answering [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of performance across different answering agents (Qwen3.6-35B-A3B and [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Performance on History questions in bAbI as a function of lookback distance (x-axis), comparing RAG and Full Context methods with and without temporal cues (History vs. +Date/Time). Adding timestamps as explicit markers helps recover the gap caused by interference. C.3 Effect of Adding Temporal Cues to History Questions To investigate whether the performance degradation with increasing lookback distance in… view at source ↗

**Figure 9.** Figure 9: Rate of tool usage for AtomMem and Mem-α. Mem-α consistently underutilizes the delete operation across all datasets, which may partially explain why memory systems struggle in long-horizon settings with heavy interference: outdated or conflicting information accumulates over time, leading to progressively greater conflict within memory. 0 1 3 5 # Distractors 30 40 50 60 70 80 Accuracy Simple (OOD) Simple (… view at source ↗

**Figure 10.** Figure 10: Performance under varying distractor types and numbers of distractors. ID distractors more strongly affect questions such as Counting and History compared to simpler queries like Simple, suggesting that tasks requiring aggregation or tracking over multiple facts are more susceptible to interference. 1 5 10 20 50 75 Top-K 22 24 26 28 30 32 Performance Qwen3-embedding-4B Gemini-Embedding-001 [PITH_FULL_… view at source ↗

**Figure 12.** Figure 12: RAG performance across question types with varying numbers of retrieval documents [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

read the original abstract

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MINTEval gives a useful benchmark for interference in long agent contexts, but the degradation findings likely mix up interference with plain context length growth.

read the letter

This paper introduces MINTEval to test memory systems on evolving, interfering information over long horizons. The core result is that seven different systems average only 27.9 percent accuracy, with bigger drops on questions that need pulling together multiple facts, and performance falling as more updates sit between the original fact and the question. The benchmark covers four domains with contexts that average 138.8k tokens and reach 1.8M, using both single-target recall and multi-target aggregation questions. That setup is new enough to be worth looking at if you work on agent memory. The multi-domain coverage and the split between question types are the parts that actually add something concrete. The authors also run a range of baselines including plain long-context models, RAG, and memory-augmented agents, which gives a reasonable first picture of where the problems sit. Retrieval and memory construction come out as the main limits, which matches what many people already suspect but is now shown on this particular interference-heavy data. The soft spot is the claim that accuracy degrades specifically because of revised or interfered facts. Contexts grow with each update, so the observed drop could simply reflect established long-context retrieval failures rather than interference per se. The paper does not appear to include controls that hold total length or fact count fixed while varying only the degree of cross-target revision. Without those, it is hard to separate the two effects. Details on how the contexts and questions were built are also light. This is the sort of benchmark paper that researchers building long-horizon agents or testing memory modules will want to read. It is solid enough to deserve a serious referee who can check the construction and ask for tighter length controls. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces MINTEval, a benchmark for assessing memory-augmented agents in long-horizon settings with multi-target interference. It constructs long, evolving contexts (average 138.8k tokens, up to 1.8M) across domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits, yielding 15.6k QA pairs. Questions target single-fact recall and multi-target aggregation over updated information. Evaluation of seven systems (vanilla LLMs, RAG, memory frameworks) reports 27.9% average accuracy, with performance limited by retrieval and memory construction, and further degradation as the number of intervening updates grows.

Significance. If the benchmark successfully isolates multi-target interference effects, the work would usefully document concrete limitations of current memory mechanisms on realistic, revision-heavy tasks and could guide targeted improvements in retrieval and update handling for agent systems.

major comments (2)

[Abstract and results/analysis sections] Abstract and analysis of degradation with intervening updates: the reported accuracy drop as the number of intervening updates increases is presented as evidence that systems struggle specifically with revised or interfered facts under multi-target interference. However, contexts naturally lengthen with additional updates (averaging 138.8k tokens and reaching 1.8M), and no explicit controls are described that hold total context length or total fact count fixed while varying only the degree of cross-target revision. Without such controls, the degradation is consistent with known long-context retrieval failures rather than isolating the claimed interference mechanism.
[Benchmark construction and evaluation sections] Benchmark construction and evaluation setup: the central claims that performance is 'primarily limited by retrieval and memory construction' and that the benchmark induces realistic multi-target interference rest on the construction of contexts and questions. The manuscript provides no details on statistical methods, error bars, controls for confounds such as context length alone, or verification that question design measures interference rather than generic long-context difficulty.

minor comments (2)

[Evaluation] Clarify the exact configurations and prompting strategies used for the seven evaluated systems to allow reproduction.
[Domains and question types] Add explicit discussion of how domain generalization is measured across the four chosen domains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and results/analysis sections] Abstract and analysis of degradation with intervening updates: the reported accuracy drop as the number of intervening updates increases is presented as evidence that systems struggle specifically with revised or interfered facts under multi-target interference. However, contexts naturally lengthen with additional updates (averaging 138.8k tokens and reaching 1.8M), and no explicit controls are described that hold total context length or total fact count fixed while varying only the degree of cross-target revision. Without such controls, the degradation is consistent with known long-context retrieval failures rather than isolating the claimed interference mechanism.

Authors: We agree that context length is a potential confound and that the current analysis does not fully isolate interference from length effects. In the revised manuscript we will add a controlled analysis that holds total context length approximately fixed (via subsampling of later updates) while varying the number of intervening updates, and we will report the resulting accuracy trends. This addition will strengthen the claim that the observed degradation reflects multi-target interference rather than length alone. revision: yes
Referee: [Benchmark construction and evaluation sections] Benchmark construction and evaluation setup: the central claims that performance is 'primarily limited by retrieval and memory construction' and that the benchmark induces realistic multi-target interference rest on the construction of contexts and questions. The manuscript provides no details on statistical methods, error bars, controls for confounds such as context length alone, or verification that question design measures interference rather than generic long-context difficulty.

Authors: We accept that the manuscript would benefit from explicit statistical reporting and additional controls. In the revision we will (1) add error bars computed via bootstrapping over the 15.6k QA pairs, (2) describe the statistical methods used for all reported averages, and (3) include a new control experiment that compares performance on interfered versus non-interfered long contexts of matched length. We will also expand the benchmark-construction section to detail how question templates were designed to require distinguishing updated facts from distractors introduced by other targets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with direct measurements

full rationale

This is an empirical benchmark paper that constructs MINTEval contexts and questions across domains, runs 7 systems on 15.6k QA pairs, and reports observed accuracies (avg. 27.9%) plus degradation trends with intervening updates. No equations, fitted parameters, predictions, or derivations appear in the provided text. Results are direct measurements from system evaluations rather than quantities that reduce to self-defined inputs or self-citation chains. The central claims rest on experimental observations, which are self-contained against external benchmarks and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark design implicitly assumes that the generated contexts with updates create representative interference; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

axioms (1)

domain assumption Selected domains and update patterns induce substantial multi-target interference representative of real-world long-horizon agent scenarios.
Invoked when describing the benchmark as capturing dynamic interactions between evolving memories.

pith-pipeline@v0.9.0 · 5867 in / 1286 out tokens · 68403 ms · 2026-05-20T10:41:11.995043+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance is primarily limited by retrieval and memory construction... accuracy degrading as the number of intervening updates increases
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MINTEVAL... long-horizon contexts averaging 138.8k tokens... 86 temporally ordered updates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

[1]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

work page 2024
[2]

2026 , eprint=

SimpleMem: Efficient Lifelong Memory for LLM Agents , author=. 2026 , eprint=

work page 2026
[3]

2026 , eprint=

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs , author=. 2026 , eprint=

work page 2026
[4]

2026 , eprint=

MultiSessionCollab: Learning User Preferences with Memory to Improve Long-Term Collaboration , author=. 2026 , eprint=

work page 2026
[5]

ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support , url=

Chen, Tiantian and Lu, Jiaqi and Shen, Ying and Zhang, Lin , year=. ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support , url=. doi:10.1145/3774904.3792143 , booktitle=

work page doi:10.1145/3774904.3792143
[6]

Memory , publisher =

Chapter 8 - Interference and Inhibition in Memory Retrieval , editor =. Memory , publisher =. 1996 , isbn =. doi:https://doi.org/10.1016/B978-012102570-0/50010-0 , url =

work page doi:10.1016/b978-012102570-0/50010-0 1996
[7]

2025 , eprint=

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models , author=. 2025 , eprint=

work page 2025
[8]

2025 , eprint=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

work page 2025
[9]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[10]

2025 , eprint=

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent , author=. 2025 , eprint=

work page 2025
[11]

2025 , eprint=

A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

work page 2025
[12]

2025 , eprint=

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns , author=. 2025 , eprint=

work page 2025
[13]

2026 , eprint=

Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams , author=. 2026 , eprint=

work page 2026
[14]

2026 , eprint=

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning , author=. 2026 , eprint=

work page 2026
[15]

2025 , eprint=

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , author=. 2025 , eprint=

work page 2025
[16]

2026 , eprint=

AtomMem : Learnable Dynamic Agentic Memory with Atomic Memory Operation , author=. 2026 , eprint=

work page 2026
[17]

2025 , eprint=

Mem-alpha: Learning Memory Construction via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[18]

2026 , eprint=

REMem: Reasoning with Episodic Memory in Language Agent , author=. 2026 , eprint=

work page 2026
[19]

2026 , eprint=

SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation , author=. 2026 , eprint=

work page 2026
[20]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Evaluating very long-term conversational memory of llm agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[21]

2025 , eprint=

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author=. 2025 , eprint=

work page 2025
[22]

2026 , eprint=

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions , author=. 2026 , eprint=

work page 2026
[23]

2026 , eprint=

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction , author=. 2026 , eprint=

work page 2026
[24]

2025 , eprint=

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory , author=. 2025 , eprint=

work page 2025
[25]

2026 , eprint=

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions , author=. 2026 , eprint=

work page 2026
[26]

2024 , eprint=

PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering , author=. 2024 , eprint=

work page 2024
[27]

2026 , eprint=

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments , author=. 2026 , eprint=

work page 2026
[28]

2024 , eprint=

How Well Do Large Language Models Truly Ground? , author=. 2024 , eprint=

work page 2024
[29]

arXiv preprint arXiv:2409.20296 , year=

Personalllm: Tailoring llms to individual preferences , author=. arXiv preprint arXiv:2409.20296 , year=

work page arXiv
[30]

Underwood , doi =

Benton J. Underwood , doi =. Interference and Forgetting , volume =. Psychological Review , number =

work page
[31]

2024 , eprint=

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. 2024 , eprint=

work page 2024
[32]

2025 , eprint=

Retrieval-Augmented Generation with Conflicting Evidence , author=. 2025 , eprint=

work page 2025
[33]

2025 , eprint=

CORG: Generating Answers from Complex, Interrelated Contexts , author=. 2025 , eprint=

work page 2025
[34]

2021 , eprint=

Sparse, Dense, and Attentional Representations for Text Retrieval , author=. 2021 , eprint=

work page 2021
[35]

2022 , eprint=

Generative Multi-hop Retrieval , author=. 2022 , eprint=

work page 2022
[36]

2026 , eprint=

Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey , author=. 2026 , eprint=

work page 2026
[37]

2026 , eprint=

Lifelong Learning of Large Language Model based Agents: A Roadmap , author=. 2026 , eprint=

work page 2026
[38]

Lifelonga- gentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942,

Lifelongagentbench: Evaluating llm agents as lifelong learners , author=. arXiv preprint arXiv:2505.11942 , year=

work page arXiv
[39]

2025 , eprint=

MemVerse: Multimodal Memory for Lifelong Learning Agents , author=. 2025 , eprint=

work page 2025
[40]

H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

Hu, Mengkang and Chen, Tianxing and Chen, Qiguang and Mu, Yao and Shao, Wenqi and Luo, Ping. H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1575

work page doi:10.18653/v1/2025.acl-long.1575 2025
[41]

Towards lifelong dialogue agents via timeline-based memory management , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025
[42]

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Towards ai-complete question answering: A set of prerequisite toy tasks , author=. arXiv preprint arXiv:1502.05698 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

2026 , eprint=

HorizonBench: Long-Horizon Personalization with Evolving Preferences , author=. 2026 , eprint=

work page 2026
[44]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[45]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

2026 , howpublished =

Gemini 3.1 Pro: A smarter model for your most complex tasks , author =. 2026 , howpublished =

work page 2026
[48]

2025 , howpublished =

Gemini-Embedding-001 , author =. 2025 , howpublished =

work page 2025
[49]

2026 , howpublished =

Gemini 3.1 Flash-Lite Preview: Model Documentation , author =. 2026 , howpublished =

work page 2026
[50]

2024 , eprint=

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack , author=. 2024 , eprint=

work page 2024
[51]

arXiv preprint arXiv:2603.00270 , year=

Transformers Remember First, Forget Last: Dual-Process Interference in LLMs , author=. arXiv preprint arXiv:2603.00270 , year=

work page arXiv
[52]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

work page 2021

[1] [1]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

work page 2024

[2] [2]

2026 , eprint=

SimpleMem: Efficient Lifelong Memory for LLM Agents , author=. 2026 , eprint=

work page 2026

[3] [3]

2026 , eprint=

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs , author=. 2026 , eprint=

work page 2026

[4] [4]

2026 , eprint=

MultiSessionCollab: Learning User Preferences with Memory to Improve Long-Term Collaboration , author=. 2026 , eprint=

work page 2026

[5] [5]

ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support , url=

Chen, Tiantian and Lu, Jiaqi and Shen, Ying and Zhang, Lin , year=. ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support , url=. doi:10.1145/3774904.3792143 , booktitle=

work page doi:10.1145/3774904.3792143

[6] [6]

Memory , publisher =

Chapter 8 - Interference and Inhibition in Memory Retrieval , editor =. Memory , publisher =. 1996 , isbn =. doi:https://doi.org/10.1016/B978-012102570-0/50010-0 , url =

work page doi:10.1016/b978-012102570-0/50010-0 1996

[7] [7]

2025 , eprint=

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models , author=. 2025 , eprint=

work page 2025

[8] [8]

2025 , eprint=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

work page 2025

[9] [9]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[10] [10]

2025 , eprint=

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent , author=. 2025 , eprint=

work page 2025

[11] [11]

2025 , eprint=

A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

work page 2025

[12] [12]

2025 , eprint=

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns , author=. 2025 , eprint=

work page 2025

[13] [13]

2026 , eprint=

Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams , author=. 2026 , eprint=

work page 2026

[14] [14]

2026 , eprint=

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning , author=. 2026 , eprint=

work page 2026

[15] [15]

2025 , eprint=

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , author=. 2025 , eprint=

work page 2025

[16] [16]

2026 , eprint=

AtomMem : Learnable Dynamic Agentic Memory with Atomic Memory Operation , author=. 2026 , eprint=

work page 2026

[17] [17]

2025 , eprint=

Mem-alpha: Learning Memory Construction via Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[18] [18]

2026 , eprint=

REMem: Reasoning with Episodic Memory in Language Agent , author=. 2026 , eprint=

work page 2026

[19] [19]

2026 , eprint=

SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation , author=. 2026 , eprint=

work page 2026

[20] [20]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Evaluating very long-term conversational memory of llm agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[21] [21]

2025 , eprint=

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author=. 2025 , eprint=

work page 2025

[22] [22]

2026 , eprint=

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions , author=. 2026 , eprint=

work page 2026

[23] [23]

2026 , eprint=

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction , author=. 2026 , eprint=

work page 2026

[24] [24]

2025 , eprint=

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory , author=. 2025 , eprint=

work page 2025

[25] [25]

2026 , eprint=

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions , author=. 2026 , eprint=

work page 2026

[26] [26]

2024 , eprint=

PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering , author=. 2024 , eprint=

work page 2024

[27] [27]

2026 , eprint=

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments , author=. 2026 , eprint=

work page 2026

[28] [28]

2024 , eprint=

How Well Do Large Language Models Truly Ground? , author=. 2024 , eprint=

work page 2024

[29] [29]

arXiv preprint arXiv:2409.20296 , year=

Personalllm: Tailoring llms to individual preferences , author=. arXiv preprint arXiv:2409.20296 , year=

work page arXiv

[30] [30]

Underwood , doi =

Benton J. Underwood , doi =. Interference and Forgetting , volume =. Psychological Review , number =

work page

[31] [31]

2024 , eprint=

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. 2024 , eprint=

work page 2024

[32] [32]

2025 , eprint=

Retrieval-Augmented Generation with Conflicting Evidence , author=. 2025 , eprint=

work page 2025

[33] [33]

2025 , eprint=

CORG: Generating Answers from Complex, Interrelated Contexts , author=. 2025 , eprint=

work page 2025

[34] [34]

2021 , eprint=

Sparse, Dense, and Attentional Representations for Text Retrieval , author=. 2021 , eprint=

work page 2021

[35] [35]

2022 , eprint=

Generative Multi-hop Retrieval , author=. 2022 , eprint=

work page 2022

[36] [36]

2026 , eprint=

Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey , author=. 2026 , eprint=

work page 2026

[37] [37]

2026 , eprint=

Lifelong Learning of Large Language Model based Agents: A Roadmap , author=. 2026 , eprint=

work page 2026

[38] [38]

Lifelonga- gentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942,

Lifelongagentbench: Evaluating llm agents as lifelong learners , author=. arXiv preprint arXiv:2505.11942 , year=

work page arXiv

[39] [39]

2025 , eprint=

MemVerse: Multimodal Memory for Lifelong Learning Agents , author=. 2025 , eprint=

work page 2025

[40] [40]

H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

Hu, Mengkang and Chen, Tianxing and Chen, Qiguang and Mu, Yao and Shao, Wenqi and Luo, Ping. H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1575

work page doi:10.18653/v1/2025.acl-long.1575 2025

[41] [41]

Towards lifelong dialogue agents via timeline-based memory management , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025

[42] [42]

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Towards ai-complete question answering: A set of prerequisite toy tasks , author=. arXiv preprint arXiv:1502.05698 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

2026 , eprint=

HorizonBench: Long-Horizon Personalization with Evolving Preferences , author=. 2026 , eprint=

work page 2026

[44] [44]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

2026 , howpublished =

Gemini 3.1 Pro: A smarter model for your most complex tasks , author =. 2026 , howpublished =

work page 2026

[48] [48]

2025 , howpublished =

Gemini-Embedding-001 , author =. 2025 , howpublished =

work page 2025

[49] [49]

2026 , howpublished =

Gemini 3.1 Flash-Lite Preview: Model Documentation , author =. 2026 , howpublished =

work page 2026

[50] [50]

2024 , eprint=

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack , author=. 2024 , eprint=

work page 2024

[51] [51]

arXiv preprint arXiv:2603.00270 , year=

Transformers Remember First, Forget Last: Dual-Process Interference in LLMs , author=. arXiv preprint arXiv:2603.00270 , year=

work page arXiv

[52] [52]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

work page 2021