pith. machine review for the scientific record. sign in

arxiv: 2507.07957 · v1 · submitted 2025-07-10 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords MIRIXmulti-agent memoryLLM agentsmultimodal memorylong-term memoryScreenshotVQALOCOMOmemory system
0
0 comments X

The pith

MIRIX uses six specialized memory types coordinated by multiple agents to enable LLM-based agents to accurately recall long-term multimodal user data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI agents struggle with flat memory that limits personalization and reliable recall of user information over time. MIRIX introduces a modular system built around six memory types—Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault—managed by a multi-agent framework that handles dynamic updates and retrieval. This structure supports rich visual and textual experiences, such as long sequences of computer screenshots, making memory practical for ongoing agent use. On the ScreenshotVQA benchmark with nearly 20,000 high-resolution images, the system delivers 35% higher accuracy than a RAG baseline while cutting storage by 99.9%. It also reaches 85.4% on the LOCOMO long-conversation benchmark, establishing new performance levels for memory-augmented agents.

Core claim

MIRIX consists of six distinct, carefully structured memory types—Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault—coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale, as shown by 35% higher accuracy than the RAG baseline on ScreenshotVQA with 99.9% reduced storage and state-of-the-art 85.4% performance on LOCOMO.

What carries the argument

A multi-agent framework that dynamically controls updates and retrieval across six memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault.

If this is right

  • Agents maintain accurate recall across sequences of nearly 20,000 high-resolution screenshots.
  • Storage needs for memory drop by 99.9% relative to standard retrieval-augmented methods.
  • State-of-the-art results are achieved on long-form textual conversation benchmarks.
  • Agents can personalize responses using accumulated visual and textual user histories.
  • Real-time screen monitoring becomes viable for building and querying personalized memory bases locally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continuous memory accumulation could support agents that adapt to daily user patterns without periodic retraining.
  • The modular memory design may extend naturally to additional input types such as audio streams.
  • Local secure storage emphasis opens pathways for privacy-focused deployment on personal devices.
  • Coordination patterns used here might apply to other multi-agent tasks beyond memory handling.

Load-bearing premise

The multi-agent coordination mechanism can reliably manage updates and retrieval across the six memory types without introducing retrieval errors or inconsistencies in long sequences.

What would settle it

A test showing retrieval errors rising or accuracy dropping below the RAG baseline on extended screenshot sequences or additional long-conversation data would falsify the performance claims.

read the original abstract

Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MIRIX, a modular multi-agent memory system for LLM-based agents comprising six structured memory types (Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault) coordinated by a dynamic multi-agent framework for updates and retrieval. It claims to enable scalable, long-term multimodal memory and reports 35% higher accuracy than a RAG baseline with 99.9% storage reduction on the ScreenshotVQA benchmark (nearly 20k high-resolution screenshots) plus state-of-the-art 85.4% performance on the LOCOMO long-form conversation benchmark.

Significance. If the empirical claims hold after proper validation, MIRIX would represent a practical advance in memory-augmented agents by moving beyond flat retrieval to a typed, multimodal, long-horizon memory architecture. The provision of a packaged real-time screen-monitoring application is a positive step toward reproducibility and usability.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the central performance claims (35% accuracy lift and 99.9% storage reduction on ScreenshotVQA; 85.4% on LOCOMO) are stated without any reference to tables, figures, statistical significance tests, or error bars, and no ablation isolating the multi-agent coordinator from the six memory types is described. This makes it impossible to determine whether the reported gains require the coordination mechanism or could be obtained from the memory schemas alone.
  2. [Methods] Methods section: the multi-agent coordination mechanism for dynamic updates and retrieval across the six memory types is presented at a high level with no description of conflict resolution, consistency invariants, or failure modes in long sequences. The weakest assumption in the design—that coordination reliably avoids retrieval inconsistencies—therefore remains untested.
  3. [Results] Results section: no quantitative breakdown is given for how the 99.9% storage reduction is achieved (e.g., compression ratios per memory type, deduplication strategy, or comparison against a single-memory baseline with identical content). Without these details the storage claim cannot be evaluated.
minor comments (2)
  1. [Abstract] The abstract would benefit from a one-sentence definition or example for each of the six memory types to make the architecture immediately intelligible.
  2. [Methods] The paper should include a clear diagram or pseudocode for the multi-agent update/retrieval loop to clarify control flow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving clarity and rigor. We have revised the manuscript to address each major comment by adding explicit references, new analyses, and expanded methodological details.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central performance claims (35% accuracy lift and 99.9% storage reduction on ScreenshotVQA; 85.4% on LOCOMO) are stated without any reference to tables, figures, statistical significance tests, or error bars, and no ablation isolating the multi-agent coordinator from the six memory types is described. This makes it impossible to determine whether the reported gains require the coordination mechanism or could be obtained from the memory schemas alone.

    Authors: We agree that explicit cross-references and supporting analyses strengthen the presentation. In the revised manuscript, the Abstract now cites Table 1 (ScreenshotVQA results) and Table 2 (LOCOMO results). The Experiments section includes statistical significance tests, error bars, and a new ablation study comparing the full MIRIX system (with multi-agent coordinator) against a variant using only the six memory types. This ablation confirms that the coordinator contributes measurably to the observed gains beyond the memory schemas alone. revision: yes

  2. Referee: [Methods] Methods section: the multi-agent coordination mechanism for dynamic updates and retrieval across the six memory types is presented at a high level with no description of conflict resolution, consistency invariants, or failure modes in long sequences. The weakest assumption in the design—that coordination reliably avoids retrieval inconsistencies—therefore remains untested.

    Authors: We have expanded the Methods section with a detailed account of the coordination mechanism. This now covers conflict resolution via priority-based merging (factoring recency and memory type), consistency invariants (timestamp ordering and cross-type semantic checks), and failure-mode analysis for long sequences with mitigation via periodic reconciliation. New experiments in the revised paper demonstrate robustness over extended interactions, directly testing the assumption. revision: yes

  3. Referee: [Results] Results section: no quantitative breakdown is given for how the 99.9% storage reduction is achieved (e.g., compression ratios per memory type, deduplication strategy, or comparison against a single-memory baseline with identical content). Without these details the storage claim cannot be evaluated.

    Authors: We have added a dedicated subsection in Results providing the requested breakdown. It reports per-type compression ratios (e.g., 99.5% for Knowledge Vault via embedding-based semantic compression), the deduplication approach (similarity thresholding with periodic pruning), and a direct comparison to a single flat-memory baseline holding identical content. These details substantiate the overall 99.9% reduction. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering design with no equations, fitted parameters, or self-referential derivations.

full rationale

The paper describes MIRIX as a modular system with six explicitly defined memory types (Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault) plus a multi-agent coordinator. Performance claims (35% accuracy lift on ScreenshotVQA, 85.4% on LOCOMO) are presented as outcomes of empirical evaluation against baselines, not as predictions derived from equations or parameters that reduce to the inputs by construction. No mathematical derivations, ansatzes, uniqueness theorems, or self-citations appear as load-bearing steps in the provided text. The architecture is justified by functional requirements for multimodal long-term memory rather than by any self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that structured multi-agent memory coordination improves retrieval accuracy and efficiency over flat RAG without introducing new failure modes; no free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption LLM agents benefit from explicit separation of memory into core, episodic, semantic, procedural, resource, and knowledge-vault types
    Invoked implicitly when claiming the six-type design enables better personalization and recall

pith-pipeline@v0.9.0 · 5608 in / 1220 out tokens · 33292 ms · 2026-05-15T05:53:20.135092+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

    cs.AI 2026-05 conditional novelty 8.0

    MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...

  2. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.

  3. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.

  4. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

  5. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

    cs.AI 2026-05 unverdicted novelty 7.0

    EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...

  6. Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

    cs.CL 2026-05 unverdicted novelty 7.0

    MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.

  7. Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...

  8. MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    MemEvoBench is the first benchmark for long-horizon memory safety in LLM agents, using QA tasks across 7 domains and 36 risks plus workflow tasks with noisy tools to measure behavioral drift from biased memory updates.

  9. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  10. Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.

  11. Cognifold: Always-On Proactive Memory via Cognitive Folding

    cs.AI 2026-05 unverdicted novelty 6.0

    Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...

  12. $\delta$-mem: Efficient Online Memory for Large Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...

  13. SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

    cs.AI 2026-05 unverdicted novelty 6.0

    SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...

  14. HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

    cs.AI 2026-05 unverdicted novelty 6.0

    HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.

  15. Tree-based Credit Assignment for Multi-Agent Memory System

    cs.MA 2026-05 unverdicted novelty 6.0

    TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.

  16. Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    RSCB-MC is a risk-sensitive contextual bandit memory controller for LLM coding agents that chooses safe actions including abstention, achieving 60.5% proxy success with 0% false positives and low latency in 200-case v...

  17. From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

    math.OC 2026-04 unverdicted novelty 6.0

    Agora-Opt uses decentralized debate among LLM agent teams plus a read-write memory bank to produce more accurate optimization models from text than prior LLM methods.

  18. Stateless Decision Memory for Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...

  19. MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

    cs.IR 2026-04 unverdicted novelty 6.0

    MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.

  20. MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

    cs.IR 2026-04 unverdicted novelty 6.0

    MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.

  21. Decocted Experience Improves Test-Time Inference in LLM Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Decocted experience—extracting and organizing the essence from accumulated interactions—enables more effective context construction that improves test-time inference in LLM agents on math, web, and software tasks.

  22. PersonaVLM: Long-Term Personalized Multimodal LLMs

    cs.CL 2026-03 unverdicted novelty 6.0

    PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.

  23. Joint Optimization of Multi-agent Memory System

    cs.MA 2026-03 unverdicted novelty 6.0

    CoMAM jointly optimizes agents in multi-agent LLM memory systems via end-to-end RL and adaptive credit assignment to improve collaboration and performance.

  24. Security Considerations for Multi-agent Systems

    cs.CR 2026-03 unverdicted novelty 6.0

    No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.

  25. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    cs.SE 2026-04 accept novelty 5.0

    LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

  26. MemOS: A Memory OS for AI System

    cs.CL 2025-07 unverdicted novelty 5.0

    MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 24 Pith papers · 7 internal anchors

  1. [1]

    Arigraph: Learning knowledge graph world models with episodic memory for llm agents

    Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents. arXiv preprint arXiv:2407.04363, 2024

  2. [2]

    Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Recurrent memory transformer. In NeurIPS, 2022

  3. [4]

    URL https://arxiv.org/abs/2006.11527

  4. [5]

    Agentverse: A multi-agent framework for autonomous task completion

    Li Chen, Rohan Kumar, and Anika Patel. Agentverse: A multi-agent framework for autonomous task completion. Online; accessed 2024, 2024

  5. [6]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025

  6. [7]

    Autogpt: Autonomous gpt-4 powered agent

    Community. Autogpt: Autonomous gpt-4 powered agent. GitHub repository,https://github. com/Significant-Gravitas/Auto-GPT, 2023

  7. [8]

    Babyagi: Open-source autonomous ai agent

    Community. Babyagi: Open-source autonomous ai agent. GitHub repository, https:// github.com/yoheinakajima/babyagi, 2023

  8. [9]

    Lozano, Georgios Kollias, Vijil Chenthamarakshan, Jirí Navrátil, Soham Dan, and Pin-Yu Chen

    Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarathkrishna Swaminathan, Sihui Dai, Aurélie C. Lozano, Georgios Kollias, Vijil Chenthamarakshan, Jirí Navrátil, Soham Dan, and Pin-Yu Chen. Larimar: Large language models with episodic memory control. In ICML. OpenReview.net, 2024

  9. [10]

    Cartridges: Lightweight and general- purpose long context representations via self-study

    Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general- purpose long context representations via self-study. arXiv preprint arXiv:2506.06266, 2025

  10. [11]

    Camelot: Towards large language models with training-free consolidated associative memory

    Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, and Rogerio Feris. Camelot: Towards large language models with training-free consolidated associative memory. arXiv preprint arXiv:2402.13449, 2024

  11. [12]

    Metagpt: Designing a multi-agent ecosystem for task management

    Emily Hong, Xin Zhao, and Kevin Lee. Metagpt: Designing a multi-agent ecosystem for task management. Online; accessed 2023, 2023

  12. [13]

    Memory os of ai agent

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. arXiv preprint arXiv:2506.06326, 2025

  13. [14]

    A machine with short-term, episodic, and semantic memory systems

    Taewoon Kim, Michael Cochez, Vincent François-Lavet, Mark Neerincx, and Piek V ossen. A machine with short-term, episodic, and semantic memory systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 48–56, 2023

  14. [16]

    Memory, consciousness and large language model

    Jitang Li and Jinzheng Li. Memory, consciousness and large language model. arXiv preprint arXiv:2401.02509, 2024

  15. [17]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: LLM knows what you are looking for before generation. CoRR, abs/2404.14469, 2024. doi: 10.48550/ARXIV .2404.14469. URL https://doi.org/10.48550/arXiv.2404.14469

  16. [18]

    The role of episodic memory in long-term llm agents: A position paper

    Ming Liao, Su Chen, and Li Zhao. The role of episodic memory in long-term llm agents: A position paper. Online; accessed 2024, 2024

  17. [19]

    Echo: A large language model with temporal episodic memory

    WenTao Liu, Ruohua Zhang, Aimin Zhou, Feng Gao, and JiaLi Liu. Echo: A large language model with temporal episodic memory. arXiv preprint arXiv:2502.16090, 2025. 14

  18. [20]

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753, 2024

  19. [21]

    Optimizing the interface between knowledge graphs and llms for complex reasoning

    Vasilije Markovic, Lazar Obradovic, Laszlo Hajdu, and Jovan Pavlovic. Optimizing the interface between knowledge graphs and llms for complex reasoning. arXiv preprint arXiv:2505.24478, 2025

  20. [22]

    Leave no context behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143, 101, 2024

    Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143, 101, 2024

  21. [23]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems. CoRR, abs/2310.08560, 2023

  22. [24]

    Position: Episodic memory is the missing piece for long-term LLM agents.CoRR, abs/2502.06975,

    Mathis Pink, Qinyuan Wu, Vy Ai V o, Javier Turek, Jianing Mu, Alexander Huth, and Mariya Toneva. Position: Episodic memory is the missing piece for long-term llm agents.arXiv preprint arXiv:2502.06975, 2025

  23. [25]

    Zep: A Temporal Knowledge Graph Architecture for Agent Memory

    Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956, 2025

  24. [26]

    Cognitive Memory in Large Language Models, April 2025

    Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu. Cognitive memory in large language models. arXiv preprint arXiv:2504.02441, 2025

  25. [27]

    Cognitive neuroscience perspective on memory: overview and summary

    Sruthi Sridhar, Abdulrahman Khamaj, and Manish Kumar Asthana. Cognitive neuroscience perspective on memory: overview and summary. Frontiers in human neuroscience, 17:1217093, 2023

  26. [28]

    Memory and consciousness

    Endel Tulving. Memory and consciousness. Canadian Psychology/Psychologie canadienne, 26 (1):1, 1985

  27. [29]

    Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian J. McAuley. MEMORYLLM: towards self-updatable large language models. In ICML. OpenReview.net, 2024

  28. [30]

    Yu Wang, Chi Han, Tongtong Wu, Xiaoxin He, Wangchunshu Zhou, Nafis Sadeq, Xiusi Chen, Zexue He, Wei Wang, Gholamreza Haffari, Heng Ji, and Julian J. McAuley. Towards lifespan cognitive systems. CoRR, abs/2409.13265, 2024

  29. [31]

    Self- updatable large language models with parameter integration

    Yu Wang, Xinshuang Liu, Xiusi Chen, Sean O’Brien, Junda Wu, and Julian McAuley. Self- updatable large language models with parameter integration. arXiv preprint arXiv:2410.00487, 2024

  30. [32]

    arXiv preprint arXiv:2502.00592 , year=

    Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term memory. arXiv preprint arXiv:2502.00592, 2025

  31. [33]

    Caim: Devel- opment and evaluation of a cognitive ai memory framework for long-term interaction with intelligent agents

    Rebecca Westhäußer, Frederik Berenz, Wolfgang Minker, and Sebastian Zepf. Caim: Devel- opment and evaluation of a cognitive ai memory framework for long-term interaction with intelligent agents. arXiv preprint arXiv:2505.13044, 2025

  32. [34]

    Procedural memory is not all you need: Bridging cognitive gaps in llm-based agents

    Schaun Wheeler and Olivier Jeunen. Procedural memory is not all you need: Bridging cognitive gaps in llm-based agents. arXiv preprint arXiv:2505.03434, 2025

  33. [35]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. arXiv preprint arXiv:2410.10813, 2024

  34. [36]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110, 2025. 15

  35. [37]

    Patil, Ion Stoica, and Joseph E

    Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley function calling leaderboard.https://gorilla.cs.berkeley. edu/blogs/8_berkeley_function_calling_leaderboard.html, 2024

  36. [38]

    Barrett, Zhangyang Wang, and Beidi Chen

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2O: heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2023

  37. [39]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250, 2023. 16 A Full Experimental Results with Different Runs We run MIRIX and Full-Context with gpt-4.1-mini three times and we report the full results in Ta- ble 3. There are variations across different r...