Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Deshraj Yadav; Dev Khant; Prateek Chhikara; Saket Aryan; Taranjeet Singh

arxiv: 2504.19413 · v1 · submitted 2025-04-28 · 💻 cs.CL · cs.AI

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara , Dev Khant , Saket Aryan , Taranjeet Singh , Deshraj Yadav This is my paper

Pith reviewed 2026-05-10 23:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords long-term memoryLLM agentsmemory architectureconversational AIgraph memoryRAGscalabilityLOC OMO benchmark

0 comments

The pith

Mem0 dynamically extracts and consolidates key facts from conversations to give LLMs reliable long-term memory without processing full histories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mem0 as a memory architecture that pulls salient details from ongoing dialogues, stores them efficiently, and retrieves them as needed for consistent answers across sessions. A sympathetic reader would care because current LLMs struggle with extended interactions, either forgetting earlier context or incurring high costs from retaining everything. The authors evaluate it on the LOCOMO benchmark against six categories of baselines including full-context processing, RAG variants, and other memory systems. Results show higher accuracy on single-hop, temporal, multi-hop, and open-domain questions plus major reductions in latency and token use. A graph-based extension adds relational structure among stored facts for further gains.

Core claim

Mem0 is a scalable memory-centric architecture that dynamically extracts, consolidates, and retrieves salient information from ongoing conversations. An enhanced variant uses graph-based representations to capture complex relational structures among conversational elements. On the LOCOMO benchmark it outperforms established memory systems, RAG setups, full-context processing, open-source solutions, proprietary systems, and dedicated memory platforms across single-hop, temporal, multi-hop, and open-domain questions. Mem0 achieves 26% relative improvement in the LLM-as-a-Judge metric over OpenAI, the graph version scores about 2% higher overall, and both deliver 91% lower p95 latency with more

What carries the argument

Mem0's dynamic extraction, consolidation, and retrieval pipeline for salient conversational information, together with its optional graph-based memory representation for relational structures.

If this is right

Outperforms all tested baselines on single-hop, temporal, multi-hop, and open-domain questions.
Delivers 26% relative gain in LLM-as-a-Judge score over OpenAI memory.
Graph memory variant adds roughly 2% overall score improvement over the base Mem0.
Reduces p95 latency by 91% and token cost by more than 90% versus full-context processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If extraction remains reliable at scale, the approach could support agents that maintain coherence across weeks of interaction rather than single sessions.
The relational graph may prove especially useful for tasks that track how facts evolve or connect over time, suggesting targeted tests on longer dependency chains.
Combining this memory layer with other agent components such as planning or tool use could further improve production deployment without proportional cost increases.
The efficiency gains open the possibility of running multiple parallel agents on the same hardware while each retains its own long-term context.

Load-bearing premise

Extracting and consolidating only the most salient facts from conversations preserves every piece of context required for correct answers to complex multi-hop and temporal questions.

What would settle it

A new evaluation set of long multi-session dialogues containing explicit temporal chains and multi-hop dependencies where full-context processing scores measurably higher than Mem0 on accuracy metrics.

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Mem0, a scalable memory-centric architecture for LLMs that dynamically extracts, consolidates, and retrieves salient information from multi-session conversations, along with a graph-based variant for capturing relational structures. It evaluates both variants on the LOCOMO benchmark against six categories of baselines (memory-augmented systems, RAG variants, full-context, open-source, proprietary, and dedicated platforms), claiming consistent outperformance across single-hop, temporal, multi-hop, and open-domain questions, including a 26% relative gain in LLM-as-Judge over OpenAI, ~2% additional gain from the graph variant, 91% lower p95 latency, and >90% token cost savings versus full-context.

Significance. If the results hold after addressing the gaps below, this would represent a practical contribution to production-ready long-term memory for AI agents, with notable efficiency advantages over full-context baselines that could enable scalable deployment. The breadth of baseline comparisons across question categories is a strength, though the absence of targeted ablations and error analysis limits the ability to attribute gains specifically to the proposed extraction and graph mechanisms.

major comments (2)

[Experimental evaluation (Section 4)] Experimental evaluation (Section 4 / LOCOMO results): Aggregate scores are reported for the four question categories and LLM-as-Judge metric, but no per-question error analysis, extraction-precision audit against gold facts, or ablation isolating dynamic extraction/consolidation failures from retrieval/graph issues is provided. This is load-bearing for the central claim, as omissions in temporal anchors or cross-turn entities could explain gains without the memory mechanism itself being superior.
[Methodology] Methodology and implementation details: The manuscript does not specify data splits for LOCOMO, exact extraction prompts/models, graph construction algorithm, or precise configurations for all six baseline categories (e.g., chunk sizes and k for RAG). Without these, the 26% relative improvement and efficiency metrics cannot be independently verified or reproduced.

minor comments (1)

[Abstract] The abstract states 'around 2% higher overall score' for the graph variant; the main text should report the exact metric, absolute values, and statistical significance for this comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical contributions of Mem0 to scalable long-term memory for AI agents. The comments highlight important areas for improving the strength of our claims and reproducibility. We address each major comment below and have revised the manuscript to incorporate additional analysis and details where feasible.

read point-by-point responses

Referee: [Experimental evaluation (Section 4)] Experimental evaluation (Section 4 / LOCOMO results): Aggregate scores are reported for the four question categories and LLM-as-Judge metric, but no per-question error analysis, extraction-precision audit against gold facts, or ablation isolating dynamic extraction/consolidation failures from retrieval/graph issues is provided. This is load-bearing for the central claim, as omissions in temporal anchors or cross-turn entities could explain gains without the memory mechanism itself being superior.

Authors: We agree that aggregate metrics alone make it harder to isolate the contributions of dynamic extraction, consolidation, and graph-based retrieval. In the revised manuscript we will add a dedicated error analysis subsection in Section 4 that provides per-category breakdowns (single-hop, temporal, multi-hop, open-domain) with representative success and failure examples, focusing on cases involving temporal anchors and cross-turn entities. We will also include targeted ablations: (i) Mem0 without dynamic extraction/consolidation, (ii) base Mem0 versus graph variant, and (iii) retrieval-only versus full memory pipeline. These will help attribute gains more precisely to the proposed mechanisms. A full extraction-precision audit against gold facts is not possible because LOCOMO does not provide such annotations; we will instead report precision estimates from manual inspection of a sampled subset of extracted memories and note this as a limitation. revision: partial
Referee: [Methodology] Methodology and implementation details: The manuscript does not specify data splits for LOCOMO, exact extraction prompts/models, graph construction algorithm, or precise configurations for all six baseline categories (e.g., chunk sizes and k for RAG). Without these, the 26% relative improvement and efficiency metrics cannot be independently verified or reproduced.

Authors: We acknowledge that the original manuscript omitted several implementation details necessary for full reproducibility. The revised version will expand the Experimental Setup section with: (1) LOCOMO data usage and any train/test splits applied; (2) the exact extraction and consolidation prompts together with the underlying models (gpt-4o for extraction, gpt-4o-mini for retrieval); (3) the graph construction algorithm, which uses LLM-based entity-relation extraction followed by incremental graph updates; and (4) complete baseline configurations, including chunk sizes (256/512/1024 tokens) and k values (3/5/10) for all RAG variants, as well as the exact settings for the other five baseline categories. These additions will allow independent verification of the reported accuracy gains, 91% p95 latency reduction, and >90% token cost savings. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain

full rationale

The paper proposes the Mem0 architecture for dynamic memory extraction/consolidation/retrieval in LLMs and evaluates it empirically on the external LOCOMO benchmark against six categories of baselines. All reported results (26% relative LLM-as-Judge gain, 2% graph variant uplift, 91% p95 latency reduction, >90% token savings) are direct performance comparisons to independent systems rather than any first-principles derivation, fitted-parameter prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes appear in the provided text; the central claims rest on aggregate benchmark scores without reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; relies on domain assumption that structured memory extraction improves coherence without full verification of extraction accuracy.

axioms (1)

domain assumption Dynamic extraction of salient information from conversations can be done reliably enough to support multi-hop and temporal reasoning.
Central to the architecture's ability to outperform full-context and RAG baselines.

pith-pipeline@v0.9.0 · 5618 in / 1157 out tokens · 35852 ms · 2026-05-10T23:07:10.637047+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements.
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration.
IndisputableMonolith.Foundation.DiscretenessForcing discreteness_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 conditional novelty 8.0

GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
cs.CL 2026-03 unverdicted novelty 8.0

AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts
cs.IR 2026-05 unverdicted novelty 7.0

MemConflict provides a benchmark for testing LLM long-term memory systems under dynamic, static, and conditional conflicts involving temporal validity, factual correctness, and contextual applicability.
MemGym: a Long-Horizon Memory Environment for LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory
cs.CL 2026-05 unverdicted novelty 7.0

SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heter...
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 unverdicted novelty 7.0

GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
Agentic Recommender System with Hierarchical Belief-State Memory
cs.CL 2026-05 unverdicted novelty 7.0

MARS uses hierarchical memory and LLM planning to achieve 26.4% higher HR@1 on InstructRec benchmarks compared to prior methods.
Thinking Ahead: Prospection-Guided Retrieval of Memory with Language Models
cs.IR 2026-05 conditional novelty 7.0

PGR expands user queries into plausible future steps via Tree-of-Thought or chains and uses them as retrieval probes, delivering nearly 3x recall gains on the new MemoryQuest benchmark for low-similarity memory retrieval.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 unverdicted novelty 7.0

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
cs.LG 2026-05 unverdicted novelty 7.0

EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-ben...
MEME: Multi-entity & Evolving Memory Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection
cs.CL 2026-05 accept novelty 7.0

CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
cs.AI 2026-05 unverdicted novelty 7.0

MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of basel...
Stateful Agent Backdoor
cs.CR 2026-05 unverdicted novelty 7.0

A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
cs.AI 2026-05 unverdicted novelty 7.0

MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
cs.CL 2026-04 unverdicted novelty 7.0

Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
Latent Preference Modeling for Cross-Session Personalized Tool Calling
cs.CL 2026-04 unverdicted novelty 7.0

Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
cs.IR 2026-04 conditional novelty 7.0

vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BE...
The Missing Knowledge Layer in Cognitive Architectures for AI Agents
cs.AI 2026-04 conditional novelty 7.0

Cognitive architectures for AI agents require a distinct Knowledge layer with indefinite supersession persistence, separate from Memory decay, Wisdom evidence-gating, and Intelligence ephemerality.
GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing
cs.DB 2026-03 unverdicted novelty 7.0

GRAB-ANNS is a new GPU graph index that achieves up to 240x higher hybrid search throughput via bucket layouts and hybrid intra/inter-bucket edges.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
cs.AI 2026-03 unverdicted novelty 7.0

PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams
cs.CL 2026-03 unverdicted novelty 7.0

SensorPersona uses LLMs for hierarchical reasoning on longitudinal mobile sensor streams to continually extract stable personas, showing up to 31.4% higher recall and 85.7% win rate over baselines on a 20-user dataset.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
cs.CL 2025-11 unverdicted novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
MIRIX: Multi-Agent Memory System for LLM-Based Agents
cs.CL 2025-07 unverdicted novelty 7.0

MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
cs.CL 2025-07 unverdicted novelty 7.0

MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods ...
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
cs.CL 2026-05 unverdicted novelty 6.0

Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
cs.CL 2026-05 unverdicted novelty 6.0

DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API to...
The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems
cs.AI 2026-05 unverdicted novelty 6.0

ActiveGraph inverts traditional agent frameworks by treating the append-only event log as the primary source of truth, from which the reactive graph is projected, yielding deterministic replay, forking, and lineage tracking.
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate
cs.CL 2026-05 unverdicted novelty 6.0

Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents
cs.CV 2026-05 unverdicted novelty 6.0

SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.
Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Memory-equipped LLM agents exhibit increasing safety violation rates as memory accumulates across independent tasks, termed temporal memory contamination, detected via a new trigger-probe protocol.
Compress the Context, Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression
cs.LG 2026-05 unverdicted novelty 6.0

Context Codec is a commitment-level framework for verifiable LLM context compression using semantic atoms, defined metrics, and a compact rendering language.
GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction
cs.AI 2026-05 unverdicted novelty 6.0

GRID trains Qwen-based 4B models on a task-bank reward system of multi-select questions and regex targets to extract security KGs from CTI text, reporting 84.62% precision and 64.91% recall on 249 articles from five sources.
DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
cs.CL 2026-05 unverdicted novelty 6.0

DimMem introduces a dimensional memory framework that structures memories as typed atomic units to improve retrieval efficiency and accuracy for long-term LLM agent tasks.
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
cs.CL 2026-05 unverdicted novelty 6.0

H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.
Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory
cs.LG 2026-05 unverdicted novelty 6.0

SeqMem-Eval reveals that high final accuracy in sequential LLM memory tasks often coexists with substantial forgetting and negative transfer, exposing stability-adaptability trade-offs hidden by standard aggregate metrics.
Agentic Recommender System with Hierarchical Belief-State Memory
cs.CL 2026-05 unverdicted novelty 6.0

MARS uses hierarchical event-preference-profile memory with an LLM-scheduled lifecycle of six operations to achieve state-of-the-art results on InstructRec benchmarks.
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems
cs.AI 2026-05 unverdicted novelty 6.0

HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.
Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery
cs.LG 2026-05 unverdicted novelty 6.0

Empirical evaluation of eight memory condensation strategies on 480 DiscoveryBench tasks finds no significant impact on hypothesis quality but domain-dependent differences in token efficiency.
Cognifold: Always-On Proactive Memory via Cognitive Folding
cs.AI 2026-05 unverdicted novelty 6.0

Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 144 Pith papers

[1]

Carefully analyze all provided memories from both speakers

work page
[2]

Pay special attention to the timestamps to determine the answer

work page
[3]

If the question asks about a specific event or fact, look for direct evidence in the memories

work page
[4]

If the memories contain contradictory information, prioritize the most recent memory

work page
[5]

last year

If there is a question about time references (like "last year", "two months ago", etc.), calculate the actual date based on the memory timestamp. For example, if a memory from 4 May 2022 mentions "went to India last year," then the trip occurred in 2021

work page 2022
[6]

last year

Always convert relative time references to specific dates, months, or years. For example, convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory timestamp. Ignore the reference while answering the question

work page 2022
[7]

Do not confuse character names mentioned in memories with the actual users who created those memories

Focus only on the content of the memories from both speakers. Do not confuse character names mentioned in memories with the actual users who created those memories

work page
[8]

# APPROACH (Think step by step):

The answer should be less than 5-6 words. # APPROACH (Think step by step):

work page
[15]

Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_user_id}: {speaker_1_memories} Memories for user {speaker_2_user_id}: {speaker_2_memories} Question: {question} Answer: 19 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory P rom p t Te m p l at e f or Re s u lt s G e n e r at ion (M e...

work page
[16]

First, examine all memories that contain information related to the question

work page
[17]

Examine the timestamps and content of these memories carefully

work page
[18]

Look for explicit mentions of dates, times, locations, or events that answer the question

work page
[19]

If the answer requires calculation (e.g., converting relative time references), show your work

work page
[20]

Analyze the knowledge graph relations to understand the user’s knowledge context

work page
[21]

Formulate a precise, concise answer based solely on the evidence in the memories

work page
[22]

Double-check that your answer directly addresses the question asked

work page
[23]

(1:56 pm on 8 May, 2023) Caroline: Hey Mel! Good to see you! How have you been? (1:56 pm on 8 May, 2023) Melanie: Hey Caroline! Good to see you! I’m swamped with the kids & work

Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_user_id}: {speaker_1_memories} Relations for user {speaker_1_user_id}: {speaker_1_graph_memories} Memories for user {speaker_2_user_id}: {speaker_2_memories} Relations for user {speaker_2_user_id}: {speaker_2_graph_memories} Question: {question} Answer: P ro...

work page 2023

[1] [1]

Carefully analyze all provided memories from both speakers

work page

[2] [2]

Pay special attention to the timestamps to determine the answer

work page

[3] [3]

If the question asks about a specific event or fact, look for direct evidence in the memories

work page

[4] [4]

If the memories contain contradictory information, prioritize the most recent memory

work page

[5] [5]

last year

If there is a question about time references (like "last year", "two months ago", etc.), calculate the actual date based on the memory timestamp. For example, if a memory from 4 May 2022 mentions "went to India last year," then the trip occurred in 2021

work page 2022

[6] [6]

last year

Always convert relative time references to specific dates, months, or years. For example, convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory timestamp. Ignore the reference while answering the question

work page 2022

[7] [7]

Do not confuse character names mentioned in memories with the actual users who created those memories

Focus only on the content of the memories from both speakers. Do not confuse character names mentioned in memories with the actual users who created those memories

work page

[8] [8]

# APPROACH (Think step by step):

The answer should be less than 5-6 words. # APPROACH (Think step by step):

work page

[9] [15]

Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_user_id}: {speaker_1_memories} Memories for user {speaker_2_user_id}: {speaker_2_memories} Question: {question} Answer: 19 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory P rom p t Te m p l at e f or Re s u lt s G e n e r at ion (M e...

work page

[10] [16]

First, examine all memories that contain information related to the question

work page

[11] [17]

Examine the timestamps and content of these memories carefully

work page

[12] [18]

Look for explicit mentions of dates, times, locations, or events that answer the question

work page

[13] [19]

If the answer requires calculation (e.g., converting relative time references), show your work

work page

[14] [20]

Analyze the knowledge graph relations to understand the user’s knowledge context

work page

[15] [21]

Formulate a precise, concise answer based solely on the evidence in the memories

work page

[16] [22]

Double-check that your answer directly addresses the question asked

work page

[17] [23]

(1:56 pm on 8 May, 2023) Caroline: Hey Mel! Good to see you! How have you been? (1:56 pm on 8 May, 2023) Melanie: Hey Caroline! Good to see you! I’m swamped with the kids & work

Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_user_id}: {speaker_1_memories} Relations for user {speaker_1_user_id}: {speaker_1_graph_memories} Memories for user {speaker_2_user_id}: {speaker_2_memories} Relations for user {speaker_2_user_id}: {speaker_2_graph_memories} Question: {question} Answer: P ro...

work page 2023