arxiv: 2504.19413 · v1 · submitted 2025-04-28 · 💻 cs.CL · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara , Dev Khant , Saket Aryan , Taranjeet Singh , Deshraj Yadav

Authors on Pith no claims yet

Pith reviewed 2026-05-10 23:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords long-term memoryLLM agentsmemory architectureconversational AIgraph memoryRAGscalabilityLOC OMO benchmark

0 comments

The pith

Mem0 dynamically extracts and consolidates key facts from conversations to give LLMs reliable long-term memory without processing full histories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mem0 as a memory architecture that pulls salient details from ongoing dialogues, stores them efficiently, and retrieves them as needed for consistent answers across sessions. A sympathetic reader would care because current LLMs struggle with extended interactions, either forgetting earlier context or incurring high costs from retaining everything. The authors evaluate it on the LOCOMO benchmark against six categories of baselines including full-context processing, RAG variants, and other memory systems. Results show higher accuracy on single-hop, temporal, multi-hop, and open-domain questions plus major reductions in latency and token use. A graph-based extension adds relational structure among stored facts for further gains.

Core claim

Mem0 is a scalable memory-centric architecture that dynamically extracts, consolidates, and retrieves salient information from ongoing conversations. An enhanced variant uses graph-based representations to capture complex relational structures among conversational elements. On the LOCOMO benchmark it outperforms established memory systems, RAG setups, full-context processing, open-source solutions, proprietary systems, and dedicated memory platforms across single-hop, temporal, multi-hop, and open-domain questions. Mem0 achieves 26% relative improvement in the LLM-as-a-Judge metric over OpenAI, the graph version scores about 2% higher overall, and both deliver 91% lower p95 latency with more

What carries the argument

Mem0's dynamic extraction, consolidation, and retrieval pipeline for salient conversational information, together with its optional graph-based memory representation for relational structures.

If this is right

Outperforms all tested baselines on single-hop, temporal, multi-hop, and open-domain questions.
Delivers 26% relative gain in LLM-as-a-Judge score over OpenAI memory.
Graph memory variant adds roughly 2% overall score improvement over the base Mem0.
Reduces p95 latency by 91% and token cost by more than 90% versus full-context processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If extraction remains reliable at scale, the approach could support agents that maintain coherence across weeks of interaction rather than single sessions.
The relational graph may prove especially useful for tasks that track how facts evolve or connect over time, suggesting targeted tests on longer dependency chains.
Combining this memory layer with other agent components such as planning or tool use could further improve production deployment without proportional cost increases.
The efficiency gains open the possibility of running multiple parallel agents on the same hardware while each retains its own long-term context.

Load-bearing premise

Extracting and consolidating only the most salient facts from conversations preserves every piece of context required for correct answers to complex multi-hop and temporal questions.

What would settle it

A new evaluation set of long multi-session dialogues containing explicit temporal chains and multi-hop dependencies where full-context processing scores measurably higher than Mem0 on accuracy metrics.

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Mem0, a scalable memory-centric architecture for LLMs that dynamically extracts, consolidates, and retrieves salient information from multi-session conversations, along with a graph-based variant for capturing relational structures. It evaluates both variants on the LOCOMO benchmark against six categories of baselines (memory-augmented systems, RAG variants, full-context, open-source, proprietary, and dedicated platforms), claiming consistent outperformance across single-hop, temporal, multi-hop, and open-domain questions, including a 26% relative gain in LLM-as-Judge over OpenAI, ~2% additional gain from the graph variant, 91% lower p95 latency, and >90% token cost savings versus full-context.

Significance. If the results hold after addressing the gaps below, this would represent a practical contribution to production-ready long-term memory for AI agents, with notable efficiency advantages over full-context baselines that could enable scalable deployment. The breadth of baseline comparisons across question categories is a strength, though the absence of targeted ablations and error analysis limits the ability to attribute gains specifically to the proposed extraction and graph mechanisms.

major comments (2)

[Experimental evaluation (Section 4)] Experimental evaluation (Section 4 / LOCOMO results): Aggregate scores are reported for the four question categories and LLM-as-Judge metric, but no per-question error analysis, extraction-precision audit against gold facts, or ablation isolating dynamic extraction/consolidation failures from retrieval/graph issues is provided. This is load-bearing for the central claim, as omissions in temporal anchors or cross-turn entities could explain gains without the memory mechanism itself being superior.
[Methodology] Methodology and implementation details: The manuscript does not specify data splits for LOCOMO, exact extraction prompts/models, graph construction algorithm, or precise configurations for all six baseline categories (e.g., chunk sizes and k for RAG). Without these, the 26% relative improvement and efficiency metrics cannot be independently verified or reproduced.

minor comments (1)

[Abstract] The abstract states 'around 2% higher overall score' for the graph variant; the main text should report the exact metric, absolute values, and statistical significance for this comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical contributions of Mem0 to scalable long-term memory for AI agents. The comments highlight important areas for improving the strength of our claims and reproducibility. We address each major comment below and have revised the manuscript to incorporate additional analysis and details where feasible.

read point-by-point responses

Referee: [Experimental evaluation (Section 4)] Experimental evaluation (Section 4 / LOCOMO results): Aggregate scores are reported for the four question categories and LLM-as-Judge metric, but no per-question error analysis, extraction-precision audit against gold facts, or ablation isolating dynamic extraction/consolidation failures from retrieval/graph issues is provided. This is load-bearing for the central claim, as omissions in temporal anchors or cross-turn entities could explain gains without the memory mechanism itself being superior.

Authors: We agree that aggregate metrics alone make it harder to isolate the contributions of dynamic extraction, consolidation, and graph-based retrieval. In the revised manuscript we will add a dedicated error analysis subsection in Section 4 that provides per-category breakdowns (single-hop, temporal, multi-hop, open-domain) with representative success and failure examples, focusing on cases involving temporal anchors and cross-turn entities. We will also include targeted ablations: (i) Mem0 without dynamic extraction/consolidation, (ii) base Mem0 versus graph variant, and (iii) retrieval-only versus full memory pipeline. These will help attribute gains more precisely to the proposed mechanisms. A full extraction-precision audit against gold facts is not possible because LOCOMO does not provide such annotations; we will instead report precision estimates from manual inspection of a sampled subset of extracted memories and note this as a limitation. revision: partial
Referee: [Methodology] Methodology and implementation details: The manuscript does not specify data splits for LOCOMO, exact extraction prompts/models, graph construction algorithm, or precise configurations for all six baseline categories (e.g., chunk sizes and k for RAG). Without these, the 26% relative improvement and efficiency metrics cannot be independently verified or reproduced.

Authors: We acknowledge that the original manuscript omitted several implementation details necessary for full reproducibility. The revised version will expand the Experimental Setup section with: (1) LOCOMO data usage and any train/test splits applied; (2) the exact extraction and consolidation prompts together with the underlying models (gpt-4o for extraction, gpt-4o-mini for retrieval); (3) the graph construction algorithm, which uses LLM-based entity-relation extraction followed by incremental graph updates; and (4) complete baseline configurations, including chunk sizes (256/512/1024 tokens) and k values (3/5/10) for all RAG variants, as well as the exact settings for the other five baseline categories. These additions will allow independent verification of the reported accuracy gains, 91% p95 latency reduction, and >90% token cost savings. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain

full rationale

The paper proposes the Mem0 architecture for dynamic memory extraction/consolidation/retrieval in LLMs and evaluates it empirically on the external LOCOMO benchmark against six categories of baselines. All reported results (26% relative LLM-as-Judge gain, 2% graph variant uplift, 91% p95 latency reduction, >90% token savings) are direct performance comparisons to independent systems rather than any first-principles derivation, fitted-parameter prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes appear in the provided text; the central claims rest on aggregate benchmark scores without reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; relies on domain assumption that structured memory extraction improves coherence without full verification of extraction accuracy.

axioms (1)

domain assumption Dynamic extraction of salient information from conversations can be done reliably enough to support multi-hop and temporal reasoning.
Central to the architecture's ability to outperform full-context and RAG baselines.

pith-pipeline@v0.9.0 · 5618 in / 1157 out tokens · 35852 ms · 2026-05-10T23:07:10.637047+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements.
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration.
IndisputableMonolith.Foundation.DiscretenessForcing discreteness_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 conditional novelty 8.0

GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
Agentic Recommender System with Hierarchical Belief-State Memory
cs.CL 2026-05 unverdicted novelty 7.0

MARS uses hierarchical memory and LLM planning to achieve 26.4% higher HR@1 on InstructRec benchmarks compared to prior methods.
Thinking Ahead: Prospection-Guided Retrieval of Memory with Language Models
cs.IR 2026-05 conditional novelty 7.0

PGR expands user queries into plausible future steps via Tree-of-Thought or chains and uses them as retrieval probes, delivering nearly 3x recall gains on the new MemoryQuest benchmark for low-similarity memory retrieval.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
cs.LG 2026-05 unverdicted novelty 7.0

EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-ben...
MEME: Multi-entity & Evolving Memory Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection
cs.CL 2026-05 accept novelty 7.0

CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
cs.AI 2026-05 unverdicted novelty 7.0

MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of basel...
Stateful Agent Backdoor
cs.CR 2026-05 unverdicted novelty 7.0

A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
cs.AI 2026-05 unverdicted novelty 7.0

MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
cs.CL 2026-04 unverdicted novelty 7.0

Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
Latent Preference Modeling for Cross-Session Personalized Tool Calling
cs.CL 2026-04 unverdicted novelty 7.0

Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
cs.IR 2026-04 conditional novelty 7.0

vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BE...
The Missing Knowledge Layer in Cognitive Architectures for AI Agents
cs.AI 2026-04 conditional novelty 7.0

Cognitive architectures for AI agents require a distinct Knowledge layer with indefinite supersession persistence, separate from Memory decay, Wisdom evidence-gating, and Intelligence ephemerality.
GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing
cs.DB 2026-03 unverdicted novelty 7.0

GRAB-ANNS is a new GPU graph index that achieves up to 240x higher hybrid search throughput via bucket layouts and hybrid intra/inter-bucket edges.
SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams
cs.CL 2026-03 unverdicted novelty 7.0

SensorPersona uses LLMs for hierarchical reasoning on longitudinal mobile sensor streams to continually extract stable personas, showing up to 31.4% higher recall and 85.7% win rate over baselines on a 20-user dataset.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
MIRIX: Multi-Agent Memory System for LLM-Based Agents
cs.CL 2025-07 unverdicted novelty 7.0

MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems
cs.AI 2026-05 unverdicted novelty 6.0

HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.
Cognifold: Always-On Proactive Memory via Cognitive Folding
cs.AI 2026-05 unverdicted novelty 6.0

Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...
$\delta$-mem: Efficient Online Memory for Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
cs.CL 2026-05 unverdicted novelty 6.0

SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
Beyond Similarity Search: Tenure and the Case for Structured Belief State in LLM Memory
cs.IR 2026-05 unverdicted novelty 6.0

Tenure replaces similarity search with a structured belief store using scope isolation and alias-weighted BM25 retrieval, achieving 1.0 precision on 72 cases where cosine similarity scores 0.12.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
cs.AI 2026-05 unverdicted novelty 6.0

HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
cs.CR 2026-05 unverdicted novelty 6.0

MemPrivacy uses edge-side privacy span detection and semantic placeholders to enable cloud memory management for LLM agents while limiting utility loss to 1.6% and outperforming masking baselines.
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
cs.CR 2026-05 unverdicted novelty 6.0

MemPrivacy uses edge detection of sensitive spans and type-aware placeholders to enable cloud-side memory management for LLM agents without exposing private data, achieving under 1.6% utility loss.
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
cs.CR 2026-05 unverdicted novelty 6.0

MemPrivacy replaces privacy-sensitive spans with structured placeholders on edge devices to enable effective cloud memory management while limiting utility loss to 1.6% and outperforming general models on privacy extraction.
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory
cs.LG 2026-05 unverdicted novelty 6.0

Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillMaster is a training framework that lets LLM agents autonomously propose, update, and apply skills, yielding 8.8% and 9.3% higher success rates on ALFWorld and WebShop than prior methods.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
GASim: A Graph-Accelerated Hybrid Framework for Social Simulation
cs.AI 2026-05 unverdicted novelty 6.0

GASim accelerates hybrid LLM-ABM social simulations via graph-optimized memory, graph message passing, and entropy-driven agent grouping, delivering 9.94x speedup and under 20% token use while aligning with real-world trends.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall
cs.CL 2026-05 conditional novelty 6.0

True Memory is a verbatim-event retrieval pipeline running on a single SQLite file that reaches 93% accuracy on LoCoMo multi-session questions, outperforming Mem0, Supermemory, Zep, and matching or exceeding EverMemOS...
Tree-based Credit Assignment for Multi-Agent Memory System
cs.MA 2026-05 unverdicted novelty 6.0

TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
cs.AI 2026-05 unverdicted novelty 6.0

ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

Circuit analysis reveals that routing circuits for agent memory emerge at 0.6B parameters while content circuits emerge at 4B, with a shared grounding hub and an unsupervised diagnostic achieving 76.2% accuracy for lo...
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 7...
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
cs.AI 2026-04 unverdicted novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments
cs.HC 2026-04 unverdicted novelty 6.0

AgentEconomist is an end-to-end agentic system with idea development, experimental design, and execution stages that uses a large economics paper database to produce research ideas with better literature grounding, no...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 103 Pith papers

[1]

Carefully analyze all provided memories from both speakers

work page
[2]

Pay special attention to the timestamps to determine the answer

work page
[3]

If the question asks about a specific event or fact, look for direct evidence in the memories

work page
[4]

If the memories contain contradictory information, prioritize the most recent memory

work page
[5]

last year

If there is a question about time references (like "last year", "two months ago", etc.), calculate the actual date based on the memory timestamp. For example, if a memory from 4 May 2022 mentions "went to India last year," then the trip occurred in 2021

work page 2022
[6]

last year

Always convert relative time references to specific dates, months, or years. For example, convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory timestamp. Ignore the reference while answering the question

work page 2022
[7]

Do not confuse character names mentioned in memories with the actual users who created those memories

Focus only on the content of the memories from both speakers. Do not confuse character names mentioned in memories with the actual users who created those memories

work page
[8]

# APPROACH (Think step by step):

The answer should be less than 5-6 words. # APPROACH (Think step by step):

work page
[15]

Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_user_id}: {speaker_1_memories} Memories for user {speaker_2_user_id}: {speaker_2_memories} Question: {question} Answer: 19 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory P rom p t Te m p l at e f or Re s u lt s G e n e r at ion (M e...

work page
[16]

First, examine all memories that contain information related to the question

work page
[17]

Examine the timestamps and content of these memories carefully

work page
[18]

Look for explicit mentions of dates, times, locations, or events that answer the question

work page
[19]

If the answer requires calculation (e.g., converting relative time references), show your work

work page
[20]

Analyze the knowledge graph relations to understand the user’s knowledge context

work page
[21]

Formulate a precise, concise answer based solely on the evidence in the memories

work page
[22]

Double-check that your answer directly addresses the question asked

work page
[23]

(1:56 pm on 8 May, 2023) Caroline: Hey Mel! Good to see you! How have you been? (1:56 pm on 8 May, 2023) Melanie: Hey Caroline! Good to see you! I’m swamped with the kids & work

Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_user_id}: {speaker_1_memories} Relations for user {speaker_1_user_id}: {speaker_1_graph_memories} Memories for user {speaker_2_user_id}: {speaker_2_memories} Relations for user {speaker_2_user_id}: {speaker_2_graph_memories} Question: {question} Answer: P ro...

work page 2023