arxiv: 2502.12110 · v11 · submitted 2025-02-17 · 💻 cs.CL · cs.HC

Recognition: 3 theorem links

· Lean Theorem

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu , Zujie Liang , Kai Mei , Hang Gao , Juntao Tan , Yongfeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:40 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords agentic memoryLLM agentsdynamic memory organizationZettelkasten methodmemory linkingmemory evolutionknowledge networksagent memory systems

0 comments

The pith

An agentic memory system lets LLM agents dynamically index, link, and evolve interconnected knowledge networks from their experiences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a memory system for LLM agents that creates structured notes for each new experience and automatically links them to relevant past memories while updating older entries. It draws on Zettelkasten principles of dynamic indexing and connection-making to replace the fixed storage and retrieval used in prior agent memory designs. This setup allows the memory network to grow and refine itself as new tasks arrive. A reader would care because agents that maintain adaptable, linked histories can handle longer and more varied real-world sequences without forgetting or repeating mistakes. Experiments across six foundation models report better results than existing state-of-the-art memory baselines.

Core claim

The central claim is that an agentic memory system, by generating notes with contextual descriptions, keywords, and tags for each new memory and then analyzing historical memories to establish meaningful links and trigger updates to existing entries, produces an evolving interconnected knowledge network that improves agent performance on complex tasks.

What carries the argument

The agentic memory process that creates structured notes and performs dynamic similarity-based linking together with evolution updates to prior memories.

If this is right

Agents gain adaptability across diverse tasks because memory organization is no longer limited to fixed operations and structures.
Historical experiences become more usable as new memories trigger refinements to the contextual representations of older ones.
The memory network continuously evolves rather than remaining static, supporting longer-term task sequences.
Performance gains appear consistently across multiple foundation models when compared with prior state-of-the-art memory systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents using this memory approach could maintain coherence over hundreds of steps without external human intervention to correct memory errors.
The same linking mechanism might be applied to multi-agent settings where separate agents share and evolve a joint memory network.
Efficiency questions arise for very large memory collections, where the cost of repeated similarity analysis could become a bottleneck.

Load-bearing premise

The underlying LLM must reliably produce accurate contextual descriptions, keywords, tags, and meaningful links without introducing errors or hallucinations that degrade the overall memory network.

What would settle it

Measure task performance on the six foundation models when the system is used versus when fixed memory baselines are used; if no consistent improvement appears, or if incorrect links cause measurable degradation over long sequences, the central claim does not hold.

read the original abstract

While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/A-mem, while the source code of the agentic memory system is available at https://github.com/WujiangXu/A-mem-sys.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A-MEM gives LLM agents a Zettelkasten-style memory that creates notes, links them, and evolves old entries when new ones arrive, with public code and reported gains on six models.

read the letter

The main contribution here is a memory architecture where the LLM itself handles note creation with contextual descriptions, keywords, and tags, then decides on links to historical entries and can revise those older notes as the network grows. This agent-driven evolution is the piece that feels new compared to more static graph or retrieval systems in prior agent work. They back it with experiments across six models that beat the baselines they chose, and both the system code and the evaluation scripts are on GitHub, which makes the claims easier to inspect and try out directly. That openness is a real plus for anyone who wants to see the implementation details or build on it. The approach combines structured organization with on-the-fly decisions, which addresses the fixed-structure problem they point out in existing memory systems. The soft spot is the closed loop the stress test highlights: every new note depends on the LLM producing accurate attributes and sensible links, and those outputs can then alter prior memories. If a generation step introduces errors or weak connections, they get indexed and potentially retrieved or amplified later. The abstract reports the performance lift but gives no numbers on note fidelity, link accuracy, or any validation steps for the generated content. Without that, it is difficult to separate real memory improvement from cases where the model simply handles the downstream tasks well. This is aimed at researchers and engineers working on long-horizon LLM agents who need more adaptive memory handling. The concrete design and released code make it worth a serious referee even with the current gaps in the validation details, because reviewers can check the implementation and ask for the missing checks on generation quality.

Referee Report

2 major / 2 minor

Summary. The paper proposes A-MEM, an agentic memory system for LLM agents inspired by Zettelkasten principles. It dynamically creates structured notes (contextual descriptions, keywords, tags) for new memories via LLM prompting, identifies links to historical memories, and enables memory evolution by updating prior entries' representations as new information integrates. This forms an evolving interconnected knowledge network. The central claim is that this yields superior performance improvements over existing SOTA baselines across experiments on six foundation models, with source code released at two GitHub repositories.

Significance. If the empirical gains are robust and the memory network remains stable, the work could meaningfully advance memory systems for LLM agents by enabling adaptive, context-aware organization beyond fixed retrieval or static graphs. The explicit release of both evaluation and system code is a clear strength that supports reproducibility and follow-on work.

major comments (2)

[Abstract and Experiments section] Abstract and Experiments section: the claim of 'superior improvement against existing SOTA baselines' on six models is presented without any description of the experimental setup, specific baselines, evaluation metrics, statistical significance tests, task benchmarks, or controls for variance. This is load-bearing for the central empirical claim.
[§3 (Memory Addition and Evolution)] §3 (Memory Addition and Evolution): the system relies on the LLM to generate accurate contextual descriptions, keywords, tags, and links, then to rewrite existing memories. No quantitative fidelity check, error-rate measurement, or manual validation of generated attributes and link quality is reported. Because updates create a closed loop that can propagate errors, this directly affects whether the claimed performance gains can be sustained.

minor comments (2)

[Abstract] The abstract mentions 'recent attempts to incorporate graph databases' but does not cite specific prior systems; adding 1-2 concrete references would clarify the positioning.
[§3] Figure captions and algorithm pseudocode (if present in §3) could more explicitly label the LLM prompting steps versus the graph-update steps to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript describing A-MEM. The comments highlight important areas for strengthening the presentation of our empirical results and the reliability of the memory operations. We address each major comment below and have revised the manuscript accordingly to improve clarity, completeness, and rigor.

read point-by-point responses

Referee: [Abstract and Experiments section] Abstract and Experiments section: the claim of 'superior improvement against existing SOTA baselines' on six models is presented without any description of the experimental setup, specific baselines, evaluation metrics, statistical significance tests, task benchmarks, or controls for variance. This is load-bearing for the central empirical claim.

Authors: We agree that the abstract is high-level and does not enumerate experimental details. The Experiments section (Section 4) does describe the six foundation models, task benchmarks (agentic QA, tool-use, and multi-step reasoning tasks), SOTA baselines (including fixed-retrieval and graph-memory systems), metrics (success rate, latency, and memory efficiency), and variance controls via repeated runs with different seeds. However, we acknowledge that statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests with reported p-values) and more explicit baseline implementation details were not sufficiently highlighted. In the revised manuscript we will (1) update the abstract with a concise sentence on the evaluation framework and (2) add a dedicated “Experimental Setup” subsection that includes all requested elements plus significance tests. These changes directly support the central empirical claim. revision: yes
Referee: [§3 (Memory Addition and Evolution)] §3 (Memory Addition and Evolution): the system relies on the LLM to generate accurate contextual descriptions, keywords, tags, and links, then to rewrite existing memories. No quantitative fidelity check, error-rate measurement, or manual validation of generated attributes and link quality is reported. Because updates create a closed loop that can propagate errors, this directly affects whether the claimed performance gains can be sustained.

Authors: We recognize this as a substantive limitation. The original submission emphasizes end-to-end task performance and does not report direct fidelity measurements on the LLM-generated notes or links. In the revised version we will insert a new subsection (under Section 3 or 4) that presents quantitative validation: human evaluation on 200 randomly sampled memories measuring (a) accuracy of contextual descriptions, (b) relevance of keywords and tags, and (c) precision/recall of generated links. We will also report an error-propagation analysis by tracking how often an erroneous update affects downstream retrieval. These additions will allow readers to assess the robustness of the closed-loop evolution process. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system proposal without derivational reductions

full rationale

The paper presents a design for an agentic memory system that generates structured notes, identifies links, and evolves prior entries via LLM prompts, explicitly following Zettelkasten principles. No equations, fitted parameters, uniqueness theorems, or mathematical derivations appear in the abstract or description. All performance claims rest on external empirical experiments across six models against SOTA baselines rather than any internal self-definition, prediction-from-fit, or self-citation chain that reduces the central result to its own inputs by construction. The system is therefore self-contained as an engineering proposal evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that LLMs can perform reliable memory organization tasks and that the Zettelkasten-inspired structure improves agent performance; no explicit free parameters or invented physical entities are described.

axioms (2)

domain assumption LLM agents require sophisticated memory organization beyond basic storage and retrieval to handle complex tasks effectively.
Stated in the opening of the abstract as motivation for the work.
domain assumption Dynamic indexing, linking, and evolution of memories will produce an adaptive knowledge network superior to fixed-structure systems.
Core design principle presented as following Zettelkasten method.

invented entities (1)

Agentic memory network with evolving links no independent evidence
purpose: To enable continuous refinement of historical memories through new integrations.
Introduced as the core output of the system; no independent falsifiable evidence outside the proposed implementation is provided in the abstract.

pith-pipeline@v0.9.0 · 5581 in / 1433 out tokens · 39994 ms · 2026-05-11T00:40:57.572182+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LedgerForcing conservation_from_balance echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist.
Foundation.LedgerForcing add_event_balanced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Additionally, this process enables memory evolution – as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 conditional novelty 8.0

GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
cs.AI 2026-05 unverdicted novelty 8.0

RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
cs.AI 2026-05 unverdicted novelty 7.0

LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
cs.AI 2026-05 unverdicted novelty 7.0

MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
AEL: Agent Evolving Learning for Open-Ended Environments
cs.CL 2026-04 conditional novelty 7.0

AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechan...
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
cs.CL 2026-04 unverdicted novelty 7.0

Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
cs.IR 2026-04 conditional novelty 7.0

vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BE...
When to Forget: A Memory Governance Primitive
cs.AI 2026-04 unverdicted novelty 7.0

Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
cs.AI 2026-04 unverdicted novelty 7.0

ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
cs.AI 2026-04 unverdicted novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing
cs.DB 2026-03 unverdicted novelty 7.0

GRAB-ANNS is a new GPU graph index that achieves up to 240x higher hybrid search throughput via bucket layouts and hybrid intra/inter-bucket edges.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
MIRIX: Multi-Agent Memory System for LLM-Based Agents
cs.CL 2025-07 unverdicted novelty 7.0

MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
cs.CR 2024-10 unverdicted novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
Cognifold: Always-On Proactive Memory via Cognitive Folding
cs.AI 2026-05 unverdicted novelty 6.0

Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...
Context Training with Active Information Seeking
cs.CL 2026-05 unverdicted novelty 6.0

Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
Beyond Similarity Search: Tenure and the Case for Structured Belief State in LLM Memory
cs.IR 2026-05 unverdicted novelty 6.0

Tenure replaces similarity search with a structured belief store using scope isolation and alias-weighted BM25 retrieval, achieving 1.0 precision on 72 cases where cosine similarity scores 0.12.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
cs.AI 2026-05 unverdicted novelty 6.0

HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
cs.CR 2026-05 unverdicted novelty 6.0

MemPrivacy uses edge detection of sensitive spans and type-aware placeholders to enable cloud-side memory management for LLM agents without exposing private data, achieving under 1.6% utility loss.
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
cs.CR 2026-05 unverdicted novelty 6.0

MemPrivacy uses edge-side privacy span detection and semantic placeholders to enable cloud memory management for LLM agents while limiting utility loss to 1.6% and outperforming masking baselines.
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
cs.CR 2026-05 unverdicted novelty 6.0

MemPrivacy replaces privacy-sensitive spans with structured placeholders on edge devices to enable effective cloud memory management while limiting utility loss to 1.6% and outperforming general models on privacy extraction.
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory
cs.LG 2026-05 unverdicted novelty 6.0

Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
cs.MA 2026-05 unverdicted novelty 6.0

Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
From History to State: Constant-Context Skill Learning for LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebS...
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling
math.OC 2026-04 unverdicted novelty 6.0

Agora-Opt uses decentralized debate among LLM agent teams plus a read-write memory bank to produce more accurate optimization models from text than prior LLM methods.
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
cs.AI 2026-04 unverdicted novelty 6.0

Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge
cs.IR 2026-04 unverdicted novelty 6.0

SmartVector augments embeddings with time, confidence, and relation signals plus a consolidation process, raising top-1 accuracy on versioned queries from 31% to 62% on a synthetic benchmark while cutting stale answer...
Stateless Decision Memory for Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
To Know is to Construct: Schema-Constrained Generation for Agent Memory
cs.CL 2026-04 unverdicted novelty 6.0

SCG-MEM reformulates agent memory access as schema-constrained generation within dynamic cognitive schemas, using assimilation and accommodation for updates plus an associative graph for reasoning, and outperforms ret...
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
cs.CL 2026-04 unverdicted novelty 6.0

HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 93 Pith papers · 7 internal anchors

[1]

Amazon, 2017

Sönke Ahrens.How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking. Amazon, 2017. Second Edition

work page 2017
[2]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. Anthropic, Mar 2024. Accessed May 2025

work page 2024
[3]

Claude 3.5 sonnet model card addendum

Anthropic. Claude 3.5 sonnet model card addendum. Technical report, Anthropic, 2025. Accessed May 2025

work page 2025
[4]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511, 2023

work page internal anchor Pith review arXiv 2023
[5]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

work page 2005
[6]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR, 2022

work page 2022
[7]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023
[8]

mem0: The memory layer for ai agents

Khant Dev and Singh Taranjeet. mem0: The memory layer for ai agents. https://github. com/mem0ai/mem0, 2024

work page 2024
[9]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024
[10]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

I. Ilin. Advanced rag techniques: An illustrated overview, 2023

work page 2023
[13]

Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations.arXiv preprint arXiv:2310.13420, 2023

Jihyoung Jang, Minseong Boo, and Hyounghun Kim. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations.arXiv preprint arXiv:2310.13420, 2023

work page arXiv 2023
[14]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation.arXiv preprint arXiv:2305.06983, 2023

work page arXiv 2023
[15]

Google Books, May 2021

David Kadavy.Digital Zettelkasten: Principles, Methods, & Examples. Google Books, May 2021

work page 2021
[16]

Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents

Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, and Edward Choi. Dialsim: A real-time simulator for evaluating long-term multi-party dialogue understanding of conversational agents.arXiv preprint arXiv:2406.13144, 2024

work page arXiv 2024
[17]

arXiv preprint arXiv:2402.09727 , year=

Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024. 10

work page arXiv 2024
[18]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[19]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004
[20]

doi:10.48550/arXiv.2310.01352 , abstract =

Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. Ra-dit: Retrieval-augmented dual instruction tuning.arXiv preprint arXiv:2310.01352, 2023

work page arXiv 2023
[21]

Agentlite: A lightweight library for building and advancing task-oriented llm agent system.arXiv preprint arXiv:2402.15538, 2024

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K Choubey, Tian Lan, Jason Wu, Huan Wang, et al. Agentlite: A lightweight library for building and advancing task-oriented llm agent system.arXiv preprint arXiv:2402.15538, 2024

work page arXiv 2024
[22]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review arXiv 2024
[23]

Aios: Llm agent operating system.arXiv e-prints, pp

Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Aios: Llm agent operating system.arXiv e-prints, pp. arXiv–2403, 2024

work page 2024
[24]

Ret-llm: Towards a general read-write memory for large language models

Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. Ret-llm: Towards a general read-write memory for large language models.arXiv preprint arXiv:2305.14322, 2023

work page arXiv 2023
[25]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[27]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

work page 2019
[28]

‘smolagents‘: a smol library to build great agentic systems

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agentic systems. https://github. com/huggingface/smolagents, 2025

work page 2025
[29]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023

work page arXiv 2023
[30]

From commands to prompts: Llm-based semantic file system for aios.arXiv preprint arXiv:2410.11843, 2024

Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, et al. From commands to prompts: Llm-based semantic file system for aios.arXiv preprint arXiv:2410.11843, 2024

work page arXiv 2024
[31]

innocent until proven guilty

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509, 2022

work page arXiv 2022
[32]

Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system

Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343, 2023

work page arXiv 2023
[33]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review arXiv 2024
[34]

Learn- ing to filter context for retrieval-augmented generation

Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. Learning to filter context for retrieval-augmented generation.arXiv preprint arXiv:2311.08377, 2023. 11

work page arXiv 2023
[35]

Llm-powered autonomous agents.lilianweng.github.io, Jun 2023

Lilian Weng. Llm-powered autonomous agents.lilianweng.github.io, Jun 2023

work page 2023
[36]

Beyond goldfish memory: Long-term open-domain conversation.arXiv preprint arXiv:2107.07567, 2021

J Xu. Beyond goldfish memory: Long-term open-domain conversation.arXiv preprint arXiv:2107.07567, 2021

work page arXiv 2021
[37]

Chain-of-note: Enhancing robustness in retrieval-augmented language models

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models.arXiv preprint arXiv:2311.09210, 2023

work page arXiv 2023
[38]

Augmentation-adapted retriever improves generalization of language models as generic plug-in.arXiv preprint arXiv:2305.17331, 2023

Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. Augmentation-adapted retriever improves generalization of language models as generic plug-in.arXiv preprint arXiv:2305.17331, 2023

work page arXiv 2023
[39]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024. 12 Contents 1 Introduction 1 2 Related Work 2 2.1 Memory for LLM Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2024
[40]

Identifying the most salient keywords (focus on nouns, verbs, and key concepts)

work page
[41]

Extracting core themes and contextual elements

work page
[42]

keywords

Creating relevant categorical tags Format the response as a JSON object: { "keywords": [ // several specific, distinct keywords that capture key concepts and terminology // Order from most to least important // Don’t include keywords that are the name of the speaker or time // At least three keywords, but don’t be too redundant. ], "context": // one sente...

work page
[43]

should_evolve

What specific actions should be taken (strengthen, update_neighbor)? 1.1 If choose to strengthen the connection, which memory should it be connected to? Can you give the updated tags of this memory? 1.2 If choose to update neighbor, you can update the context and tags of these memories based on the understanding of these memories. Tags should be determine...

work page 2023
[44]

• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025