pith. machine review for the scientific record. sign in

arxiv: 2502.12110 · v11 · submitted 2025-02-17 · 💻 cs.CL · cs.HC

Recognition: 3 theorem links

· Lean Theorem

A-MEM: Agentic Memory for LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:40 UTC · model grok-4.3

classification 💻 cs.CL cs.HC
keywords agentic memoryLLM agentsdynamic memory organizationZettelkasten methodmemory linkingmemory evolutionknowledge networksagent memory systems
0
0 comments X

The pith

An agentic memory system lets LLM agents dynamically index, link, and evolve interconnected knowledge networks from their experiences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a memory system for LLM agents that creates structured notes for each new experience and automatically links them to relevant past memories while updating older entries. It draws on Zettelkasten principles of dynamic indexing and connection-making to replace the fixed storage and retrieval used in prior agent memory designs. This setup allows the memory network to grow and refine itself as new tasks arrive. A reader would care because agents that maintain adaptable, linked histories can handle longer and more varied real-world sequences without forgetting or repeating mistakes. Experiments across six foundation models report better results than existing state-of-the-art memory baselines.

Core claim

The central claim is that an agentic memory system, by generating notes with contextual descriptions, keywords, and tags for each new memory and then analyzing historical memories to establish meaningful links and trigger updates to existing entries, produces an evolving interconnected knowledge network that improves agent performance on complex tasks.

What carries the argument

The agentic memory process that creates structured notes and performs dynamic similarity-based linking together with evolution updates to prior memories.

If this is right

  • Agents gain adaptability across diverse tasks because memory organization is no longer limited to fixed operations and structures.
  • Historical experiences become more usable as new memories trigger refinements to the contextual representations of older ones.
  • The memory network continuously evolves rather than remaining static, supporting longer-term task sequences.
  • Performance gains appear consistently across multiple foundation models when compared with prior state-of-the-art memory systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents using this memory approach could maintain coherence over hundreds of steps without external human intervention to correct memory errors.
  • The same linking mechanism might be applied to multi-agent settings where separate agents share and evolve a joint memory network.
  • Efficiency questions arise for very large memory collections, where the cost of repeated similarity analysis could become a bottleneck.

Load-bearing premise

The underlying LLM must reliably produce accurate contextual descriptions, keywords, tags, and meaningful links without introducing errors or hallucinations that degrade the overall memory network.

What would settle it

Measure task performance on the six foundation models when the system is used versus when fixed memory baselines are used; if no consistent improvement appears, or if incorrect links cause measurable degradation over long sequences, the central claim does not hold.

read the original abstract

While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/A-mem, while the source code of the agentic memory system is available at https://github.com/WujiangXu/A-mem-sys.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes A-MEM, an agentic memory system for LLM agents inspired by Zettelkasten principles. It dynamically creates structured notes (contextual descriptions, keywords, tags) for new memories via LLM prompting, identifies links to historical memories, and enables memory evolution by updating prior entries' representations as new information integrates. This forms an evolving interconnected knowledge network. The central claim is that this yields superior performance improvements over existing SOTA baselines across experiments on six foundation models, with source code released at two GitHub repositories.

Significance. If the empirical gains are robust and the memory network remains stable, the work could meaningfully advance memory systems for LLM agents by enabling adaptive, context-aware organization beyond fixed retrieval or static graphs. The explicit release of both evaluation and system code is a clear strength that supports reproducibility and follow-on work.

major comments (2)
  1. [Abstract and Experiments section] Abstract and Experiments section: the claim of 'superior improvement against existing SOTA baselines' on six models is presented without any description of the experimental setup, specific baselines, evaluation metrics, statistical significance tests, task benchmarks, or controls for variance. This is load-bearing for the central empirical claim.
  2. [§3 (Memory Addition and Evolution)] §3 (Memory Addition and Evolution): the system relies on the LLM to generate accurate contextual descriptions, keywords, tags, and links, then to rewrite existing memories. No quantitative fidelity check, error-rate measurement, or manual validation of generated attributes and link quality is reported. Because updates create a closed loop that can propagate errors, this directly affects whether the claimed performance gains can be sustained.
minor comments (2)
  1. [Abstract] The abstract mentions 'recent attempts to incorporate graph databases' but does not cite specific prior systems; adding 1-2 concrete references would clarify the positioning.
  2. [§3] Figure captions and algorithm pseudocode (if present in §3) could more explicitly label the LLM prompting steps versus the graph-update steps to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript describing A-MEM. The comments highlight important areas for strengthening the presentation of our empirical results and the reliability of the memory operations. We address each major comment below and have revised the manuscript accordingly to improve clarity, completeness, and rigor.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] Abstract and Experiments section: the claim of 'superior improvement against existing SOTA baselines' on six models is presented without any description of the experimental setup, specific baselines, evaluation metrics, statistical significance tests, task benchmarks, or controls for variance. This is load-bearing for the central empirical claim.

    Authors: We agree that the abstract is high-level and does not enumerate experimental details. The Experiments section (Section 4) does describe the six foundation models, task benchmarks (agentic QA, tool-use, and multi-step reasoning tasks), SOTA baselines (including fixed-retrieval and graph-memory systems), metrics (success rate, latency, and memory efficiency), and variance controls via repeated runs with different seeds. However, we acknowledge that statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests with reported p-values) and more explicit baseline implementation details were not sufficiently highlighted. In the revised manuscript we will (1) update the abstract with a concise sentence on the evaluation framework and (2) add a dedicated “Experimental Setup” subsection that includes all requested elements plus significance tests. These changes directly support the central empirical claim. revision: yes

  2. Referee: [§3 (Memory Addition and Evolution)] §3 (Memory Addition and Evolution): the system relies on the LLM to generate accurate contextual descriptions, keywords, tags, and links, then to rewrite existing memories. No quantitative fidelity check, error-rate measurement, or manual validation of generated attributes and link quality is reported. Because updates create a closed loop that can propagate errors, this directly affects whether the claimed performance gains can be sustained.

    Authors: We recognize this as a substantive limitation. The original submission emphasizes end-to-end task performance and does not report direct fidelity measurements on the LLM-generated notes or links. In the revised version we will insert a new subsection (under Section 3 or 4) that presents quantitative validation: human evaluation on 200 randomly sampled memories measuring (a) accuracy of contextual descriptions, (b) relevance of keywords and tags, and (c) precision/recall of generated links. We will also report an error-propagation analysis by tracking how often an erroneous update affects downstream retrieval. These additions will allow readers to assess the robustness of the closed-loop evolution process. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system proposal without derivational reductions

full rationale

The paper presents a design for an agentic memory system that generates structured notes, identifies links, and evolves prior entries via LLM prompts, explicitly following Zettelkasten principles. No equations, fitted parameters, uniqueness theorems, or mathematical derivations appear in the abstract or description. All performance claims rest on external empirical experiments across six models against SOTA baselines rather than any internal self-definition, prediction-from-fit, or self-citation chain that reduces the central result to its own inputs by construction. The system is therefore self-contained as an engineering proposal evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that LLMs can perform reliable memory organization tasks and that the Zettelkasten-inspired structure improves agent performance; no explicit free parameters or invented physical entities are described.

axioms (2)
  • domain assumption LLM agents require sophisticated memory organization beyond basic storage and retrieval to handle complex tasks effectively.
    Stated in the opening of the abstract as motivation for the work.
  • domain assumption Dynamic indexing, linking, and evolution of memories will produce an adaptive knowledge network superior to fixed-structure systems.
    Core design principle presented as following Zettelkasten method.
invented entities (1)
  • Agentic memory network with evolving links no independent evidence
    purpose: To enable continuous refinement of historical memories through new integrations.
    Introduced as the core output of the system; no independent falsifiable evidence outside the proposed implementation is provided in the abstract.

pith-pipeline@v0.9.0 · 5581 in / 1433 out tokens · 39994 ms · 2026-05-11T00:40:57.572182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.LedgerForcing conservation_from_balance echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist.

  • Foundation.LedgerForcing add_event_balanced echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Additionally, this process enables memory evolution – as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

    cs.CL 2026-05 conditional novelty 8.0

    GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.

  2. RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

    cs.AI 2026-05 unverdicted novelty 8.0

    RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.

  3. MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

    cs.AI 2026-05 conditional novelty 8.0

    MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...

  4. Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

    cs.CL 2026-05 unverdicted novelty 8.0

    Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

  5. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.

  6. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.

  7. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

  8. SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

    cs.SE 2026-05 unverdicted novelty 7.0

    SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...

  9. LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

    cs.CL 2026-05 unverdicted novelty 7.0

    LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

  10. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  11. Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.

  12. Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.

  13. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

  14. MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

    cs.AI 2026-05 unverdicted novelty 7.0

    MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.

  15. Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

    cs.CV 2026-05 unverdicted novelty 7.0

    Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.

  16. Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...

  17. Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...

  18. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

    cs.AI 2026-05 unverdicted novelty 7.0

    EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...

  19. MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

    cs.RO 2026-05 unverdicted novelty 7.0

    MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...

  20. MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

    cs.RO 2026-05 unverdicted novelty 7.0

    MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.

  21. When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

  22. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  23. Belief Memory: Agent Memory Under Partial Observability

    cs.AI 2026-05 unverdicted novelty 7.0

    BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.

  24. Belief Memory: Agent Memory Under Partial Observability

    cs.AI 2026-05 unverdicted novelty 7.0

    BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...

  25. MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing

    cs.AI 2026-05 unverdicted novelty 7.0

    MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.

  26. Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

    cs.CL 2026-05 unverdicted novelty 7.0

    MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.

  27. AEL: Agent Evolving Learning for Open-Ended Environments

    cs.CL 2026-04 conditional novelty 7.0

    AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechan...

  28. From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.

  29. Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...

  30. vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents

    cs.IR 2026-04 conditional novelty 7.0

    vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BE...

  31. When to Forget: A Memory Governance Primitive

    cs.AI 2026-04 unverdicted novelty 7.0

    Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.

  32. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.

  33. PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

    cs.AI 2026-04 unverdicted novelty 7.0

    PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

  34. GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing

    cs.DB 2026-03 unverdicted novelty 7.0

    GRAB-ANNS is a new GPU graph index that achieves up to 240x higher hybrid search throughput via bucket layouts and hybrid intra/inter-bucket edges.

  35. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  36. MIRIX: Multi-Agent Memory System for LLM-Based Agents

    cs.CL 2025-07 unverdicted novelty 7.0

    MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.

  37. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    cs.CR 2024-10 unverdicted novelty 7.0

    ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...

  38. Cognifold: Always-On Proactive Memory via Cognitive Folding

    cs.AI 2026-05 unverdicted novelty 6.0

    Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...

  39. Context Training with Active Information Seeking

    cs.CL 2026-05 unverdicted novelty 6.0

    Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...

  40. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

  41. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  42. PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...

  43. SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

    cs.AI 2026-05 unverdicted novelty 6.0

    SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...

  44. Beyond Similarity Search: Tenure and the Case for Structured Belief State in LLM Memory

    cs.IR 2026-05 unverdicted novelty 6.0

    Tenure replaces similarity search with a structured belief store using scope isolation and alias-weighted BM25 retrieval, achieving 1.0 precision on 72 cases where cosine similarity scores 0.12.

  45. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...

  46. HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

    cs.AI 2026-05 unverdicted novelty 6.0

    HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.

  47. MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    MemPrivacy uses edge detection of sensitive spans and type-aware placeholders to enable cloud-side memory management for LLM agents without exposing private data, achieving under 1.6% utility loss.

  48. MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    MemPrivacy uses edge-side privacy span detection and semantic placeholders to enable cloud memory management for LLM agents while limiting utility loss to 1.6% and outperforming masking baselines.

  49. MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    MemPrivacy replaces privacy-sensitive spans with structured placeholders on edge devices to enable effective cloud memory management while limiting utility loss to 1.6% and outperforming general models on privacy extraction.

  50. The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory

    cs.LG 2026-05 unverdicted novelty 6.0

    Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.

  51. Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

    cs.MA 2026-05 unverdicted novelty 6.0

    Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.

  52. SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...

  53. From History to State: Constant-Context Skill Learning for LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebS...

  54. MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.

  55. From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

    math.OC 2026-04 unverdicted novelty 6.0

    Agora-Opt uses decentralized debate among LLM agent teams plus a read-write memory bank to produce more accurate optimization models from text than prior LLM methods.

  56. Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...

  57. Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge

    cs.IR 2026-04 unverdicted novelty 6.0

    SmartVector augments embeddings with time, confidence, and relation signals plus a consolidation process, raising top-1 accuracy on versioned queries from 31% to 62% on a synthetic benchmark while cutting stale answer...

  58. Stateless Decision Memory for Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...

  59. To Know is to Construct: Schema-Constrained Generation for Agent Memory

    cs.CL 2026-04 unverdicted novelty 6.0

    SCG-MEM reformulates agent memory access as schema-constrained generation within dynamic cognitive schemas, using assimilation and accommodation for updates plus an associative graph for reasoning, and outperforms ret...

  60. HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 93 Pith papers · 7 internal anchors

  1. [1]

    Amazon, 2017

    Sönke Ahrens.How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking. Amazon, 2017. Second Edition

  2. [2]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku. Anthropic, Mar 2024. Accessed May 2025

  3. [3]

    Claude 3.5 sonnet model card addendum

    Anthropic. Claude 3.5 sonnet model card addendum. Technical report, Anthropic, 2025. Accessed May 2025

  4. [4]

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511, 2023

  5. [5]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

  6. [6]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR, 2022

  7. [7]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  8. [8]

    mem0: The memory layer for ai agents

    Khant Dev and Singh Taranjeet. mem0: The memory layer for ai agents. https://github. com/mem0ai/mem0, 2024

  9. [9]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

  10. [10]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    I. Ilin. Advanced rag techniques: An illustrated overview, 2023

  13. [13]

    Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations.arXiv preprint arXiv:2310.13420, 2023

    Jihyoung Jang, Minseong Boo, and Hyounghun Kim. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations.arXiv preprint arXiv:2310.13420, 2023

  14. [14]

    Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation.arXiv preprint arXiv:2305.06983, 2023

  15. [15]

    Google Books, May 2021

    David Kadavy.Digital Zettelkasten: Principles, Methods, & Examples. Google Books, May 2021

  16. [16]

    Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents

    Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, and Edward Choi. Dialsim: A real-time simulator for evaluating long-term multi-party dialogue understanding of conversational agents.arXiv preprint arXiv:2406.13144, 2024

  17. [17]

    arXiv preprint arXiv:2402.09727 , year=

    Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024. 10

  18. [18]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  19. [19]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  20. [20]

    doi:10.48550/arXiv.2310.01352 , abstract =

    Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. Ra-dit: Retrieval-augmented dual instruction tuning.arXiv preprint arXiv:2310.01352, 2023

  21. [21]

    Agentlite: A lightweight library for building and advancing task-oriented llm agent system.arXiv preprint arXiv:2402.15538, 2024

    Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K Choubey, Tian Lan, Jason Wu, Huan Wang, et al. Agentlite: A lightweight library for building and advancing task-oriented llm agent system.arXiv preprint arXiv:2402.15538, 2024

  22. [22]

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

  23. [23]

    Aios: Llm agent operating system.arXiv e-prints, pp

    Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Aios: Llm agent operating system.arXiv e-prints, pp. arXiv–2403, 2024

  24. [24]

    Ret-llm: Towards a general read-write memory for large language models

    Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. Ret-llm: Towards a general read-write memory for large language models.arXiv preprint arXiv:2305.14322, 2023

  25. [25]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

  26. [26]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  27. [27]

    Sentence-bert: Sentence embeddings using siamese bert- networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

  28. [28]

    ‘smolagents‘: a smol library to build great agentic systems

    Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agentic systems. https://github. com/huggingface/smolagents, 2025

  29. [29]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023

  30. [30]

    From commands to prompts: Llm-based semantic file system for aios.arXiv preprint arXiv:2410.11843, 2024

    Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, et al. From commands to prompts: Llm-based semantic file system for aios.arXiv preprint arXiv:2410.11843, 2024

  31. [31]

    innocent until proven guilty

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509, 2022

  32. [32]

    Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system

    Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343, 2023

  33. [33]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  34. [34]

    Learn- ing to filter context for retrieval-augmented generation

    Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. Learning to filter context for retrieval-augmented generation.arXiv preprint arXiv:2311.08377, 2023. 11

  35. [35]

    Llm-powered autonomous agents.lilianweng.github.io, Jun 2023

    Lilian Weng. Llm-powered autonomous agents.lilianweng.github.io, Jun 2023

  36. [36]

    Beyond goldfish memory: Long-term open-domain conversation.arXiv preprint arXiv:2107.07567, 2021

    J Xu. Beyond goldfish memory: Long-term open-domain conversation.arXiv preprint arXiv:2107.07567, 2021

  37. [37]

    Chain-of-note: Enhancing robustness in retrieval-augmented language models

    Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models.arXiv preprint arXiv:2311.09210, 2023

  38. [38]

    Augmentation-adapted retriever improves generalization of language models as generic plug-in.arXiv preprint arXiv:2305.17331, 2023

    Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. Augmentation-adapted retriever improves generalization of language models as generic plug-in.arXiv preprint arXiv:2305.17331, 2023

  39. [39]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024. 12 Contents 1 Introduction 1 2 Related Work 2 2.1 Memory for LLM Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  40. [40]

    Identifying the most salient keywords (focus on nouns, verbs, and key concepts)

  41. [41]

    Extracting core themes and contextual elements

  42. [42]

    keywords

    Creating relevant categorical tags Format the response as a JSON object: { "keywords": [ // several specific, distinct keywords that capture key concepts and terminology // Order from most to least important // Don’t include keywords that are the name of the speaker or time // At least three keywords, but don’t be too redundant. ], "context": // one sente...

  43. [43]

    should_evolve

    What specific actions should be taken (strengthen, update_neighbor)? 1.1 If choose to strengthen the connection, which memory should it be connected to? Can you give the updated tags of this memory? 1.2 If choose to update neighbor, you can update the context and tags of these memories based on the understanding of these memories. Tags should be determine...

  44. [44]

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...