CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.
super hub Canonical reference
MemGPT: Towards LLMs as Operating Systems
Canonical reference. 77% of citing Pith papers cite this work as background.
abstract
Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers i
authors
co-cited works
representative citing papers
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.
A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.
SOLAR is a learning-augmented policy for semantic cache replacement that achieves constant competitive ratio 3 and 5-75% gains over FIFO on retrieval workloads.
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.
A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.
Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.
Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
DCPM reorganizes LLM agent memory into a cognitive hierarchy driven by a synchronous daytime belief writer and an asynchronous nighttime schema engine, reporting gains on cross-session inference benchmarks.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies while retaining a language-model judge.
SubtleMemory benchmark with 1,522 instances over 10 histories shows current memory systems are weak at fine-grained relational discrimination in long-term AI agent interactions.
eMEM is a multi-index memory architecture with tiered consolidation and ten recall tools for embodied agents, scoring 80.8 weighted mean on eMEM-Bench covering eight cognitive psychology paradigms and outperforming a flat RAG baseline on context and lure rejection tasks.
SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.
EvoNote lets LLM agents self-evolve by distilling prior correction feedback into reusable memory for claim analysis, evidence acquisition, and note writing, outperforming human notes on a 1.2K health post benchmark.
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
Momento benchmark reveals current agents fail at multi-session tasks mainly by misestimating user state and treating old session history as current context instead of stale data needing re-validation.
LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
citing papers explorer
-
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipping as plugins and servers with an audit log.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
-
Stateful Agent Backdoor
A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
-
Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting
Telegraph English compresses prompts via structured symbolic rewriting into atomic facts, achieving roughly 50% token reduction with 99.1% key-fact accuracy on LongBench-v2 and outperforming token-deletion baselines across models.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.
-
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
-
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
-
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
-
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BEIR datasets.
-
SAGER: Self-Evolving User Policy Skills for Recommendation Agent
SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to memory accumulation.
-
IE as Cache: Information Extraction Enhanced Agentic Reasoning
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
-
EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture
A hybrid SNN-LLM system uses learned spiking dynamics and lateral STDP propagation to trigger LLM actions without external prompts, producing the first autonomous action after 7 exchanges from a clean start.
-
When to Forget: A Memory Governance Primitive
Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
-
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
-
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent self-diagnosed bugs and maintained cross-channel context.
-
SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems
SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
-
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interventions for tacit domain knowledge.
-
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
-
CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents
CLAG organizes agent memory into clusters via an SLM router and uses cluster profiles for two-stage retrieval, yielding better answer quality on QA benchmarks than prior memory systems.
-
Semantic Level of Detail for Knowledge Graphs: Discovering Abstraction Boundaries via Spectral Heat Diffusion
SLoD detects emergent scale boundaries in knowledge graphs by applying spectral heat diffusion to Poincare embeddings, recovering planted hierarchies in synthetic data and aligning with taxonomic depths in WordNet without resolution-parameter tuning.
-
E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory
E-mem uses a heterogeneous multi-agent setup for episodic context reconstruction in LLM agents, reaching over 54% F1 on LoCoMo while cutting token cost by over 70% compared to prior methods like GAM.
-
$How^{2}$: How to learn from procedural How-to questions
$How^{2}$ is a memory agent framework enabling agents to ask, store, and reuse answers to how-to questions at varying abstraction levels for better lifelong planning in environments like Plancraft.
-
MIRIX: Multi-Agent Memory System for LLM-Based Agents
MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
-
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
-
AutoMem: Automated Learning of Memory as a Cognitive Skill
AutoMem automates memory structure revision and proficiency training in LLMs, delivering 2x-4x performance gains on long-horizon games without altering task-action behavior.
-
ManimAgent: Self-Evolving Multimodal Agents for Visual Education
ManimAgent improves Manim animation code generation by maintaining a self-growing dual-channel episodic memory of validated successes and failures derived entirely from its own task stream.
-
MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation
MemDelta shows agent memory evaluations are confounded by LLM family and embedding model, with RAG often matching full context and self-memory underperforming basic retrieval.
-
Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering
NPM distills contrastive experiences into implicit activation steering vectors that guide LLM agent execution comparably to explicit RAG instructions, with complementary gains when combined.
-
Experience Graphs: The Data Foundation for Self-Improving Agents
Trellis treats agent experience graphs as first-class database state so that search patterns become queries, enabling crash recovery, scaling, and closed-loop training as architectural byproducts.
-
MemLeak: Diagnosing Information Leaks in Multimodal Agent Memory
MemLeak benchmark shows retained images enable 12% recovery of deleted facts in multimodal agents (reduced to 2% with content-aware deletion), with 47% of image leaks not text-recoverable.
-
Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents
Context compaction silently drops governance constraints in LLM agents, raising policy violation rates from 0% to 30% on average, with a proposed pinning mitigation restoring compliance.
-
Recursive Self-Evolving Agents via Held-Out Selection
RSEA adds a strict held-out keep-better gate to recursive self-evolution of agent artifacts, yielding monotone-safe gains or parity with the base ReAct agent on ALFWorld, GAIA, τ-bench, and WebShop.
-
MemRefine: LLM-Guided Compression for Long-Term Agent Memory
MemRefine uses LLM factual judgments to iteratively compress agent memory to target budgets while preserving downstream task performance better than rule-based baselines.
-
SafeGEO: Understanding Generative Engine Optimization Risks in Recommendation Agents
SafeGEO benchmark demonstrates that GEO attacks raise flawed product inclusion in recommendation sets by up to 83.2%, with partial mitigation from defensive prompting and evidence checks.
-
End-to-End Context Compression at Scale
LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.
-
When More Cores Hurts: The Vector Database Scaling Paradox in HPC
Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.
-
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Emergence World is a model-agnostic multi-agent simulation platform integrating live data, 120+ tools, persistent memory, and democratic governance, illustrated by a 15-day study showing divergent outcomes across five LLM models.
-
Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses
Bayesian-Agent maintains feature-conditioned categorical posteriors over skills/SOPs from verified trajectories and maps them to actions that improve benchmark scores on SOP-Bench, Lifelong AgentBench, and RealFin-Bench.
-
SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows
SKILL.nb uses selective formalization and gate-conditioned execution in auditable notebooks to improve durability of agent workflows, achieving 53.7% success on WebArena-Verified with 91.7% retention across re-executions.
-
Constrained Dominant Sets for Multimodal Document Question Answering
Constrained Dominant Sets on query-augmented graphs select complementary evidence for long multimodal document QA, claiming new SOTA of 66.99 on VisDoMBench and gains of 37.1 and 4.8 points over baselines.
-
Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
Engram's hybrid bi-temporal retrieval from a knowledge graph with provenance yields 83.6% accuracy on LongMemEval_S using 9.6k tokens versus 73.2% with full history.
-
TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management
TokenMizer builds a knowledge graph of LLM sessions and serializes it into 78-token resume blocks that retain more task, decision, and file information than flat-text baselines at roughly half the token cost.
-
RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation
RAMPART is a registry-based memory system for LLM agents with priority-aware primitives that experimentally demonstrates position-dependent performance cliffs and benefits from block grouping and relevance gating.
-
Scaling Self-Evolving Agents via Parametric Memory
TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.