CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.
super hub Canonical reference
MemGPT: Towards LLMs as Operating Systems
Canonical reference. 77% of citing Pith papers cite this work as background.
abstract
Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers i
authors
co-cited works
representative citing papers
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.
A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.
SOLAR is a learning-augmented policy for semantic cache replacement that achieves constant competitive ratio 3 and 5-75% gains over FIFO on retrieval workloads.
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.
A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.
Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.
StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.
User facts are internalized as surgical local edits to a hash-keyed Engram memory table with reasoning skill held in a shared adapter, claimed to match LoRA recall, improve indirect reasoning 5.6x on average, and compose across users with 33,000x smaller footprint than per-user adapters.
RTSGameBench is a new extensible benchmark for VLMs using diverse RTS matchups, diagnostic mini-games targeting individual competencies, and a self-evolving query-to-game generator, with results showing poor VLM performance on tight coordination and large-scale tasks.
GateMem benchmark shows no existing memory method for LLM agents achieves strong utility, access control, and reliable forgetting simultaneously in multi-principal shared settings.
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
PreAct compiles successful agent executions into verifiable state-machine programs for 8.5-13x faster replay on repeated tasks, with an independent evaluator check before storing each program.
MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.
Formalizes four concurrency anomalies in multi-agent LLM systems and mechanically verifies a hierarchy of sound detectors and preventions realized in Rust runtimes using TLA+ and Verus.
An empirical comparison of thirteen control-plane placements in agent memory pipelines identifies three regimes with complementary forgetting recovery on a new 385-case adversarial benchmark, with mutation-time placement achieving 91.7-93.2% overall.
OSL-MR is a learning-augmented framework that casts memory retention as constrained stochastic optimization under partial observability and outperforms heuristic baselines on LoCoMo and LongMemEval.
Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
DCPM reorganizes LLM agent memory into a cognitive hierarchy driven by a synchronous daytime belief writer and an asynchronous nighttime schema engine, reporting gains on cross-session inference benchmarks.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
citing papers explorer
-
MAS-Lab: A Specification-Driven Validation Framework for Reliable Multi-Agent Systems
MAS-Lab proposes a specification-driven framework with Spec, MAS-OS, and Labs layers to enable intent-based validation and reliable evolution of multi-agent systems.
-
EVAF: A Test-Retest Protocol for Selective Parametric Consolidation
EVAF and test-retest protocol show selective parametric consolidation of high-valence experiences in GPT-2 and TinyLlama while preserving factual retrieval.
-
RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources
RESOURCE2SKILL converts multimodal human resources into a hierarchical Skill Wiki of executable agent skills, reporting +11.9 percentage point average gains over no-skill baselines across seven authoring domains.
-
Manufactured Confidence: How Memory Consolidation Turns Hearsay into Confident Facts
LLM memory consolidation turns casual hedged statements into confident facts that agents obey regardless of source or verification.
-
KernelFlume: Elastic Core-Attention Scaling for Agentic Long-Context Decoding
KernelFlume presents a disaggregated decode architecture that separates core attention from projection/FFN paths to enable elastic scaling of attention nodes, reporting up to 61% lower cost per million tokens versus full-instance scaling on H100 hardware for Llama-3.1-8B under dynamic long-context w
-
Selective Memory Retention for Long-Horizon LLM Agents
TraceRetain applies feature-based scoring to evict low-value entries from bounded external memory in frozen LLM agents, preserving task success under 75% synthetic distractors on ALFWorld where unbounded memory degrades.
-
Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy
OmniAct framework integrates planning, memory, and verification to enable persistent autonomy in omnimodal embodied agents, showing improved success and stable context in 40 real-world tasks.
-
When Does Overlap Help? OSU-Mem and a Cell-Conditional Analysis of Trajectory Memory for LLM Agents
OSU-Mem shows overlapping memory helps retrieval when evidence shares tools or entities but hurts when steps are heterogeneous, with benefits on synthetic benchmarks vanishing on mixed real ones due to query mixing.
-
OpenRath: Session-Centered Runtime State for Agent Systems
OpenRath introduces Session as a first-class, branchable runtime value that unifies fragmented state in multi-agent systems and makes fork, merge, and replay explicit operations.
-
CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents
CoreMem replaces cosine retrieval with Fisher-Rao Riemannian matching and introduces Fisher-guided discrete token distillation for syntax-aware compression, reporting +4.51 pp open-domain and +4.17 pp temporal gains on LOCOMO and LongMemEval-S while staying inside an 8 GB VRAM budget.
-
MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision
MemSlides introduces a three-part memory hierarchy (user profile, working, tool) with scoped local revision for multi-turn personalized slide generation.
-
PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents
ProjectMem implements a local event-sourced memory and judgment layer for AI coding agents that logs typed events, projects them to MCP summaries, and applies deterministic pre-action gates to avoid known failures.
-
Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory
Infini Memory proposes topic-structured documents as the core unit of LLM agent memory, with buffer staging, periodic consolidation, and iterative agentic retrieval, reaching 64.7% on MemoryAgentBench.
-
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning
ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.
-
What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory
Geometry-led weighting outperforms blended memory recall for spatial queries, and a DDA-based visibility predicate correctly flags occluded targets while recall remains occlusion-blind.
-
Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference
RHO is a self-supervised technique that selects challenging past tasks, re-solves them, and uses self-preference to update an agent's harness, raising SWE-Bench Pro pass rate from 59% to 78% without external labels.
-
EMBER: Efficient Memory via Budgeted Evidence Retention for Long-Horizon Agents
EMBER learns to retain source-backed evidence capsules under a fixed token budget, improving F1, Retain-Recall, and Read-Recall on LongMemEval-RR over budgeted baselines.
-
Scaling Expert Feedback with Reflective Edit Propagation in Compositional Knowledge Bases
RAID is a reflective agent system that infers intent from single expert edits and propagates corrections across compositional knowledge bases through a three-step architecture.
-
From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents
This survey defines execution provenance as a typed graph of agent execution and evidence tracing as its projection onto evidence-support relations, then reviews methods, taxonomy, benchmarks, and challenges for auditable LLM agents.
-
Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents
SegTreeMem organizes agent conversation history as a temporally ordered segment tree and shows improved answer quality on long-horizon benchmarks when chronological order is preserved during insertion and retrieval.
-
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline
An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.
-
Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions
Simulations of 16 LLM agents in a naming game on 8 topologies show memory depth interacts with network structure to flip coordination speed and increase fragmentation in centralized networks.
-
SaliMory: Orchestrating Cognitive Memory for Conversational Agents
SALIMORY trains an LM to orchestrate cognitive memory operations via stage-wise process rewards, cutting memory failures by one-third and more than doubling good personalization rates.
-
Agent libOS: A Runtime Substrate for Capability-Controlled Self-Evolving LLM Agents
Agent libOS is a runtime substrate for capability-controlled self-evolving LLM agents that completed 27 deterministic tasks without unauthorized side effects while maintaining a 7% false-denial rate.
-
DMF: A Deterministic Memory Framework for Conversational AI Agents
DMF introduces a deterministic memory system using Survival Scores and decay laws that matches Mem0 accuracy on benchmarks while eliminating LLM token use for memory preparation.
-
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
PEFT adapters are positioned as persistent personal state on foundation models, organized via Scale Up, Scale Down, and Scale Out axes, with MinT as an infrastructure example for managing them.
-
Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents
InKH architecture absorbs complexity into financial LLM agents, cutting latency 83%, token cost 82%, and stale knowledge 97% while raising task quality 0.108 on a 46k-episode synthetic benchmark versus baselines.
-
From Empathy to Personalized Empathy: Adapting Empathetic Strategies to Individual Users
Introduces personalized empathy task, PersonaEmp dataset from long-term interactions, and PereGRM reward framework that combines empathy evaluation with dynamic criteria for improved adaptation to user personas.
-
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning
Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.
-
Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents
MMPO introduces Belief Entropy as a self-supervised signal to provide fine-grained supervision for memory policies in LLM agents, outperforming outcome-based RL on long-horizon tasks up to 1.75M tokens.
-
HTAM: Hierarchical Transition-Attended Memory for Operator Optimization
HTAM builds a Hierarchical Transition Graph to organize coarse global directions and detailed local strategies for guiding LLM-based CUDA kernel optimization, improving results on KernelBench.
-
VikingMem: A Memory Base Management System for Stateful LLM-based Applications
VikingMem implements the Memory Base paradigm via event-centric extraction and entity updates on VikingDB with temporal compression, claiming up to 30% better retrieval effectiveness on long-term memory benchmarks.
-
Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory
PPRO improves user-aware memory retrieval in conversational agents by using derived user profiles for ranking and training a query rewriter via Group Relative Policy Optimization, with reported gains on LoCoMo and LongMemEval-S benchmarks.
-
ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor
ConvMemory delivers competitive recall at far lower latency than larger rerankers for long-term conversational memory while a multi-seed ablation refutes temporal-structure exploitation as the operative mechanism.
-
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
MUSE-Autoskill introduces a skill-centric framework for self-evolving LLM agents through a unified lifecycle of skill creation, memory, management, evaluation, and refinement.
-
AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents
AlphaMemo equips LLM alpha-mining agents with AST-diff motif memory, residual learning, and asymmetric veto control to improve out-of-sample factor discovery on CSI 500 and S&P 500.
-
Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory
Proposes Governed Evolving Memory (GEM) as a state-trajectory workload for long-term AI agent memory using four operators and six correctness conditions that record-level systems cannot satisfy.
-
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
-
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Memory-R2 proposes LoGo-GRPO to fix unfair trajectory comparisons in RL training of memory-augmented LLM agents by combining global end-to-end rewards with local rerollouts from identical memory states.
-
CALMem : Application-Layer Dual Memory for Conversational AI
CALMem delivers virtually unbounded effective context for LLM conversations via an application-layer dual memory architecture with intra-session retrieval and token-adaptive injection.
-
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.
-
Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management
Existing proofs of autoregressive Transformer Turing-completeness apply to scaling families of models rather than fixed systems with context management, so they do not establish Turing-completeness for real-world LLMs.
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
-
Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents
Causal Memory Intervention selects memories based on estimated causal impact on LLM answers rather than semantic similarity, with a new benchmark showing improved robustness to irrelevant or harmful memories.
-
Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents
A dual-process memory architecture for scientific AI agents maintains 70-85% accuracy over 15,000 messages by using a constant 10-message episodic window and domain-specific semantic consolidation, consuming 62% fewer tokens than full-context baselines.
-
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
NeuSymMS is a hybrid neuro-symbolic memory system that extracts facts via LLMs and manages them with explicit CLIPS rules for scoping, deduplication, and dual-horizon persistence in LLM agents.
-
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
FORGE is a staged population protocol that evolves prompt-injected memory (Rules, Examples, or Mixed) for ReAct agents via reflection and broadcast, yielding 1.7-7.7× gains over zero-shot and 29-72% over Reflexion on CybORG CAGE-2.
-
DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
DimMem introduces typed dimensional memory units that improve accuracy to 81.43% and 78.20% on two long-term agent benchmarks while cutting token cost by 24% and enabling small models to match larger extractors.
-
TopoClaw: A Human-Centric and Topology-Aware Agent Operating System
TopoClaw is a human-centric Agent OS that uses physical and social topology modeling to enable cross-boundary execution with identity attribution and context-aware governance.
-
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.