CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.
super hub Canonical reference
MemGPT: Towards LLMs as Operating Systems
Canonical reference. 77% of citing Pith papers cite this work as background.
abstract
Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers i
authors
co-cited works
representative citing papers
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.
A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.
SOLAR is a learning-augmented policy for semantic cache replacement that achieves constant competitive ratio 3 and 5-75% gains over FIFO on retrieval workloads.
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.
A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.
Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.
OSL-MR is a learning-augmented framework that casts memory retention as constrained stochastic optimization under partial observability and outperforms heuristic baselines on LoCoMo and LongMemEval.
Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
DCPM reorganizes LLM agent memory into a cognitive hierarchy driven by a synchronous daytime belief writer and an asynchronous nighttime schema engine, reporting gains on cross-session inference benchmarks.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies while retaining a language-model judge.
SubtleMemory benchmark with 1,522 instances over 10 histories shows current memory systems are weak at fine-grained relational discrimination in long-term AI agent interactions.
eMEM is a multi-index memory architecture with tiered consolidation and ten recall tools for embodied agents, scoring 80.8 weighted mean on eMEM-Bench covering eight cognitive psychology paradigms and outperforming a flat RAG baseline on context and lure rejection tasks.
SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.
EvoNote lets LLM agents self-evolve by distilling prior correction feedback into reusable memory for claim analysis, evidence acquisition, and note writing, outperforming human notes on a 1.2K health post benchmark.
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
Momento benchmark reveals current agents fail at multi-session tasks mainly by misestimating user state and treating old session history as current context instead of stale data needing re-validation.
LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
citing papers explorer
-
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.
-
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
-
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving
GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
-
S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination
S-Bus reconstructs read sets from HTTP traffic for multi-agent LLM state coordination, delivering Observable-Read Isolation with formal proofs and empirical safety matching traditional databases.
-
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-benchmark transfer.
-
MEME: Multi-entity & Evolving Memory Evaluation
All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
-
Semantic Level of Detail for Knowledge Graphs: Discovering Abstraction Boundaries via Spectral Heat Diffusion
SLoD detects emergent scale boundaries in knowledge graphs by applying spectral heat diffusion to Poincare embeddings, recovering planted hierarchies in synthetic data and aligning with taxonomic depths in WordNet without resolution-parameter tuning.
-
MemLeak: Diagnosing Information Leaks in Multimodal Agent Memory
MemLeak benchmark shows retained images enable 12% recovery of deleted facts in multimodal agents (reduced to 2% with content-aware deletion), with 47% of image leaks not text-recoverable.
-
Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories
Language models can use a two-stage sleep process of upward distillation for memory consolidation and RL-based dreaming for unsupervised self-improvement to enable continual learning.
-
SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors
SemanticZip is a pilot framework introducing LLM-mediated lossy text compression with an experimental interface evaluating six representation regimes on five diagnostic cases for semantic atom recovery and token efficiency.
-
Compress the Context, Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression
Context Codec is a commitment-level framework for verifiable LLM context compression using semantic atoms, defined metrics, and a compact rendering language.
-
Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery
Empirical evaluation of eight memory condensation strategies on 480 DiscoveryBench tasks finds no significant impact on hypothesis quality but domain-dependent differences in token efficiency.
-
Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse
Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.
-
EVAF: A Test-Retest Protocol for Selective Parametric Consolidation
EVAF and test-retest protocol show selective parametric consolidation of high-valence experiences in GPT-2 and TinyLlama while preserving factual retrieval.
-
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
PEFT adapters are positioned as persistent personal state on foundation models, organized via Scale Up, Scale Down, and Scale Out axes, with MinT as an infrastructure example for managing them.
-
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Memory-R2 proposes LoGo-GRPO to fix unfair trajectory comparisons in RL training of memory-augmented LLM agents by combining global end-to-end rewards with local rerollouts from identical memory states.
-
Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics
Memini is introduced as a graph-based external memory using multi-timescale edge dynamics to enable emergent episodic sensitivity, consolidation, and selective forgetting in LLM systems.
-
SCM: Sleep-Consolidated Memory with Algorithmic Forgetting for Large Language Models
SCM enables LLMs to achieve perfect recall in ten-turn conversations by using sleep-like consolidation and adaptive forgetting to reduce memory noise by over 90%.
- ISM:Self-Improving Strategy Memory for Continual Mathematical Reasoning