CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.
super hub Canonical reference
MemGPT: Towards LLMs as Operating Systems
Canonical reference. 78% of citing Pith papers cite this work as background.
abstract
Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers i
authors
co-cited works
representative citing papers
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.
A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.
SOLAR is a learning-augmented policy for semantic cache replacement that achieves constant competitive ratio 3 and 5-75% gains over FIFO on retrieval workloads.
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.
A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.
Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.
KBSpec maintains an evolving knowledge base combining external docs and internal verifier feedback to improve LLM generation of verifiable JML specifications, achieving 10-25% higher verification pass rates.
StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.
User facts are internalized as surgical local edits to a hash-keyed Engram memory table with reasoning skill held in a shared adapter, claimed to match LoRA recall, improve indirect reasoning 5.6x on average, and compose across users with 33,000x smaller footprint than per-user adapters.
RTSGameBench is a new extensible benchmark for VLMs using diverse RTS matchups, diagnostic mini-games targeting individual competencies, and a self-evolving query-to-game generator, with results showing poor VLM performance on tight coordination and large-scale tasks.
GateMem benchmark shows no existing memory method for LLM agents achieves strong utility, access control, and reliable forgetting simultaneously in multi-principal shared settings.
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
PreAct compiles successful agent executions into verifiable state-machine programs for 8.5-13x faster replay on repeated tasks, with an independent evaluator check before storing each program.
MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.
Formalizes four concurrency anomalies in multi-agent LLM systems and mechanically verifies a hierarchy of sound detectors and preventions realized in Rust runtimes using TLA+ and Verus.
An empirical comparison of thirteen control-plane placements in agent memory pipelines identifies three regimes with complementary forgetting recovery on a new 385-case adversarial benchmark, with mutation-time placement achieving 91.7-93.2% overall.
OSL-MR is a learning-augmented framework that casts memory retention as constrained stochastic optimization under partial observability and outperforms heuristic baselines on LoCoMo and LongMemEval.
Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
DCPM reorganizes LLM agent memory into a cognitive hierarchy driven by a synchronous daytime belief writer and an asynchronous nighttime schema engine, reporting gains on cross-session inference benchmarks.
citing papers explorer
-
Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents
Causal Memory Intervention selects memories based on estimated causal impact on LLM answers rather than semantic similarity, with a new benchmark showing improved robustness to irrelevant or harmful memories.
-
Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents
A dual-process memory architecture for scientific AI agents maintains 70-85% accuracy over 15,000 messages by using a constant 10-message episodic window and domain-specific semantic consolidation, consuming 62% fewer tokens than full-context baselines.
-
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
NeuSymMS is a hybrid neuro-symbolic memory system that extracts facts via LLMs and manages them with explicit CLIPS rules for scoping, deduplication, and dual-horizon persistence in LLM agents.
-
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
FORGE is a staged population protocol that evolves prompt-injected memory (Rules, Examples, or Mixed) for ReAct agents via reflection and broadcast, yielding 1.7-7.7× gains over zero-shot and 29-72% over Reflexion on CybORG CAGE-2.
-
DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
DimMem introduces typed dimensional memory units that improve accuracy to 81.43% and 78.20% on two long-term agent benchmarks while cutting token cost by 24% and enabling small models to match larger extractors.
-
TopoClaw: A Human-Centric and Topology-Aware Agent Operating System
TopoClaw is a human-centric Agent OS that uses physical and social topology modeling to enable cross-boundary execution with identity attribution and context-aware governance.
-
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.
-
A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency
ARPM is a heterogeneous temporal memory governance framework using vector retrieval, BM25, RRF fusion, dual-temporal reranking, and evidence verification to maintain LLM persona consistency under noise, context clearing, and model handoff.
-
Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay
The LOOP Skill Engine records one LLM-powered run of a periodic task and converts it into a deterministic replay template that eliminates further LLM usage while maintaining high success rates.
-
CogniFold: Always-On Proactive Memory via Cognitive Folding
CogniFold extends Complementary Learning Systems theory to three layers with a prefrontal intent layer and uses graph self-organization to build proactive agent memory from continuous event streams.
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
-
Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs
Grokers architecture performs bottom-up inductive comprehension over typed KGs at write time via LM agents, with three claimed formal theorems on byte-identity, accumulation monotonicity, and dual-traversal ordering, plus a deterministic synonym-caching search alternative.
-
Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics
Memini is introduced as a graph-based external memory using multi-timescale edge dynamics to enable emergent episodic sensitivity, consolidation, and selective forgetting in LLM systems.
-
GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory
GRAVITY adds structured relational, temporal, and thematic memory anchors to conversational LLMs at generation time, delivering 7.5-10.1% average gains in LLM-judge accuracy across five host systems on LongMemEval and LoCoMo.
-
Ghost in the Context: Policy-Carriage Integrity in LLM Agents
Protected policy placements in LLM agents maintain integrity under replay pressure on AutoGen and OpenHands traces, unlike task-local placements which show eviction or weakening.
-
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
EngramaBench shows structured graph memory outperforms full-context prompting on cross-space reasoning in long conversations but scores lower overall than full-context and higher than vector retrieval.
-
A Control Architecture for Training-Free Memory Use
A training-free control architecture with uncertainty-based routing, confidence-selective acceptance, and evidence-based memory governance improves arithmetic reasoning by +7 points on SVAMP and ASDiv benchmarks.
-
Towards Self-Improving Error Diagnosis in Multi-Agent Systems
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
-
The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward
AI intelligence is limited by the lack of an architecture that carries forward understanding across sessions, and the proposed continuity layer with Decomposed Trace Convergence Memory addresses this by enabling persistent state as a system property.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
-
Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents
Persistent self-modifying AI agents exhibit compositional drift from mismatches across five mutability layers, with governance difficulty rising under rapid mutation, strong coupling, weak reversibility, and low observability, as indicated by a 0.68 identity hysteresis ratio in a preliminary ratchet
-
Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning
A case-based learning framework extracts reusable knowledge from past tasks to improve LLM agents' structured performance on complex real-world tasks, outperforming standard prompting baselines especially as task complexity grows.
-
Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure
OIDA is a proposed framework that represents organizational knowledge as epistemic Knowledge Objects with class-specific importance decay and signed contradictions, plus a QUESTION mechanism that surfaces modeled ignorance via inverse decay.
-
SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context
SWE-AGILE introduces a Dynamic Reasoning Context with sliding windows of detailed steps and compressed Reasoning Digests to enable efficient long-horizon reasoning in software engineering agents, claiming new benchmark results on SWE-Bench-Verified for 7B-8B models.
-
Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
Orchestrating one 8B model in three roles at inference time doubles task completion on AppWorld from 5.4% to 8.9%, surpassing a 33B baseline.
-
Knowledge Compounding: An Empirical Economic Analysis of Self-Evolving Knowledge Wikis under the Agentic ROI Framework
A four-query experiment demonstrates 84.6% token savings through knowledge compounding in self-evolving wikis compared to standard RAG, by amortizing ingestion costs and reusing synthesized knowledge over time.
-
Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure
Sema Code decouples AI coding agents into a programmable npm library with eight mechanisms for isolation, queuing, compression, scheduling, permissions, and integration.
-
Contract-Coding: Towards Repo-Level Generation via Structured Symbolic Paradigm
Contract-Coding projects ambiguous intents into formal Language Contracts as a single source of truth to enable more reliable repo-level code generation, reporting 47% functional success on the Greenfield-5 benchmark.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEval-S benchmarks.
-
Prism: An Evolutionary Memory Substrate for Multi-Agent Open-Ended Discovery
Prism unifies file, vector, graph, and evolutionary memory under a decision-theoretic framework with entropy-gated stratification, causal graphs, value-of-information retrieval, heartbeat consolidation, and replicator-decay dynamics, reporting 88.1 on LOCOMO and 2.8x gains on CORAL tasks.
-
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
MemMachine stores entire conversational episodes and applies contextualized retrieval plus adaptive query routing to achieve 0.9169 accuracy on LoCoMo and 93 percent on LongMemEvalS while using 80 percent fewer tokens than Mem0.
-
Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity
A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.
-
AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment
AI4S-SDS uses sparse MCTS and differentiable physics alignment to generate valid solvent mixtures and identifies a competitive photoresist developer formulation.
-
TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents
TiMem introduces a Temporal Memory Tree that consolidates conversational history into hierarchical persona representations, reaching 75.30% on LoCoMo and 76.88% on LongMemEval-S while cutting recalled length by 52%.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
MemOS: A Memory OS for AI System
MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.
-
MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning
MIRROR applies cognitive principles of parallel processing, reconstructive synthesis, and complementary learning to AI, yielding 21% relative gains on multi-turn constraint-maintenance tasks across seven models with supporting ablations.
-
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
-
AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts
AtomMem introduces atomic-fact extraction, hierarchical event structures, and an associative memory graph to build stable long-term memory for LLM agents, claiming SOTA results on the LoCoMo benchmark.
-
Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction
Proposes HACD-H framework integrating emotional adaptation, relational organization, memory and personality into a dynamical system and reports empirical patterns from a 14,700-turn dataset linking social intelligence to reduced social cognitive energy.
-
TAHOE: Text-to-SQL with Automated Hint Optimization from Experience
TAHOE builds a Hint Bank from error traces to raise Text-to-SQL pass rates on Spider 2.0-Snow from 61.95% to 79.42% for GPT-5.5 without parameter updates.
-
Making Software Meaningful
Committing to explicit meaning via a domain-grounded vocabulary of individuals, actions, facts, and concepts improves software usability, enables modular LLM code generation, and supports accountable agent behavior.
-
What makes a harness a harness: necessary and sufficient conditions for an agent harness
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
-
RIZZ: Routing Interactions to Near Zero-Interference Zones for Continual Adaptation of Black-Box Agents
RIZZ is a continual adaptation framework for black-box LLM agents that uses dynamically spawned memory branches, context-aware routing, verifier-gated updates, and prompt compilation to control interference across nonstationary inputs.
-
SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow
SURGENT is a multi-agent surgical assistance system with novel memory management that outperforms baseline LLMs on case analysis, plan simulation, safety monitoring, risk assessment, and rehabilitation guidance.
-
CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories
CommitDistill is a deterministic, local-only prototype that extracts typed knowledge from git commits and evaluates retrieval performance against baselines on public repositories.
-
Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework
Empirical analysis of multi-paradigm agent interactions in buddyMe framework reports that Generator-Evaluator detects omissions in 20% of complex tasks, ReAct causes 30% redundant tool calls, and adversarial discussions reach consensus in 2-3 rounds for 70% of cases.
-
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using established economic theories.
-
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents
MEMTIER reports 0.382 accuracy and 0.412 F1 on the 500-question LongMemEval-S benchmark, a 33pp gain over full-context baseline using tiered memory and retrieval components on 6GB GPU hardware.