super hub Canonical reference

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Ion Stoica, Kevin Lin, Sarah Wooders, Shishir G. Patil, Vivian Fang · 2023 · cs.AI · arXiv 2310.08560

Canonical reference. 77% of citing Pith papers cite this work as background.

299 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 299 citing papers more from Charles Packer arXiv PDF

abstract

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 baseline 3 dataset 3 method 1 other 1

citation-polarity summary

background 34 baseline 3 use dataset 3 support 2 unclear 1 use method 1

claims ledger

abstract Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers i

authors

Charles Packer Ion Stoica Kevin Lin Sarah Wooders Shishir G. Patil Vivian Fang

co-cited works

representative citing papers

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents

cs.CL · 2026-04-17 · unverdicted · novelty 8.0 · 2 refs

MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.

Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator

physics.acc-ph · 2025-09-21 · unverdicted · novelty 8.0

A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

cs.DB · 2026-07-01 · unverdicted · novelty 7.0

SOLAR is a learning-augmented policy for semantic cache replacement that achieves constant competitive ratio 3 and 5-75% gains over FIFO on retrieval workloads.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

HyphaeDB: A Living Knowledge Topology for Agent-First Memory

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.

LLM agents security duality: a comprehensive survey of self-security and empowered cybersecurity

cs.CR · 2026-06-26 · unverdicted · novelty 7.0

A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

OSL-MR is a learning-augmented framework that casts memory retention as constrained stochastic optimization under partial observability and outperforms heuristic baselines on LoCoMo and LongMemEval.

Self-Harness: Harnesses That Improve Themselves

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

DCPM reorganizes LLM agent memory into a cognitive hierarchy driven by a synchronous daytime belief writer and an asynchronous nighttime schema engine, reporting gains on cross-session inference benchmarks.

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.

TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

cs.DB · 2026-06-04 · unverdicted · novelty 7.0

TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies while retaining a language-model judge.

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

SubtleMemory benchmark with 1,522 instances over 10 histories shows current memory systems are weak at fine-grained relational discrimination in long-term AI agent interactions.

eMEM: A Hybrid Spatio-Temporal Memory System For Embodied Agents

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

eMEM is a multi-index memory architecture with tiered consolidation and ten recall tools for embodied agents, scoring 80.8 weighted mean on eMEM-Bench covering eight cognitive psychology paradigms and outperforming a flat RAG baseline on context and lure rejection tasks.

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.

Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

EvoNote lets LLM agents self-evolve by distilling prior correction feedback into reusable memory for claim analysis, evidence acquisition, and note writing, outperforming human notes on a 1.2K health post benchmark.

Leyline: KV Cache Directives for Agentic Inference

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

Momento benchmark reveals current agents fail at multi-session tasks mainly by misestimating user state and treating old session history as current context instead of stale data needing re-validation.

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

cs.CR · 2026-05-28 · unverdicted · novelty 7.0

MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.

citing papers explorer

Showing 19 of 19 citing papers after filters.

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis cs.LG · 2026-05-28 · unverdicted · none · ref 6 · internal anchor
LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents cs.LG · 2026-05-22 · unverdicted · none · ref 32 · internal anchor
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving cs.LG · 2026-05-21 · unverdicted · none · ref 14 · internal anchor
GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination cs.LG · 2026-05-16 · unverdicted · partial · ref 14 · 2 links · internal anchor
S-Bus reconstructs read sets from HTTP traffic for multi-agent LLM state coordination, delivering Observable-Read Isolation with formal proofs and empirical safety matching traditional databases.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents cs.LG · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-benchmark transfer.
MEME: Multi-entity & Evolving Memory Evaluation cs.LG · 2026-05-12 · unverdicted · none · ref 12 · internal anchor
All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
Semantic Level of Detail for Knowledge Graphs: Discovering Abstraction Boundaries via Spectral Heat Diffusion cs.LG · 2026-03-09 · unverdicted · none · ref 2 · internal anchor
SLoD detects emergent scale boundaries in knowledge graphs by applying spectral heat diffusion to Poincare embeddings, recovering planted hierarchies in synthetic data and aligning with taxonomic depths in WordNet without resolution-parameter tuning.
MemLeak: Diagnosing Information Leaks in Multimodal Agent Memory cs.LG · 2026-06-29 · unverdicted · none · ref 56 · internal anchor
MemLeak benchmark shows retained images enable 12% recovery of deleted facts in multimodal agents (reduced to 2% with content-aware deletion), with 47% of image leaks not text-recoverable.
Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories cs.LG · 2026-06-02 · unverdicted · none · ref 61 · internal anchor
Language models can use a two-stage sleep process of upward distillation for memory consolidation and RL-based dreaming for unsupervised self-improvement to enable continual learning.
SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors cs.LG · 2026-05-23 · unverdicted · none · ref 8 · internal anchor
SemanticZip is a pilot framework introducing LLM-mediated lossy text compression with an experimental interface evaluating six representation regimes on five diagnostic cases for semantic atom recovery and token efficiency.
Compress the Context, Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression cs.LG · 2026-05-17 · unverdicted · none · ref 14 · internal anchor
Context Codec is a commitment-level framework for verifiable LLM context compression using semantic atoms, defined metrics, and a compact rendering language.
Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery cs.LG · 2026-05-13 · unverdicted · none · ref 12 · internal anchor
Empirical evaluation of eight memory condensation strategies on 480 DiscoveryBench tasks finds no significant impact on hypothesis quality but domain-dependent differences in token efficiency.
Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse cs.LG · 2025-11-01 · unverdicted · none · ref 10 · internal anchor
Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.
EVAF: A Test-Retest Protocol for Selective Parametric Consolidation cs.LG · 2026-06-29 · unverdicted · none · ref 13 · internal anchor
EVAF and test-retest protocol show selective parametric consolidation of high-valence experiences in GPT-2 and TinyLlama while preserving factual retrieval.
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters cs.LG · 2026-06-01 · unverdicted · none · ref 24 · internal anchor
PEFT adapters are positioned as persistent personal state on foundation models, organized via Scale Up, Scale Down, and Scale Out axes, with MinT as an infrastructure example for managing them.
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents cs.LG · 2026-05-20 · unverdicted · none · ref 12 · internal anchor
Memory-R2 proposes LoGo-GRPO to fix unfair trajectory comparisons in RL training of memory-augmented LLM agents by combining global end-to-end rewards with local rerollouts from identical memory states.
Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics cs.LG · 2026-05-06 · unverdicted · none · ref 2 · internal anchor
Memini is introduced as a graph-based external memory using multi-timescale edge dynamics to enable emergent episodic sensitivity, consolidation, and selective forgetting in LLM systems.
SCM: Sleep-Consolidated Memory with Algorithmic Forgetting for Large Language Models cs.LG · 2026-04-22 · unverdicted · none · ref 7 · internal anchor
SCM enables LLMs to achieve perfect recall in ten-turn conversations by using sleep-like consolidation and adaptive forgetting to reduce memory noise by over 90%.
ISM:Self-Improving Strategy Memory for Continual Mathematical Reasoning cs.LG · 2026-06-30 · unreviewed · ref 23 · 2 links · internal anchor

MemGPT: Towards LLMs as Operating Systems

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer