super hub Canonical reference

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Ion Stoica, Kevin Lin, Sarah Wooders, Shishir G. Patil, Vivian Fang · 2023 · cs.AI · arXiv 2310.08560

Canonical reference. 77% of citing Pith papers cite this work as background.

261 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 261 citing papers more from Charles Packer arXiv PDF

abstract

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 baseline 3 dataset 3 method 1 other 1

citation-polarity summary

background 34 baseline 3 use dataset 3 support 2 unclear 1 use method 1

claims ledger

abstract Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers i

authors

Charles Packer Ion Stoica Kevin Lin Sarah Wooders Shishir G. Patil Vivian Fang

co-cited works

representative citing papers

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents

cs.CL · 2026-04-17 · unverdicted · novelty 8.0 · 2 refs

MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.

Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator

physics.acc-ph · 2025-09-21 · unverdicted · novelty 8.0

A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

HyphaeDB: A Living Knowledge Topology for Agent-First Memory

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.

LLM agents security duality: a comprehensive survey of self-security and empowered cybersecurity

cs.CR · 2026-06-26 · unverdicted · novelty 7.0

A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.

Leyline: KV Cache Directives for Agentic Inference

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

Momento benchmark reveals current agents fail at multi-session tasks mainly by misestimating user state and treating old session history as current context instead of stale data needing re-validation.

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

cs.CR · 2026-05-28 · unverdicted · novelty 7.0

MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

MemFail introduces diagnostic datasets that isolate failure modes in LLM memory systems by testing summarization, storage, and retrieval operations separately.

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

AGORA is an inference-free step-level compressor for LLM agent prompts that retains at least 75% of uncompressed performance in most tested settings where token-level methods collapse due to action-grammar destruction.

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

EnterpriseMem-Bench shows stateless multi-turn Text-to-SQL accuracy drops to zero by turn 3, working memory is the main driver of gains, and additional memory components yield model- and dataset-dependent effects from +14 to -16 percentage points.

AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

AuthTrace is a diagnostic benchmark that annotates fan-in gradients in single-author corpora to measure evidence recall, precision, and answer correctness across eight systems in retrieval, memory, graph, and structured-evidence paradigms.

Memory-Induced Tool-Drift in LLM Agents

cs.CR · 2026-05-24 · unverdicted · novelty 7.0

Biased long-term memories in LLM agents cause measurable deviations in tool parameters across 105 scenarios, seven models, and 608 real tools, persisting under standard memory architectures.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.

Same Ranking, Different Winner: How Scoring Targets Shape LLM Memory Benchmarks

cs.IR · 2026-05-22 · unverdicted · novelty 7.0

Switching the credited target among Raw, Source, and Canonical changes nDCG on 83.4-94.0% of queries, flips system orderings, and reverses parser-density recommendations on LoCoMo and LongMemEval-S.

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.

MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts

cs.IR · 2026-05-20 · unverdicted · novelty 7.0

MemConflict provides a benchmark for testing LLM long-term memory systems under dynamic, static, and conditional conflicts involving temporal validity, factual correctness, and contextual applicability.

citing papers explorer

Showing 50 of 261 citing papers.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts cs.CR · 2026-05-09 · unverdicted · none · ref 6 · 3 links · internal anchor
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents cs.CL · 2026-04-17 · unverdicted · none · ref 1 · 2 links · internal anchor
MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.
Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator physics.acc-ph · 2025-09-21 · unverdicted · none · ref 17 · internal anchor
A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 3 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents cs.AI · 2026-06-29 · unverdicted · none · ref 18 · internal anchor
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
HyphaeDB: A Living Knowledge Topology for Agent-First Memory cs.AI · 2026-06-27 · unverdicted · none · ref 60 · internal anchor
HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.
LLM agents security duality: a comprehensive survey of self-security and empowered cybersecurity cs.CR · 2026-06-26 · unverdicted · none · ref 8 · internal anchor
A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.
Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One cs.CL · 2026-06-24 · unverdicted · none · ref 8 · internal anchor
Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.
Leyline: KV Cache Directives for Agentic Inference cs.DC · 2026-05-31 · unverdicted · none · ref 38 · internal anchor
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations cs.CL · 2026-05-30 · unverdicted · none · ref 5 · internal anchor
Momento benchmark reveals current agents fail at multi-session tasks mainly by misestimating user state and treating old session history as current context instead of stale data needing re-validation.
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis cs.LG · 2026-05-28 · unverdicted · none · ref 6 · internal anchor
LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.
Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction cs.CR · 2026-05-28 · unverdicted · none · ref 31 · internal anchor
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization cs.CL · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions cs.AI · 2026-05-26 · unverdicted · none · ref 34 · internal anchor
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
MemFail: Stress-Testing Failure Modes of LLM Memory Systems cs.AI · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
MemFail introduces diagnostic datasets that isolate failure modes in LLM memory systems by testing summarization, storage, and retrieval operations separately.
AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents cs.AI · 2026-05-26 · unverdicted · none · ref 22 · internal anchor
AGORA is an inference-free step-level compressor for LLM agent prompts that retains at least 75% of uncompressed performance in most tested settings where token-level methods collapse due to action-grammar destruction.
Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study cs.CL · 2026-05-25 · unverdicted · none · ref 6 · internal anchor
EnterpriseMem-Bench shows stateless multi-turn Text-to-SQL accuracy drops to zero by turn 3, working memory is the main driver of gains, and additional memory components yield model- and dataset-dependent effects from +14 to -16 percentage points.
AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora cs.CL · 2026-05-25 · unverdicted · none · ref 4 · internal anchor
AuthTrace is a diagnostic benchmark that annotates fan-in gradients in single-author corpora to measure evidence recall, precision, and answer correctness across eight systems in retrieval, memory, graph, and structured-evidence paradigms.
Memory-Induced Tool-Drift in LLM Agents cs.CR · 2026-05-24 · unverdicted · none · ref 23 · internal anchor
Biased long-term memories in LLM agents cause measurable deviations in tool parameters across 105 scenarios, seven models, and 608 real tools, persisting under standard memory architectures.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 61 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents cs.LG · 2026-05-22 · unverdicted · none · ref 32 · internal anchor
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
Same Ranking, Different Winner: How Scoring Targets Shape LLM Memory Benchmarks cs.IR · 2026-05-22 · unverdicted · none · ref 4 · internal anchor
Switching the credited target among Raw, Source, and Canonical changes nDCG on 83.4-94.0% of queries, flips system orderings, and reverses parser-density recommendations on LoCoMo and LongMemEval-S.
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving cs.LG · 2026-05-21 · unverdicted · none · ref 14 · internal anchor
GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts cs.IR · 2026-05-20 · unverdicted · none · ref 20 · internal anchor
MemConflict provides a benchmark for testing LLM long-term memory systems under dynamic, static, and conditional conflicts involving temporal validity, factual correctness, and contextual applicability.
SocialMemBench: Are AI Memory Systems Ready for Social Group Settings? cs.CL · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
SocialMemBench provides 1,031 QA pairs from 43 synthetic social networks to show that existing AI memory frameworks perform poorly in multi-party group settings compared to full-context baselines.
EXG: Self-Evolving Agents with Experience Graphs cs.AI · 2026-05-18 · unverdicted · none · ref 19 · internal anchor
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination cs.LG · 2026-05-16 · unverdicted · partial · ref 14 · 2 links · internal anchor
S-Bus reconstructs read sets from HTTP traffic for multi-agent LLM state coordination, delivering Observable-Read Isolation with formal proofs and empirical safety matching traditional databases.
RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents cs.CL · 2026-05-15 · unverdicted · none · ref 2 · internal anchor
RecMem reduces memory construction token costs by up to 87% in long-running LLM agents by consolidating memory only upon sustained recurrence of semantically similar interactions, while exceeding the accuracy of three prior systems.
SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory cs.CL · 2026-05-15 · unverdicted · none · ref 26 · internal anchor
SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models cs.CV · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation cs.SE · 2026-05-14 · unverdicted · none · ref 36 · internal anchor
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents cs.LG · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-benchmark transfer.
MMSkills: Towards Multimodal Skills for General Visual Agents cs.AI · 2026-05-13 · unverdicted · none · ref 22 · 3 links · internal anchor
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues cs.CL · 2026-05-12 · unverdicted · none · ref 88 · internal anchor
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
MEME: Multi-entity & Evolving Memory Evaluation cs.LG · 2026-05-12 · unverdicted · none · ref 12 · internal anchor
All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments cs.RO · 2026-05-12 · unverdicted · none · ref 13 · internal anchor
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales cs.AI · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory cs.AI · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning cs.AI · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents cs.CR · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipping as plugins and servers with an audit log.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory cs.AI · 2026-05-08 · unverdicted · none · ref 40 · internal anchor
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory cs.AI · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
Stateful Agent Backdoor cs.CR · 2026-05-07 · unverdicted · none · ref 26 · internal anchor
A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting cs.CL · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
Telegraph English compresses prompts via structured symbolic rewriting into atomic facts, achieving roughly 50% token reduction with 99.1% key-fact accuracy on LongBench-v2 and outperforming token-deletion baselines across models.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents cs.MA · 2026-05-05 · unverdicted · none · ref 30 · internal anchor
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates cs.AI · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing cs.AI · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking cs.CR · 2026-05-01 · unverdicted · none · ref 30 · internal anchor
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory cs.CL · 2026-05-01 · unverdicted · none · ref 142 · internal anchor
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory cs.CL · 2026-04-29 · unverdicted · none · ref 13 · internal anchor
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

MemGPT: Towards LLMs as Operating Systems

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer