super hub Canonical reference

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Ion Stoica, Kevin Lin, Sarah Wooders, Shishir G. Patil, Vivian Fang · 2023 · cs.AI · arXiv 2310.08560

Canonical reference. 77% of citing Pith papers cite this work as background.

251 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 251 citing papers more from Charles Packer arXiv PDF

abstract

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 baseline 3 dataset 3 method 1 other 1

citation-polarity summary

background 34 baseline 3 use dataset 3 support 2 unclear 1 use method 1

claims ledger

abstract Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers i

authors

Charles Packer Ion Stoica Kevin Lin Sarah Wooders Shishir G. Patil Vivian Fang

co-cited works

representative citing papers

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents

cs.CL · 2026-04-17 · unverdicted · novelty 8.0 · 2 refs

MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.

Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator

physics.acc-ph · 2025-09-21 · unverdicted · novelty 8.0

A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

HyphaeDB: A Living Knowledge Topology for Agent-First Memory

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.

LLM agents security duality: a comprehensive survey of self-security and empowered cybersecurity

cs.CR · 2026-06-26 · unverdicted · novelty 7.0

A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

Momento benchmark reveals current agents fail at multi-session tasks mainly by misestimating user state and treating old session history as current context instead of stale data needing re-validation.

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

cs.CR · 2026-05-28 · unverdicted · novelty 7.0

MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

MemFail introduces diagnostic datasets that isolate failure modes in LLM memory systems by testing summarization, storage, and retrieval operations separately.

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

AGORA is an inference-free step-level compressor for LLM agent prompts that retains at least 75% of uncompressed performance in most tested settings where token-level methods collapse due to action-grammar destruction.

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

EnterpriseMem-Bench shows stateless multi-turn Text-to-SQL accuracy drops to zero by turn 3, working memory is the main driver of gains, and additional memory components yield model- and dataset-dependent effects from +14 to -16 percentage points.

AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

AuthTrace is a diagnostic benchmark that annotates fan-in gradients in single-author corpora to measure evidence recall, precision, and answer correctness across eight systems in retrieval, memory, graph, and structured-evidence paradigms.

Memory-Induced Tool-Drift in LLM Agents

cs.CR · 2026-05-24 · unverdicted · novelty 7.0

Biased long-term memories in LLM agents cause measurable deviations in tool parameters across 105 scenarios, seven models, and 608 real tools, persisting under standard memory architectures.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.

Same Ranking, Different Winner: How Scoring Targets Shape LLM Memory Benchmarks

cs.IR · 2026-05-22 · unverdicted · novelty 7.0

Switching the credited target among Raw, Source, and Canonical changes nDCG on 83.4-94.0% of queries, flips system orderings, and reverses parser-density recommendations on LoCoMo and LongMemEval-S.

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.

MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts

cs.IR · 2026-05-20 · unverdicted · novelty 7.0

MemConflict provides a benchmark for testing LLM long-term memory systems under dynamic, static, and conditional conflicts involving temporal validity, factual correctness, and contextual applicability.

SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

SocialMemBench provides 1,031 QA pairs from 43 synthetic social networks to show that existing AI memory frameworks perform poorly in multi-party group settings compared to full-context baselines.

citing papers explorer

Showing 50 of 74 citing papers after filters.

MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents cs.CL · 2026-04-17 · unverdicted · none · ref 1 · 2 links · internal anchor
MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.
Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One cs.CL · 2026-06-24 · unverdicted · none · ref 8 · internal anchor
Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.
Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations cs.CL · 2026-05-30 · unverdicted · none · ref 5 · internal anchor
Momento benchmark reveals current agents fail at multi-session tasks mainly by misestimating user state and treating old session history as current context instead of stale data needing re-validation.
Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization cs.CL · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study cs.CL · 2026-05-25 · unverdicted · none · ref 6 · internal anchor
EnterpriseMem-Bench shows stateless multi-turn Text-to-SQL accuracy drops to zero by turn 3, working memory is the main driver of gains, and additional memory components yield model- and dataset-dependent effects from +14 to -16 percentage points.
AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora cs.CL · 2026-05-25 · unverdicted · none · ref 4 · internal anchor
AuthTrace is a diagnostic benchmark that annotates fan-in gradients in single-author corpora to measure evidence recall, precision, and answer correctness across eight systems in retrieval, memory, graph, and structured-evidence paradigms.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 61 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
SocialMemBench: Are AI Memory Systems Ready for Social Group Settings? cs.CL · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
SocialMemBench provides 1,031 QA pairs from 43 synthetic social networks to show that existing AI memory frameworks perform poorly in multi-party group settings compared to full-context baselines.
RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents cs.CL · 2026-05-15 · unverdicted · none · ref 2 · internal anchor
RecMem reduces memory construction token costs by up to 87% in long-running LLM agents by consolidating memory only upon sustained recurrence of semantically similar interactions, while exceeding the accuracy of three prior systems.
SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory cs.CL · 2026-05-15 · unverdicted · none · ref 26 · internal anchor
SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues cs.CL · 2026-05-12 · unverdicted · none · ref 88 · internal anchor
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting cs.CL · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
Telegraph English compresses prompts via structured symbolic rewriting into atomic facts, achieving roughly 50% token reduction with 99.1% key-fact accuracy on LongBench-v2 and outperforming token-deletion baselines across models.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory cs.CL · 2026-05-01 · unverdicted · none · ref 142 · internal anchor
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory cs.CL · 2026-04-29 · unverdicted · none · ref 13 · internal anchor
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows cs.CL · 2026-04-17 · conditional · none · ref 29 · internal anchor
GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
IE as Cache: Information Extraction Enhanced Agentic Reasoning cs.CL · 2026-04-16 · unverdicted · none · ref 16 · internal anchor
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents cs.CL · 2026-03-16 · unverdicted · none · ref 3 · internal anchor
CLAG organizes agent memory into clusters via an SLM router and uses cluster profiles for two-stage retrieval, yielding better answer quality on QA benchmarks than prior memory systems.
MIRIX: Multi-Agent Memory System for LLM-Based Agents cs.CL · 2025-07-10 · unverdicted · none · ref 23 · internal anchor
MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation cs.CL · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
MemDelta shows agent memory evaluations are confounded by LLM family and embedding model, with RAG often matching full context and self-memory underperforming basic retrieval.
Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering cs.CL · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
NPM distills contrastive experiences into implicit activation steering vectors that guide LLM agent execution comparably to explicit RAG instructions, with complementary gains when combined.
MemRefine: LLM-Guided Compression for Long-Term Agent Memory cs.CL · 2026-06-11 · unverdicted · none · ref 15 · internal anchor
MemRefine uses LLM factual judgments to iteratively compress agent memory to target budgets while preserving downstream task performance better than rule-based baselines.
MemPro: Agentic Memory Systems as Evolvable Programs cs.CL · 2026-05-30 · unverdicted · none · ref 33 · internal anchor
MemPro evolves the entire MCR pipeline as runnable programs via failure-guided refinement on a version tree and outperforms static baselines on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA.
Eywa: Provenance-Grounded Long-Term Memory for AI Agents cs.CL · 2026-05-29 · unverdicted · none · ref 18 · internal anchor
Eywa introduces a provenance-grounded memory system for persistent AI agents featuring evidence-first storage, typed validation, and deterministic multi-route retrieval, reporting 90.19% accuracy on LoCoMo and 88.2% on LongMemEval-S.
Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory cs.CL · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
Entity-collision protocol stratifies agent-memory retrieval tests by tag and pins BM25 floor via shared entity tokens to attribute lift specifically to embedders.
PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration cs.CL · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
PatchBoard introduces schema-grounded JSON Patch state mutations with an Architect agent and validation kernel, reporting 84.6% success and lower token use on 630 ALFWorld episodes versus LangGraph and Flock baselines.
Rethinking Memory as Continuously Evolving Connectivity cs.CL · 2026-05-27 · unverdicted · none · ref 36 · internal anchor
FluxMem evolves memory as a heterogeneous graph via three refinement stages and reports consistent state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.
Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents cs.CL · 2026-05-25 · unverdicted · none · ref 29 · internal anchor
ProAct uses idle compute to anticipate user needs via dialogue history and memory, achieving 14.8% fewer turns, 11.7% less user effort, and 28.1% fewer hallucinations than reactive baselines on the new ProActEval benchmark.
EfficientGraph-RAG: Structured Retrieval-State Management for Cross-Task Retrieval-Augmented Generation cs.CL · 2026-05-25 · unverdicted · none · ref 13 · internal anchor
EfficientGraph-RAG structures retrieval state with TAM, MARS and SMP, ranking first on averaged LongBench answer-quality metrics while cutting token use 3.51x on HotpotQA.
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA cs.CL · 2026-05-21 · unverdicted · none · ref 8 · internal anchor
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA cs.CL · 2026-05-21 · unverdicted · none · ref 27 · internal anchor
DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API token cost.
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate cs.CL · 2026-05-20 · unverdicted · none · ref 30 · internal anchor
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents cs.CL · 2026-05-20 · unverdicted · none · ref 22 · internal anchor
Auto-Dreamer trains an offline memory consolidator via GRPO on agent performance to abstract cross-session patterns, outperforming baselines by 7 points on ScienceWorld with 12x smaller memory and generalizing to ALFWorld and WebArena.
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure cs.CL · 2026-05-15 · unverdicted · none · ref 27 · internal anchor
H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.
Agentic Recommender System with Hierarchical Belief-State Memory cs.CL · 2026-05-14 · unverdicted · none · ref 18 · 2 links · internal anchor
MARS uses hierarchical event-preference-profile memory with an LLM-scheduled lifecycle of six operations to achieve state-of-the-art results on InstructRec benchmarks.
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents cs.CL · 2026-05-12 · unverdicted · none · ref 16 · 2 links · internal anchor
PRISM is a new inference-time retrieval system that achieves higher accuracy than baselines on long-horizon agent tasks while using an order of magnitude less context by combining hierarchical graph search, intent-based costing, compression, and adaptive routing over structured memory.
An Annotation Scheme and Classifier for Personal Facts in Dialogue cs.CL · 2026-05-11 · accept · none · ref 26 · internal anchor
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context cs.CL · 2026-05-08 · unverdicted · none · ref 9 · 2 links · internal anchor
SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall cs.CL · 2026-05-06 · conditional · none · ref 10 · internal anchor
True Memory is a verbatim-event retrieval pipeline running on a single SQLite file that reaches 93% accuracy on LoCoMo multi-session questions, outperforming Mem0, Supermemory, Zep, and matching or exceeding EverMemOS and Hindsight on other long-context benchmarks.
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents cs.CL · 2026-05-01 · unverdicted · none · ref 11 · internal anchor
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
StructMem: Structured Memory for Long-Horizon Behavior in LLMs cs.CL · 2026-04-23 · unverdicted · none · ref 2 · internal anchor
StructMem is a structure-enriched hierarchical memory system that improves temporal reasoning and multi-hop QA on LoCoMo while cutting token usage, API calls, and runtime versus prior flat or graph-based memories.
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents cs.CL · 2026-04-20 · unverdicted · none · ref 4 · internal anchor
HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoMo10 benchmark.
AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models cs.CL · 2026-04-19 · unverdicted · none · ref 4 · internal anchor
AnchorMem decouples atomic fact anchors and associative event graphs for retrieval from preserved raw interaction contexts, outperforming prior memory methods on the LoCoMo benchmark.
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0) cs.CL · 2026-04-18 · unverdicted · none · ref 4 · internal anchor
GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI cs.CL · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
APEX-MEM uses property graphs with temporal events, append-only storage, and an agentic retrieval system to reach 88.88% accuracy on LOCOMO QA and 86.2% on LongMemEval, outperforming prior session-aware methods.
AgentSPEX: An Agent SPecification and EXecution Language cs.CL · 2026-04-14 · unverdicted · none · ref 9 · internal anchor
AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.
Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory cs.CL · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
RoMem uses a Semantic Speed Gate to assign volatility to relations and continuous phase rotation to shadow obsolete facts in complex space, delivering SOTA temporal KG completion and 2-3x gains on agentic memory benchmarks.
HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues cs.CL · 2026-04-08 · unverdicted · none · ref 36 · internal anchor
HingeMem segments dialogue memory via boundary-triggered hyperedges over four elements and applies query-adaptive retrieval, yielding ~20% relative gains and 68% lower QA token cost versus baselines on LOCOMO.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework cs.CL · 2026-04-02 · unverdicted · none · ref 73 · internal anchor
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation cs.CL · 2026-03-31 · unverdicted · none · ref 3 · internal anchor
Oblivion is a decay-driven memory framework that decouples read and write paths in LLM agents to enable adaptive forgetting and reinforcement for better long-horizon reasoning.
PersonaVLM: Long-Term Personalized Multimodal LLMs cs.CL · 2026-03-20 · unverdicted · none · ref 31 · internal anchor
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.

MemGPT: Towards LLMs as Operating Systems

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer