super hub Canonical reference

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Ion Stoica, Kevin Lin, Sarah Wooders, Shishir G. Patil, Vivian Fang · 2023 · cs.AI · arXiv 2310.08560

Canonical reference. 78% of citing Pith papers cite this work as background.

329 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 329 citing papers more from Charles Packer arXiv PDF

abstract

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 37 baseline 3 dataset 3 method 1 other 1

citation-polarity summary

background 35 baseline 3 use dataset 3 support 2 unclear 1 use method 1

claims ledger

abstract Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers i

authors

Charles Packer Ion Stoica Kevin Lin Sarah Wooders Shishir G. Patil Vivian Fang

co-cited works

representative citing papers

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents

cs.CL · 2026-04-17 · unverdicted · novelty 8.0 · 2 refs

MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.

Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator

physics.acc-ph · 2025-09-21 · unverdicted · novelty 8.0

A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

cs.DB · 2026-07-01 · unverdicted · novelty 7.0

SOLAR is a learning-augmented policy for semantic cache replacement that achieves constant competitive ratio 3 and 5-75% gains over FIFO on retrieval workloads.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

HyphaeDB: A Living Knowledge Topology for Agent-First Memory

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.

LLM agents security duality: a comprehensive survey of self-security and empowered cybersecurity

cs.CR · 2026-06-26 · unverdicted · novelty 7.0

A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.

KBSpec: LLM-driven Formal Specification Generation with Evolving Domain Knowledge Base

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

KBSpec maintains an evolving knowledge base combining external docs and internal verifier feedback to improve LLM generation of verifiable JML specifications, achieving 10-25% higher verification pass rates.

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

cs.SE · 2026-06-17 · unverdicted · novelty 7.0

StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

cs.AI · 2026-06-17 · unverdicted · novelty 7.0

User facts are internalized as surgical local edits to a hash-keyed Engram memory table with reasoning skill held in a shared adapter, claimed to match LoRA recall, improve indirect reasoning 5.6x on average, and compose across users with 33,000x smaller footprint than per-user adapters.

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

cs.AI · 2026-06-17 · unverdicted · novelty 7.0

RTSGameBench is a new extensible benchmark for VLMs using diverse RTS matchups, diagnostic mini-games targeting individual competencies, and a self-evolving query-to-game generator, with results showing poor VLM performance on tight coordination and large-scale tasks.

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

GateMem benchmark shows no existing memory method for LLM agents achieves strong utility, access control, and reliable forgetting simultaneously in multi-principal shared settings.

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

cs.AI · 2026-06-16 · unverdicted · novelty 7.0

PreAct compiles successful agent executions into verifiable state-machine programs for 8.5-13x faster replay on repeated tasks, with an independent evaluator check before storing each program.

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

cs.AI · 2026-06-15 · unverdicted · novelty 7.0

MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.

Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

cs.LG · 2026-06-15 · accept · novelty 7.0

Formalizes four concurrency anomalies in multi-agent LLM systems and mechanically verifies a hierarchy of sound detectors and preventions realized in Rust runtimes using TLA+ and Verus.

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

cs.CL · 2026-06-14 · unverdicted · novelty 7.0

An empirical comparison of thirteen control-plane placements in agent memory pipelines identifies three regimes with complementary forgetting recovery on a new 385-case adversarial benchmark, with mutation-time placement achieving 91.7-93.2% overall.

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

OSL-MR is a learning-augmented framework that casts memory retention as constrained stochastic optimization under partial observability and outperforms heuristic baselines on LoCoMo and LongMemEval.

Self-Harness: Harnesses That Improve Themselves

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

DCPM reorganizes LLM agent memory into a cognitive hierarchy driven by a synchronous daytime belief writer and an asynchronous nighttime schema engine, reporting gains on cross-session inference benchmarks.

citing papers explorer

Showing 50 of 329 citing papers.

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents cs.AI · 2026-05-17 · unverdicted · none · ref 7 · internal anchor
Causal Memory Intervention selects memories based on estimated causal impact on LLM answers rather than semantic similarity, with a new benchmark showing improved robustness to irrelevant or harmful memories.
Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents cs.AI · 2026-05-17 · unverdicted · none · ref 22 · internal anchor
A dual-process memory architecture for scientific AI agents maintains 70-85% accuracy over 15,000 messages by using a constant 10-message episodic window and domain-specific semantic consolidation, consuming 62% fewer tokens than full-context baselines.
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents cs.AI · 2026-05-17 · unverdicted · none · ref 37 · 2 links · internal anchor
NeuSymMS is a hybrid neuro-symbolic memory system that extracts facts via LLMs and manages them with explicit CLIPS rules for scoping, deduplication, and dual-horizon persistence in LLM agents.
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast cs.AI · 2026-05-15 · unverdicted · none · ref 12 · internal anchor
FORGE is a staged population protocol that evolves prompt-injected memory (Rules, Examples, or Mixed) for ReAct agents via reflection and broadcast, yielding 1.7-7.7× gains over zero-shot and 29-72% over Reflexion on CybORG CAGE-2.
DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory cs.CL · 2026-05-15 · unverdicted · none · ref 3 · 2 links · internal anchor
DimMem introduces typed dimensional memory units that improve accuracy to 81.43% and 78.20% on two long-term agent benchmarks while cutting token cost by 24% and enabling small models to match larger extractors.
TopoClaw: A Human-Centric and Topology-Aware Agent Operating System cs.HC · 2026-05-15 · unverdicted · none · ref 20 · internal anchor
TopoClaw is a human-centric Agent OS that uses physical and social topology modeling to enable cross-boundary execution with identity attribution and context-aware governance.
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search cs.CL · 2026-05-14 · unverdicted · none · ref 17 · internal anchor
Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.
A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency cs.AI · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
ARPM is a heterogeneous temporal memory governance framework using vector retrieval, BM25, RRF fusion, dual-temporal reranking, and evidence verification to maintain LLM persona consistency under noise, context clearing, and model handoff.
Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay cs.AI · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
The LOOP Skill Engine records one LLM-powered run of a periodic task and converts it into a deterministic replay template that eliminates further LLM usage while maintaining high success rates.
CogniFold: Always-On Proactive Memory via Cognitive Folding cs.AI · 2026-05-13 · unverdicted · none · ref 43 · 2 links · internal anchor
CogniFold extends Complementary Learning Systems theory to three layers with a prefrontal intent layer and uses graph self-organization to build proactive agent memory from continuous event streams.
Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unverdicted · none · ref 147 · internal anchor
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs cs.AI · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
Grokers architecture performs bottom-up inductive comprehension over typed KGs at write time via LM agents, with three claimed formal theorems on byte-identity, accumulation monotonicity, and dual-traversal ordering, plus a deterministic synonym-caching search alternative.
Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics cs.LG · 2026-05-06 · unverdicted · none · ref 2 · internal anchor
Memini is introduced as a graph-based external memory using multi-timescale edge dynamics to enable emergent episodic sensitivity, consolidation, and selective forgetting in LLM systems.
GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory cs.CL · 2026-05-03 · unverdicted · none · ref 12 · internal anchor
GRAVITY adds structured relational, temporal, and thematic memory anchors to conversational LLMs at generation time, delivering 7.5-10.1% average gains in LLM-judge accuracy across five host systems on LongMemEval and LoCoMo.
Ghost in the Context: Policy-Carriage Integrity in LLM Agents cs.CR · 2026-05-02 · unverdicted · none · ref 12 · 3 links · internal anchor
Protected policy placements in LLM agents maintain integrity under replay pressure on AutoGen and OpenHands traces, unlike task-local placements which show eviction or weakening.
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval cs.CL · 2026-04-23 · unverdicted · none · ref 17 · internal anchor
EngramaBench shows structured graph memory outperforms full-context prompting on cross-space reasoning in long conversations but scores lower overall than full-context and higher than vector retrieval.
A Control Architecture for Training-Free Memory Use cs.AI · 2026-04-20 · unverdicted · none · ref 16 · internal anchor
A training-free control architecture with uncertainty-based routing, confidence-selective acceptance, and evidence-based memory governance improves arithmetic reasoning by +7 points on SVAMP and ASDiv benchmarks.
Towards Self-Improving Error Diagnosis in Multi-Agent Systems cs.MA · 2026-04-19 · unverdicted · none · ref 68 · internal anchor
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward cs.AI · 2026-04-19 · unverdicted · none · ref 3 · internal anchor
AI intelligence is limited by the lack of an architecture that carries forward understanding across sessions, and the proposed continuity layer with Decomposed Trace Convergence Memory addresses this by enabling persistent state as a system property.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 12 · internal anchor
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents cs.AI · 2026-04-16 · unverdicted · none · ref 19 · 2 links · internal anchor
Persistent self-modifying AI agents exhibit compositional drift from mismatches across five mutability layers, with governance difficulty rising under rapid mutation, strong coupling, weak reversibility, and low observability, as indicated by a 0.68 identity hysteresis ratio in a preliminary ratchet
Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning cs.AI · 2026-04-14 · unverdicted · none · ref 29 · internal anchor
A case-based learning framework extracts reusable knowledge from past tasks to improve LLM agents' structured performance on complex real-world tasks, outperforming standard prompting baselines especially as task complexity grows.
Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure cs.AI · 2026-04-13 · unverdicted · none · ref 29 · 2 links · internal anchor
OIDA is a proposed framework that represents organizational knowledge as epistemic Knowledge Objects with class-specific importance decay and signed contradictions, plus a QUESTION mechanism that surfaces modeled ignorance via inverse decay.
SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context cs.AI · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
SWE-AGILE introduces a Dynamic Reasoning Context with sliding windows of detailed steps and compressed Reasoning Digests to enable efficient long-horizon reasoning in software engineering agents, claiming new benchmark results on SWE-Bench-Verified for 7B-8B models.
Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents cs.AI · 2026-04-13 · unverdicted · none · ref 15 · internal anchor
Orchestrating one 8B model in three roles at inference time doubles task completion on AppWorld from 5.4% to 8.9%, surpassing a 33B baseline.
Knowledge Compounding: An Empirical Economic Analysis of Self-Evolving Knowledge Wikis under the Agentic ROI Framework econ.EM · 2026-04-13 · unverdicted · none · ref 8 · internal anchor
A four-query experiment demonstrates 84.6% token savings through knowledge compounding in self-evolving wikis compared to standard RAG, by amortizing ingestion costs and reusing synthesized knowledge over time.
Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure cs.SE · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
Sema Code decouples AI coding agents into a programmable npm library with eight mechanisms for isolation, queuing, compression, scheduling, permissions, and integration.
Contract-Coding: Towards Repo-Level Generation via Structured Symbolic Paradigm cs.SE · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
Contract-Coding projects ambiguous intents into formal Language Contracts as a single source of truth to enable more reliable repo-level code generation, reporting 47% functional success on the Greenfield-5 benchmark.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering cs.SE · 2026-04-09 · accept · none · ref 113 · internal anchor
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought cs.MA · 2026-04-09 · unverdicted · none · ref 25 · 2 links · internal anchor
MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEval-S benchmarks.
Prism: An Evolutionary Memory Substrate for Multi-Agent Open-Ended Discovery cs.AI · 2026-04-08 · unverdicted · none · ref 13 · internal anchor
Prism unifies file, vector, graph, and evolutionary memory under a decision-theoretic framework with entropy-gated stratification, causal graphs, value-of-information retrieval, heartbeat consolidation, and replicator-decay dynamics, reporting 88.1 on LOCOMO and 2.8x gains on CORAL tasks.
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents cs.AI · 2026-04-06 · unverdicted · none · ref 1 · internal anchor
MemMachine stores entire conversational episodes and applies contextualized retrieval plus adaptive query routing to achieve 0.9169 accuracy on LoCoMo and 93 percent on LongMemEvalS while using 80 percent fewer tokens than Mem0.
Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity cs.AI · 2026-04-03 · conditional · none · ref 5 · internal anchor
A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.
AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment cs.AI · 2026-03-04 · unverdicted · none · ref 10 · internal anchor
AI4S-SDS uses sparse MCTS and differentiable physics alignment to generate valid solvent mixtures and identifies a competitive photoresist developer formulation.
TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents cs.CL · 2026-01-06 · unverdicted · none · ref 4 · internal anchor
TiMem introduces a Temporal Memory Tree that consolidates conversational history into hierarchical persona representations, reaching 75.30% on LoCoMo and 76.88% on LongMemEval-S while cutting recalled length by 52%.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems cs.AI · 2025-08-10 · unverdicted · none · ref 65 · internal anchor
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
MemOS: A Memory OS for AI System cs.CL · 2025-07-04 · unverdicted · none · ref 100 · internal anchor
MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.
MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning cs.AI · 2025-05-31 · unverdicted · none · ref 39 · internal anchor
MIRROR applies cognitive principles of parallel processing, reconstructive synthesis, and complementary learning to AI, yielding 21% relative gains on multi-turn constraint-maintenance tasks across seven models with supporting ablations.
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs cs.IR · 2025-04-22 · unverdicted · none · ref 45 · internal anchor
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts cs.CL · 2026-06-18 · unverdicted · none · ref 17 · 2 links · internal anchor
AtomMem introduces atomic-fact extraction, hierarchical event structures, and an associative memory graph to build stable long-term memory for LLM agents, claiming SOTA results on the LoCoMo benchmark.
Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction cs.AI · 2026-06-17 · unverdicted · none · ref 4 · internal anchor
Proposes HACD-H framework integrating emotional adaptation, relational organization, memory and personality into a dynamical system and reports empirical patterns from a 14,700-turn dataset linking social intelligence to reduced social cognitive energy.
TAHOE: Text-to-SQL with Automated Hint Optimization from Experience cs.DB · 2026-06-10 · unverdicted · none · ref 8 · internal anchor
TAHOE builds a Hint Bank from error traces to raise Text-to-SQL pass rates on Spider 2.0-Snow from 61.95% to 79.42% for GPT-5.5 without parameter updates.
Making Software Meaningful cs.SE · 2026-06-09 · unverdicted · none · ref 36 · internal anchor
Committing to explicit meaning via a domain-grounded vocabulary of individuals, actions, facts, and concepts improves software usability, enables modular LLM code generation, and supports accountable agent behavior.
What makes a harness a harness: necessary and sufficient conditions for an agent harness cs.SE · 2026-06-08 · unverdicted · none · ref 34 · internal anchor
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
RIZZ: Routing Interactions to Near Zero-Interference Zones for Continual Adaptation of Black-Box Agents cs.AI · 2026-06-02 · unverdicted · none · ref 23 · internal anchor
RIZZ is a continual adaptation framework for black-box LLM agents that uses dynamically spawned memory branches, context-aware routing, verifier-gated updates, and prompt compilation to control interference across nonstationary inputs.
SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow cs.CL · 2026-05-28 · unverdicted · none · ref 39 · internal anchor
SURGENT is a multi-agent surgical assistance system with novel memory management that outperforms baseline LLMs on case analysis, plan simulation, safety monitoring, risk assessment, and rehabilitation guidance.
CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories cs.SE · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
CommitDistill is a deterministic, local-only prototype that extracts typed knowledge from git commits and evaluates retrieval performance against baselines on public repositories.
Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework cs.AI · 2026-05-16 · unverdicted · none · ref 14 · internal anchor
Empirical analysis of multi-paradigm agent interactions in buddyMe framework reports that Generator-Evaluator detects omissions in 20% of complex tasks, ReAct causes 30% redundant tool calls, and adversarial discussions reach consensus in 2-3 rounds for 70% of cases.
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics cs.AI · 2026-05-09 · unverdicted · none · ref 66 · internal anchor
The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using established economic theories.
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents cs.AI · 2026-05-05 · reject · none · ref 3 · 2 links · internal anchor
MEMTIER reports 0.382 accuracy and 0.412 F1 on the 500-question LongMemEval-S benchmark, a 33pp gain over full-context baseline using tiered memory and retrieval components on 6GB GPU hardware.

MemGPT: Towards LLMs as Operating Systems

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer