AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
read the original abstract
Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between applications and evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric settings. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any Length), a benchmark designed to evaluate long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories of arbitrary horizons paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information, and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest baselines by 11.16%. Resources are available at our project website: https://ama-bench.github.io/
This paper has not been read by Pith yet.
Forward citations
Cited by 18 Pith papers
-
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
-
EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
-
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
-
CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations
CSR with ASR enables infinite-horizon real-time LLM policies via stable KV-cache properties and background eviction, delivering 26x lower latency and SOTA recall on embodied benchmarks.
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 7...
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
Circuit analysis reveals that routing circuits for agent memory emerge at 0.6B parameters while content circuits emerge at 4B, with a shared grounding hub and an unsupervised diagnostic achieving 76.2% accuracy for lo...
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
-
Stateless Decision Memory for Enterprise AI Agents
Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
-
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
-
Opal: Private Memory for Personal AI
Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.
-
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
NeuSymMS is a hybrid neuro-symbolic memory architecture for LLM agents that extracts facts neurally, manages them with explicit lifecycle rules in a CLIPS expert system, stores them as triples in a relational database...
-
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
NeuSymMS is a hybrid neuro-symbolic memory system that extracts facts via LLMs and manages them with explicit CLIPS rules for scoping, deduplication, and dual-horizon persistence in LLM agents.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.