AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Abhilash Shankarampeta; Boqin Yuan; Haocheng Yuan; Haozhou Xu; Jishen Zhao; Junbo Huang; Lanxiang Hu; Wentao Ni; Yuandong Tian; Yujie Zhao

arxiv: 2602.22769 · v3 · pith:ZNJ3J7DQnew · submitted 2026-02-26 · 💻 cs.AI · cs.LG

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao , Boqin Yuan , Junbo Huang , Haocheng Yuan , Zhongming Yu , Haozhou Xu , Lanxiang Hu , Abhilash Shankarampeta

show 4 more authors

Zimeng Huang Wentao Ni Yuandong Tian Jishen Zhao

This is my paper

classification 💻 cs.AI cs.LG

keywords memoryagenticama-benchapplicationsagentlong-horizonprimarilyama-agent

0 comments

read the original abstract

Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between applications and evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric settings. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any Length), a benchmark designed to evaluate long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories of arbitrary horizons paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information, and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest baselines by 11.16%. Resources are available at our project website: https://ama-bench.github.io/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 conditional novelty 8.0

GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data
econ.EM 2026-05 accept novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
MemGym: a Long-Horizon Memory Environment for LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 unverdicted novelty 7.0

GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations
cs.RO 2026-05 unverdicted novelty 6.0

CSR with ASR enables infinite-horizon real-time LLM policies via stable KV-cache properties and background eviction, delivering 26x lower latency and SOTA recall on embodied benchmarks.
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 7...
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

Circuit analysis reveals that routing circuits for agent memory emerge at 0.6B parameters while content circuits emerge at 4B, with a shared grounding hub and an unsupervised diagnostic achieving 76.2% accuracy for lo...
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 unverdicted novelty 6.0

NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
Stateless Decision Memory for Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
cs.CV 2026-04 unverdicted novelty 6.0

FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
Opal: Private Memory for Personal AI
cs.CR 2026-04 unverdicted novelty 6.0

Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
cs.AI 2026-05 unverdicted novelty 5.0

NeuSymMS is a hybrid neuro-symbolic memory architecture for LLM agents that extracts facts neurally, manages them with explicit lifecycle rules in a CLIPS expert system, stores them as triples in a relational database...
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
cs.AI 2026-05 unverdicted novelty 5.0

NeuSymMS is a hybrid neuro-symbolic memory system that extracts facts via LLMs and manages them with explicit CLIPS rules for scoping, deduplication, and dual-horizon persistence in LLM agents.