hub

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

· 2025 · cs.LG · arXiv 2510.17281

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open full Pith review browse 12 citing papers arXiv PDF

abstract

Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms. Website: https://memorybench.thuir.cn Code: https://github.com/THUIR/MemoryBench Data: https://huggingface.co/datasets/THUIR/MemoryBench Data-Full: https://huggingface.co/datasets/THUIR/MemoryBench-Full

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

EXG: Self-Evolving Agents with Experience Graphs

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

cs.CL · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

cs.AI · 2026-03-24 · unverdicted · novelty 7.0

PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

RefMem-Bench benchmarks reflective memory in dialogue with 26K instances across eight dimensions, and REMIND improves model accuracy via hierarchical evidence retrieval, grounding, and abstraction.

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

EvoMemBench evaluates 15 memory methods for LLM agents and finds long-context baselines competitive with no single memory approach working consistently across settings.

State Contamination in Memory-Augmented LLM Agents

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.

Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards

cs.AI · 2026-04-11 · unverdicted · novelty 6.0

Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.

ATANT: An Evaluation Framework for AI Continuity

cs.AI · 2026-04-08 · unverdicted · novelty 6.0

ATANT defines AI continuity via seven properties and offers a 10-checkpoint, LLM-free test using 250 stories to check if systems retrieve correct facts without cross-contamination.

ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks

cs.AI · 2026-04-13 · unverdicted · novelty 4.0

Existing memory benchmarks cover at most two of the seven continuity properties from ATANT v1.0, with a median of one and none covering more than two.

Skill Retrieval Augmentation for Agentic AI

cs.CL · 2026-04-27 · 2 refs

Improve Large Language Model Systems with User Logs

cs.CL · 2026-02-06

citing papers explorer

Showing 12 of 12 citing papers.

MemGym: a Long-Horizon Memory Environment for LLM Agents cs.CL · 2026-05-20 · unverdicted · none · ref 1 · internal anchor
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
EXG: Self-Evolving Agents with Experience Graphs cs.AI · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations cs.CL · 2026-05-14 · unverdicted · none · ref 12 · 2 links · internal anchor
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments cs.AI · 2026-03-24 · unverdicted · none · ref 1 · internal anchor
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue cs.CL · 2026-05-31 · unverdicted · none · ref 61 · internal anchor
RefMem-Bench benchmarks reflective memory in dialogue with 26K instances across eight dimensions, and REMIND improves model accuracy via hierarchical evidence retrieval, grounding, and abstraction.
EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective cs.CL · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
EvoMemBench evaluates 15 memory methods for LLM agents and finds long-context baselines competitive with no single memory approach working consistently across settings.
State Contamination in Memory-Augmented LLM Agents cs.AI · 2026-05-16 · unverdicted · none · ref 1 · internal anchor
Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.
Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards cs.AI · 2026-04-11 · unverdicted · none · ref 1 · internal anchor
Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.
ATANT: An Evaluation Framework for AI Continuity cs.AI · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
ATANT defines AI continuity via seven properties and offers a 10-checkpoint, LLM-free test using 250 stories to check if systems retrieve correct facts without cross-contamination.
ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks cs.AI · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
Existing memory benchmarks cover at most two of the seven continuity properties from ATANT v1.0, with a median of one and none covering more than two.
Skill Retrieval Augmentation for Agentic AI cs.CL · 2026-04-27 · unreviewed · ref 1 · 2 links · internal anchor
Improve Large Language Model Systems with User Logs cs.CL · 2026-02-06 · unreviewed · ref 2 · internal anchor

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer