Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Hui Liu; Jun Liu; Lijiao Xu; Lingling Zhang; Muye Huang; Rongman Xu; Weidong Guo; Yifei Li; Yu Xu

arxiv: 2602.10715 · v1 · pith:VPTRJZEPnew · submitted 2026-02-11 · 💻 cs.CL · cs.AI

Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Yifei Li , Weidong Guo , Lingling Zhang , Rongman Xu , Muye Huang , Hui Liu , Lijiao Xu , Yu Xu

show 1 more author

Jun Liu

This is my paper

classification 💻 cs.CL cs.AI

keywords memoryevaluationcognitiveframeworklocomo-plusacrossbenchmarksconstraints

0 comments

read the original abstract

Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo-Plus.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory
cs.AI 2026-06 unverdicted novelty 7.0

MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.
PersonaTree: Structured Lifecycle Memory for Person Understanding in LLM Agents
cs.CL 2026-06 unverdicted novelty 7.0

PersonaTree is a new hierarchical memory framework for persistent LLM agents that structures evidence into persona claims via support paths and outperforms baselines on six person-understanding benchmarks.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
A-MBER: Affective Memory Benchmark for Emotion Recognition
cs.AI 2026-04 unverdicted novelty 7.0

A-MBER is a new benchmark for evaluating AI models on using interaction history to recognize and explain a user's present affective state across judgment, retrieval, and explanation tasks.
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline
cs.AI 2026-06 unverdicted novelty 5.0

An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.