H2HMem is a multimodal memory benchmark evaluating LLM agents on recall, reasoning, and application in dyadic and multi-party human-human conversations with phenomena such as anaphora and deixis.
Evaluating long-horizon memory for multi-party collaborative dialogues.arXiv preprint arXiv:2602.01313, 2026
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
MemConflict provides a benchmark for testing LLM long-term memory systems under dynamic, static, and conditional conflicts involving temporal validity, factual correctness, and contextual applicability.
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
citing papers explorer
-
H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions
H2HMem is a multimodal memory benchmark evaluating LLM agents on recall, reasoning, and application in dyadic and multi-party human-human conversations with phenomena such as anaphora and deixis.
-
MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts
MemConflict provides a benchmark for testing LLM long-term memory systems under dynamic, static, and conditional conflicts involving temporal validity, factual correctness, and contextual applicability.
-
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
- MemSyco-Bench: Benchmarking Sycophancy in Agent Memory