HorizonBench generates 6-month conversation histories from structured mental state graphs to test AI models on tracking evolving user preferences, finding that frontier models mostly fail at belief updates and perform near or below chance.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
baseline 1polarities
baseline 1representative citing papers
A hybrid LLM-symbolic verifier maintains a dependency graph over conversation turns classified into eight formal update operations, enabling linear-time groundedness checks and precise retraction propagation with a conflict-free guarantee.
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and long-term agent benchmarks.
MINTEval benchmark shows current memory-augmented systems average 27.9% accuracy on long-horizon interference tasks, limited by retrieval and memory construction with degradation from intervening updates.
citing papers explorer
-
HorizonBench: Long-Horizon Personalization with Evolving Preferences
HorizonBench generates 6-month conversation histories from structured mental state graphs to test AI models on tracking evolving user preferences, finding that frontier models mostly fail at belief updates and perform near or below chance.
-
Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations
A hybrid LLM-symbolic verifier maintains a dependency graph over conversation turns classified into eight formal update operations, enabling linear-time groundedness checks and precise retraction propagation with a conflict-free guarantee.
-
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and long-term agent benchmarks.
-
MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
MINTEval benchmark shows current memory-augmented systems average 27.9% accuracy on long-horizon interference tasks, limited by retrieval and memory construction with degradation from intervening updates.