PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, et al · 2026 · cs.AI · arXiv 2603.23231

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. Existing evaluations of this capability typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events driving user preference evolution. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems extract precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

representative citing papers

$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.

Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.

citing papers explorer

Showing 2 of 2 citing papers.

$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows cs.AI · 2026-05-14 · unverdicted · none · ref 20 · 2 links · internal anchor
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data cs.AI · 2026-04-22 · unverdicted · none · ref 74 · internal anchor
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

fields

years

verdicts

representative citing papers

citing papers explorer