AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

· 2026 · cs.CL · arXiv 2603.26680

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.

representative citing papers

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.

From Empathy to Personalized Empathy: Adapting Empathetic Strategies to Individual Users

cs.CL · 2026-05-30 · unverdicted · novelty 5.0

Introduces personalized empathy task, PersonaEmp dataset from long-term interactions, and PereGRM reward framework that combines empathy evaluation with dynamic criteria for improved adaptation to user personas.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization cs.CL · 2026-05-27 · unverdicted · none · ref 16 · internal anchor
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
From Empathy to Personalized Empathy: Adapting Empathetic Strategies to Individual Users cs.CL · 2026-05-30 · unverdicted · none · ref 5 · internal anchor
Introduces personalized empathy task, PersonaEmp dataset from long-term interactions, and PereGRM reward framework that combines empathy evaluation with dynamic criteria for improved adaptation to user personas.

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

fields

years

verdicts

representative citing papers

citing papers explorer