arxiv: 2603.26680 · v2 · submitted 2026-03-09 · 💻 cs.CL · cs.AI

Recognition: no theorem link

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Jianfei Xiao , Xiang Yu , Chengbing Wang , Wuqiang Zheng , Xinyu Lin , Kaining Liu , Hongxun Ding , Yang Zhang

show 3 more authors

Wenjie Wang Fuli Feng Xiangnan He

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM personalizationmemory managementdialogue benchmarkpreference alignmentinformation extractionmemory retrievallong-term interactions

0 comments

The pith

AlpsBench supplies a real-dialogue benchmark to test the full lifecycle of LLM memory management for personalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AlpsBench, built from 2,500 sequences drawn from actual human-LLM conversations and paired with human-verified structured memories that capture both stated and implied user details. It defines four tasks that track memory from initial extraction of traits through updating, retrieval amid distractions, and final use in generating responses. Benchmark runs on leading models and memory systems show consistent shortfalls in pulling out hidden preferences, a hard limit on how well memories can be kept current, sharp drops in retrieval accuracy as irrelevant items increase, and the fact that added memory stores raise recall without automatically producing outputs that better fit user tastes or tone. These results matter because they expose concrete barriers to turning LLMs into reliable long-term personal assistants that remember individuals across many exchanges.

Core claim

AlpsBench is assembled by selecting 2,500 long-term interaction sequences from real WildChat dialogues and attaching human-verified structured memories that encode explicit and implicit personalization signals. Four tasks are specified—personalized information extraction, memory updating, retrieval, and utilization—along with evaluation protocols that cover the entire memory-management cycle. When frontier LLMs and memory-centric systems are tested, they prove unable to extract latent user traits reliably, encounter a performance ceiling on memory updates, suffer sharp retrieval accuracy loss with large distractor sets, and gain factual recall from explicit memory mechanisms without gaining

What carries the argument

AlpsBench benchmark of real-dialogue sequences with verified structured memories, evaluated on four tasks that span extraction, updating, retrieval, and utilization of personalized information.

If this is right

Better methods are needed to surface implicit user traits that current models overlook in natural conversations.
Memory updating must overcome its observed ceiling to keep pace with ongoing changes in user preferences.
Retrieval components must maintain accuracy when large numbers of irrelevant memories are present.
Explicit memory stores raise recall rates yet leave preference alignment and emotional resonance largely unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing memory mechanisms with separate preference-alignment objectives during training could close the gap between recall and response quality.
Extending the benchmark to track how preferences evolve across dozens of sessions would test long-term adaptability.
High-performing systems on these tasks could serve as stronger starting points for building assistants that feel consistently personal to individual users.

Load-bearing premise

The 2,500 WildChat sequences with their human-verified structured memories are taken to represent typical real-world personalization demands and to cover the complete memory-management cycle.

What would settle it

A controlled study in which models that score highest on all four AlpsBench tasks are deployed in live multi-week user interactions and show no measurable improvement in preference matching or user retention compared with baseline models would falsify the benchmark's claimed relevance.

Figures

Figures reproduced from arXiv: 2603.26680 by Chengbing Wang, Fuli Feng, Hongxun Ding, Jianfei Xiao, Kaining Liu, Wenjie Wang, Wuqiang Zheng, Xiangnan He, Xiang Yu, Xinyu Lin, Yang Zhang.

**Figure 3.** Figure 3: The evaluation tasks of AlpsBench. 2 AlpsBench Evaluation Framework To rigorously evaluate the personalization capabilities of AI assistants, we introduce AlpsBench, a comprehensive benchmark designed to simulate real-world personalization scenarios. In this section, we provide a detailed description of the task design for AlpsBench (rf [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The four-step benchmark construction pipeline [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Task 1. Extraction recall of memory-oriented sys [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 8.** Figure 8: Task 3. Retrieval performance decay across evalu [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlpsBench brings a real-dialogue benchmark to LLM personalization with human-verified memories, but the 2,500-sequence curation lacks the distribution checks needed to firmly support the reported model limitations.

read the letter

AlpsBench stands out for building its evaluation set from actual WildChat conversations rather than synthetic dialogues, then pairing them with human-verified structured memories that track both explicit and implicit signals. That choice directly tackles the distribution gap the authors flag in prior work and lets them define four tasks that span the memory lifecycle: extraction, updating, retrieval, and utilization. The experiments on frontier models and memory systems produce concrete observations about extraction difficulty, updating ceilings, distractor sensitivity, and the limits of explicit memory for alignment. Those results give practitioners something tangible to measure against. The curation step is the clearest soft spot. The paper selects 2,500 sequences and verifies the memories, yet it supplies no topic distributions, length statistics, or direct comparison to the source WildChat corpus. Without those checks it remains possible that rarer but important cases, such as conflicting preferences or long-horizon conflicts, are under-represented. That makes the performance ceilings and declines more provisional than the abstract presents them. The work is aimed at groups building memory-augmented assistants and at benchmark designers who need realistic test beds. Readers who care about evaluation protocols will find usable task definitions and baseline numbers. The real-data framing is worth referee attention even with the current gaps in validation. I would send it to peer review so the methods and data analysis can be tightened.

Referee Report

3 major / 2 minor

Summary. The paper introduces AlpsBench, a benchmark consisting of 2,500 long-term interaction sequences curated from WildChat paired with human-verified structured memories capturing explicit and implicit personalization signals. It defines four tasks—personalized information extraction, updating, retrieval, and utilization—to evaluate the full lifecycle of memory management, and reports benchmarking results on frontier LLMs and memory-centric systems showing struggles with latent trait extraction, performance ceilings in updating, sharp retrieval declines with large distractor pools, and that explicit memory improves recall but does not guarantee better preference-aligned or emotionally resonant responses.

Significance. If the curation is shown to be representative and the evaluation protocols are fully specified and reproducible, AlpsBench would provide a valuable real-world grounded framework that addresses gaps in existing synthetic benchmarks, offering concrete empirical insights into LLM limitations for lifelong personalization that could guide targeted improvements in memory mechanisms.

major comments (3)

[§3] §3 (Dataset Construction): The curation of the 2,500 sequences from WildChat is presented without quantitative distribution comparisons (e.g., topic histograms, dialogue length statistics, or trait coverage) to the full source corpus or explicit selection criteria; this is load-bearing for the central claims because the reported model limitations and performance ceilings are generalized from these sequences as representative of real-world personalization lifecycles including rare cases like conflicting preferences.
[§4.2] §4.2 (Evaluation Protocols): The human verification process for structured memories and the exact metrics/protocols for the utilization task (preference alignment and emotional resonance) lack reported inter-annotator agreement or annotation guidelines; without these, it is unclear whether the finding that explicit memory does not guarantee better responses is supported by reliable ground truth.
[§5] §5 (Benchmarking Results): The retrieval experiments report sharp accuracy declines with large distractor pools, but the manuscript does not specify the exact distractor pool sizes tested, the construction of negative examples, or direct comparisons to non-memory baselines; this detail is necessary to substantiate the claim as a general limitation rather than an artifact of the setup.

minor comments (2)

[Abstract] Abstract: The final sentence could be strengthened by briefly noting one or two key quantitative findings rather than ending on the general aim statement.
[Throughout] Notation: Ensure consistent terminology for 'structured memories' versus 'explicit memory mechanisms' across sections to avoid minor reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to improve clarity, reproducibility, and support for our claims.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The curation of the 2,500 sequences from WildChat is presented without quantitative distribution comparisons (e.g., topic histograms, dialogue length statistics, or trait coverage) to the full source corpus or explicit selection criteria; this is load-bearing for the central claims because the reported model limitations and performance ceilings are generalized from these sequences as representative of real-world personalization lifecycles including rare cases like conflicting preferences.

Authors: We agree that quantitative comparisons are essential to substantiate representativeness. In the revised manuscript, we will add topic histograms, dialogue length distributions, and trait coverage statistics comparing the 2,500 selected sequences against the full WildChat corpus. We will also explicitly document the curation criteria, including how sequences were sampled to capture both common and rare personalization signals such as conflicting preferences. These additions will appear in an expanded Section 3. revision: yes
Referee: [§4.2] §4.2 (Evaluation Protocols): The human verification process for structured memories and the exact metrics/protocols for the utilization task (preference alignment and emotional resonance) lack reported inter-annotator agreement or annotation guidelines; without these, it is unclear whether the finding that explicit memory does not guarantee better responses is supported by reliable ground truth.

Authors: We will report inter-annotator agreement (Fleiss' kappa) for the structured memory verification process in the revised Section 4.2. The full annotation guidelines and detailed protocols for scoring preference alignment and emotional resonance in the utilization task will be added to the appendix. These changes will directly support the reliability of the ground-truth labels underlying our finding that explicit memory does not guarantee better responses. revision: yes
Referee: [§5] §5 (Benchmarking Results): The retrieval experiments report sharp accuracy declines with large distractor pools, but the manuscript does not specify the exact distractor pool sizes tested, the construction of negative examples, or direct comparisons to non-memory baselines; this detail is necessary to substantiate the claim as a general limitation rather than an artifact of the setup.

Authors: We will revise Section 5 to specify the exact distractor pool sizes evaluated (10, 50, 100, 500, and 1000). Negative examples are constructed by sampling sequences from the dataset that share no user memories with the query. We will also add direct comparisons against non-memory baselines, including standard prompting and simple RAG without structured memory, to demonstrate that the observed retrieval declines reflect a general limitation rather than a setup-specific artifact. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark construction

full rationale

This is a purely empirical benchmark paper that curates 2,500 sequences from WildChat, defines four tasks, and reports model performance metrics. No derivations, equations, predictions, or first-principles claims exist that could reduce to fitted inputs, self-citations, or ansatzes by construction. The central claims rest on direct evaluation protocols and human verification rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new benchmark but rests on the domain assumption that human annotation reliably captures personalization signals; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human-verified structured memories accurately encapsulate both explicit and implicit personalization signals from real dialogues.
The quality and validity of the benchmark depend on this assumption for the 2,500 sequences.

pith-pipeline@v0.9.0 · 5554 in / 1375 out tokens · 55247 ms · 2026-05-15T14:47:44.781686+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

[1]

Saleh Afzoon, Usman Naseem, Amin Beheshti, and Zahra Jamali. 2024. Per- sobench: Benchmarking personalized response generation in large language models.arXiv preprint arXiv:2410.03198(2024)

work page arXiv 2024
[2]

2025.Claude Sonnet 4.5 System Card

Anthropic. 2025.Claude Sonnet 4.5 System Card. Technical Report. Anthropic. https://www-cdn.anthropic.com/963373e433e489a87a10c823c52a0a013e9172dd. pdf [Online; accessed Feb. 12, 2026]

work page 2025
[3]

Anthropic. 2026. The Assistant Axis: Situating and Stabilizing the Character of Large Language Models. https://www.anthropic.com/research/assistant-axis. [Online; accessed Feb. 13, 2026]

work page 2026
[4]

Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li. 2025. Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506(2025)

work page arXiv 2025
[5]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Ya- dav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

DeepSeek-AI. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.CoRRabs/2512.02556 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al . 2025. Light- mem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866(2025)

work page arXiv 2025
[8]

2025.Gemini 3 Flash Model Card

Google DeepMind. 2025.Gemini 3 Flash Model Card. Technical Report. Google DeepMind. https://storage.googleapis.com/deepmind-media/Model- Cards/Gemini-3-Flash-Model-Card.pdf [Online; accessed Feb. 2026]

work page 2025
[9]

Jian Guan, Junfei Wu, Jia-Nan Li, Chuanqi Cheng, and Wei Wu. 2025. A Survey on Personalized Alignment - The Missing Piece for Large Language Models in Real-World Applications. InACL (Findings) (Findings of ACL, Vol. ACL 2025). Association for Computational Linguistics, 5313–5333

work page 2025
[10]

Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, et al . 2026. EverMemOS: A Self- Organizing Memory Operating System for Structured Long-Horizon Reasoning. arXiv preprint arXiv:2601.02163(2026)

work page arXiv 2026
[11]

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu

work page
[12]

Personalized soups: Personalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564(2023)

work page arXiv 2023
[13]

Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. 2025. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225(2025)

work page arXiv 2025
[14]

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al . 2025. Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688(2025)

work page arXiv 2025
[15]

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. 2025. Artificial hivemind: The open-ended homogeneity of language models (and beyond). In NeurIPS

work page 2025
[16]

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. 2025. Memory OS of AI Agent. InEMNLP. Association for Computational Linguistics, 25961–25970

work page 2025
[17]

Xiaoyu Kong, Jiancan Wu, An Zhang, Leheng Sheng, Hui Lin, Xiang Wang, and Xiangnan He. 2024. Customizing language models with instance-wise lora for sequential recommendation.Advances in Neural Information Processing Systems 37 (2024), 113072–113095

work page 2024
[18]

Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, et al. 2024. Longlamp: A benchmark for personalized long-form text generation.arXiv preprint arXiv:2407.11016(2024)

work page arXiv 2024
[19]

Amy Armento Lee, Narayan Hegde, Nina Deliu, Emily Rosenzweig, Arun Sug- gala, Sriram Lakshminarasimhan, Qian He, John Hernandez, Martin Seneviratne, Rahul Singh, et al . 2025. A Personalized Exercise Assistant using Reinforce- ment Learning (PEARL): Results from a four-arm Randomized-controlled Trial. arXiv:2508.10060(2025)

work page arXiv 2025
[20]

Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, et al. 2025. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Xinyu Lin, Pengyuan Liu, Wenjie Wang, Yicheng Hu, Chen Xu, Fuli Feng, Qifan Wang, and Tat-Seng Chua. 2026. Bringing Reasoning to Generative Recommen- dation Through the Lens of Cascaded Ranking.arXiv:2602.03692(2026)

work page arXiv 2026
[23]

Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Jieming Zhu, Minda Hu, Menglin Yang, and Irwin King. 2025. A Survey of Personalized Large Language Models: Progress and Future Directions.CoRRabs/2502.11528 (2025)

work page arXiv 2025
[24]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. InACL (1). Association for Computational Linguistics, 13851–13870

work page 2024
[25]

Meta AI. 2025. Llama 4: Multimodal Intelligence. https://ai.meta.com/blog/llama- 4-multimodal-intelligence/. [Online; accessed Feb. 12, 2026]

work page 2025
[26]

Jiayan Nan, Wenquan Ma, Wenlong Wu, and Yize Chen. 2025. Nemori: Self-organizing agent memory inspired by cognitive science.arXiv preprint arXiv:2508.03341(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

OpenAI. 2025. GPT-4.1. https://openai.com/index/gpt-4-1/. [Online; accessed Feb. 12, 2026]

work page 2025
[28]

2025.GPT-5.2 System Card

OpenAI. 2025.GPT-5.2 System Card. Technical Report. OpenAI. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_ 5_2_system-card.pdf [Online; accessed Feb. 2026]

work page 2025
[29]

OpenAI. 2025. The Power of Personalized AI. https://openai.com/global-affairs/ the-power-of-personalized-ai/. [Online; accessed Feb. 13, 2026]

work page 2025
[30]

Samuel J. Paech. 2023. EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models.CoRRabs/2312.06281 (2023)

work page arXiv 2023
[31]

Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, and Fuli Feng

work page
[32]

Latent Inter-User Difference Modeling for LLM Personalization. InEMNLP. Association for Computational Linguistics, 10599–10617

work page
[33]

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yimeng Bai, Wenjie Wang, Hong Cheng, Fuli Feng, and Tat-Seng Chua. 2025. Measuring What Makes You Unique: Difference-Aware User Modeling for Enhancing LLM Personalization. InACL (Findings) (Findings of ACL, Vol. ACL 2025). Association for Computational Lin- guistics, 21258–21277

work page 2025
[34]

Robertson and Hugo Zaragoza

Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond.Found. Trends Inf. Retr.3, 4 (2009), 333–389

work page 2009
[35]

Liu, Jinfeng Zhou, Alvionna S

Sahand Sabour, Siyang Liu, Zheyuan Zhang, June M. Liu, Jinfeng Zhou, Alvionna S. Sunaryo, Tatia M. C. Lee, Rada Mihalcea, and Minlie Huang. 2024. EmoBench: Evaluating the Emotional Intelligence of Large Language Models. In ACL (1). Association for Computational Linguistics, 5986–6004

work page 2024
[36]

Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba. 2025. MemInsight: Autonomous Memory Augmentation for LLM Agents. InEMNLP. Association for Computational Linguistics, 33136–33152

work page 2025
[37]

Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. LaMP: When Large Language Models Meet Personalization. InACL (1). Associa- tion for Computational Linguistics, 7370–7392

work page 2024
[38]

Alireza Salemi and Hamed Zamani. 2025. LaMP-QA: A Benchmark for Personal- ized Long-form Question Answering. InEMNLP. Association for Computational Linguistics, 1139–1159

work page 2025
[39]

sentence-transformers. 2025. all -MiniLM-L6-v2 Sentence Embedding Model. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. [Online; ac- cessed Feb. 12, 2026]

work page 2025
[40]

Wentao Shi, Xiangnan He, Yang Zhang, Chongming Gao, Xinyue Li, Jizhi Zhang, Qifan Wang, and Fuli Feng. 2024. Large language models are learnable planners Conference’17, July 2017, Washington, DC, USA Xiao et al. for long-term recommendation. InSIGIR. 1893–1903

work page 2024
[41]

Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. 2024. Democratizing Large Language Models via Personalized Parameter- Efficient Fine-tuning. InEMNLP. Association for Computational Linguistics, 6476–6491

work page 2024
[42]

Meiling Tao, Chenghao Zhu, Dongyi Ding, Tiannan Wang, Yuchen Eleanor Jiang, and Wangchunshu Zhou. 2025. PersonaFeedback: A Large-scale Human- annotated Benchmark For Personalization.arXiv preprint arXiv:2506.12915(2025)

work page arXiv 2025
[43]

Wiebke Wagner. 2010. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit - O’Reilly Media, Beijing, 2009, ISBN 978-0-596-51649-9.Lang. Resour. Evaluation44, 4 (2010), 421–424

work page 2010
[44]

Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2025. Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.CoRRabs/2512.06690 (2025)

work page arXiv 2025
[45]

Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2026. Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation. (2026)

work page 2026
[46]

Chengbing Wang, Wuqiang Zheng, Yang Zhang, Fengbin Zhu, Junyi Cheng, Yi Xie, Wenjie Wang, and Fuli Feng. 2026. PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models.arXiv preprint arXiv:2601.10532 (2026)

work page arXiv 2026
[47]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Voyager: An Open-Ended Embodied Agent with Large Language Models.Trans. Mach. Learn. Res.2024 (2024)

work page 2024
[48]

Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuan-Jing Huang, and Zhongyu Wei. 2025. Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation. InProceedings of the 31st international conference on computational linguistics. 3310–3328

work page 2025
[49]

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu

work page
[50]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InICLR. OpenReview.net

work page
[51]

Rossi, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K

Junda Wu, Hanjia Lyu, Yu Xia, Zhehao Zhang, Joe Barrow, Ishita Kumar, Mehrnoosh Mirtaheri, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Namyong Park, Sungchul Kim, Huanrui Yang, Subrata Mitra, Zhengmian Hu, Nedim Lipka, Dang Nguyen, Yue Zhao, Jiebo Luo, and ...

work page arXiv 2024
[52]

Smith, Mari Ostendorf, and Hannaneh Hajishirzi

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Am- manabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2023. Fine-Grained Human Feedback Gives Better Rewards for Language Model Train- ing. InNeurIPS

work page 2023
[53]

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang

work page
[54]

A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. 2022. STaR: Boot- strapping Reasoning With Reasoning. InNeurIPS

work page 2022
[57]

You Zhang, Jin Wang, Liang-Chih Yu, Dan Xu, and Xuejie Zhang. 2024. Person- alized LoRA for Human-Centered Text Understanding. InAAAI. AAAI Press, 19588–19596

work page 2024
[58]

Yang Zhang, Wenxin Xu, Xiaoyan Zhao, Wenjie Wang, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2025. Reinforced Latent Reasoning for LLM-based Recommendation.CoRRabs/2505.19092 (2025)

work page arXiv 2025
[59]

Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen K. Ahmed, and Yu Wang. 2025. Person- alization of Large Language Models: A Survey.Trans. Mach. ...

work page 2025
[60]

Zeyu Zhang, Yang Zhang, Haoran Tan, Rui Li, and Xu Chen. 2025. Explicit vs implicit memory: Exploring multi-hop complex reasoning over personalized information.arXiv preprint arXiv:2508.13250(2025)

work page arXiv 2025
[61]

Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, and Kaixiang Lin

work page
[62]

Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs. InICLR. OpenReview.net

work page
[63]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. WildChat: 1M ChatGPT Interaction Logs in the Wild. InICLR. OpenReview.net

work page 2024
[64]

Xiaoyan Zhao, Ming Yan, Yilun Qiu, Haoting Ni, Yang Zhang, Fuli Feng, Hong Cheng, and Tat-Seng Chua. 2025. SteerX: Disentangled Steering for LLM Person- alization.CoRRabs/2510.22256 (2025)

work page arXiv 2025
[66]

Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2025. Nextquill: Causal preference modeling for enhancing llm personalization. InarXiv:2506.02368

work page arXiv 2025
[67]

Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B Cohen, and Emine Yilmaz. 2025. PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants.arXiv preprint arXiv:2506.09902(2025)

work page arXiv 2025
[68]

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. 2024. Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization. InACL (Findings) (Findings of ACL, Vol. ACL 2024). Association for Computational Linguistics, 10586–10613

work page 2024
[69]

Jiachen Zhu, Jianghao Lin, Xinyi Dai, Bo Chen, Rong Shan, Jieming Zhu, Ruim- ing Tang, Yong Yu, and Weinan Zhang. 2024. Lifelong personalized low-rank adaptation of large language models for recommendation.arXiv preprint arXiv:2408.03533(2024)

work page arXiv 2024
[70]

Zollo, Andrew Wei Tung Siah, Naimeng Ye, Ang Li, and Hongseok Namkoong

Thomas P. Zollo, Andrew Wei Tung Siah, Naimeng Ye, Ang Li, and Hongseok Namkoong. 2025. PersonalLLM: Tailoring LLMs to Individual Preferences. In ICLR. OpenReview.net. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page 2025