pith. machine review for the scientific record. sign in

arxiv: 2603.26680 · v2 · submitted 2026-03-09 · 💻 cs.CL · cs.AI

Recognition: no theorem link

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM personalizationmemory managementdialogue benchmarkpreference alignmentinformation extractionmemory retrievallong-term interactions
0
0 comments X

The pith

AlpsBench supplies a real-dialogue benchmark to test the full lifecycle of LLM memory management for personalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AlpsBench, built from 2,500 sequences drawn from actual human-LLM conversations and paired with human-verified structured memories that capture both stated and implied user details. It defines four tasks that track memory from initial extraction of traits through updating, retrieval amid distractions, and final use in generating responses. Benchmark runs on leading models and memory systems show consistent shortfalls in pulling out hidden preferences, a hard limit on how well memories can be kept current, sharp drops in retrieval accuracy as irrelevant items increase, and the fact that added memory stores raise recall without automatically producing outputs that better fit user tastes or tone. These results matter because they expose concrete barriers to turning LLMs into reliable long-term personal assistants that remember individuals across many exchanges.

Core claim

AlpsBench is assembled by selecting 2,500 long-term interaction sequences from real WildChat dialogues and attaching human-verified structured memories that encode explicit and implicit personalization signals. Four tasks are specified—personalized information extraction, memory updating, retrieval, and utilization—along with evaluation protocols that cover the entire memory-management cycle. When frontier LLMs and memory-centric systems are tested, they prove unable to extract latent user traits reliably, encounter a performance ceiling on memory updates, suffer sharp retrieval accuracy loss with large distractor sets, and gain factual recall from explicit memory mechanisms without gaining

What carries the argument

AlpsBench benchmark of real-dialogue sequences with verified structured memories, evaluated on four tasks that span extraction, updating, retrieval, and utilization of personalized information.

If this is right

  • Better methods are needed to surface implicit user traits that current models overlook in natural conversations.
  • Memory updating must overcome its observed ceiling to keep pace with ongoing changes in user preferences.
  • Retrieval components must maintain accuracy when large numbers of irrelevant memories are present.
  • Explicit memory stores raise recall rates yet leave preference alignment and emotional resonance largely unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing memory mechanisms with separate preference-alignment objectives during training could close the gap between recall and response quality.
  • Extending the benchmark to track how preferences evolve across dozens of sessions would test long-term adaptability.
  • High-performing systems on these tasks could serve as stronger starting points for building assistants that feel consistently personal to individual users.

Load-bearing premise

The 2,500 WildChat sequences with their human-verified structured memories are taken to represent typical real-world personalization demands and to cover the complete memory-management cycle.

What would settle it

A controlled study in which models that score highest on all four AlpsBench tasks are deployed in live multi-week user interactions and show no measurable improvement in preference matching or user retention compared with baseline models would falsify the benchmark's claimed relevance.

Figures

Figures reproduced from arXiv: 2603.26680 by Chengbing Wang, Fuli Feng, Hongxun Ding, Jianfei Xiao, Kaining Liu, Wenjie Wang, Wuqiang Zheng, Xiangnan He, Xiang Yu, Xinyu Lin, Yang Zhang.

Figure 1
Figure 1. Figure 1: Comparison between synthesized and real data. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: The evaluation tasks of AlpsBench. 2 AlpsBench Evaluation Framework To rigorously evaluate the personalization capabilities of AI as￾sistants, we introduce AlpsBench, a comprehensive benchmark designed to simulate real-world personalization scenarios. In this section, we provide a detailed description of the task design for Alps￾Bench (rf [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The four-step benchmark construction pipeline [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Task 1. Extraction recall of memory-oriented sys [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Task 3. Retrieval performance decay across evalu [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AlpsBench, a benchmark consisting of 2,500 long-term interaction sequences curated from WildChat paired with human-verified structured memories capturing explicit and implicit personalization signals. It defines four tasks—personalized information extraction, updating, retrieval, and utilization—to evaluate the full lifecycle of memory management, and reports benchmarking results on frontier LLMs and memory-centric systems showing struggles with latent trait extraction, performance ceilings in updating, sharp retrieval declines with large distractor pools, and that explicit memory improves recall but does not guarantee better preference-aligned or emotionally resonant responses.

Significance. If the curation is shown to be representative and the evaluation protocols are fully specified and reproducible, AlpsBench would provide a valuable real-world grounded framework that addresses gaps in existing synthetic benchmarks, offering concrete empirical insights into LLM limitations for lifelong personalization that could guide targeted improvements in memory mechanisms.

major comments (3)
  1. [§3] §3 (Dataset Construction): The curation of the 2,500 sequences from WildChat is presented without quantitative distribution comparisons (e.g., topic histograms, dialogue length statistics, or trait coverage) to the full source corpus or explicit selection criteria; this is load-bearing for the central claims because the reported model limitations and performance ceilings are generalized from these sequences as representative of real-world personalization lifecycles including rare cases like conflicting preferences.
  2. [§4.2] §4.2 (Evaluation Protocols): The human verification process for structured memories and the exact metrics/protocols for the utilization task (preference alignment and emotional resonance) lack reported inter-annotator agreement or annotation guidelines; without these, it is unclear whether the finding that explicit memory does not guarantee better responses is supported by reliable ground truth.
  3. [§5] §5 (Benchmarking Results): The retrieval experiments report sharp accuracy declines with large distractor pools, but the manuscript does not specify the exact distractor pool sizes tested, the construction of negative examples, or direct comparisons to non-memory baselines; this detail is necessary to substantiate the claim as a general limitation rather than an artifact of the setup.
minor comments (2)
  1. [Abstract] Abstract: The final sentence could be strengthened by briefly noting one or two key quantitative findings rather than ending on the general aim statement.
  2. [Throughout] Notation: Ensure consistent terminology for 'structured memories' versus 'explicit memory mechanisms' across sections to avoid minor reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to improve clarity, reproducibility, and support for our claims.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The curation of the 2,500 sequences from WildChat is presented without quantitative distribution comparisons (e.g., topic histograms, dialogue length statistics, or trait coverage) to the full source corpus or explicit selection criteria; this is load-bearing for the central claims because the reported model limitations and performance ceilings are generalized from these sequences as representative of real-world personalization lifecycles including rare cases like conflicting preferences.

    Authors: We agree that quantitative comparisons are essential to substantiate representativeness. In the revised manuscript, we will add topic histograms, dialogue length distributions, and trait coverage statistics comparing the 2,500 selected sequences against the full WildChat corpus. We will also explicitly document the curation criteria, including how sequences were sampled to capture both common and rare personalization signals such as conflicting preferences. These additions will appear in an expanded Section 3. revision: yes

  2. Referee: [§4.2] §4.2 (Evaluation Protocols): The human verification process for structured memories and the exact metrics/protocols for the utilization task (preference alignment and emotional resonance) lack reported inter-annotator agreement or annotation guidelines; without these, it is unclear whether the finding that explicit memory does not guarantee better responses is supported by reliable ground truth.

    Authors: We will report inter-annotator agreement (Fleiss' kappa) for the structured memory verification process in the revised Section 4.2. The full annotation guidelines and detailed protocols for scoring preference alignment and emotional resonance in the utilization task will be added to the appendix. These changes will directly support the reliability of the ground-truth labels underlying our finding that explicit memory does not guarantee better responses. revision: yes

  3. Referee: [§5] §5 (Benchmarking Results): The retrieval experiments report sharp accuracy declines with large distractor pools, but the manuscript does not specify the exact distractor pool sizes tested, the construction of negative examples, or direct comparisons to non-memory baselines; this detail is necessary to substantiate the claim as a general limitation rather than an artifact of the setup.

    Authors: We will revise Section 5 to specify the exact distractor pool sizes evaluated (10, 50, 100, 500, and 1000). Negative examples are constructed by sampling sequences from the dataset that share no user memories with the query. We will also add direct comparisons against non-memory baselines, including standard prompting and simple RAG without structured memory, to demonstrate that the observed retrieval declines reflect a general limitation rather than a setup-specific artifact. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark construction

full rationale

This is a purely empirical benchmark paper that curates 2,500 sequences from WildChat, defines four tasks, and reports model performance metrics. No derivations, equations, predictions, or first-principles claims exist that could reduce to fitted inputs, self-citations, or ansatzes by construction. The central claims rest on direct evaluation protocols and human verification rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new benchmark but rests on the domain assumption that human annotation reliably captures personalization signals; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Human-verified structured memories accurately encapsulate both explicit and implicit personalization signals from real dialogues.
    The quality and validity of the benchmark depend on this assumption for the 2,500 sequences.

pith-pipeline@v0.9.0 · 5554 in / 1375 out tokens · 55247 ms · 2026-05-15T14:47:44.781686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

  1. [1]

    Saleh Afzoon, Usman Naseem, Amin Beheshti, and Zahra Jamali. 2024. Per- sobench: Benchmarking personalized response generation in large language models.arXiv preprint arXiv:2410.03198(2024)

  2. [2]

    2025.Claude Sonnet 4.5 System Card

    Anthropic. 2025.Claude Sonnet 4.5 System Card. Technical Report. Anthropic. https://www-cdn.anthropic.com/963373e433e489a87a10c823c52a0a013e9172dd. pdf [Online; accessed Feb. 12, 2026]

  3. [3]

    Anthropic. 2026. The Assistant Axis: Situating and Stabilizing the Character of Large Language Models. https://www.anthropic.com/research/assistant-axis. [Online; accessed Feb. 13, 2026]

  4. [4]

    Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li. 2025. Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506(2025)

  5. [5]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Ya- dav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413(2025)

  6. [6]

    DeepSeek-AI. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.CoRRabs/2512.02556 (2025)

  7. [7]

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al . 2025. Light- mem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866(2025)

  8. [8]

    2025.Gemini 3 Flash Model Card

    Google DeepMind. 2025.Gemini 3 Flash Model Card. Technical Report. Google DeepMind. https://storage.googleapis.com/deepmind-media/Model- Cards/Gemini-3-Flash-Model-Card.pdf [Online; accessed Feb. 2026]

  9. [9]

    Jian Guan, Junfei Wu, Jia-Nan Li, Chuanqi Cheng, and Wei Wu. 2025. A Survey on Personalized Alignment - The Missing Piece for Large Language Models in Real-World Applications. InACL (Findings) (Findings of ACL, Vol. ACL 2025). Association for Computational Linguistics, 5313–5333

  10. [10]

    Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, et al . 2026. EverMemOS: A Self- Organizing Memory Operating System for Structured Long-Horizon Reasoning. arXiv preprint arXiv:2601.02163(2026)

  11. [11]

    Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu

  12. [12]

    Personalized soups: Personalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564(2023)

  13. [13]

    Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. 2025. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225(2025)

  14. [14]

    Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al . 2025. Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688(2025)

  15. [15]

    Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. 2025. Artificial hivemind: The open-ended homogeneity of language models (and beyond). In NeurIPS

  16. [16]

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. 2025. Memory OS of AI Agent. InEMNLP. Association for Computational Linguistics, 25961–25970

  17. [17]

    Xiaoyu Kong, Jiancan Wu, An Zhang, Leheng Sheng, Hui Lin, Xiang Wang, and Xiangnan He. 2024. Customizing language models with instance-wise lora for sequential recommendation.Advances in Neural Information Processing Systems 37 (2024), 113072–113095

  18. [18]

    Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, et al. 2024. Longlamp: A benchmark for personalized long-form text generation.arXiv preprint arXiv:2407.11016(2024)

  19. [19]

    Amy Armento Lee, Narayan Hegde, Nina Deliu, Emily Rosenzweig, Arun Sug- gala, Sriram Lakshminarasimhan, Qian He, John Hernandez, Martin Seneviratne, Rahul Singh, et al . 2025. A Personalized Exercise Assistant using Reinforce- ment Learning (PEARL): Results from a four-arm Randomized-controlled Trial. arXiv:2508.10060(2025)

  20. [20]

    Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, et al. 2025. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724(2025)

  21. [21]

    Xinyu Lin, Pengyuan Liu, Wenjie Wang, Yicheng Hu, Chen Xu, Fuli Feng, Qifan Wang, and Tat-Seng Chua. 2026. Bringing Reasoning to Generative Recommen- dation Through the Lens of Cascaded Ranking.arXiv:2602.03692(2026)

  22. [23]

    Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Jieming Zhu, Minda Hu, Menglin Yang, and Irwin King. 2025. A Survey of Personalized Large Language Models: Progress and Future Directions.CoRRabs/2502.11528 (2025)

  23. [24]

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. InACL (1). Association for Computational Linguistics, 13851–13870

  24. [25]

    Meta AI. 2025. Llama 4: Multimodal Intelligence. https://ai.meta.com/blog/llama- 4-multimodal-intelligence/. [Online; accessed Feb. 12, 2026]

  25. [26]

    Jiayan Nan, Wenquan Ma, Wenlong Wu, and Yize Chen. 2025. Nemori: Self-organizing agent memory inspired by cognitive science.arXiv preprint arXiv:2508.03341(2025)

  26. [27]

    OpenAI. 2025. GPT-4.1. https://openai.com/index/gpt-4-1/. [Online; accessed Feb. 12, 2026]

  27. [28]

    2025.GPT-5.2 System Card

    OpenAI. 2025.GPT-5.2 System Card. Technical Report. OpenAI. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_ 5_2_system-card.pdf [Online; accessed Feb. 2026]

  28. [29]

    OpenAI. 2025. The Power of Personalized AI. https://openai.com/global-affairs/ the-power-of-personalized-ai/. [Online; accessed Feb. 13, 2026]

  29. [30]

    Samuel J. Paech. 2023. EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models.CoRRabs/2312.06281 (2023)

  30. [31]

    Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, and Fuli Feng

  31. [32]

    Latent Inter-User Difference Modeling for LLM Personalization. InEMNLP. Association for Computational Linguistics, 10599–10617

  32. [33]

    Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yimeng Bai, Wenjie Wang, Hong Cheng, Fuli Feng, and Tat-Seng Chua. 2025. Measuring What Makes You Unique: Difference-Aware User Modeling for Enhancing LLM Personalization. InACL (Findings) (Findings of ACL, Vol. ACL 2025). Association for Computational Lin- guistics, 21258–21277

  33. [34]

    Robertson and Hugo Zaragoza

    Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond.Found. Trends Inf. Retr.3, 4 (2009), 333–389

  34. [35]

    Liu, Jinfeng Zhou, Alvionna S

    Sahand Sabour, Siyang Liu, Zheyuan Zhang, June M. Liu, Jinfeng Zhou, Alvionna S. Sunaryo, Tatia M. C. Lee, Rada Mihalcea, and Minlie Huang. 2024. EmoBench: Evaluating the Emotional Intelligence of Large Language Models. In ACL (1). Association for Computational Linguistics, 5986–6004

  35. [36]

    Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba. 2025. MemInsight: Autonomous Memory Augmentation for LLM Agents. InEMNLP. Association for Computational Linguistics, 33136–33152

  36. [37]

    Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. LaMP: When Large Language Models Meet Personalization. InACL (1). Associa- tion for Computational Linguistics, 7370–7392

  37. [38]

    Alireza Salemi and Hamed Zamani. 2025. LaMP-QA: A Benchmark for Personal- ized Long-form Question Answering. InEMNLP. Association for Computational Linguistics, 1139–1159

  38. [39]

    sentence-transformers. 2025. all -MiniLM-L6-v2 Sentence Embedding Model. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. [Online; ac- cessed Feb. 12, 2026]

  39. [40]

    Wentao Shi, Xiangnan He, Yang Zhang, Chongming Gao, Xinyue Li, Jizhi Zhang, Qifan Wang, and Fuli Feng. 2024. Large language models are learnable planners Conference’17, July 2017, Washington, DC, USA Xiao et al. for long-term recommendation. InSIGIR. 1893–1903

  40. [41]

    Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. 2024. Democratizing Large Language Models via Personalized Parameter- Efficient Fine-tuning. InEMNLP. Association for Computational Linguistics, 6476–6491

  41. [42]

    Meiling Tao, Chenghao Zhu, Dongyi Ding, Tiannan Wang, Yuchen Eleanor Jiang, and Wangchunshu Zhou. 2025. PersonaFeedback: A Large-scale Human- annotated Benchmark For Personalization.arXiv preprint arXiv:2506.12915(2025)

  42. [43]

    Wiebke Wagner. 2010. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit - O’Reilly Media, Beijing, 2009, ISBN 978-0-596-51649-9.Lang. Resour. Evaluation44, 4 (2010), 421–424

  43. [44]

    Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2025. Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.CoRRabs/2512.06690 (2025)

  44. [45]

    Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2026. Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation. (2026)

  45. [46]

    Chengbing Wang, Wuqiang Zheng, Yang Zhang, Fengbin Zhu, Junyi Cheng, Yi Xie, Wenjie Wang, and Fuli Feng. 2026. PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models.arXiv preprint arXiv:2601.10532 (2026)

  46. [47]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Voyager: An Open-Ended Embodied Agent with Large Language Models.Trans. Mach. Learn. Res.2024 (2024)

  47. [48]

    Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuan-Jing Huang, and Zhongyu Wei. 2025. Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation. InProceedings of the 31st international conference on computational linguistics. 3310–3328

  48. [49]

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu

  49. [50]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InICLR. OpenReview.net

  50. [51]

    Rossi, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K

    Junda Wu, Hanjia Lyu, Yu Xia, Zhehao Zhang, Joe Barrow, Ishita Kumar, Mehrnoosh Mirtaheri, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Namyong Park, Sungchul Kim, Huanrui Yang, Subrata Mitra, Zhengmian Hu, Nedim Lipka, Dang Nguyen, Yue Zhao, Jiebo Luo, and ...

  51. [52]

    Smith, Mari Ostendorf, and Hannaneh Hajishirzi

    Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Am- manabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2023. Fine-Grained Human Feedback Gives Better Rewards for Language Model Train- ing. InNeurIPS

  52. [53]

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang

  53. [54]

    A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110 (2025)

  54. [55]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

  55. [56]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. 2022. STaR: Boot- strapping Reasoning With Reasoning. InNeurIPS

  56. [57]

    You Zhang, Jin Wang, Liang-Chih Yu, Dan Xu, and Xuejie Zhang. 2024. Person- alized LoRA for Human-Centered Text Understanding. InAAAI. AAAI Press, 19588–19596

  57. [58]

    Yang Zhang, Wenxin Xu, Xiaoyan Zhao, Wenjie Wang, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2025. Reinforced Latent Reasoning for LLM-based Recommendation.CoRRabs/2505.19092 (2025)

  58. [59]

    Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen K. Ahmed, and Yu Wang. 2025. Person- alization of Large Language Models: A Survey.Trans. Mach. ...

  59. [60]

    Zeyu Zhang, Yang Zhang, Haoran Tan, Rui Li, and Xu Chen. 2025. Explicit vs implicit memory: Exploring multi-hop complex reasoning over personalized information.arXiv preprint arXiv:2508.13250(2025)

  60. [61]

    Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, and Kaixiang Lin

  61. [62]

    Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs. InICLR. OpenReview.net

  62. [63]

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. WildChat: 1M ChatGPT Interaction Logs in the Wild. InICLR. OpenReview.net

  63. [64]

    Xiaoyan Zhao, Ming Yan, Yilun Qiu, Haoting Ni, Yang Zhang, Fuli Feng, Hong Cheng, and Tat-Seng Chua. 2025. SteerX: Disentangled Steering for LLM Person- alization.CoRRabs/2510.22256 (2025)

  64. [66]

    Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2025. Nextquill: Causal preference modeling for enhancing llm personalization. InarXiv:2506.02368

  65. [67]

    Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B Cohen, and Emine Yilmaz. 2025. PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants.arXiv preprint arXiv:2506.09902(2025)

  66. [68]

    Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. 2024. Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization. InACL (Findings) (Findings of ACL, Vol. ACL 2024). Association for Computational Linguistics, 10586–10613

  67. [69]

    Jiachen Zhu, Jianghao Lin, Xinyi Dai, Bo Chen, Rong Shan, Jieming Zhu, Ruim- ing Tang, Yong Yu, and Weinan Zhang. 2024. Lifelong personalized low-rank adaptation of large language models for recommendation.arXiv preprint arXiv:2408.03533(2024)

  68. [70]

    Zollo, Andrew Wei Tung Siah, Naimeng Ye, Ang Li, and Hongseok Namkoong

    Thomas P. Zollo, Andrew Wei Tung Siah, Naimeng Ye, Ang Li, and Hongseok Namkoong. 2025. PersonalLLM: Tailoring LLMs to Individual Preferences. In ICLR. OpenReview.net. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009