Recognition: unknown
HorizonBench: Long-Horizon Personalization with Evolving Preferences
Pith reviewed 2026-05-10 05:48 UTC · model grok-4.3
The pith
Frontier models fail to track evolving user preferences because they do not update beliefs on new life events.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We define long-horizon personalization as the task of inferring preference changes from life events in extended dialogues. Using a data generator from a structured mental state graph, we build HorizonBench with 4,245 test items across 360 users, each with average 4,300-turn histories. Across 25 frontier models, performance tops out at 52.8 percent, and over one third of errors on evolved preferences involve selecting the original unupdated value. This failure holds regardless of context length or how explicitly the change is stated, pinpointing state-tracking as the central limitation.
What carries the argument
The structured mental state graph used to generate conversations, providing explicit provenance for each preference change over simulated 6-month periods.
If this is right
- Improving long-horizon personalization will require specific advances in dynamic belief updating rather than relying on larger context windows alone.
- Models that do not track state changes will give incorrect responses based on outdated preferences in ongoing user interactions.
- Evaluation benchmarks for user modeling must include ground-truth evolution to properly diagnose tracking failures.
- Architectures designed for theory-of-mind reasoning need to incorporate mechanisms for handling preference shifts explicitly.
Where Pith is reading between the lines
- Systems could maintain a separate, updatable record of user preferences outside the model's context to avoid tracking errors.
- This approach to benchmarking could be adapted to other domains involving long-term state changes, such as multi-session planning or personalized recommendations.
- Real deployments might need explicit confirmation steps when models detect potential preference updates to ensure accuracy.
Load-bearing premise
The simulated dialogues generated from the mental state graph reflect the way real users naturally evolve their preferences in response to life events.
What would settle it
Running the same models on a collection of real human multi-month conversation logs with independently verified preference changes would reveal whether the observed state-tracking failures occur at similar rates.
Figures
read the original abstract
User preferences evolve across months of interaction, and tracking them requires inferring when a stated preference has been changed by a subsequent life event. We define this problem as long-horizon personalization and observe that progress on it is limited by data availability and measurement, with no existing resource providing both naturalistic long-horizon interactions and the ground-truth provenance needed to diagnose why models fail. We introduce a data generator that produces conversations from a structured mental state graph, yielding ground-truth provenance for every preference change across 6-month timelines, and from it construct HorizonBench, a benchmark of 4,245 items from 360 simulated users with 6-month conversation histories averaging ~4,300 turns and ~163K tokens. HorizonBench provides a testbed for long-context modeling, memory-augmented architectures, theory-of-mind reasoning, and user modeling. Across 25 frontier models, the best model reaches 52.8% and most score at or below the 20% chance baseline. When these models err on evolved preferences, over a third of the time they select the user's originally stated value without tracking the updated user state. This belief-update failure persists across context lengths and expression explicitness levels, identifying state-tracking capability as the primary bottleneck for long-horizon personalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HorizonBench, a benchmark for long-horizon personalization where user preferences evolve over 6-month simulated timelines. It constructs 4,245 test items from 360 users via a data generator based on a structured mental state graph that provides ground-truth provenance for every preference change. Experiments across 25 frontier models show the best performer at 52.8% accuracy with most at or below the 20% chance baseline; error analysis indicates that models frequently revert to originally stated preferences (>1/3 of errors on evolved items) rather than tracking updates, and this pattern holds across context lengths and expression explicitness, positioning state-tracking as the primary bottleneck.
Significance. If the synthetic generator produces representative preference dynamics, the benchmark would provide a valuable diagnostic tool for long-context modeling, memory architectures, and theory-of-mind capabilities in user modeling, with the explicit ground-truth provenance enabling precise failure-mode analysis that existing resources lack. The scale (long histories averaging 4,300 turns) and multi-model evaluation are strengths for identifying systematic limitations.
major comments (3)
- [§3] §3 (Generator and Mental State Graph): The central claim that state-tracking is the primary bottleneck rests on the assumption that the structured mental state graph produces preference evolutions comparable to real users. The manuscript provides no validation, ablation, or comparison against naturalistic data (e.g., gradual drift, conflicting signals, or unstated context), which is load-bearing because the observed reversion errors could be artifacts of the graph's explicit, discrete life-event updates rather than a general architectural limitation.
- [§5] §5 (Experiments): The error analysis reports reversion to original preferences in over one-third of evolved-item errors and claims persistence across context lengths and explicitness levels, but lacks statistical tests for the pattern's significance, per-model breakdowns, or controls for generator-specific artifacts (such as how life events are verbalized). This weakens the diagnosis that state-tracking, rather than other factors like prompt sensitivity, is the dominant issue.
- [Abstract and §4] Abstract and §4 (Benchmark Items): The 20% chance baseline is invoked without specifying the number of options per multiple-choice item, the construction of distractors, or how the baseline accounts for the distribution of preference values in the mental state graph. This detail is needed to confirm the baseline is fair and that the reported sub-chance performance is not an artifact of the evaluation setup.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly discuss the limitations of synthetic data for generalizing claims about real-user personalization.
- [Results figures] Figure captions and axis labels in the results section should clarify how context length is operationalized across models with varying window sizes.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and describe the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Generator and Mental State Graph): The central claim that state-tracking is the primary bottleneck rests on the assumption that the structured mental state graph produces preference evolutions comparable to real users. The manuscript provides no validation, ablation, or comparison against naturalistic data (e.g., gradual drift, conflicting signals, or unstated context), which is load-bearing because the observed reversion errors could be artifacts of the graph's explicit, discrete life-event updates rather than a general architectural limitation.
Authors: We agree that empirical validation against real-user data would strengthen the work. However, the absence of long-horizon conversational datasets with ground-truth provenance for preference changes is the central motivation for creating the structured mental state graph. The graph models evolution via sequences of life events that can produce both incremental drifts and conflicting signals. We will expand §3 with additional details on these modeling choices and add an explicit limitations subsection acknowledging the synthetic nature of the data and the value of future naturalistic validation. This clarifies rather than alters the diagnostic focus of the benchmark. revision: partial
-
Referee: [§5] §5 (Experiments): The error analysis reports reversion to original preferences in over one-third of evolved-item errors and claims persistence across context lengths and explicitness levels, but lacks statistical tests for the pattern's significance, per-model breakdowns, or controls for generator-specific artifacts (such as how life events are verbalized). This weakens the diagnosis that state-tracking, rather than other factors like prompt sensitivity, is the dominant issue.
Authors: We accept this critique and will strengthen the analysis. The revised §5 will include statistical tests (binomial and ANOVA) for the significance of the reversion pattern across conditions. We will add a per-model breakdown table in the appendix. We will also report a new ablation varying life-event verbalization styles to control for generator artifacts. These changes will provide stronger support for identifying state-tracking as the primary bottleneck. revision: yes
-
Referee: [Abstract and §4] Abstract and §4 (Benchmark Items): The 20% chance baseline is invoked without specifying the number of options per multiple-choice item, the construction of distractors, or how the baseline accounts for the distribution of preference values in the mental state graph. This detail is needed to confirm the baseline is fair and that the reported sub-chance performance is not an artifact of the evaluation setup.
Authors: We apologize for the omission. Each item is a 5-option multiple-choice question: the ground-truth updated preference, the original preference, and three distractors sampled from the user's other values in the mental state graph. The 20% baseline is the uniform random selection rate over these five options. We will update both the abstract and §4 to state this explicitly, including the distractor construction procedure. revision: yes
Circularity Check
No circularity: benchmark and evaluation are empirically independent
full rationale
The paper defines long-horizon personalization, introduces an independent structured mental state graph generator to create synthetic 6-month conversations with explicit ground-truth provenance for every preference change, builds HorizonBench from it (4,245 items across 360 users), and evaluates 25 external frontier models on accuracy and error patterns such as reverting to original preferences. All reported results (e.g., best model at 52.8%, over one-third of errors on evolved items) are direct empirical measurements against this generated ground-truth rather than quantities defined by the paper's own fitted parameters, self-citations, or ansatzes. The identification of state-tracking as the primary bottleneck follows from observed model behaviors across context lengths and explicitness levels on the held-out benchmark, with no load-bearing step reducing by construction to the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption User preferences evolve in response to life events according to structured rules that can be encoded in a graph
invented entities (1)
-
structured mental state graph
no independent evidence
Reference graph
Works this paper leans on
-
[2]
2025 , eprint=
The Rise of AI Companions: How Human-Chatbot Relationships Influence Well-Being , author=. 2025 , eprint=
2025
-
[3]
Second Conference on Language Modeling , year=
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale , author=. Second Conference on Language Modeling , year=
-
[4]
IEEE Transactions on Systems, Man, and Cybernetics: Systems , volume=
Modeling users preference dynamics and side information in recommender systems , author=. IEEE Transactions on Systems, Man, and Cybernetics: Systems , volume=. 2015 , publisher=
2015
-
[5]
Applied Sciences , volume=
Recommender system based on temporal models: a systematic review , author=. Applied Sciences , volume=. 2020 , publisher=
2020
-
[6]
Toward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning , author=. arXiv preprint arXiv:2503.07018 , year=
-
[7]
2025 , eprint=
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants , author=. 2025 , eprint=
2025
-
[8]
2024 , eprint=
AI PERSONA: Towards Life-long Personalization of LLMs , author=. 2024 , eprint=
2024
-
[9]
Chang, Serina and Anderson, Ashton and Hofman, Jake M. , year=. ChatBench: From Static Benchmarks to Human-AI Evaluation , url=. doi:10.18653/v1/2025.acl-long.1262 , booktitle=
-
[10]
Connotation Frames of Power and Agency in Modern Films
Sap, Maarten and Prasettio, Marcella Cindy and Holtzman, Ari and Rashkin, Hannah and Choi, Yejin. Connotation Frames of Power and Agency in Modern Films. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1247
-
[11]
Data and R epresentation for T urkish N atural L anguage I nference
Budur, Emrah and. Data and R epresentation for T urkish N atural L anguage I nference. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.662
-
[12]
The Empirical Variability of Narrative Perceptions of Social Media Texts
Mire, Joel and Antoniak, Maria and Ash, Elliott and Piper, Andrew and Sap, Maarten. The Empirical Variability of Narrative Perceptions of Social Media Texts. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1113
-
[13]
Do llms recognize your preferences? evaluating personalized preference following in llms , author=. arXiv preprint arXiv:2502.09597 , year=
-
[14]
Know Me, Respond to Me:
Jiang, Bowen and Hao, Zhuoqun and Cho, Young-Min and Li, Bryan and Yuan, Yuan and Chen, Sihao and Ungar, Lyle and Taylor, Camillo J and Roth, Dan , journal=. Know Me, Respond to Me:. 2025 , note=
2025
-
[15]
Jiang, Bowen and Yuan, Yuan and Shen, Maohao and Hao, Zhuoqun and Xu, Zhangchen and Chen, Zichen and Liu, Ziyi and Vijjini, Anvesh Rao and He, Jiashu and Yu, Hanchao and others , journal=
-
[16]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Toward multi-session personalized conversation: A large-scale dataset and hierarchical tree framework for implicit reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[17]
Towards Realistic Personalization:
Guo, Qianyun and Li, Yibo and Liu, Yue and Hooi, Bryan , journal=. Towards Realistic Personalization:
-
[18]
Zhao, Zheng and Vania, Clara and Kayal, Subhradeep and Khan, Naila and Cohen, Shay B and Yilmaz, Emine , booktitle=
-
[19]
Zollo, Thomas P and Siah, Andrew Wei Tung and Ye, Naimeng and Li, Ang and Namkoong, Hongseok , journal=
-
[20]
Yoon, Sangyeon and Kim, Sunkyoung and Hong, Hyesoo and Jeung, Wonje and Kim, Yongil and Seo, Wooseok and Yeen, Heuiyeen and No, Albert , journal=
-
[21]
The Fourteenth International Conference on Learning Representations , year=
PrefDisco: Benchmarking Proactive Personalized Reasoning , author=. The Fourteenth International Conference on Learning Representations , year=
-
[22]
The Second Conference on Language Modeling , year=
Prefpalette: Personalized preference modeling with latent attributes , author=. The Second Conference on Language Modeling , year=
-
[23]
Tan, Haoran and Zhang, Zeyu and Ma, Chen and Chen, Xu and Dai, Quanyu and Dong, Zhenhua , booktitle=
-
[24]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Evaluating very long-term conversational memory of llm agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[25]
Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yuwei and Chang, Kai-Wei and Yu, Dong , journal=
-
[26]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
Knowledge conflicts for llms: A survey , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
2024
-
[27]
Generative Agents:
Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S , booktitle=. Generative Agents:
-
[28]
arXiv preprint arXiv:2412.13103 (2024)
Ai persona: Towards life-long personalization of llms , author=. arXiv preprint arXiv:2412.13103 , year=
-
[29]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Chatbench: From static benchmarks to human-ai evaluation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[30]
Zhu, Lixi and Huang, Xiaowen and Sang, Jitao , title=. 2025 , isbn =. doi:10.1145/3696410.3714858 , booktitle =
-
[31]
Sekulic, Ivan and Terragni, Silvia and Guimar \ a es, Victor and Khau, Nghia and Guedes, Bruna and Filipavicius, Modestas and Manso, Andre Ferreira and Mathis, Roland. Reliable. Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT 2024). 2024. doi:10.18653/v1/2024.scichat-1.3
-
[32]
Salemi, Alireza and Mysore, Sheshera and Bendersky, Michael and Zamani, Hamed. La. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.399
-
[33]
2024 , eprint=
LongLaMP: A Benchmark for Personalized Long-form Text Generation , author=. 2024 , eprint=
2024
-
[34]
arXiv preprint arXiv:2410.03198 , year=
Persobench: Benchmarking personalized response generation in large language models , author=. arXiv preprint arXiv:2410.03198 , year=
-
[35]
Findings of the Association for Computational Linguistics: ACL 2025
Tan, Juntao and Yang, Liangwei and Liu, Zuxin and Liu, Zhiwei and R N, Rithesh and Awalgaonkar, Tulika Manoj and Zhang, Jianguo and Yao, Weiran and Zhu, Ming and Kokane, Shirley and Savarese, Silvio and Wang, Huan and Xiong, Caiming and Heinecke, Shelby. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findin...
-
[36]
2025 , eprint=
PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization , author=. 2025 , eprint=
2025
-
[37]
Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation
Yoon, Se-eun and He, Zhankui and Echterhoff, Jessica and McAuley, Julian. Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.1865...
-
[38]
Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638
-
[39]
2026 , eprint=
Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind , author=. 2026 , eprint=
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.