pith. machine review for the scientific record. sign in

arxiv: 2604.17259 · v1 · submitted 2026-04-19 · 💻 cs.IR · cs.AI· cs.CL

Recognition: unknown

HORIZON: A Benchmark for In-the-wild User Behaviour Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:16 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords user behavior modelingbenchmarksequential recommendationcross-domain recommendationtemporal generalizationunseen user modelingLLM baselines
0
0 comments X

The pith

HORIZON benchmark requires user models to generalize across domains, users, and long time periods instead of single-domain next-item prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HORIZON as a benchmark that reformulates user modeling around a large cross-domain dataset drawn from Amazon Reviews, covering 54 million users and 35 million items with long-term interaction histories. Existing benchmarks stay narrow by focusing on short sessions and missing-positive prediction within one domain, which fails to match how real users behave across multiple areas over extended periods. HORIZON adds tasks and evaluation setups for temporal generalization, handling different sequence lengths, and modeling previously unseen users, using metrics aimed at overall behavior understanding. Benchmarking results with sequential architectures and LLM baselines that use long histories show clear performance gaps under these conditions. If correct, this would push development toward models that remain effective when user data spans domains and time rather than assuming repeated same-domain patterns.

Core claim

HORIZON reformulates user modeling along three axes of dataset, task, and evaluation by creating a cross-domain, long-horizon version of Amazon Reviews data. It requires models to handle generalization across domains, across different users, and across time, with new setups for temporal shifts, sequence-length variation, and unseen users, plus metrics that test broad behavior understanding rather than isolated next-item accuracy. Experiments with popular sequential recommenders and LLM baselines that incorporate full histories reveal that current methods fall short of these real-world requirements.

What carries the argument

The HORIZON benchmark, built on a reformulated cross-domain long-term interaction dataset from Amazon Reviews, which supports pretraining and evaluation under heterogeneous conditions.

If this is right

  • Sequential recommendation models must incorporate mechanisms for cross-domain transfer and retention of long interaction histories.
  • Evaluation protocols should shift from single-domain next-item accuracy toward metrics that assess generalization to new users and time periods.
  • LLM-based user models can be directly tested for robustness when histories span multiple domains and extended durations.
  • Research priorities move toward building temporally stable and domain-agnostic user representations for deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption of HORIZON could redirect benchmark design in recommendation systems away from isolated session prediction toward lifelong user modeling.
  • The benchmark setup naturally connects to problems in continual learning, where models must adapt as new domains and users appear over time.
  • One testable extension is whether pretraining on the full HORIZON corpus transfers to improved performance in live e-commerce systems with mixed product categories.

Load-bearing premise

The reformulated Amazon Reviews dataset with its cross-domain and long-term interactions accurately represents diverse real-world user behaviors without major selection or reporting biases from the original platform data.

What would settle it

A deployment study where models that score highest on HORIZON tasks show no measurable improvement in actual multi-domain user retention or satisfaction over time, or where models that underperform on HORIZON still succeed in live heterogeneous environments.

Figures

Figures reproduced from arXiv: 2604.17259 by Amit Sharma, Arnav Goel, Bhawna Paliwal, Bishal Santra, Pranjal A Chitale.

Figure 1
Figure 1. Figure 1: Proposed evaluation splits on the HORIZON benchmark for Task 1. (n−2) interactions form the training set. This ap￾proach has been widely deployed for evaluating user modeling architectures in recent years (Sun, 2023) but can often leak future interactions into training data, violating the temporal order of real￾world scenarios. This leads to inflated performance metrics that don’t reflect practical deploym… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline Detailing the LLM Generation, Retrieval and Evaluation Process Proposed for Tasks 2 and 3. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Histogram Depicting the Frequency Distribu [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Line Plot Depicting the Temporal Distribution [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Frequency Distribution of Products in the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE depicting the distinct user topic distributions in the in-distribution and OOD users. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

User behavior in the real world is diverse, cross-domain, and spans long time horizons. Existing user modeling benchmarks however remain narrow, focusing mainly on short sessions and next-item prediction within a single domain. Such limitations hinder progress toward robust and generalizable user models. We present HORIZON, a new benchmark that reformulates user modeling along three axes i.e. dataset, task, and evaluation. Built from a large-scale, cross-domain reformulation of Amazon Reviews, HORIZON covers 54M users and 35M items, enabling both pretraining and realistic evaluation of models in heterogeneous environments. Unlike prior benchmarks, it challenges models to generalize across domains, users, and time, moving beyond standard missing-positive prediction in the same domain. We propose new tasks and evaluation setups that better reflect real-world deployment scenarios. These include temporal generalization, sequence-length variation, and modeling unseen users, with metrics designed to assess general user behavior understanding rather than isolated next-item prediction. We benchmark popular sequential recommendation architectures alongside LLM-based baselines that leverage long-term interaction histories. Our results highlight the gap between current methods and the demands of real-world user modeling, while establishing HORIZON as a foundation for research on temporally robust, cross-domain, and general-purpose user models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HORIZON, a benchmark for in-the-wild user behavior modeling constructed via large-scale cross-domain reformulation of the Amazon Reviews dataset (54M users, 35M items). It redefines user modeling along dataset, task, and evaluation axes, proposing tasks for temporal generalization, sequence-length variation, and unseen-user modeling that extend beyond single-domain next-item prediction. The paper benchmarks sequential recommendation architectures and LLM-based models on long-term histories, reporting performance gaps that illustrate limitations of existing methods for real-world deployment.

Significance. If the reformulated dataset and tasks validly capture diverse, cross-domain, long-horizon behaviors, HORIZON could provide a valuable new foundation for research on generalizable user models, shifting the field from narrow session-based prediction toward more realistic evaluation. The scale and explicit focus on generalization across domains/users/time are clear strengths, as is the inclusion of both traditional and LLM baselines.

major comments (2)
  1. [§3] §3 (Dataset Construction): The claim that the reformulated Amazon Reviews data enables realistic evaluation of generalization across domains, users, and time is load-bearing for the entire contribution, yet the manuscript provides no quantitative analysis or mitigation of platform-specific selection biases (e.g., reviewer self-selection, sparse self-reported interactions, or incomplete temporal coverage). This directly affects whether observed gaps reflect real-world demands or benchmark artifacts.
  2. [§5] §5 (Experiments and Results): The reported performance gaps between baselines and the new tasks lack error bars, statistical significance tests, or explicit details on data splits, preprocessing, and hyperparameter choices. Without these, it is difficult to assess whether the highlighted limitations of current methods are robust or sensitive to implementation decisions.
minor comments (2)
  1. [Abstract] The abstract states that new metrics assess 'general user behavior understanding rather than isolated next-item prediction,' but the specific metric definitions and how they differ from standard ranking metrics are not summarized early in the paper.
  2. [Figures/Tables] Figure captions and table headers could more explicitly link results to the three proposed axes (dataset/task/evaluation) to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and recognition of HORIZON's potential to advance research on generalizable user models. We address each major comment below and commit to revisions that strengthen the manuscript's rigor and transparency.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The claim that the reformulated Amazon Reviews data enables realistic evaluation of generalization across domains, users, and time is load-bearing for the entire contribution, yet the manuscript provides no quantitative analysis or mitigation of platform-specific selection biases (e.g., reviewer self-selection, sparse self-reported interactions, or incomplete temporal coverage). This directly affects whether observed gaps reflect real-world demands or benchmark artifacts.

    Authors: We acknowledge that Amazon Reviews, as a self-reported dataset, inherently carries selection biases such as reviewer self-selection and variable temporal coverage. Our reformulation prioritizes cross-domain and long-horizon structures to better approximate real-world user behavior than single-domain next-item benchmarks, but we agree that explicit quantification of these biases is needed to support the generalization claims. In the revised manuscript, we will add quantitative analyses of user activity distributions, domain-specific temporal coverage, and interaction sparsity, along with a dedicated discussion of potential artifacts and their implications for the observed performance gaps. revision: yes

  2. Referee: [§5] §5 (Experiments and Results): The reported performance gaps between baselines and the new tasks lack error bars, statistical significance tests, or explicit details on data splits, preprocessing, and hyperparameter choices. Without these, it is difficult to assess whether the highlighted limitations of current methods are robust or sensitive to implementation decisions.

    Authors: We agree that greater statistical rigor and experimental transparency are essential for validating the reported gaps. The revised manuscript will include error bars (standard deviations across multiple runs), statistical significance tests for key performance differences, and comprehensive details on data splits, preprocessing pipelines, and hyperparameter selection in the main text and an expanded appendix to ensure reproducibility and robustness assessment. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark reformulation and task definition are constructive, not self-referential

full rationale

The paper creates HORIZON by reformulating Amazon Reviews into a cross-domain, long-term user interaction dataset and defines new generalization tasks (temporal, cross-user, cross-domain). No derivations, equations, or fitted parameters are presented as 'predictions' that reduce to the inputs by construction. Baselines are run on the new splits without any self-citation chain or ansatz that bears the central claim. The reported performance gap is an empirical observation on the released benchmark, not a logical tautology. This matches the expected honest non-finding for a dataset/task paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no mathematical derivations, free parameters, or new physical entities; it rests on the domain assumption that Amazon Reviews data can proxy real-world cross-domain user behavior.

axioms (1)
  • domain assumption Amazon product reviews can serve as a proxy for diverse, cross-domain, long-horizon user behavior.
    The entire benchmark is constructed by reformulating this existing dataset.

pith-pipeline@v0.9.0 · 5540 in / 1302 out tokens · 46939 ms · 2026-05-10T06:16:07.668424+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 18 canonical work pages · 3 internal anchors

  1. [1]

    Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

    Bridging language and items for retrieval and recommendation , author=. arXiv preprint arXiv:2403.03952 , year=

  2. [2]

    MIND : A Large-scale Dataset for News Recommendation

    Wu, Fangzhao and Qiao, Ying and Chen, Jiun-Hung and Wu, Chuhan and Qi, Tao and Lian, Jianxun and Liu, Danyang and Xie, Xing and Gao, Jianfeng and Wu, Winnie and Zhou, Ming. MIND : A Large-scale Dataset for News Recommendation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.331

  3. [3]

    Big Data and Artificial Intelligence: 11th International Conference, BDA 2023, Delhi, India, December 7–9, 2023, Proceedings , pages =

    Anand, Avinash and Goel, Arnav and Hira, Medha and Buldeo, Snehal and Kumar, Jatin and Verma, Astha and Gupta, Rushali and Shah, Rajiv Ratn , title =. Big Data and Artificial Intelligence: 11th International Conference, BDA 2023, Delhi, India, December 7–9, 2023, Proceedings , pages =. 2023 , isbn =. doi:10.1007/978-3-031-49601-1_4 , abstract =

  4. [4]

    Proceedings of the 2nd International Workshop on Large Vision - Language Model Learning and Applications , pages =

    Kapuriya, Janak and Shaikh, Anwar and Goel, Arnav and Hira, Medha and Singh, Apoorv and Saraf, Jay and Sanjana and Nauriyal, Vaibhav and Anand, Avinash and Wang, Zhengkui and Shah, Rajiv Ratn , title =. Proceedings of the 2nd International Workshop on Large Vision - Language Model Learning and Applications , pages =. 2025 , isbn =. doi:10.1145/3728483.376...

  5. [5]

    2025 , eprint=

    Attributing Culture-Conditioned Generations to Pretraining Corpora , author=. 2025 , eprint=

  6. [6]

    2023 , eprint=

    Advancements in Scientific Controllable Text Generation Methods , author=. 2023 , eprint=

  7. [7]

    Myers, and Jure Leskovec

    Cho, Eunjoon and Myers, Seth A. and Leskovec, Jure , title =. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2011 , isbn =. doi:10.1145/2020408.2020579 , abstract =

  8. [8]

    Proceedings of the 14th ACM Conference on Recommender Systems , pages =

    Meng, Zaiqiao and McCreadie, Richard and Macdonald, Craig and Ounis, Iadh , title =. Proceedings of the 14th ACM Conference on Recommender Systems , pages =. 2020 , isbn =. doi:10.1145/3383313.3418479 , abstract =

  9. [9]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  10. [10]

    2024 , eprint=

    Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

  11. [11]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  12. [12]

    Sobkowicz, Antoni and Stokowiec, Wojciech , year =

  13. [13]

    The Faiss library

    The faiss library , author=. arXiv preprint arXiv:2401.08281 , year=

  14. [14]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

    Jin, Wei and Mao, Haitao and Li, Zheng and Jiang, Haoming and Luo, Chen and Wen, Hongzhi and Han, Haoyu and Lu, Hanqing and Wang, Zhengyang and Li, Ruirui and Li, Zhen and Cheng, Monica and Goutam, Rahul and Zhang, Haiyang and Subbian, Karthik and Wang, Suhang and Sun, Yizhou and Tang, Jiliang and Yin, Bing and Tang, Xianfeng , title =. Proceedings of the...

  15. [15]

    Maxwell Harper and Joseph A

    Harper, F. Maxwell and Konstan, Joseph A. , title =. 2015 , issue_date =. doi:10.1145/2827872 , journal =

  16. [16]

    Proceedings of the 14th acm conference on recommender systems , pages=

    Exploring data splitting strategies for the evaluation of recommendation models , author=. Proceedings of the 14th acm conference on recommender systems , pages=

  17. [17]

    Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    Take a fresh look at recommender systems from an evaluation standpoint , author=. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  18. [18]

    ACM Transactions on Information Systems , volume=

    A critical study on data leakage in recommender system offline evaluation , author=. ACM Transactions on Information Systems , volume=. 2023 , publisher=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    On the generalizability and predictability of recommender systems , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

    Mind: A large-scale dataset for news recommendation , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

  21. [21]

    COSPLAY: Concept Set Guided Personalized Dialogue Generation Across Both Party Personas , url=

    Hou, Yupeng and Hu, Binbin and Zhang, Zhiqiang and Zhao, Wayne Xin , title =. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2022 , isbn =. doi:10.1145/3477495.3531955 , abstract =

  22. [22]

    2016 , eprint=

    Session-based Recommendations with Recurrent Neural Networks , author=. 2016 , eprint=

  23. [23]

    Self-Attentive Sequential Recommendation , year=

    Kang, Wang-Cheng and McAuley, Julian , booktitle=. Self-Attentive Sequential Recommendation , year=

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    Amazon-m2: A multilingual multi-locale shopping session dataset for recommendation and text generation , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    InProceedings of the 28th ACM International Conference on Information and Knowledge Management(Beijing, China)(CIKM ’19)

    Sun, Fei and Liu, Jun and Wu, Jian and Pei, Changhua and Lin, Xiao and Ou, Wenwu and Jiang, Peng , title =. Proceedings of the 28th ACM International Conference on Information and Knowledge Management , pages =. 2019 , isbn =. doi:10.1145/3357384.3357895 , abstract =

  26. [26]

    2022 , isbn =

    Pancha, Nikil and Zhai, Andrew and Leskovec, Jure and Rosenberg, Charles , title =. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =. 2022 , isbn =. doi:10.1145/3534678.3539156 , abstract =

  27. [27]

    A Reproducible Analysis of Sequential Recommender Systems , year=

    Betello, Filippo and Purificato, Antonio and Siciliano, Federico and Trappolini, Giovanni and Bacciu, Andrea and Tonellotto, Nicola and Silvestri, Fabrizio , journal=. A Reproducible Analysis of Sequential Recommender Systems , year=

  28. [28]

    Acm transactions on interactive intelligent systems (tiis) , volume=

    The movielens datasets: History and context , author=. Acm transactions on interactive intelligent systems (tiis) , volume=. 2015 , publisher=

  29. [29]

    ACM Computing Surveys (CSUR) , volume=

    A survey on session-based recommender systems , author=. ACM Computing Surveys (CSUR) , volume=. 2021 , publisher=

  30. [30]

    Justifying recommendations using distantly-labeled reviews and fine-grained aspects , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

  31. [31]

    arXiv preprint arXiv:2403.13344 , year=

    USE: Dynamic User Modeling with Stateful Sequence Models , author=. arXiv preprint arXiv:2403.13344 , year=

  32. [32]

    arXiv preprint arXiv:2503.14772 , year=

    VIKI: Systematic Cross-Platform Profile Inference of Online Users , author=. arXiv preprint arXiv:2503.14772 , year=

  33. [33]

    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

    Towards universal sequence representation learning for recommender systems , author=. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

  34. [34]

    Proceedings of the 42nd International ACM SIGIR conference on research and development in information retrieval , pages=

    Cross: Cross-platform recommendation for social e-commerce , author=. Proceedings of the 42nd International ACM SIGIR conference on research and development in information retrieval , pages=

  35. [35]

    arXiv preprint arXiv:2103.01696 , year=

    Cross-domain recommendation: challenges, progress, and prospects , author=. arXiv preprint arXiv:2103.01696 , year=

  36. [36]

    Matrix Factorization Techniques for Recommender Systems , year=

    Koren, Yehuda and Bell, Robert and Volinsky, Chris , journal=. Matrix Factorization Techniques for Recommender Systems , year=

  37. [37]

    RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms , booktitle=

    Wayne Xin Zhao and Shanlei Mu and Yupeng Hou and Zihan Lin and Yushuo Chen and Xingyu Pan and Kaiyuan Li and Yujie Lu and Hui Wang and Changxin Tian and Yingqian Min and Zhichao Feng and Xinyan Fan and Xu Chen and Pengfei Wang and Wendi Ji and Yaliang Li and Xiaoling Wang and Ji. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recomm...

  38. [38]

    RecBole 2.0: Towards a More Up-to-Date Recommendation Library , journal=

    Wayne Xin Zhao and Yupeng Hou and Xingyu Pan and Chen Yang and Zeyu Zhang and Zihan Lin and Jingsen Zhang and Shuqing Bian and Jiakai Tang and Wenqi Sun and Yushuo Chen and Lanling Xu and Gaowei Zhang and Zhen Tian and Changxin Tian and Shanlei Mu and Xinyan Fan and Xu Chen and Ji. RecBole 2.0: Towards a More Up-to-Date Recommendation Library , journal=

  39. [39]

    content-based filtering: differences and similarities , author=

    Collaborative filtering vs. content-based filtering: differences and similarities , author=. arXiv preprint arXiv:1912.08932 , year=

  40. [40]

    matrix factorization revisited , author=

    Neural collaborative filtering vs. matrix factorization revisited , author=. Proceedings of the 14th ACM Conference on Recommender Systems , pages=

  41. [41]

    Llamarec: Two-stage recommendation using large language models for ranking, 2023

    Llamarec: Two-stage recommendation using large language models for ranking , author=. arXiv preprint arXiv:2311.02089 , year=

  42. [42]

    Proceedings of the ACM Web Conference 2024 , pages=

    Can small language models be good reasoners for sequential recommendation? , author=. Proceedings of the ACM Web Conference 2024 , pages=

  43. [43]

    Proceedings of the 18th ACM Conference on Recommender Systems , pages =

    Klenitskiy, Anton and Volodkevich, Anna and Pembek, Anton and Vasilev, Alexey , title =. Proceedings of the 18th ACM Conference on Recommender Systems , pages =. 2024 , isbn =. doi:10.1145/3640457.3688195 , abstract =

  44. [44]

    Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=