pith. machine review for the scientific record. sign in

arxiv: 2605.07125 · v1 · submitted 2026-05-08 · 💻 cs.IR · cs.AI

Recognition: no theorem link

An Embarrassingly Simple Graph Heuristic Reveals Shortcut-Solvable Benchmarks for Sequential Recommendation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:59 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords sequential recommendationbenchmark auditgraph heuristicshortcut structuresnext-item predictiondataset propertiesmodel evaluation
0
0 comments X

The pith

A simple untrained graph heuristic matches or beats many trained sequential recommenders on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a basic method using only the last one or two user interactions can retrieve and rank items from a transition graph by feature similarity, achieving performance comparable to or better than complex generative models on many common datasets. This holds without any sequence encoder, training, or generative objective, with notable gains on datasets like Amazon Sports and CDs. The results trace back to shortcut structures in the data, including predictable local transitions, similar features between nearby items, and weak reliance on long histories. If these structures explain the outcomes, then high benchmark scores do not reliably indicate that a model has captured genuine sequential or semantic patterns. The work therefore questions whether current evaluation practices can support strong claims about advances in recommendation modeling.

Core claim

An embarrassingly simple graph heuristic that retrieves candidates from a few-hop item-transition graph starting from the last one or two interactions and ranks them by item-feature similarity matches or outperforms many modern baselines across 10 of 14 datasets, with relative NDCG@10 gains of 38.10% and 44.18% on Amazon Review Sports and CDs. This performance reflects three shortcut structures: low-branching local transitions, feature-smooth transitions, and limited dependence on long user histories. Model rankings shift substantially once these structures are weakened, showing that benchmark success does not always require advanced sequential, semantic, or generative capabilities.

What carries the argument

The simple graph heuristic that retrieves next-item candidates from a few-hop item-transition graph starting from the last one or two interactions and ranks them by item-feature similarity.

If this is right

  • Model performance rankings change substantially once datasets are characterized by their shortcut strength.
  • Weakening any one of the three shortcut structures makes the advantages of sophisticated models more visible.
  • Strong benchmark results do not necessarily demonstrate advanced sequential or generative modeling ability.
  • More careful dataset selection and property-level analysis are required to support claims about new recommendation methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • New benchmarks could be built by filtering or generating data that deliberately reduces low-branching transitions and feature smoothness to better isolate true sequential modeling needs.
  • The heuristic itself could serve as a minimal baseline that future papers must beat to demonstrate meaningful progress.
  • Similar transition-graph audits might expose shortcut issues in other sequential prediction tasks outside recommendation.

Load-bearing premise

That the strong results arise from the identified shortcut structures in the data rather than from any sequential patterns captured during graph building or feature-based ranking.

What would settle it

Select or synthesize datasets that combine high branching in item transitions, low feature similarity between consecutive items, and strong dependence on long user histories, then check whether the heuristic falls behind trained sequential models on those datasets.

Figures

Figures reproduced from arXiv: 2605.07125 by Bingheng Li, Chun How Tan, Daochen Zha, Hanbing Wang, Haoyu Han, Huiji Gao, Hui Liu, Jiliang Tang, Li Ma, Sanjeev Katariya, Stephanie Moyerman, Xin Liu.

Figure 1
Figure 1. Figure 1: The proportion of surveyed se￾quential recommendation papers utilizing each dataset. Along with this methodological shift, evalua￾tion practice has also become strikingly concen￾trated. We analyze dataset usage across 94 re￾cent generative-recommendation papers, detailed in Appendix A. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The relative performance gap between the Full sequence and Last-1 settings. Shortcut 3: limited dependence on long user histories. To test whether each dataset re￾quires long-range user-history information, we compare SASRec and HSTU under two settings: the full-sequence setting and the LAST-1 set￾ting, where each prediction only uses the im￾mediately previous item [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prediction-level comparison between TGH and learned recommenders. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of the surveyed papers. We collect 94 papers published between 2022 and 2026 that propose or evaluate generative recom￾mendation methods; the full list of paper titles is provided in our code repository. The corpus statistics are summarized in [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Sequential recommendation has increasingly shifted toward generative recommenders that combine sequential patterns with semantic item information. Yet these methods are often evaluated on a small set of widely used benchmarks, raising a key question: do these benchmarks actually require the advanced modeling capabilities that modern generative recommenders claim to provide? We conduct a benchmark audit with an intentionally simple graph heuristic. Starting from only the last one or two interacted items, it retrieves candidates from a few-hop item-transition graph and ranks them by item-feature similarity. Despite using no sequence encoder, generative objective, or training, this heuristic matches or outperforms many modern baselines, with relative NDCG@10 improvements of 38.10% and 44.18% over the best competing baseline on Amazon Review Sports and CDs. We show that this behavior reflects shortcut solvability rather than an artifact of one heuristic. We identify three shortcut structures that can make next-item prediction easier than expected: low-branching local transitions, feature-smooth transitions, and limited dependence on long user histories. These shortcuts need not appear together; even one or two strong signals can make simple local retrieval highly competitive, while weakening them makes the benefits of more sophisticated models clearer. Across 14 datasets, model rankings vary substantially with dataset properties, yet the heuristic remains competitive on 10 of them. Our findings suggest that strong performance on standard benchmarks does not always demonstrate advanced sequential, semantic, or generative modeling ability. We call for more careful dataset selection and dataset-level diagnostic analysis when using benchmarks to support claims about new recommendation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a simple, untrained graph heuristic—retrieving candidates from a few-hop item-transition graph starting from the last 1-2 user interactions and ranking by item-feature similarity—matches or exceeds many trained generative sequential recommenders on standard benchmarks. It reports relative NDCG@10 gains of 38.10% and 44.18% over the best baseline on Amazon Sports and CDs, attributes this to three dataset shortcuts (low-branching local transitions, feature-smooth transitions, limited long-history dependence), and demonstrates across 14 datasets that model rankings vary with these properties while the heuristic remains competitive on 10.

Significance. If the central empirical result holds without implementation artifacts, the work is significant as a benchmark audit: it supplies concrete, multi-dataset evidence that strong performance on widely used sequential recommendation benchmarks does not necessarily demonstrate advanced sequential, semantic, or generative modeling. The identification of measurable shortcut structures and the call for dataset-level diagnostics provide a practical framework for future evaluation.

major comments (2)
  1. [Abstract and heuristic description] The construction of the few-hop item-transition graph is not explicitly stated to use only training interactions. If validation or test transitions are included, retrieval from the last 1-2 items can directly surface held-out next-items via shared edges, turning the 'no training, no encoder' heuristic into an implicit oracle. This would explain the reported 38-44% relative gains on Amazon Sports/CDs without reference to the claimed shortcuts (low-branching transitions or feature smoothness). The abstract and method description must add an explicit guarantee and ablation confirming training-only graph construction.
  2. [Shortcut analysis and multi-dataset results] The causal link between the three identified shortcut structures and the heuristic's competitiveness is not load-bearingly demonstrated. While the paper shows the heuristic is competitive on 10 of 14 datasets, it does not quantify how much of the NDCG@10 margin on Sports/CDs is attributable to each shortcut versus other unstated factors in candidate retrieval or feature ranking. A controlled ablation that systematically weakens each shortcut (e.g., by increasing branching factor or reducing feature correlation) and measures the resulting drop in heuristic performance is needed to support the claim that the benchmarks are 'shortcut-solvable' rather than merely easy for this particular retrieval method.
minor comments (2)
  1. [Abstract] The abstract states 'relative NDCG@10 improvements of 38.10% and 44.18%' but does not name the exact best competing baseline or the absolute NDCG@10 values; adding these numbers would improve interpretability.
  2. [Heuristic description] Notation for the graph (e.g., definition of 'few-hop' and how item-feature similarity is computed) should be introduced with a short equation or pseudocode in the main text rather than deferred to appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important clarifications needed for the heuristic construction and the strength of evidence linking dataset shortcuts to the observed results. We address each point below and have revised the manuscript to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract and heuristic description] The construction of the few-hop item-transition graph is not explicitly stated to use only training interactions. If validation or test transitions are included, retrieval from the last 1-2 items can directly surface held-out next-items via shared edges, turning the 'no training, no encoder' heuristic into an implicit oracle. This would explain the reported 38-44% relative gains on Amazon Sports/CDs without reference to the claimed shortcuts (low-branching transitions or feature smoothness). The abstract and method description must add an explicit guarantee and ablation confirming training-only graph construction.

    Authors: We agree that explicit confirmation of training-only graph construction is essential to rule out leakage. In the original implementation, the item-transition graph is built exclusively from training-set interactions; the last 1-2 items serve only as query starting points for candidate retrieval, with no edges or transitions drawn from validation or test data. We will add a clear statement to this effect in both the abstract and the method section of the revised manuscript. In addition, we will include a new ablation that contrasts performance under the training-only graph against an (invalid) version that incorporates test transitions, thereby demonstrating that the reported gains do not rely on oracle-like access to held-out items. revision: yes

  2. Referee: [Shortcut analysis and multi-dataset results] The causal link between the three identified shortcut structures and the heuristic's competitiveness is not load-bearingly demonstrated. While the paper shows the heuristic is competitive on 10 of 14 datasets, it does not quantify how much of the NDCG@10 margin on Sports/CDs is attributable to each shortcut versus other unstated factors in candidate retrieval or feature ranking. A controlled ablation that systematically weakens each shortcut (e.g., by increasing branching factor or reducing feature correlation) and measures the resulting drop in heuristic performance is needed to support the claim that the benchmarks are 'shortcut-solvable' rather than merely easy for this particular retrieval method.

    Authors: We appreciate the call for stronger causal evidence. The manuscript currently demonstrates the link through systematic variation across 14 datasets: the heuristic remains competitive exactly on those datasets exhibiting one or more of the three shortcuts, while model rankings shift when the shortcuts are weaker. To strengthen this, we will add explicit quantitative measurements of shortcut strength (average local branching factor, feature-transition correlation, and history-dependence statistics) for each dataset and report their correlation with the heuristic's relative NDCG@10 margin. However, controlled interventions that artificially weaken individual shortcuts (e.g., by subsampling edges to increase branching or perturbing features) would require substantial dataset modifications that risk introducing new artifacts and may not preserve the original benchmark semantics. We therefore provide a partial revision by expanding the quantitative correlation analysis and discussion, while acknowledging that full interventional ablations lie beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical audit with fixed heuristic and direct comparisons

full rationale

This is an empirical benchmark audit that applies a fixed, non-learned graph heuristic (last 1-2 items, few-hop retrieval, feature similarity ranking) to existing datasets and compares its NDCG against trained baselines. No equations, parameters, or derivations are introduced that reduce by construction to fitted inputs or self-referential definitions. Shortcut identification is performed by inspecting observable dataset statistics (branching factors, feature smoothness) after the fact, not by any self-defining construction. All reported gains are direct empirical measurements; the paper does not claim a mathematical derivation whose validity depends on its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that the described heuristic achieves competitive performance and on the interpretation that this performance arises from three data properties rather than from hidden modeling power in the heuristic itself.

axioms (1)
  • domain assumption Few-hop item-transition graphs plus feature similarity capture the dominant local patterns in the evaluated datasets
    Invoked when the heuristic is presented as sufficient to match complex models without sequence encoders.

pith-pipeline@v0.9.0 · 5617 in / 1326 out tokens · 40037 ms · 2026-05-11T00:59:17.132367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    A survey of sequential recommendation systems: Techniques, evaluation, and future directions.Information Systems, 125:102427, 2024

    Tesfaye Fenta Boka, Zhendong Niu, and Rama Bastola Neupane. A survey of sequential recommendation systems: Techniques, evaluation, and future directions.Information Systems, 125:102427, 2024

  2. [2]

    A survey on sequential recommendation.Frontiers of Computer Science, 20(3):2003606, 2026

    Li-Wei Pan, Wei-Ke Pan, Mei-Yan Wei, Hong-Zhi Yin, and Zhong Ming. A survey on sequential recommendation.Frontiers of Computer Science, 20(3):2003606, 2026

  3. [3]

    Deep learning for sequential recommendation: Algorithms, influential factors, and evaluations.ACM Transactions on Information Systems (TOIS), 39(1):1–42, 2020

    Hui Fang, Danning Zhang, Yiheng Shu, and Guibing Guo. Deep learning for sequential recommendation: Algorithms, influential factors, and evaluations.ACM Transactions on Information Systems (TOIS), 39(1):1–42, 2020

  4. [5]

    Collaborative filtering recommender systems.Foundations and Trends® in Human–Computer Interaction, 4(2):81–173, 2011

    Michael D Ekstrand, John T Riedl, and Joseph A Konstan. Collaborative filtering recommender systems.Foundations and Trends® in Human–Computer Interaction, 4(2):81–173, 2011

  5. [6]

    Collaborative filtering for implicit feedback datasets

    Yifan Hu, Yehuda Koren, and Chris V olinsky. Collaborative filtering for implicit feedback datasets. In2008 Eighth IEEE international conference on data mining, pages 263–272. Ieee, 2008

  6. [7]

    Translation-based recommendation

    Ruining He, Wang-Cheng Kang, and Julian McAuley. Translation-based recommendation. In Proceedings of the eleventh ACM conference on recommender systems, pages 161–169, 2017

  7. [8]

    Session-based Recommendations with Recurrent Neural Networks

    Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939, 2015

  8. [9]

    Improved recurrent neural networks for session- based recommendations

    Yong Kiam Tan, Xinxing Xu, and Yong Liu. Improved recurrent neural networks for session- based recommendations. InProceedings of the 1st workshop on deep learning for recommender systems, pages 17–22, 2016

  9. [10]

    Lightgcn: Simplifying and powering graph convolution network for recommendation

    Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn: Simplifying and powering graph convolution network for recommendation. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 639–648, 2020

  10. [11]

    Session-based recommendation with graph neural networks

    Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. Session-based recommendation with graph neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 346–353, 2019

  11. [12]

    Deep interest network for click-through rate prediction

    Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1059–1068, 2018

  12. [13]

    Deep interest evolution network for click-through rate prediction

    Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep interest evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 5941–5948, 2019

  13. [14]

    Recommender systems with generative retrieval.Advances in Neural Information Processing Systems, 36:10299–10315, 2023

    Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al. Recommender systems with generative retrieval.Advances in Neural Information Processing Systems, 36:10299–10315, 2023

  14. [15]

    Learnable item tokenization for generative recommendation

    Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Learnable item tokenization for generative recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2400–2409, 2024

  15. [16]

    Actionpiece: Contextually tokeniz- ing action sequences for generative recommendation,

    Yupeng Hou, Jianmo Ni, Zhankui He, Noveen Sachdeva, Wang-Cheng Kang, Ed H Chi, Julian McAuley, and Derek Zhiyuan Cheng. Actionpiece: Contextually tokenizing action sequences for generative recommendation.arXiv preprint arXiv:2502.13581, 2025. 10

  16. [17]

    Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5)

    Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). InProceedings of the 16th ACM conference on recommender systems, pages 299–315, 2022

  17. [18]

    How to index item ids for recommendation foundation models

    Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. How to index item ids for recommendation foundation models. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 195–204, 2023

  18. [19]

    Cofirec: Coarse-to-fine tokenization for generative recommendation.arXiv preprint arXiv:2511.22707, 2025

    Tianxin Wei, Xuying Ning, Xuxing Chen, Ruizhong Qiu, Yupeng Hou, Yan Xie, Shuang Yang, Zhigang Hua, and Jingrui He. Cofirec: Coarse-to-fine tokenization for generative recommendation.arXiv preprint arXiv:2511.22707, 2025

  19. [20]

    Genrec: Large language model for generative recommendation

    Jianchao Ji, Zelong Li, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Juntao Tan, and Yongfeng Zhang. Genrec: Large language model for generative recommendation. InEuropean Conference on Information Retrieval, pages 494–502. Springer, 2024

  20. [21]

    Image-based recommendations on styles and substitutes

    Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based recommendations on styles and substitutes. InProceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pages 43–52, 2015

  21. [22]

    Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

  22. [23]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  23. [24]

    Matrix factorization techniques for recom- mender systems.Computer, 42(8):30–37, 2009

    Yehuda Koren, Robert Bell, and Chris V olinsky. Matrix factorization techniques for recom- mender systems.Computer, 42(8):30–37, 2009

  24. [25]

    Factorizing personalized markov chains for next-basket recommendation

    Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized markov chains for next-basket recommendation. InProceedings of the 19th international conference on World wide web, pages 811–820, 2010

  25. [26]

    Personalized top-n sequential recommendation via convolutional sequence embedding

    Jiaxi Tang and Ke Wang. Personalized top-n sequential recommendation via convolutional sequence embedding. InProceedings of the eleventh ACM international conference on web search and data mining, pages 565–573, 2018

  26. [27]

    Self-attentive sequential recommendation

    Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM), pages 197–206, 2018

  27. [28]

    Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer

    Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. InPro- ceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450, 2019

  28. [29]

    Global context enhanced graph neural networks for session-based recommendation

    Ziyang Wang, Wei Wei, Gao Cong, Xiao-Li Li, Xian-Ling Mao, and Minghui Qiu. Global context enhanced graph neural networks for session-based recommendation. InProceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pages 169–178, 2020

  29. [30]

    S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization

    Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. InProceedings of the 29th ACM international conference on information & knowledge management, pages 1893–1902, 2020

  30. [31]

    Towards universal sequence representation learning for recommender systems

    Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. Towards universal sequence representation learning for recommender systems. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 585–593, 2022. 11

  31. [32]

    Text is all you need: Learning language representations for sequential recommendation

    Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian McAuley. Text is all you need: Learning language representations for sequential recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1258–1267, 2023

  32. [33]

    Amazon-m2: A multilingual multi-locale shopping session dataset for recommendation and text generation.Advances in Neural Information Processing Systems, 36:8006–8026, 2023

    Wei Jin, Haitao Mao, Zheng Li, Haoming Jiang, Chen Luo, Hongzhi Wen, Haoyu Han, Hanqing Lu, Zhengyang Wang, Ruirui Li, et al. Amazon-m2: A multilingual multi-locale shopping session dataset for recommendation and text generation.Advances in Neural Information Processing Systems, 36:8006–8026, 2023

  33. [34]

    A review of modern recommender systems using generative models (gen-recsys)

    Yashar Deldjoo, Zhankui He, Julian McAuley, Anton Korikov, Scott Sanner, Arnau Ramisa, René Vidal, Maheswaran Sathiamoorthy, Atoosa Kasirzadeh, and Silvia Milano. A review of modern recommender systems using generative models (gen-recsys). InProceedings of the 30th ACM SIGKDD conference on Knowledge Discovery and Data Mining, pages 6448–6458, 2024

  34. [35]

    Analysis of recom- mender system using generative artificial intelligence: A systematic literature review.Ieee Access, 12:87742–87766, 2024

    Matthew O Ayemowa, Roliana Ibrahim, and Muhammad Murad Khan. Analysis of recom- mender system using generative artificial intelligence: A systematic literature review.Ieee Access, 12:87742–87766, 2024

  35. [36]

    Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

    Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152, 2024

  36. [37]

    Large language models for generative recommendation: A survey and visionary discussions

    Lei Li, Yongfeng Zhang, Dugang Liu, and Li Chen. Large language models for generative recommendation: A survey and visionary discussions. InProceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), pages 10146–10159, 2024

  37. [38]

    On sampled metrics for item recommendation

    Walid Krichene and Steffen Rendle. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1748–1757, 2020

  38. [39]

    Exploring data splitting strategies for the evaluation of recommendation models

    Zaiqiao Meng, Richard McCreadie, Craig Macdonald, and Iadh Ounis. Exploring data splitting strategies for the evaluation of recommendation models. InProceedings of the 14th acm conference on recommender systems, pages 681–686, 2020

  39. [40]

    A revisiting study of appropriate offline evaluation for top-n recommendation algorithms.ACM Transactions on Information Systems, 41(2):1–41, 2022

    Wayne Xin Zhao, Zihan Lin, Zhichao Feng, Pengfei Wang, and Ji-Rong Wen. A revisiting study of appropriate offline evaluation for top-n recommendation algorithms.ACM Transactions on Information Systems, 41(2):1–41, 2022

  40. [41]

    Are we really making much progress? a worrying analysis of recent neural recommendation approaches

    Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much progress? a worrying analysis of recent neural recommendation approaches. InProceedings of the 13th ACM conference on recommender systems, pages 101–109, 2019

  41. [42]

    A troubling analysis of reproducibility and progress in recommender systems research.ACM Transactions on Information Systems (TOIS), 39(2):1–49, 2021

    Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. A troubling analysis of reproducibility and progress in recommender systems research.ACM Transactions on Information Systems (TOIS), 39(2):1–49, 2021

  42. [43]

    Neural collaborative filtering vs

    Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. Neural collaborative filtering vs. matrix factorization revisited. InProceedings of the 14th ACM conference on recommender systems, pages 240–248, 2020

  43. [44]

    Generative recommendation with semantic ids: A practitioner’s handbook

    Clark Mingxuan Ju, Liam Collins, Leonardo Neves, Bhuvesh Kumar, Louis Yufeng Wang, Tong Zhao, and Neil Shah. Generative recommendation with semantic ids: A practitioner’s handbook. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 6420–6425, 2025

  44. [45]

    Second workshop on information hetero- geneity and fusion in recommender systems (hetrec2011)

    Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. Second workshop on information hetero- geneity and fusion in recommender systems (hetrec2011). InProceedings of the fifth ACM conference on Recommender systems, pages 387–388, 2011

  45. [46]

    The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

    F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015. 12

  46. [47]

    Yelp open dataset

    https://www.yelp.com/dataset. Yelp open dataset. 2023

  47. [48]

    Mind: A large-scale dataset for news recommendation

    Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, et al. Mind: A large-scale dataset for news recommendation. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 3597–3606, 2020

  48. [49]

    Item recommendation on monotonic behavior chains

    Mengting Wan and Julian McAuley. Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM conference on recommender systems, pages 86–94, 2018

  49. [50]

    H&m personalized fashion recommendations

    https://www.kaggle.com/c/h-and-m-personalized-fashion recommendations/overview. H&m personalized fashion recommendations. 2022

  50. [51]

    Tokenrec: Learning to tokenize id for llm- based generative recommendations.IEEE Transactions on Knowledge and Data Engineering, 2025

    Haohao Qu, Wenqi Fan, Zihuai Zhao, and Qing Li. Tokenrec: Learning to tokenize id for llm- based generative recommendations.IEEE Transactions on Knowledge and Data Engineering, 2025

  51. [52]

    Memory augmented graph neural networks for sequential recommendation

    Chen Ma, Liheng Ma, Yingxue Zhang, Jianing Sun, Xue Liu, and Mark Coates. Memory augmented graph neural networks for sequential recommendation. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 5045–5052, 2020

  52. [53]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022. 13 A Details of Surveyed Papers arXiv 34.0% SIGIR 9.6% NeurIPS 7.4% Others 7.4% EMNLP 2.1% MM 2.1% KDD 2.1% SIGIR-AP 2....