arxiv: 2605.07125 · v1 · submitted 2026-05-08 · 💻 cs.IR · cs.AI

Recognition: no theorem link

An Embarrassingly Simple Graph Heuristic Reveals Shortcut-Solvable Benchmarks for Sequential Recommendation

Haoyu Han , Li Ma , Hanbing Wang , Bingheng Li , Daochen Zha , Chun How Tan , Huiji Gao , Xin Liu

show 4 more authors

Stephanie Moyerman Sanjeev Katariya Hui Liu Jiliang Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:59 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords sequential recommendationbenchmark auditgraph heuristicshortcut structuresnext-item predictiondataset propertiesmodel evaluation

0 comments

The pith

A simple untrained graph heuristic matches or beats many trained sequential recommenders on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a basic method using only the last one or two user interactions can retrieve and rank items from a transition graph by feature similarity, achieving performance comparable to or better than complex generative models on many common datasets. This holds without any sequence encoder, training, or generative objective, with notable gains on datasets like Amazon Sports and CDs. The results trace back to shortcut structures in the data, including predictable local transitions, similar features between nearby items, and weak reliance on long histories. If these structures explain the outcomes, then high benchmark scores do not reliably indicate that a model has captured genuine sequential or semantic patterns. The work therefore questions whether current evaluation practices can support strong claims about advances in recommendation modeling.

Core claim

An embarrassingly simple graph heuristic that retrieves candidates from a few-hop item-transition graph starting from the last one or two interactions and ranks them by item-feature similarity matches or outperforms many modern baselines across 10 of 14 datasets, with relative NDCG@10 gains of 38.10% and 44.18% on Amazon Review Sports and CDs. This performance reflects three shortcut structures: low-branching local transitions, feature-smooth transitions, and limited dependence on long user histories. Model rankings shift substantially once these structures are weakened, showing that benchmark success does not always require advanced sequential, semantic, or generative capabilities.

What carries the argument

The simple graph heuristic that retrieves next-item candidates from a few-hop item-transition graph starting from the last one or two interactions and ranks them by item-feature similarity.

If this is right

Model performance rankings change substantially once datasets are characterized by their shortcut strength.
Weakening any one of the three shortcut structures makes the advantages of sophisticated models more visible.
Strong benchmark results do not necessarily demonstrate advanced sequential or generative modeling ability.
More careful dataset selection and property-level analysis are required to support claims about new recommendation methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New benchmarks could be built by filtering or generating data that deliberately reduces low-branching transitions and feature smoothness to better isolate true sequential modeling needs.
The heuristic itself could serve as a minimal baseline that future papers must beat to demonstrate meaningful progress.
Similar transition-graph audits might expose shortcut issues in other sequential prediction tasks outside recommendation.

Load-bearing premise

That the strong results arise from the identified shortcut structures in the data rather than from any sequential patterns captured during graph building or feature-based ranking.

What would settle it

Select or synthesize datasets that combine high branching in item transitions, low feature similarity between consecutive items, and strong dependence on long user histories, then check whether the heuristic falls behind trained sequential models on those datasets.

Figures

Figures reproduced from arXiv: 2605.07125 by Bingheng Li, Chun How Tan, Daochen Zha, Hanbing Wang, Haoyu Han, Huiji Gao, Hui Liu, Jiliang Tang, Li Ma, Sanjeev Katariya, Stephanie Moyerman, Xin Liu.

**Figure 1.** Figure 1: The proportion of surveyed sequential recommendation papers utilizing each dataset. Along with this methodological shift, evaluation practice has also become strikingly concentrated. We analyze dataset usage across 94 recent generative-recommendation papers, detailed in Appendix A. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The relative performance gap between the Full sequence and Last-1 settings. Shortcut 3: limited dependence on long user histories. To test whether each dataset requires long-range user-history information, we compare SASRec and HSTU under two settings: the full-sequence setting and the LAST-1 setting, where each prediction only uses the immediately previous item [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Prediction-level comparison between TGH and learned recommenders. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Statistics of the surveyed papers. We collect 94 papers published between 2022 and 2026 that propose or evaluate generative recommendation methods; the full list of paper titles is provided in our code repository. The corpus statistics are summarized in [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Sequential recommendation has increasingly shifted toward generative recommenders that combine sequential patterns with semantic item information. Yet these methods are often evaluated on a small set of widely used benchmarks, raising a key question: do these benchmarks actually require the advanced modeling capabilities that modern generative recommenders claim to provide? We conduct a benchmark audit with an intentionally simple graph heuristic. Starting from only the last one or two interacted items, it retrieves candidates from a few-hop item-transition graph and ranks them by item-feature similarity. Despite using no sequence encoder, generative objective, or training, this heuristic matches or outperforms many modern baselines, with relative NDCG@10 improvements of 38.10% and 44.18% over the best competing baseline on Amazon Review Sports and CDs. We show that this behavior reflects shortcut solvability rather than an artifact of one heuristic. We identify three shortcut structures that can make next-item prediction easier than expected: low-branching local transitions, feature-smooth transitions, and limited dependence on long user histories. These shortcuts need not appear together; even one or two strong signals can make simple local retrieval highly competitive, while weakening them makes the benefits of more sophisticated models clearer. Across 14 datasets, model rankings vary substantially with dataset properties, yet the heuristic remains competitive on 10 of them. Our findings suggest that strong performance on standard benchmarks does not always demonstrate advanced sequential, semantic, or generative modeling ability. We call for more careful dataset selection and dataset-level diagnostic analysis when using benchmarks to support claims about new recommendation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a basic untrained graph heuristic can match or beat many sequential recommenders on standard benchmarks by exploiting local data shortcuts, but the graph construction needs explicit confirmation to rule out leakage.

read the letter

The core point is that many widely used sequential recommendation datasets contain shortcuts—low-branching local transitions, feature-smooth item pairs, and limited need for long histories—that let a simple retrieval method from the last one or two items perform competitively without any training or sequence modeling. The authors back this with results on 14 datasets and report clear relative gains, such as 38% and 44% NDCG@10 lifts over the best baseline on Amazon Sports and CDs. They also name the three shortcut structures explicitly and note that model rankings shift depending on which datasets you pick. That multi-dataset scope and the diagnostic framing are the real additions here; prior work had simple baselines, but not this focused audit tied to named data properties. The call for more careful dataset selection follows directly from the evidence and is worth taking seriously. The main soft spot is the graph construction step. The abstract and summary do not state outright that the item-transition graph is built only from training interactions. If test or validation edges are included, the heuristic could retrieve held-out items via direct connections, which would explain the margins without needing the claimed shortcuts. Full implementation details and code would resolve this quickly. Assuming the graph stays training-only, the argument holds up without circularity or heavy parameter fitting. This work is for researchers who evaluate new sequential or generative models on the usual benchmarks and want to understand when strong results actually reflect modeling power. Readers focused on evaluation practices will get practical value from the dataset breakdowns. It has enough concrete empirical content and a clear question to merit serious referee time, even if the authors need to tighten the methods section. I would send it to peer review and ask specifically for the graph-building rules plus any ablations that isolate each shortcut type.

Referee Report

2 major / 2 minor

Summary. The paper claims that a simple, untrained graph heuristic—retrieving candidates from a few-hop item-transition graph starting from the last 1-2 user interactions and ranking by item-feature similarity—matches or exceeds many trained generative sequential recommenders on standard benchmarks. It reports relative NDCG@10 gains of 38.10% and 44.18% over the best baseline on Amazon Sports and CDs, attributes this to three dataset shortcuts (low-branching local transitions, feature-smooth transitions, limited long-history dependence), and demonstrates across 14 datasets that model rankings vary with these properties while the heuristic remains competitive on 10.

Significance. If the central empirical result holds without implementation artifacts, the work is significant as a benchmark audit: it supplies concrete, multi-dataset evidence that strong performance on widely used sequential recommendation benchmarks does not necessarily demonstrate advanced sequential, semantic, or generative modeling. The identification of measurable shortcut structures and the call for dataset-level diagnostics provide a practical framework for future evaluation.

major comments (2)

[Abstract and heuristic description] The construction of the few-hop item-transition graph is not explicitly stated to use only training interactions. If validation or test transitions are included, retrieval from the last 1-2 items can directly surface held-out next-items via shared edges, turning the 'no training, no encoder' heuristic into an implicit oracle. This would explain the reported 38-44% relative gains on Amazon Sports/CDs without reference to the claimed shortcuts (low-branching transitions or feature smoothness). The abstract and method description must add an explicit guarantee and ablation confirming training-only graph construction.
[Shortcut analysis and multi-dataset results] The causal link between the three identified shortcut structures and the heuristic's competitiveness is not load-bearingly demonstrated. While the paper shows the heuristic is competitive on 10 of 14 datasets, it does not quantify how much of the NDCG@10 margin on Sports/CDs is attributable to each shortcut versus other unstated factors in candidate retrieval or feature ranking. A controlled ablation that systematically weakens each shortcut (e.g., by increasing branching factor or reducing feature correlation) and measures the resulting drop in heuristic performance is needed to support the claim that the benchmarks are 'shortcut-solvable' rather than merely easy for this particular retrieval method.

minor comments (2)

[Abstract] The abstract states 'relative NDCG@10 improvements of 38.10% and 44.18%' but does not name the exact best competing baseline or the absolute NDCG@10 values; adding these numbers would improve interpretability.
[Heuristic description] Notation for the graph (e.g., definition of 'few-hop' and how item-feature similarity is computed) should be introduced with a short equation or pseudocode in the main text rather than deferred to appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important clarifications needed for the heuristic construction and the strength of evidence linking dataset shortcuts to the observed results. We address each point below and have revised the manuscript to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract and heuristic description] The construction of the few-hop item-transition graph is not explicitly stated to use only training interactions. If validation or test transitions are included, retrieval from the last 1-2 items can directly surface held-out next-items via shared edges, turning the 'no training, no encoder' heuristic into an implicit oracle. This would explain the reported 38-44% relative gains on Amazon Sports/CDs without reference to the claimed shortcuts (low-branching transitions or feature smoothness). The abstract and method description must add an explicit guarantee and ablation confirming training-only graph construction.

Authors: We agree that explicit confirmation of training-only graph construction is essential to rule out leakage. In the original implementation, the item-transition graph is built exclusively from training-set interactions; the last 1-2 items serve only as query starting points for candidate retrieval, with no edges or transitions drawn from validation or test data. We will add a clear statement to this effect in both the abstract and the method section of the revised manuscript. In addition, we will include a new ablation that contrasts performance under the training-only graph against an (invalid) version that incorporates test transitions, thereby demonstrating that the reported gains do not rely on oracle-like access to held-out items. revision: yes
Referee: [Shortcut analysis and multi-dataset results] The causal link between the three identified shortcut structures and the heuristic's competitiveness is not load-bearingly demonstrated. While the paper shows the heuristic is competitive on 10 of 14 datasets, it does not quantify how much of the NDCG@10 margin on Sports/CDs is attributable to each shortcut versus other unstated factors in candidate retrieval or feature ranking. A controlled ablation that systematically weakens each shortcut (e.g., by increasing branching factor or reducing feature correlation) and measures the resulting drop in heuristic performance is needed to support the claim that the benchmarks are 'shortcut-solvable' rather than merely easy for this particular retrieval method.

Authors: We appreciate the call for stronger causal evidence. The manuscript currently demonstrates the link through systematic variation across 14 datasets: the heuristic remains competitive exactly on those datasets exhibiting one or more of the three shortcuts, while model rankings shift when the shortcuts are weaker. To strengthen this, we will add explicit quantitative measurements of shortcut strength (average local branching factor, feature-transition correlation, and history-dependence statistics) for each dataset and report their correlation with the heuristic's relative NDCG@10 margin. However, controlled interventions that artificially weaken individual shortcuts (e.g., by subsampling edges to increase branching or perturbing features) would require substantial dataset modifications that risk introducing new artifacts and may not preserve the original benchmark semantics. We therefore provide a partial revision by expanding the quantitative correlation analysis and discussion, while acknowledging that full interventional ablations lie beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical audit with fixed heuristic and direct comparisons

full rationale

This is an empirical benchmark audit that applies a fixed, non-learned graph heuristic (last 1-2 items, few-hop retrieval, feature similarity ranking) to existing datasets and compares its NDCG against trained baselines. No equations, parameters, or derivations are introduced that reduce by construction to fitted inputs or self-referential definitions. Shortcut identification is performed by inspecting observable dataset statistics (branching factors, feature smoothness) after the fact, not by any self-defining construction. All reported gains are direct empirical measurements; the paper does not claim a mathematical derivation whose validity depends on its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that the described heuristic achieves competitive performance and on the interpretation that this performance arises from three data properties rather than from hidden modeling power in the heuristic itself.

axioms (1)

domain assumption Few-hop item-transition graphs plus feature similarity capture the dominant local patterns in the evaluated datasets
Invoked when the heuristic is presented as sufficient to match complex models without sequence encoders.

pith-pipeline@v0.9.0 · 5617 in / 1326 out tokens · 40037 ms · 2026-05-11T00:59:17.132367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 6 canonical work pages · 3 internal anchors

[1]

A survey of sequential recommendation systems: Techniques, evaluation, and future directions.Information Systems, 125:102427, 2024

Tesfaye Fenta Boka, Zhendong Niu, and Rama Bastola Neupane. A survey of sequential recommendation systems: Techniques, evaluation, and future directions.Information Systems, 125:102427, 2024

2024
[2]

A survey on sequential recommendation.Frontiers of Computer Science, 20(3):2003606, 2026

Li-Wei Pan, Wei-Ke Pan, Mei-Yan Wei, Hong-Zhi Yin, and Zhong Ming. A survey on sequential recommendation.Frontiers of Computer Science, 20(3):2003606, 2026

2026
[3]

Deep learning for sequential recommendation: Algorithms, influential factors, and evaluations.ACM Transactions on Information Systems (TOIS), 39(1):1–42, 2020

Hui Fang, Danning Zhang, Yiheng Shu, and Guibing Guo. Deep learning for sequential recommendation: Algorithms, influential factors, and evaluations.ACM Transactions on Information Systems (TOIS), 39(1):1–42, 2020

2020
[5]

Collaborative filtering recommender systems.Foundations and Trends® in Human–Computer Interaction, 4(2):81–173, 2011

Michael D Ekstrand, John T Riedl, and Joseph A Konstan. Collaborative filtering recommender systems.Foundations and Trends® in Human–Computer Interaction, 4(2):81–173, 2011

2011
[6]

Collaborative filtering for implicit feedback datasets

Yifan Hu, Yehuda Koren, and Chris V olinsky. Collaborative filtering for implicit feedback datasets. In2008 Eighth IEEE international conference on data mining, pages 263–272. Ieee, 2008

2008
[7]

Translation-based recommendation

Ruining He, Wang-Cheng Kang, and Julian McAuley. Translation-based recommendation. In Proceedings of the eleventh ACM conference on recommender systems, pages 161–169, 2017

2017
[8]

Session-based Recommendations with Recurrent Neural Networks

Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939, 2015

work page internal anchor Pith review arXiv 2015
[9]

Improved recurrent neural networks for session- based recommendations

Yong Kiam Tan, Xinxing Xu, and Yong Liu. Improved recurrent neural networks for session- based recommendations. InProceedings of the 1st workshop on deep learning for recommender systems, pages 17–22, 2016

2016
[10]

Lightgcn: Simplifying and powering graph convolution network for recommendation

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn: Simplifying and powering graph convolution network for recommendation. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 639–648, 2020

2020
[11]

Session-based recommendation with graph neural networks

Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. Session-based recommendation with graph neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 346–353, 2019

2019
[12]

Deep interest network for click-through rate prediction

Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1059–1068, 2018

2018
[13]

Deep interest evolution network for click-through rate prediction

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep interest evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 5941–5948, 2019

2019
[14]

Recommender systems with generative retrieval.Advances in Neural Information Processing Systems, 36:10299–10315, 2023

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al. Recommender systems with generative retrieval.Advances in Neural Information Processing Systems, 36:10299–10315, 2023

2023
[15]

Learnable item tokenization for generative recommendation

Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Learnable item tokenization for generative recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2400–2409, 2024

2024
[16]

Actionpiece: Contextually tokeniz- ing action sequences for generative recommendation,

Yupeng Hou, Jianmo Ni, Zhankui He, Noveen Sachdeva, Wang-Cheng Kang, Ed H Chi, Julian McAuley, and Derek Zhiyuan Cheng. Actionpiece: Contextually tokenizing action sequences for generative recommendation.arXiv preprint arXiv:2502.13581, 2025. 10

work page arXiv 2025
[17]

Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5)

Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). InProceedings of the 16th ACM conference on recommender systems, pages 299–315, 2022

2022
[18]

How to index item ids for recommendation foundation models

Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. How to index item ids for recommendation foundation models. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 195–204, 2023

2023
[19]

Cofirec: Coarse-to-fine tokenization for generative recommendation.arXiv preprint arXiv:2511.22707, 2025

Tianxin Wei, Xuying Ning, Xuxing Chen, Ruizhong Qiu, Yupeng Hou, Yan Xie, Shuang Yang, Zhigang Hua, and Jingrui He. Cofirec: Coarse-to-fine tokenization for generative recommendation.arXiv preprint arXiv:2511.22707, 2025

work page arXiv 2025
[20]

Genrec: Large language model for generative recommendation

Jianchao Ji, Zelong Li, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Juntao Tan, and Yongfeng Zhang. Genrec: Large language model for generative recommendation. InEuropean Conference on Information Retrieval, pages 494–502. Springer, 2024

2024
[21]

Image-based recommendations on styles and substitutes

Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based recommendations on styles and substitutes. InProceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pages 43–52, 2015

2015
[22]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

2024
[23]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Matrix factorization techniques for recom- mender systems.Computer, 42(8):30–37, 2009

Yehuda Koren, Robert Bell, and Chris V olinsky. Matrix factorization techniques for recom- mender systems.Computer, 42(8):30–37, 2009

2009
[25]

Factorizing personalized markov chains for next-basket recommendation

Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized markov chains for next-basket recommendation. InProceedings of the 19th international conference on World wide web, pages 811–820, 2010

2010
[26]

Personalized top-n sequential recommendation via convolutional sequence embedding

Jiaxi Tang and Ke Wang. Personalized top-n sequential recommendation via convolutional sequence embedding. InProceedings of the eleventh ACM international conference on web search and data mining, pages 565–573, 2018

2018
[27]

Self-attentive sequential recommendation

Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM), pages 197–206, 2018

2018
[28]

Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. InPro- ceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450, 2019

2019
[29]

Global context enhanced graph neural networks for session-based recommendation

Ziyang Wang, Wei Wei, Gao Cong, Xiao-Li Li, Xian-Ling Mao, and Minghui Qiu. Global context enhanced graph neural networks for session-based recommendation. InProceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pages 169–178, 2020

2020
[30]

S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization

Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. InProceedings of the 29th ACM international conference on information & knowledge management, pages 1893–1902, 2020

1902
[31]

Towards universal sequence representation learning for recommender systems

Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. Towards universal sequence representation learning for recommender systems. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 585–593, 2022. 11

2022
[32]

Text is all you need: Learning language representations for sequential recommendation

Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian McAuley. Text is all you need: Learning language representations for sequential recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1258–1267, 2023

2023
[33]

Amazon-m2: A multilingual multi-locale shopping session dataset for recommendation and text generation.Advances in Neural Information Processing Systems, 36:8006–8026, 2023

Wei Jin, Haitao Mao, Zheng Li, Haoming Jiang, Chen Luo, Hongzhi Wen, Haoyu Han, Hanqing Lu, Zhengyang Wang, Ruirui Li, et al. Amazon-m2: A multilingual multi-locale shopping session dataset for recommendation and text generation.Advances in Neural Information Processing Systems, 36:8006–8026, 2023

2023
[34]

A review of modern recommender systems using generative models (gen-recsys)

Yashar Deldjoo, Zhankui He, Julian McAuley, Anton Korikov, Scott Sanner, Arnau Ramisa, René Vidal, Maheswaran Sathiamoorthy, Atoosa Kasirzadeh, and Silvia Milano. A review of modern recommender systems using generative models (gen-recsys). InProceedings of the 30th ACM SIGKDD conference on Knowledge Discovery and Data Mining, pages 6448–6458, 2024

2024
[35]

Analysis of recom- mender system using generative artificial intelligence: A systematic literature review.Ieee Access, 12:87742–87766, 2024

Matthew O Ayemowa, Roliana Ibrahim, and Muhammad Murad Khan. Analysis of recom- mender system using generative artificial intelligence: A systematic literature review.Ieee Access, 12:87742–87766, 2024

2024
[36]

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152, 2024

work page internal anchor Pith review arXiv 2024
[37]

Large language models for generative recommendation: A survey and visionary discussions

Lei Li, Yongfeng Zhang, Dugang Liu, and Li Chen. Large language models for generative recommendation: A survey and visionary discussions. InProceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), pages 10146–10159, 2024

2024
[38]

On sampled metrics for item recommendation

Walid Krichene and Steffen Rendle. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1748–1757, 2020

2020
[39]

Exploring data splitting strategies for the evaluation of recommendation models

Zaiqiao Meng, Richard McCreadie, Craig Macdonald, and Iadh Ounis. Exploring data splitting strategies for the evaluation of recommendation models. InProceedings of the 14th acm conference on recommender systems, pages 681–686, 2020

2020
[40]

A revisiting study of appropriate offline evaluation for top-n recommendation algorithms.ACM Transactions on Information Systems, 41(2):1–41, 2022

Wayne Xin Zhao, Zihan Lin, Zhichao Feng, Pengfei Wang, and Ji-Rong Wen. A revisiting study of appropriate offline evaluation for top-n recommendation algorithms.ACM Transactions on Information Systems, 41(2):1–41, 2022

2022
[41]

Are we really making much progress? a worrying analysis of recent neural recommendation approaches

Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much progress? a worrying analysis of recent neural recommendation approaches. InProceedings of the 13th ACM conference on recommender systems, pages 101–109, 2019

2019
[42]

A troubling analysis of reproducibility and progress in recommender systems research.ACM Transactions on Information Systems (TOIS), 39(2):1–49, 2021

Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. A troubling analysis of reproducibility and progress in recommender systems research.ACM Transactions on Information Systems (TOIS), 39(2):1–49, 2021

2021
[43]

Neural collaborative filtering vs

Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. Neural collaborative filtering vs. matrix factorization revisited. InProceedings of the 14th ACM conference on recommender systems, pages 240–248, 2020

2020
[44]

Generative recommendation with semantic ids: A practitioner’s handbook

Clark Mingxuan Ju, Liam Collins, Leonardo Neves, Bhuvesh Kumar, Louis Yufeng Wang, Tong Zhao, and Neil Shah. Generative recommendation with semantic ids: A practitioner’s handbook. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 6420–6425, 2025

2025
[45]

Second workshop on information hetero- geneity and fusion in recommender systems (hetrec2011)

Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. Second workshop on information hetero- geneity and fusion in recommender systems (hetrec2011). InProceedings of the fifth ACM conference on Recommender systems, pages 387–388, 2011

2011
[46]

The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015. 12

2015
[47]

Yelp open dataset

https://www.yelp.com/dataset. Yelp open dataset. 2023

2023
[48]

Mind: A large-scale dataset for news recommendation

Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, et al. Mind: A large-scale dataset for news recommendation. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 3597–3606, 2020

2020
[49]

Item recommendation on monotonic behavior chains

Mengting Wan and Julian McAuley. Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM conference on recommender systems, pages 86–94, 2018

2018
[50]

H&m personalized fashion recommendations

https://www.kaggle.com/c/h-and-m-personalized-fashion recommendations/overview. H&m personalized fashion recommendations. 2022

2022
[51]

Tokenrec: Learning to tokenize id for llm- based generative recommendations.IEEE Transactions on Knowledge and Data Engineering, 2025

Haohao Qu, Wenqi Fan, Zihuai Zhao, and Qing Li. Tokenrec: Learning to tokenize id for llm- based generative recommendations.IEEE Transactions on Knowledge and Data Engineering, 2025

2025
[52]

Memory augmented graph neural networks for sequential recommendation

Chen Ma, Liheng Ma, Yingxue Zhang, Jianing Sun, Xue Liu, and Mark Coates. Memory augmented graph neural networks for sequential recommendation. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 5045–5052, 2020

2020
[53]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022. 13 A Details of Surveyed Papers arXiv 34.0% SIGIR 9.6% NeurIPS 7.4% Others 7.4% EMNLP 2.1% MM 2.1% KDD 2.1% SIGIR-AP 2....

work page arXiv 2022