pith. machine review for the scientific record. sign in

arxiv: 2605.12617 · v1 · submitted 2026-05-12 · 💻 cs.IR

Recognition: unknown

MLPs are Efficient Distilled Generative Recommenders

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:10 UTC · model grok-4.3

classification 💻 cs.IR
keywords generative recommendationsemantic IDsdistillationMLPinference accelerationautoregressive decodingrecommender systemsTransformer decoder
0
0 comments X

The pith

Distilling generative recommenders into MLPs preserves accuracy while speeding up inference by 8.74x

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative recommendation models that use Semantic IDs face high inference latency from beam search with Transformer decoders. The paper shows this is unnecessary because the hierarchical structure of SIDs makes token predictions much easier after the first one. SID-MLP distills the teacher into lightweight position-specific MLP heads that process the full user context in one step instead of repeated attention. Experiments confirm it matches the original accuracy at 8.74 times faster inference speed. The same distillation works across different model architectures and tokenizers as a plug-in accelerator.

Core claim

Standard Transformer decoders are structural overkill for Semantic ID generative recommendation because prediction difficulty drops sharply after the first token due to their hierarchy. SID-MLP captures global user context in a single operation decoupled from sequential prediction and distills the heavy autoregressive teacher into position-specific MLP heads. This eliminates dense attention overhead while preserving prefix and context dependencies, matching teacher accuracy with 8.74x faster inference and serving as a plug-and-play accelerator for various backbones.

What carries the argument

SID-MLP distillation framework that replaces the Transformer decoder with position-specific MLP heads after capturing user context once.

If this is right

  • This serves as a plug-and-play accelerator for different backbones and tokenizer settings.
  • SID-MLP++ extends the framework to replace the Transformer encoder for further latency reductions.
  • Decoder-side MLPs distillation is an effective acceleration path for structured SID recommendation.
  • Full encoder replacement offers an additional speed-accuracy trade-off.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other recommendation or prediction tasks with hierarchical token structures where early predictions are hardest.
  • Real-world systems might adopt this for lower latency in serving recommendations without retraining from scratch.
  • It raises the question of whether attention is overused in other autoregressive setups with structured outputs.

Load-bearing premise

The hierarchical nature of SIDs makes prediction difficulty drop sharply after the first token, rendering repeated attention computations highly redundant.

What would settle it

Observing that the recommendation quality of the MLP-distilled model falls short of the teacher model on held-out data or that the measured inference speedup is substantially below 8.74x.

Figures

Figures reproduced from arXiv: 2605.12617 by Clark Mingxuan Ju, Julian McAuley, Neil Shah, Yupeng Hou, Zitian Guo.

Figure 1
Figure 1. Figure 1: SID search-space collapse on Instruments. Average Valid next￾token choices and top-1 accuracy [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture. The architecture is composed of two components: SID-MLP (shaded yellow, left) and SID-MLP++ (shaded blue, right). MH Attn is multi-head attention, ei = e(ci) represents the embedding of prefix ci . Snowflakes mark frozen modules, and flames mark trainable modules. recommendation with lightweight MLP heads. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SID-MLP robustness across settings. (a) NDCG@10 recovery is the ratio between SID-MLP and its teacher across tokenizers. (b) Peak GPU memory (bars) and throughput (lines) as batch size changes. (c) NDCG@10 recovery (bars) and throughput (lines) as beam size changes. Beam size. In [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-scale Pareto on Amazon Reviews 2018. NDCG@10 vs throughput (samples/s, log scale) across Instruments, Arts, and Games. All throughputs are end-to-end except LC-Rec 7B and LC-Rec SID-MLP, reported decode-only (prefill excluded) as optimistic upper bounds. Baselines span SASRec [6] (∼1M/5M/13M), TIGER (∼1M/5M/13M), and LC-Rec 7B (LLaMA-2, off-axis at ∼4 samples/s); students are TIGER SID-MLP and LC-Rec… view at source ↗
Figure 5
Figure 5. Figure 5: Hyperparameter and m-mode analysis. Top row: α ∈ {0, 0.3, 0.5, 0.7, 0.8, 1.0} sweep. Middle row: head-hidden width sweep. Bottom row: m-mode accuracy–throughput tradeoff. Columns correspond to Instruments, Scientific, and Games [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
read the original abstract

Generative recommendation models employing Semantic IDs (SIDs) exhibit strong potential, yet their practical deployment is bottlenecked by the high inference latency of beam-expanded autoregressive decoding. In this work, we identify that standard attention-heavy Transformer decoders represent a structural overkill for this task: the hierarchical nature of SIDs makes prediction difficulty drops sharply after the first token, rendering repeated attention computations highly redundant. Driven by this insight, we propose SID-MLP, a lightweight MLP-centric distillation framework that fundamentally simplifies the decoding paradigm for GR. Instead of executing complex, step-by-step attention mechanisms, our approach captures the global user context in a single operation, decoupled from sequential token prediction. We then distill the heavy autoregressive teacher into position-specific MLP heads, eliminating the dense attention overhead while preserving prefix and context dependencies. Extensive experiments demonstrate that SID-MLP matches the accuracy of teacher models while accelerating inference by 8.74x. Crucially, this distillation strategy can serve as a plug-and-play accelerator for different backbones and tokenizer settings. Furthermore, we introduce SID-MLP++, extending our distillation framework to replace the Transformer encoder, unlocking further latency reductions. Ultimately, our work reveals that decoder-side MLPs distillation is an effective acceleration path for structured SID recommendation, while full encoder replacement offers an additional speed--accuracy trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes SID-MLP, a distillation framework that replaces autoregressive attention-based decoding in Semantic ID (SID) generative recommenders with position-specific MLPs. It claims that the hierarchical nature of SIDs causes prediction difficulty to drop sharply after the first token, rendering repeated attention redundant; the approach captures global user context in one step and distills the teacher into lightweight MLP heads. Experiments reportedly show SID-MLP matches teacher accuracy while achieving 8.74x inference speedup, with an extension (SID-MLP++) that also replaces the Transformer encoder for further gains. The method is positioned as a plug-and-play accelerator across backbones and tokenizers.

Significance. If the empirical results hold under rigorous validation, the work provides a practical acceleration technique for SID-based generative recommenders, potentially improving deployability by reducing inference latency with minimal accuracy loss. The distillation strategy's claimed generality across architectures would be a useful contribution to efficient inference methods in recommendation systems.

major comments (3)
  1. [Abstract] Abstract: The core assumption that 'the hierarchical nature of SIDs makes prediction difficulty drops sharply after the first token' is presented without any per-position quantitative validation (e.g., token-wise accuracy, loss curves, or difficulty metrics for the teacher on positions 2+). This assumption directly justifies replacing autoregressive attention with independent MLPs and is load-bearing for both the accuracy-matching and 8.74x speedup claims.
  2. [Experiments] Experimental results: The reported matching accuracy and 8.74x speedup lack essential details on datasets, baseline models, number of runs, error bars, and exact inference-time measurement protocol. These omissions prevent assessment of whether the central empirical claims are robust or reproducible.
  3. [SID-MLP++] SID-MLP++ extension: Replacing the full Transformer encoder with the distillation framework requires additional justification and ablations on how prefix/context dependencies are preserved without attention, as this change is more substantial than decoder-only replacement and directly affects the claimed further latency reductions.
minor comments (2)
  1. [Abstract] Abstract contains a grammatical error: 'makes prediction difficulty drops' should read 'makes prediction difficulty drop'.
  2. [Abstract] The 'plug-and-play' claim would be strengthened by explicit discussion of the conditions (e.g., tokenizer settings or backbone types) under which the distillation succeeds without retraining.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. Where the comments identify gaps in the initial submission, we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The core assumption that 'the hierarchical nature of SIDs makes prediction difficulty drops sharply after the first token' is presented without any per-position quantitative validation (e.g., token-wise accuracy, loss curves, or difficulty metrics for the teacher on positions 2+). This assumption directly justifies replacing autoregressive attention with independent MLPs and is load-bearing for both the accuracy-matching and 8.74x speedup claims.

    Authors: We agree that explicit per-position quantitative validation strengthens the justification. The manuscript provides overall empirical support for the hierarchical property through end-to-end results, but does not include dedicated token-wise accuracy or loss curves for the teacher. In the revised version we will add these analyses (token-wise accuracy and per-position loss for the teacher on positions 2+) to directly validate the assumption. revision: yes

  2. Referee: [Experiments] Experimental results: The reported matching accuracy and 8.74x speedup lack essential details on datasets, baseline models, number of runs, error bars, and exact inference-time measurement protocol. These omissions prevent assessment of whether the central empirical claims are robust or reproducible.

    Authors: We acknowledge that the initial submission omitted several experimental details for brevity. In the revision we will expand the experimental section to specify the exact datasets, baseline models, number of runs (5), error bars (standard deviation across runs), and the inference-time protocol (single A100 GPU, batch size 1, wall-clock time averaged over 1000 inferences after warmup). revision: yes

  3. Referee: [SID-MLP++] SID-MLP++ extension: Replacing the full Transformer encoder with the distillation framework requires additional justification and ablations on how prefix/context dependencies are preserved without attention, as this change is more substantial than decoder-only replacement and directly affects the claimed further latency reductions.

    Authors: We agree that the SID-MLP++ extension requires more justification and ablations. The current manuscript describes the extension at a high level but lacks explicit analysis of dependency preservation. In the revision we will add a dedicated subsection with justification and ablations (including comparisons to partial-attention variants) demonstrating that global context capture plus position-specific distillation preserves prefix dependencies while delivering the reported latency gains. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct empirical measurements of distillation

full rationale

The paper proposes SID-MLP as a distillation framework motivated by an observed property of hierarchical SIDs, then validates it through experiments comparing accuracy and latency against teacher models. No derivation chain reduces a claimed prediction to fitted inputs by construction, no self-citation is load-bearing for a uniqueness theorem, and no ansatz is smuggled via prior work. The speedup and matching-accuracy results are presented as measured outcomes rather than algebraic identities or renamed fits. The central assumption about per-token difficulty is treated as an empirical premise whose consequences are tested externally, keeping the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that SID hierarchy reduces prediction difficulty after the first token, plus empirical distillation results; no free parameters or invented entities with independent evidence are detailed in the abstract.

axioms (1)
  • domain assumption The hierarchical nature of SIDs makes prediction difficulty drop sharply after the first token.
    This insight is invoked to justify replacing attention with MLPs.
invented entities (1)
  • SID-MLP no independent evidence
    purpose: Lightweight MLP-centric distillation framework for generative recommendation decoding.
    New method name and architecture introduced by the paper.

pith-pipeline@v0.9.0 · 5542 in / 1177 out tokens · 49323 ms · 2026-05-14T20:10:41.001869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 19 canonical work pages · 5 internal anchors

  1. [1]

    Transformer memory as a differentiable search index

    Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. Transformer memory as a differentiable search index. volume 35, pages 21831–21843, 2022

  2. [2]

    Tran, Jonah Samost, Maciej Kula, Ed H

    Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Mahesh Sathiamoorthy. Recommender systems with generative retrieval. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing System...

  3. [3]

    Adapting large language models by integrating collaborative semantics for recommenda- tion

    Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. Adapting large language models by integrating collaborative semantics for recommenda- tion. In40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024. IEEE, 2024

  4. [4]

    OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

    Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965, 2025

  5. [5]

    Session-based recommendations with recurrent neural networks

    Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks. 2016

  6. [6]

    Self-attentive sequential recommendation

    Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. 2018

  7. [7]

    Plum: Adapting pre-trained language models for industrial-scale generative recommendations

    Ruining He, Lukasz Heldt, Lichan Hong, Raghunandan Keshavan, Shifan Mao, Nikhil Mehta, Zhengyang Su, Alicia Tsai, Yueqi Wang, Shao-Chuan Wang, et al. Plum: Adapting pre-trained language models for industrial-scale generative recommendations. InProceedings of the ACM Web Conference 2026, pages 8093–8104, 2026

  8. [8]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. 2023

  9. [9]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. 2024

  10. [10]

    Eagle: Speculative sampling requires rethinking feature uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. 2025

  11. [11]

    Efficient inference for large language model-based generative recommendation

    Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Efficient inference for large language model-based generative recommendation. 2024

  12. [12]

    Generating long semantic ids in parallel for recom- mendation

    Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, and Julian McAuley. Generating long semantic ids in parallel for recom- mendation. 2025

  13. [13]

    Nezha: A zero-sacrifice and hyperspeed decoding architecture for generative recommendations

    Yejing Wang, Shengyu Zhou, Jinyu Lu, Ziwei Liu, Langming Liu, Maolin Wang, Wenlin Zhang, Feng Li, Wenbo Su, Pengjie Wang, Jian Xu, and Xiangyu Zhao. Nezha: A zero-sacrifice and hyperspeed decoding architecture for generative recommendations. 2026

  14. [14]

    Non- autoregressive generative models for reranking recommendation

    Yuxin Ren, Qiya Yang, Yichun Wu, Wei Xu, Yalong Wang, and Zhiqiang Zhang. Non- autoregressive generative models for reranking recommendation. 2025

  15. [15]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 2015

  16. [16]

    Graph-less neural networks: Teaching old mlps new tricks via distillation

    Shichang Zhang, Yozen Liu, Yizhou Sun, and Neil Shah. Graph-less neural networks: Teaching old mlps new tricks via distillation. 2022

  17. [17]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023. 10

  18. [18]

    Filter-enhanced MLP is all you need for sequential recommendation

    Kun Zhou, Hui Yu, Wayne Xin Zhao, and Ji-Rong Wen. Filter-enhanced MLP is all you need for sequential recommendation. InWWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, 2022

  19. [19]

    Bridging language and items for retrieval and recommendation: Benchmarking llms as semantic encoders

    Yupeng Hou, Jiacheng Li, Xiangjun Fu, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation: Benchmarking llms as semantic encoders. 2026

  20. [20]

    EARN: efficient inference acceleration for llm-based generative recommendation by register tokens.CoRR, 2025

    Chaoqun Yang, Xinyu Lin, Wenjie Wang, Yongqi Li, Teng Sun, Xianjing Han, and Tat-Seng Chua. EARN: efficient inference acceleration for llm-based generative recommendation by register tokens.CoRR, 2025

  21. [21]

    Gated delta networks: Improving mamba2 with delta rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. 2025

  22. [22]

    Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. 2024

  23. [23]

    Generative recommendation with semantic ids: A practitioner’s handbook

    Clark Mingxuan Ju, Liam Collins, Leonardo Neves, Bhuvesh Kumar, Louis Yufeng Wang, Tong Zhao, and Neil Shah. Generative recommendation with semantic ids: A practitioner’s handbook. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 6420–6425, 2025

  24. [24]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022

  25. [25]

    Purely semantic indexing for llm-based generative recommendation and retrieval.arXiv preprint arXiv:2509.16446, 2025

    Ruohan Zhang, Jiacheng Li, Julian McAuley, and Yupeng Hou. Purely semantic indexing for llm-based generative recommendation and retrieval.arXiv preprint arXiv:2509.16446, 2025

  26. [26]

    Recommendation as language processing (RLP): A unified pretrain, personalized prompt & predict paradigm (P5)

    Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. Recommendation as language processing (RLP): A unified pretrain, personalized prompt & predict paradigm (P5). InProceedings of the 16th ACM Conference on Recommender Systems, 2022

  27. [27]

    Petrov and Craig Macdonald

    Aleksandr V . Petrov and Craig Macdonald. Generative sequential recommendation with gptrec. arXiv preprint arXiv:2306.11114, 2023

  28. [28]

    Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations

    Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024

  29. [29]

    Unifying generative and dense retrieval for sequential recommendation.arXiv preprint arXiv:2411.18814, 2024

    Liu Yang, Fabian Paischer, Kaveh Hassani, Jiacheng Li, Shuai Shao, Zhang Gabriel Li, Yun He, Xue Feng, Nima Noorshams, Sem Park, et al. Unifying generative and dense retrieval for sequential recommendation.arXiv preprint arXiv:2411.18814, 2024

  30. [30]

    How to index item ids for recommendation foundation models

    Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. How to index item ids for recommendation foundation models. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 195–204, 2023

  31. [31]

    Eager: Two-stream generative recommender with behavior-semantic collaboration

    Ye Wang, Jiahao Xun, Minjie Hong, Jieming Zhu, Tao Jin, Wang Lin, Haoyuan Li, Linjun Li, Yan Xia, Zhou Zhao, et al. Eager: Two-stream generative recommender with behavior-semantic collaboration. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3245–3254, 2024

  32. [32]

    Cost: Contrastive quantization based semantic tokenization for generative recommendation

    Jieming Zhu, Mengqun Jin, Qijiong Liu, Zexuan Qiu, Zhenhua Dong, and Xiu Li. Cost: Contrastive quantization based semantic tokenization for generative recommendation. In Proceedings of the 18th ACM Conference on Recommender Systems, pages 969–974, 2024

  33. [33]

    A simple contrastive framework of item tokenization for generative recommendation.arXiv preprint arXiv:2506.16683, 2025

    Penglong Zhai, Yifang Yuan, Fanyi Di, Jie Li, Yue Liu, Chen Li, Jie Huang, Sicong Wang, Yao Xu, and Xin Li. A simple contrastive framework of item tokenization for generative recommendation.arXiv preprint arXiv:2506.16683, 2025. 11

  34. [34]

    Cofirec: Coarse-to-fine tokenization for generative recommendation.arXiv preprint arXiv:2511.22707, 2025

    Tianxin Wei, Xuying Ning, Xuxing Chen, Ruizhong Qiu, Yupeng Hou, Yan Xie, Shuang Yang, Zhigang Hua, and Jingrui He. Cofirec: Coarse-to-fine tokenization for generative recommendation.arXiv preprint arXiv:2511.22707, 2025

  35. [35]

    Bridging textual-collaborative gap through semantic codes for sequential recommendation

    Enze Liu, Bowen Zheng, Wayne Xin Zhao, and Ji-Rong Wen. Bridging textual-collaborative gap through semantic codes for sequential recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 1788–1798, 2025

  36. [36]

    Actionpiece: Contextually tokenizing action sequences for generative recommendation

    Yupeng Hou, Jianmo Ni, Zhankui He, Noveen Sachdeva, Wang-Cheng Kang, Ed H Chi, Julian McAuley, and Derek Zhiyuan Cheng. Actionpiece: Contextually tokenizing action sequences for generative recommendation. 2025

  37. [37]

    Pctx: Tokenizing personalized context for generative recommendation.arXiv preprint arXiv:2510.21276, 2025

    Qiyong Zhong, Jiajie Su, Yunshan Ma, Julian McAuley, and Yupeng Hou. Pctx: Tokenizing personalized context for generative recommendation.arXiv preprint arXiv:2510.21276, 2025

  38. [38]

    Multi-behavior generative recommendation

    Zihan Liu, Yupeng Hou, and Julian McAuley. Multi-behavior generative recommendation. InProceedings of the 33rd ACM international conference on information and knowledge management, pages 1575–1585, 2024

  39. [39]

    Mm- grec: Multimodal generative recommendation with transformer model.arXiv preprint arXiv:2404.16555, 2024

    Han Liu, Yinwei Wei, Xuemeng Song, Weili Guan, Yuan-Fang Li, and Liqiang Nie. Mm- grec: Multimodal generative recommendation with transformer model.arXiv preprint arXiv:2404.16555, 2024

  40. [40]

    Multimodal quantitative language for generative recommendation

    Jianyang Zhai, Zi-Feng Mai, Chang-Dong Wang, Feidiao Yang, Xiawu Zheng, Hui Li, and Yonghong Tian. Multimodal quantitative language for generative recommendation. 2025

  41. [41]

    Multi-aspect cross-modal quantization for generative recommendation

    Fuwei Zhang, Xiaoyu Liu, Dongbo Xi, Jishen Yin, Huan Chen, Peng Yan, Fuzhen Zhuang, and Zhao Zhang. Multi-aspect cross-modal quantization for generative recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 16271–16279, 2026

  42. [42]

    Beyond unimodal boundaries: Generative recommendation with multimodal semantics.arXiv preprint arXiv:2503.23333, 2025

    Jing Zhu, Mingxuan Ju, Yozen Liu, Danai Koutra, Neil Shah, and Tong Zhao. Beyond unimodal boundaries: Generative recommendation with multimodal semantics.arXiv preprint arXiv:2503.23333, 2025

  43. [43]

    Learnable item tokenization for generative recommendation

    Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Learnable item tokenization for generative recommendation. 2025

  44. [44]

    Generative recommender with end-to-end learnable item tokenization

    Enze Liu, Bowen Zheng, Cheng Ling, Lantao Hu, Han Li, and Wayne Xin Zhao. Generative recommender with end-to-end learnable item tokenization. InProceedings of the 48th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, pages 729–739, 2025

  45. [45]

    Bi-Level Optimization for Generative Recommendation: Bridging Tokenization and Generation

    Yimeng Bai, Chang Liu, Yang Zhang, Dingxian Wang, Frank Yang, Andrew Rabinovich, Wenge Rong, and Fuli Feng. Bi-level optimization for generative recommendation: Bridging tokenization and generation.arXiv preprint arXiv:2510.21242, 2025

  46. [46]

    Pit: A dynamic personalized item tokenizer for end-to-end generative recommendation.arXiv preprint arXiv:2602.08530, 2026

    Huanjie Wang, Xinchen Luo, Honghui Bao, Zhang Zixing, Lejian Ren, Yunfan Wu, Hongwei Zhang, Liwei Guan, and Guang Chen. Pit: A dynamic personalized item tokenizer for end-to- end generative recommendation.arXiv preprint arXiv:2602.08530, 2026

  47. [47]

    Differentiable Semantic ID for Generative Recommendation

    Junchen Fu, Xuri Ge, Alexandros Karatzoglou, Ioannis Arapakis, Suzan Verberne, Joemon M Jose, and Zhaochun Ren. Differentiable semantic id for generative recommendation.arXiv preprint arXiv:2601.19711, 2026

  48. [48]

    Unigrec: Unified generative recommendation with soft identifiers for end-to-end optimization.arXiv preprint arXiv:2601.17438, 2026

    Jialei Li, Yang Zhang, Yimeng Bai, Shuai Zhu, Ziqi Xue, Xiaoyan Zhao, Dingxian Wang, Frank Yang, Andrew Rabinovich, and Xiangnan He. Unigrec: Unified generative recommendation with soft identifiers for end-to-end optimization.arXiv preprint arXiv:2601.17438, 2026

  49. [49]

    End-to-end semantic id generation for generative advertisement recommendation.arXiv preprint arXiv:2602.10445, 2026

    Jie Jiang, Xinxun Zhang, Enming Zhang, Yuling Xiong, Jun Zhang, Jingwen Wang, Huan Yu, Yuxiang Wang, Hao Wang, Xiao Yan, et al. End-to-end semantic id generation for generative advertisement recommendation.arXiv preprint arXiv:2602.10445, 2026. 12

  50. [50]

    Enhancing item tokenization for generative recommendation through self-improvement.arXiv preprint arXiv:2412.17171, 2024

    Runjin Chen, Mingxuan Ju, Ngoc Bui, Dimosthenis Antypas, Stanley Cai, Xiaopeng Wu, Leonardo Neves, Zhangyang Wang, Neil Shah, and Tong Zhao. Enhancing item tokenization for generative recommendation through self-improvement.arXiv preprint arXiv:2412.17171, 2024

  51. [51]

    Better generalization with semantic ids: A case study in ranking for recommendations

    Anima Singh, Trung Vu, Nikhil Mehta, Raghunandan Keshavan, Maheswaran Sathiamoorthy, Yilin Zheng, Lichan Hong, Lukasz Heldt, Li Wei, Devansh Tandon, et al. Better generalization with semantic ids: A case study in ranking for recommendations. InProceedings of the 18th ACM Conference on Recommender Systems, pages 1039–1044, 2024

  52. [52]

    Understanding generative recommendation with semantic ids from a model-scaling view.arXiv preprint arXiv:2509.25522, 2025

    Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, and Clark Mingxuan Ju. Understanding generative recommendation with semantic ids from a model-scaling view.arXiv preprint arXiv:2509.25522, 2025

  53. [53]

    How well does generative recommendation generalize? 2026

    Yijie Ding, Zitian Guo, Jiacheng Li, Letian Peng, Shuai Shao, Wei Shao, Xiaoqiang Luo, Luke Simon, Jingbo Shang, Julian McAuley, and Yupeng Hou. How well does generative recommendation generalize? 2026

  54. [54]

    Expressiveness limits of autoregressive semantic id generation in generative recommendation

    Yupeng Hou, Haven Kim, Clark Mingxuan Ju, Eduardo Escoto, Neil Shah, and Julian McAuley. Expressiveness limits of autoregressive semantic id generation in generative recommendation. 2026

  55. [55]

    Semantic ids for recommender systems at snapchat: Use cases, technical challenges, and design choices

    Clark Mingxuan Ju, Tong Zhao, Leonardo Neves, Liam Collins, Bhuvesh Kumar, Jiwen Ren, Lili Zhang, Wenfeng Zhuo, Vincent Zhang, Xiao Bai, Jinchao Li, Karthik Iyer, Zihao Fan, Yilun Xu, Yiwen Chen, Peicheng Yu, Manish Malik, and Neil Shah. Semantic ids for recommender systems at snapchat: Use cases, technical challenges, and design choices. 2026

  56. [56]

    Mitigating Collaborative Semantic ID Staleness in Generative Retrieval

    Vladimir Baikalov, Iskander Bagautdinov, and Sergey Muravyov. Mitigating collaborative semantic id staleness in generative retrieval.arXiv preprint arXiv:2604.13273, 2026

  57. [57]

    Sequential data augmentation for generative recommendation

    Geon Lee, Bhuvesh Kumar, Mingxuan Ju, Tong Zhao, Kijung Shin, Neil Shah, and Liam Collins. Sequential data augmentation for generative recommendation. InProceedings of the Nineteenth ACM International Conference on Web Search and Data Mining, pages 303–312, 2026

  58. [58]

    Inductive generative recommen- dation via retrieval-based speculation

    Yijie Ding, Jiacheng Li, Julian McAuley, and Yupeng Hou. Inductive generative recommen- dation via retrieval-based speculation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 14675–14683, 2026

  59. [59]

    Order-agnostic identifier for large language model-based generative recommendation

    Xinyu Lin, Haihan Shi, Wenjie Wang, Fuli Feng, Qifan Wang, See-Kiong Ng, and Tat-Seng Chua. Order-agnostic identifier for large language model-based generative recommendation. In Proceedings of the 48th international ACM SIGIR conference on research and development in information retrieval, pages 1923–1933, 2025

  60. [60]

    Closing the performance gap in generative recommenders with collaborative tokenization and efficient modeling

    Simon Lepage, Jeremie Mary, and David Picard. Closing the performance gap in generative recommenders with collaborative tokenization and efficient modeling. 2025

  61. [61]

    Diffgrm: Diffusion-based generative recommendation model

    Zhao Liu, Yichen Zhu, Yiqing Yang, Xiao Lv, Guoping Tang, Rui Huang, Qiang Luo, Ruiming Tang, and Guorui Zhou. Diffgrm: Diffusion-based generative recommendation model. In Proceedings of the ACM Web Conference 2026, pages 5853–5864, 2026

  62. [62]

    Diffusion generative recommendation with continuous tokens

    Haohao Qu, Shanru Lin, Yujuan Ding, Yiqi Wang, and Wenqi Fan. Diffusion generative recommendation with continuous tokens. InProceedings of the ACM Web Conference 2026, pages 7259–7270, 2026

  63. [63]

    Masked diffusion for generative recommendation

    Kulin Shah, Bhuvesh Kumar, Neil Shah, and Liam Collins. Masked diffusion for generative recommendation. 2025

  64. [64]

    Llada-rec: Discrete diffusion for parallel semantic id generation in generative recommendation

    Teng Shi, Chenglei Shen, Weijie Yu, Shen Nie, Chongxuan Li, Xiao Zhang, Ming He, Yan Han, and Jun Xu. Llada-rec: Discrete diffusion for parallel semantic id generation in generative recommendation. 2025

  65. [65]

    Fitnets: Hints for thin deep nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. 2015. 13

  66. [66]

    Distillation matters: Empowering sequential recommenders to match the performance of large language models

    Yu Cui, Feng Liu, Pengbo Wang, Bohao Wang, Heng Tang, Yi Wan, Jun Wang, and Jiawei Chen. Distillation matters: Empowering sequential recommenders to match the performance of large language models. In18th ACM Conference on Recommender Systems, page 507–517. ACM, October 2024

  67. [67]

    Distillation enhanced generative retrieval

    Yongqi Li, Zhen Zhang, Wenjie Wang, Liqiang Nie, Wenjie Li, and Tat-Seng Chua. Distillation enhanced generative retrieval. 2024

  68. [68]

    Chawla, Neil Shah, and Tong Zhao

    Zhichun Guo, William Shiao, Shichang Zhang, Yozen Liu, Nitesh V . Chawla, Neil Shah, and Tong Zhao. Linkless link prediction via relational distillation. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research, 2023

  69. [69]

    Fast transformer decoding: One write-head is all you need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need. 2019

  70. [70]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. 2023

  71. [71]

    Hydra: Sequentially-dependent draft heads for medusa decoding

    Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. 2024

  72. [72]

    Eagle-3: Scaling up inference acceleration of large language models via training-time test

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test. 2025

  73. [73]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. 2022

  74. [74]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. 2024

  75. [75]

    Justifying recommendations using distantly- labeled reviews and fine-grained aspects

    Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly- labeled reviews and fine-grained aspects. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019

  76. [76]

    Discovering the gems in early layers: Accelerating long-context llms with 1000x input token reduction

    Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. Discovering the gems in early layers: Accelerating long-context llms with 1000x input token reduction. 2024

  77. [77]

    Cold-starts in generative recommendation: A reproducibility study

    Zhen Zhang, Jujia Zhao, Xinyu Ma, Xin Xin, Maarten de Rijke, and Zhaochun Ren. Cold-starts in generative recommendation: A reproducibility study. 2026. 14 A Notations Table 6 summarizes the notations used throughout the paper. Table 6: Notations and explanations. Notation Explanation U,VUser and item sets;uandvindex users and items Xu = [v1, . . . , vn]Hi...