Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs
Pith reviewed 2026-05-21 23:18 UTC · model grok-4.3
The pith
A custom Triton kernel for cross-entropy loss with negative sampling trains sequential recommendation models up to twice as fast while cutting memory use by more than ten times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a GPU-efficient realization of cross-entropy loss with negative sampling, built as a custom Triton kernel, reduces peak memory by more than a factor of ten and speeds training by up to a factor of two relative to the standard PyTorch implementation. Because memory no longer grows linearly with catalog size, it becomes practical to raise both the number of negative samples per example and the overall batch size. The paper shows that jointly scaling these two quantities improves model accuracy on large-catalog datasets, whereas maximizing only one of them is less effective. The kernel is released to allow direct reproduction and further use.
What carries the argument
CCE-, a fused Triton-kernel implementation of cross-entropy loss with negative sampling that avoids materializing the full logit tensor over the entire catalog.
If this is right
- More negative samples per training example can be used without exceeding typical 40 GB GPU memory limits.
- Larger batch sizes become feasible, which the paper links to higher final accuracy.
- Jointly increasing both negative-sample count and batch size outperforms increasing only one of the two.
- Training runs finish in roughly half the time, allowing more frequent retraining on changing user data.
- Models trained this way reach higher accuracy than those limited by memory to smaller negative sets or batches.
Where Pith is reading between the lines
- The same kernel technique could be adapted to other ranking or contrastive losses that currently require large memory footprints.
- Production pipelines could now handle catalogs several times larger than current practical limits while keeping training within existing GPU hardware.
- The memory headroom might allow deeper transformer stacks or longer user sequences without additional hardware.
- Similar fusion strategies could be explored for CPU or distributed training settings where memory is also the bottleneck.
Load-bearing premise
The custom Triton kernel must compute exactly the same loss values and gradients as the standard PyTorch cross-entropy function for the chosen negative-sampling strategy.
What would settle it
Run the Triton kernel and the PyTorch cross-entropy on identical batches with the same negative samples and check whether the scalar loss and per-parameter gradients differ by more than floating-point round-off; any larger difference falsifies equivalence.
Figures
read the original abstract
Sequential recommendations (SR) with transformer-based architectures are widely adopted in real-world applications, where SR models require frequent retraining to adapt to ever-changing user preferences. However, training transformer-based SR models often encounters a high computational cost associated with scoring extensive item catalogs, often exceeding thousands of items. This occurs mainly due to the use of cross-entropy loss, where peak memory scales proportionally to catalog size, batch size, and sequence length. Recognizing this, practitioners in the field of recommendation systems typically address memory consumption by integrating the cross-entropy (CE) loss with negative sampling, thereby reducing the explicit memory demands of the final layer. However, a small number of negative samples would degrade model performance, and as we demonstrate in our work, increasing the number of negative samples and the batch size further improves the model's performance, but rapidly starts to exceed industrial GPUs' size (~40Gb). In this work, we introduce the CCE- method, which offers a GPU-efficient implementation of the CE loss with negative sampling. Our method accelerates training by up to two times while reducing memory consumption by more than 10 times. Leveraging the memory savings afforded by using CCE- for model training, it becomes feasible to enhance its accuracy on datasets with a large item catalog compared to those trained with original PyTorch-implemented loss functions. Finally, we perform an analysis of key memory-related hyperparameters and highlight the necessity of a delicate balance among these factors. We demonstrate that scaling both the number of negative samples and batch size leads to better results rather than maximizing only one of them. To facilitate further adoption of CCE-, we release a Triton kernel that efficiently implements the proposed method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CCE-, a Triton-kernel implementation of cross-entropy loss with negative sampling for transformer-based sequential recommendation models. It claims up to 2x training speedup and >10x memory reduction versus standard PyTorch CE, enabling larger batch sizes and negative counts that improve accuracy on large-catalog datasets; the kernel is released to support adoption.
Significance. If the numerical equivalence of the kernel holds, the work offers a practical, deployable improvement for industrial SR training on catalogs exceeding thousands of items, where memory is the primary bottleneck. Releasing the Triton kernel and the analysis showing that balanced scaling of negatives and batch size outperforms maximizing either alone are concrete strengths that aid reproducibility and practical use.
major comments (1)
- [Section describing the kernel and loss computation] The section describing the kernel and loss computation: the central claims of both efficiency gains and downstream accuracy improvements from scaling batch size and negative samples rest on the assumption that the custom Triton kernel computes numerically stable and exactly equivalent results to PyTorch cross-entropy under the chosen negative-sampling regime. No side-by-side loss-value comparisons, gradient-norm checks, tolerance thresholds, or ablation confirming unchanged model metrics when swapping the kernel for the reference implementation are reported. Any deviation in the logits-to-loss path or gradient flow would mean observed accuracy gains could arise from altered loss semantics rather than from the ability to fit larger configurations.
minor comments (1)
- [Abstract] Abstract: the statement that 'increasing the number of negative samples and the batch size further improves the model's performance' would benefit from a brief parenthetical reference to the specific datasets and metrics used to demonstrate this.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The concern regarding numerical equivalence is well-taken and we address it directly below. We will incorporate the requested verification experiments into the revised manuscript.
read point-by-point responses
-
Referee: The section describing the kernel and loss computation: the central claims of both efficiency gains and downstream accuracy improvements from scaling batch size and negative samples rest on the assumption that the custom Triton kernel computes numerically stable and exactly equivalent results to PyTorch cross-entropy under the chosen negative-sampling regime. No side-by-side loss-value comparisons, gradient-norm checks, tolerance thresholds, or ablation confirming unchanged model metrics when swapping the kernel for the reference implementation are reported. Any deviation in the logits-to-loss path or gradient flow would mean observed accuracy gains could arise from altered loss semantics rather than from the ability to fit larger configurations.
Authors: We agree that explicit numerical verification strengthens the central claims. The CCE- kernel implements the identical cross-entropy loss with negative sampling formula used by PyTorch (log-softmax over the positive and sampled negatives, followed by negative log-likelihood), with all operations performed in the same floating-point precision and without any approximation or reordering that would alter semantics. In the revised manuscript we will add: (1) side-by-side loss-value tables for identical input logits showing maximum absolute differences below 1e-6; (2) gradient-norm comparisons across multiple batches confirming identical L2 norms within machine precision; (3) a controlled ablation on two datasets where models are trained to convergence with both the Triton kernel and the reference PyTorch implementation (using identical random seeds and smaller feasible batch/negative counts), reporting identical NDCG@10 and Recall@10 within statistical noise. These additions will demonstrate that accuracy gains arise solely from the ability to scale batch size and negative count, not from any change in loss or gradient semantics. revision: yes
Circularity Check
No circularity: implementation artifact with empirical claims
full rationale
The paper describes an engineering optimization: a custom Triton kernel (CCE-) for memory-efficient negative-sampled cross-entropy loss in transformer-based sequential recommenders. All reported gains (2x speedup, >10x memory reduction, accuracy improvements from larger batch/negative counts) are presented as direct outcomes of the kernel implementation and subsequent empirical scaling experiments. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the provided text. The work is self-contained against external benchmarks because the kernel can be inspected, executed, and compared to PyTorch reference on the same negative-sampling regime; any numerical deviation would be falsifiable outside the paper's own fitted values.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Negative sampling with a chosen number of negatives approximates full cross-entropy sufficiently for model quality
Reference graph
Works this paper leans on
-
[1]
Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods.Advances in Neural Information Processing Systems31 (2018)
work page 2018
-
[2]
Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio, Claudio Pomo, and Azzurra Ragone. 2019. On the discriminative power of hyper- parameters in cross-validation and how to choose them. InProceedings of the 13th ACM Conference on Recommender Systems. ACM, Copenhagen, Denmark, 447–451. doi:10.1145/3298689.3347010
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Tesfaye Fenta Boka, Zhendong Niu, and Rama Bastola Neupane. 2024. A survey of sequential recommendation systems: Techniques, evaluation, and future directions.Information Systems125 (2024), 102427
work page 2024
-
[5]
Viktoriia A Chekalina, Anna Rudenko, Gleb Mezentsev, Aleksandr Mikhalev, Alexander Panchenko, and Ivan Oseledets. 2024. SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. ACL, Miami, Florida, USA, 14929–14939
work page 2024
-
[6]
Chen Cheng, Haiqin Yang, Michael R Lyu, and Irwin King. 2013. Where you like to go next: Successive point-of-interest recommendation.. InIJCAI, Vol. 13. AAAI Press, Beijing, China, 2605–2611
work page 2013
-
[7]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems35 (2022), 16344–16359
work page 2022
- [8]
- [9]
-
[10]
Danil Gusak, Anna Volodkevich, Anton Klenitskiy, Alexey Vasilev, and Evgeny Frolov. 2025. Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 874–883
work page 2025
-
[11]
F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19
work page 2015
-
[12]
Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, and Siyu Zhu. 2024. Liger-Kernel: Efficient Triton kernels for LLM training. (2024). https://github.com/linkedin/Liger-Kernel
work page 2024
-
[13]
Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen
- [14]
- [15]
-
[16]
Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM). IEEE, IEEE Computer Society, 197–206
work page 2018
-
[17]
Anton Klenitskiy and Alexey Vasilev. 2023. Turning dross into gold loss: is bert4rec really better than sasrec?. InProceedings of the 17th ACM Conference on Recommender Systems. Association for Computing Machinery, New York, NY, USA, 1120–1125
work page 2023
-
[18]
Anton Klenitskiy, Anna Volodkevich, Anton Pembek, and Alexey Vasilev. 2024. Does It Look Sequential? An Analysis of Datasets for Evaluation of Sequential Recommendations. InProceedings of the 18th ACM Conference on Recommender Systems. Association for Computing Machinery, New York, NY, USA, 1067–1072
work page 2024
-
[19]
Amanda Krause, Adrian North, and Lauren Hewitt. 2014. Music selection behaviors in everyday listening.Journal of Broadcasting & Electronic Media58, 2 (2014), 306–323
work page 2014
-
[20]
Conglong Li, Minjia Zhang, and Yuxiong He. 2022. The stability-efficiency dilemma: Investigating sequence length warmup for training GPT models. Advances in Neural Information Processing Systems35 (2022), 26736–26750
work page 2022
-
[21]
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. InProceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. Association for Computing Machinery, New York, NY, USA, 43–52
work page 2015
-
[22]
Gleb Mezentsev, Danil Gusak, Ivan Oseledets, and Evgeny Frolov. 2024. Scalable cross-entropy loss for sequential recommendations with large item catalogs. InProceedings of the 18th ACM Conference on Recommender Systems. 475–485
work page 2024
-
[23]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training.arXiv preprint arXiv:1710.03740(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [25]
-
[26]
Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. 2017. Lower bounds on regret for noisy Gaussian process bandit optimization. InConference on Learning Theory. PMLR, 1723–1742. Manuscript submitted to ACM 26 Zhelnin et al
work page 2017
-
[27]
Valeriy Shevchenko, Nikita Belousov, Alexey Vasilev, Vladimir Zholobov, Artyom Sosedka, Natalia Semenova, Anna Volodkevich, Andrey Savchenko, and Alexey Zaytsev. 2024. From variability to stability: Advancing RecSys benchmarking practices. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5701–5712
work page 2024
-
[28]
Egor Shvetsov, Dmitry Osin, Alexey Zaytsev, Ivan Koryakovskiy, Valentin Buchnev, Ilya Trofimov, and Evgeny Burnaev. 2024. QuantNAS for super resolution: searching for efficient quantization-friendly architectures against quantization noise.IEEE Access(2024)
work page 2024
-
[29]
Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. 2010. Gaussian process optimization in the bandit setting: no regret and experimental design. InInternational Conference on Machine Learning. 1015–1022
work page 2010
-
[30]
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management. 1441–1450
work page 2019
-
[31]
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. 2 (2024), 10–19. https://arxiv.org/abs/2403.08295
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19
work page 2019
-
[33]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Roberto Turrin, Massimo Quadrana, Andrea Condorelli, Roberto Pagano, Paolo Cremonesi, et al. 2015. 30Music Listening and Playlists Dataset. RecSys Posters75 (2015)
work page 2015
-
[35]
Alexey Vasilev, Anna Volodkevich, Denis Kulandin, Tatiana Bysheva, and Anton Klenitskiy. 2024. RePlay: a Recommendation Framework for Experimentation and Production Use. InProceedings of the 18th ACM Conference on Recommender Systems. 1191–1194
work page 2024
-
[36]
Chenxu Wang, Aodian Liu, and Tao Qin. 2024. Learning-to-rank debias with popularity-weighted negative sampling and popularity regularization. Neurocomputing587 (2024), 127681
work page 2024
- [37]
-
[38]
Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. 2023. Stable and low-precision training for large-scale vision-language models.Advances in Neural Information Processing Systems36 (2023), 10271–10298
work page 2023
- [39]
-
[40]
Gaowei Zhang, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji-Rong Wen. 2024. Scaling law of large sequential recommendation models. InProceedings of the 18th ACM Conference on Recommender Systems. 444–453
work page 2024
-
[41]
Wayne Xin Zhao, Zihan Lin, Zhichao Feng, Pengfei Wang, and Ji-Rong Wen. 2022. A revisiting study of appropriate offline evaluation for top-N recommendation algorithms.ACM Transactions on Information Systems41, 2 (2022), 1–41
work page 2022
-
[42]
Pablo Zivic, Hernan Vazquez, and Jorge Sánchez. 2024. Scaling Sequential Recommendation Models with Transformers. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1567–1577. Manuscript submitted to ACM
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.