pith. sign in

arxiv: 2509.09682 · v6 · pith:PKCHSRLPnew · submitted 2025-08-13 · 💻 cs.IR

Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs

Pith reviewed 2026-05-21 23:18 UTC · model grok-4.3

classification 💻 cs.IR
keywords sequential recommendationnegative samplingcross-entropy lossTriton kernelmemory efficiencytransformer modelslarge catalogsGPU training
0
0 comments X

The pith

A custom Triton kernel for cross-entropy loss with negative sampling trains sequential recommendation models up to twice as fast while cutting memory use by more than ten times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CCE-, an efficient implementation of cross-entropy loss that works with negative sampling for transformer-based sequential recommenders facing very large item catalogs. Standard PyTorch versions of this loss scale memory directly with catalog size, batch size, and sequence length, which forces practitioners to use few negative samples and small batches even though more of each improves accuracy. The new kernel fuses operations to avoid storing large intermediate tensors, freeing enough memory to increase both negative samples and batch size at once. This change delivers the reported speed and memory gains while preserving the original loss behavior. The authors also release the kernel so others can apply the same approach to their own large-scale training runs.

Core claim

The central claim is that a GPU-efficient realization of cross-entropy loss with negative sampling, built as a custom Triton kernel, reduces peak memory by more than a factor of ten and speeds training by up to a factor of two relative to the standard PyTorch implementation. Because memory no longer grows linearly with catalog size, it becomes practical to raise both the number of negative samples per example and the overall batch size. The paper shows that jointly scaling these two quantities improves model accuracy on large-catalog datasets, whereas maximizing only one of them is less effective. The kernel is released to allow direct reproduction and further use.

What carries the argument

CCE-, a fused Triton-kernel implementation of cross-entropy loss with negative sampling that avoids materializing the full logit tensor over the entire catalog.

If this is right

  • More negative samples per training example can be used without exceeding typical 40 GB GPU memory limits.
  • Larger batch sizes become feasible, which the paper links to higher final accuracy.
  • Jointly increasing both negative-sample count and batch size outperforms increasing only one of the two.
  • Training runs finish in roughly half the time, allowing more frequent retraining on changing user data.
  • Models trained this way reach higher accuracy than those limited by memory to smaller negative sets or batches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same kernel technique could be adapted to other ranking or contrastive losses that currently require large memory footprints.
  • Production pipelines could now handle catalogs several times larger than current practical limits while keeping training within existing GPU hardware.
  • The memory headroom might allow deeper transformer stacks or longer user sequences without additional hardware.
  • Similar fusion strategies could be explored for CPU or distributed training settings where memory is also the bottleneck.

Load-bearing premise

The custom Triton kernel must compute exactly the same loss values and gradients as the standard PyTorch cross-entropy function for the chosen negative-sampling strategy.

What would settle it

Run the Triton kernel and the PyTorch cross-entropy on identical batches with the same negative samples and check whether the scalar loss and per-parameter gradients differ by more than floating-point round-off; any larger difference falsifies equivalence.

Figures

Figures reproduced from arXiv: 2509.09682 by Alexey Vasilev, Alexey Zaytsev, Anna Volodkevich, Daniil Volkov, Darya Denisova, Dmitry Redko, Egor Shvetsov, Maxim Zhelnin, Petr Sokerin, Ruslan Izmailov, Valeriy Shevchenko.

Figure 1
Figure 1. Figure 1: Here, we compare several methods CE, CE− , CCE, CCE− , and SCE in terms of (1) NDCG@10, (2) Training time per epoch in seconds, and (3) Memory consumption in Gb for the SASRec model across six datasets. We demonstrate the best performance achieved through optimized hyperparameters. We can see that CE, CE− require much more time and memory usage. large item catalogs [16, 21]. Instead of scoring all items du… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of CE, CCE and CCE− implementation. CE: Materializes full logits in GPU memory, matmul and softmax are computed separately → high GPU memory usage, full 𝐶 × 𝐸 matmul, extra data transfers between HBM and SRAM lead to higher time delay. CCE : No logits materialization, fused matmul + softmax → store only positive logits and LSE vector, reduced HBM ←→ SRAM transfers, but still full CxE matmul. C… view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of computing gradients for the weights of the final layer. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Histograms of the gradient distributions at the final layer at the beginning (left) and end (right) of the training, the Megamarket [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Memory scaling across three dimensions: (1) marker size ( [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: This figure presents aggregated results from Figure 5, averaging across sequence lengths. Error bars indicate standard deviations. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Parameter analysis: (a) correlation structure, (b) predictive importance for recommendation metrics, [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of SASRec trained with CCE using gradient filtering at increasing thresholds (filter_eps parameter). The NDCG@10 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: NDCG@10 metric and time per epoch of the SASRec models trained with [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Sequential recommendations (SR) with transformer-based architectures are widely adopted in real-world applications, where SR models require frequent retraining to adapt to ever-changing user preferences. However, training transformer-based SR models often encounters a high computational cost associated with scoring extensive item catalogs, often exceeding thousands of items. This occurs mainly due to the use of cross-entropy loss, where peak memory scales proportionally to catalog size, batch size, and sequence length. Recognizing this, practitioners in the field of recommendation systems typically address memory consumption by integrating the cross-entropy (CE) loss with negative sampling, thereby reducing the explicit memory demands of the final layer. However, a small number of negative samples would degrade model performance, and as we demonstrate in our work, increasing the number of negative samples and the batch size further improves the model's performance, but rapidly starts to exceed industrial GPUs' size (~40Gb). In this work, we introduce the CCE- method, which offers a GPU-efficient implementation of the CE loss with negative sampling. Our method accelerates training by up to two times while reducing memory consumption by more than 10 times. Leveraging the memory savings afforded by using CCE- for model training, it becomes feasible to enhance its accuracy on datasets with a large item catalog compared to those trained with original PyTorch-implemented loss functions. Finally, we perform an analysis of key memory-related hyperparameters and highlight the necessity of a delicate balance among these factors. We demonstrate that scaling both the number of negative samples and batch size leads to better results rather than maximizing only one of them. To facilitate further adoption of CCE-, we release a Triton kernel that efficiently implements the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces CCE-, a Triton-kernel implementation of cross-entropy loss with negative sampling for transformer-based sequential recommendation models. It claims up to 2x training speedup and >10x memory reduction versus standard PyTorch CE, enabling larger batch sizes and negative counts that improve accuracy on large-catalog datasets; the kernel is released to support adoption.

Significance. If the numerical equivalence of the kernel holds, the work offers a practical, deployable improvement for industrial SR training on catalogs exceeding thousands of items, where memory is the primary bottleneck. Releasing the Triton kernel and the analysis showing that balanced scaling of negatives and batch size outperforms maximizing either alone are concrete strengths that aid reproducibility and practical use.

major comments (1)
  1. [Section describing the kernel and loss computation] The section describing the kernel and loss computation: the central claims of both efficiency gains and downstream accuracy improvements from scaling batch size and negative samples rest on the assumption that the custom Triton kernel computes numerically stable and exactly equivalent results to PyTorch cross-entropy under the chosen negative-sampling regime. No side-by-side loss-value comparisons, gradient-norm checks, tolerance thresholds, or ablation confirming unchanged model metrics when swapping the kernel for the reference implementation are reported. Any deviation in the logits-to-loss path or gradient flow would mean observed accuracy gains could arise from altered loss semantics rather than from the ability to fit larger configurations.
minor comments (1)
  1. [Abstract] Abstract: the statement that 'increasing the number of negative samples and the batch size further improves the model's performance' would benefit from a brief parenthetical reference to the specific datasets and metrics used to demonstrate this.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The concern regarding numerical equivalence is well-taken and we address it directly below. We will incorporate the requested verification experiments into the revised manuscript.

read point-by-point responses
  1. Referee: The section describing the kernel and loss computation: the central claims of both efficiency gains and downstream accuracy improvements from scaling batch size and negative samples rest on the assumption that the custom Triton kernel computes numerically stable and exactly equivalent results to PyTorch cross-entropy under the chosen negative-sampling regime. No side-by-side loss-value comparisons, gradient-norm checks, tolerance thresholds, or ablation confirming unchanged model metrics when swapping the kernel for the reference implementation are reported. Any deviation in the logits-to-loss path or gradient flow would mean observed accuracy gains could arise from altered loss semantics rather than from the ability to fit larger configurations.

    Authors: We agree that explicit numerical verification strengthens the central claims. The CCE- kernel implements the identical cross-entropy loss with negative sampling formula used by PyTorch (log-softmax over the positive and sampled negatives, followed by negative log-likelihood), with all operations performed in the same floating-point precision and without any approximation or reordering that would alter semantics. In the revised manuscript we will add: (1) side-by-side loss-value tables for identical input logits showing maximum absolute differences below 1e-6; (2) gradient-norm comparisons across multiple batches confirming identical L2 norms within machine precision; (3) a controlled ablation on two datasets where models are trained to convergence with both the Triton kernel and the reference PyTorch implementation (using identical random seeds and smaller feasible batch/negative counts), reporting identical NDCG@10 and Recall@10 within statistical noise. These additions will demonstrate that accuracy gains arise solely from the ability to scale batch size and negative count, not from any change in loss or gradient semantics. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation artifact with empirical claims

full rationale

The paper describes an engineering optimization: a custom Triton kernel (CCE-) for memory-efficient negative-sampled cross-entropy loss in transformer-based sequential recommenders. All reported gains (2x speedup, >10x memory reduction, accuracy improvements from larger batch/negative counts) are presented as direct outcomes of the kernel implementation and subsequent empirical scaling experiments. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the provided text. The work is self-contained against external benchmarks because the kernel can be inspected, executed, and compared to PyTorch reference on the same negative-sampling regime; any numerical deviation would be falsifiable outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an implementation rather than new theoretical constructs; it relies on standard GPU programming assumptions and existing negative sampling practice.

axioms (1)
  • domain assumption Negative sampling with a chosen number of negatives approximates full cross-entropy sufficiently for model quality
    Invoked when claiming performance gains from increasing negatives

pith-pipeline@v0.9.0 · 5884 in / 1177 out tokens · 36256 ms · 2026-05-21T23:18:10.344096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 5 internal anchors

  1. [1]

    Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods.Advances in Neural Information Processing Systems31 (2018)

  2. [2]

    Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio, Claudio Pomo, and Azzurra Ragone. 2019. On the discriminative power of hyper- parameters in cross-validation and how to choose them. InProceedings of the 13th ACM Conference on Recommender Systems. ACM, Copenhagen, Denmark, 447–451. doi:10.1145/3298689.3347010

  3. [3]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609(2023)

  4. [4]

    Tesfaye Fenta Boka, Zhendong Niu, and Rama Bastola Neupane. 2024. A survey of sequential recommendation systems: Techniques, evaluation, and future directions.Information Systems125 (2024), 102427

  5. [5]

    Viktoriia A Chekalina, Anna Rudenko, Gleb Mezentsev, Aleksandr Mikhalev, Alexander Panchenko, and Ivan Oseledets. 2024. SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. ACL, Miami, Florida, USA, 14929–14939

  6. [6]

    Chen Cheng, Haiqin Yang, Michael R Lyu, and Irwin King. 2013. Where you like to go next: Successive point-of-interest recommendation.. InIJCAI, Vol. 13. AAAI Press, Beijing, China, 2605–2611

  7. [7]

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems35 (2022), 16344–16359

  8. [8]

    Giulia Di Teodoro, Federico Siciliano, Nicola Tonellotto, and Fabrizio Silvestri. 2024. A Theoretical Analysis of Recommendation Loss Functions under Negative Sampling.arXiv preprint arXiv:2411.07770(2024)

  9. [9]

    Wei Guo, Hao Wang, Luankang Zhang, Jin Yao Chin, Zhongzhou Liu, Kai Cheng, Qiushi Pan, Yi Quan Lee, Wanqi Xue, Tingjia Shen, et al. 2024. Scaling New Frontiers: Insights into Large Recommendation Models.arXiv preprint arXiv:2412.00714(2024)

  10. [10]

    Danil Gusak, Anna Volodkevich, Anton Klenitskiy, Alexey Vasilev, and Evgeny Frolov. 2025. Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 874–883

  11. [11]

    F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19

  12. [12]

    Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, and Siyu Zhu. 2024. Liger-Kernel: Efficient Triton kernels for LLM training. (2024). https://github.com/linkedin/Liger-Kernel

  13. [13]

    Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen

  14. [14]

    Liger kernel: Efficient triton kernels for llm training.arXiv preprint arXiv:2410.10989(2024)

  15. [15]

    Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, and Jun Zhu. 2024. Accelerating transformer pre-training with 2: 4 sparsity.arXiv preprint arXiv:2404.01847(2024)

  16. [16]

    Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM). IEEE, IEEE Computer Society, 197–206

  17. [17]

    Anton Klenitskiy and Alexey Vasilev. 2023. Turning dross into gold loss: is bert4rec really better than sasrec?. InProceedings of the 17th ACM Conference on Recommender Systems. Association for Computing Machinery, New York, NY, USA, 1120–1125

  18. [18]

    Anton Klenitskiy, Anna Volodkevich, Anton Pembek, and Alexey Vasilev. 2024. Does It Look Sequential? An Analysis of Datasets for Evaluation of Sequential Recommendations. InProceedings of the 18th ACM Conference on Recommender Systems. Association for Computing Machinery, New York, NY, USA, 1067–1072

  19. [19]

    Amanda Krause, Adrian North, and Lauren Hewitt. 2014. Music selection behaviors in everyday listening.Journal of Broadcasting & Electronic Media58, 2 (2014), 306–323

  20. [20]

    Conglong Li, Minjia Zhang, and Yuxiong He. 2022. The stability-efficiency dilemma: Investigating sequence length warmup for training GPT models. Advances in Neural Information Processing Systems35 (2022), 26736–26750

  21. [21]

    Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. InProceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. Association for Computing Machinery, New York, NY, USA, 43–52

  22. [22]

    Gleb Mezentsev, Danil Gusak, Ivan Oseledets, and Evgeny Frolov. 2024. Scalable cross-entropy loss for sequential recommendations with large item catalogs. InProceedings of the 18th ACM Conference on Recommender Systems. 475–485

  23. [23]

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training.arXiv preprint arXiv:1710.03740(2017)

  24. [24]

    Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867(2018)

  25. [25]

    Arushi Prakash, Dimitrios Bermperidis, and Srivas Chennu. 2024. Evaluating Performance and Bias of Negative Sampling in Large-Scale Sequential Recommendation Models.arXiv preprint arXiv:2410.17276(2024)

  26. [26]

    Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. 2017. Lower bounds on regret for noisy Gaussian process bandit optimization. InConference on Learning Theory. PMLR, 1723–1742. Manuscript submitted to ACM 26 Zhelnin et al

  27. [27]

    Valeriy Shevchenko, Nikita Belousov, Alexey Vasilev, Vladimir Zholobov, Artyom Sosedka, Natalia Semenova, Anna Volodkevich, Andrey Savchenko, and Alexey Zaytsev. 2024. From variability to stability: Advancing RecSys benchmarking practices. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5701–5712

  28. [28]

    Egor Shvetsov, Dmitry Osin, Alexey Zaytsev, Ivan Koryakovskiy, Valentin Buchnev, Ilya Trofimov, and Evgeny Burnaev. 2024. QuantNAS for super resolution: searching for efficient quantization-friendly architectures against quantization noise.IEEE Access(2024)

  29. [29]

    Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. 2010. Gaussian process optimization in the bandit setting: no regret and experimental design. InInternational Conference on Machine Learning. 1015–1022

  30. [30]

    Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management. 1441–1450

  31. [31]

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. 2 (2024), 10–19. https://arxiv.org/abs/2403.08295

  32. [32]

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19

  33. [33]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

  34. [34]

    Roberto Turrin, Massimo Quadrana, Andrea Condorelli, Roberto Pagano, Paolo Cremonesi, et al. 2015. 30Music Listening and Playlists Dataset. RecSys Posters75 (2015)

  35. [35]

    Alexey Vasilev, Anna Volodkevich, Denis Kulandin, Tatiana Bysheva, and Anton Klenitskiy. 2024. RePlay: a Recommendation Framework for Experimentation and Production Use. InProceedings of the 18th ACM Conference on Recommender Systems. 1191–1194

  36. [36]

    Chenxu Wang, Aodian Liu, and Tao Qin. 2024. Learning-to-rank debias with popularity-weighted negative sampling and popularity regularization. Neurocomputing587 (2024), 127681

  37. [37]

    Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, and Philipp Krähenbühl. 2024. Cut your losses in large-vocabulary language models.arXiv preprint arXiv:2411.09009(2024)

  38. [38]

    Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. 2023. Stable and low-precision training for large-scale vision-language models.Advances in Neural Information Processing Systems36 (2023), 10271–10298

  39. [39]

    Songpei Xu, Shijia Wang, Da Guo, Xianwen Guo, Qiang Xiao, Fangjian Li, and Chuanjiang Luo. 2025. An Efficient Large Recommendation Model: Towards a Resource-Optimal Scaling Law.arXiv preprint arXiv:2502.09888(2025)

  40. [40]

    Gaowei Zhang, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji-Rong Wen. 2024. Scaling law of large sequential recommendation models. InProceedings of the 18th ACM Conference on Recommender Systems. 444–453

  41. [41]

    Wayne Xin Zhao, Zihan Lin, Zhichao Feng, Pengfei Wang, and Ji-Rong Wen. 2022. A revisiting study of appropriate offline evaluation for top-N recommendation algorithms.ACM Transactions on Information Systems41, 2 (2022), 1–41

  42. [42]

    Pablo Zivic, Hernan Vazquez, and Jorge Sánchez. 2024. Scaling Sequential Recommendation Models with Transformers. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1567–1577. Manuscript submitted to ACM