pith. machine review for the scientific record. sign in

arxiv: 2604.06956 · v1 · submitted 2026-04-08 · 💻 cs.DC · cs.LG

Recognition: 2 theorem links

· Lean Theorem

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:55 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords recommendation systemsdistributed trainingpipeliningembedding tablesAll2All communicationlarge-scale clustersscaling efficiencyGPU and NPU training
0
0 comments X

The pith

Nested pipelining trains trillion-parameter recommendation models synchronously on 1,536 accelerators with up to 3.06x speedup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large-scale recommendation training can overcome embedding lookup and communication bottlenecks by applying pipelining at two nested levels. Dual-buffer pipelining builds a five-stage inter-batch pipeline that avoids any embedding staleness through careful synchronization. Frozen-window pipelining then overlaps All2All communication with dense computation inside each batch by clustering samples around stable embedding keys. This combination runs on production GPU and NPU clusters while delivering the same training semantics as fully synchronous methods. If the approach holds, organizations can scale model size and cluster size together without the usual sharp rise in data-movement overhead.

Core claim

NestPipe shows that two hierarchical sparse-parallelism opportunities can be exploited together through nested pipelining. Dual-Buffer Pipelining (DBP) creates a staleness-free five-stage pipeline that hides lookup latency. Frozen-Window Pipelining (FWP) uses the embedding-freezing phenomenon to overlap All2All traffic with dense computation via stream scheduling and key-centric clustering. On production clusters of 1,536 workers the system reaches 3.06x speedup and 94.07 percent scaling efficiency without accuracy loss.

What carries the argument

Nested pipelining that combines Dual-Buffer Pipelining for inter-batch lookup hiding with Frozen-Window Pipelining for intra-batch communication-computation overlap, while preserving exact synchronous semantics.

If this is right

  • Training throughput for trillion-parameter models rises without forcing a choice between speed and consistency.
  • Clusters of 1,500+ accelerators can maintain scaling efficiency above 90 percent for recommendation workloads.
  • Both GPU and NPU hardware benefit from the same dual-level pipeline structure.
  • Production systems no longer need to relax synchronization to hide data-movement costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same freezing insight could be applied to other sparse-data workloads such as graph neural networks or large language models with dynamic vocabularies.
  • If freezing patterns prove architecture-dependent, an adaptive window size could further improve overlap on varied model designs.
  • Reducing effective communication volume this way may change the economic trade-off between adding more accelerators versus investing in faster interconnects.

Load-bearing premise

The embedding freezing phenomenon stays stable and predictable enough in real workloads that overlapping communication never introduces hidden inconsistencies or accuracy loss.

What would settle it

Measure final model accuracy on a production workload whose embedding access patterns shift rapidly; any statistically significant deviation from a baseline synchronous run would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.06956 by Baopeng Yuan, Hua Du, Huichao Chai, Jiaxing Wang, Ke Zhang, Qiang Peng, Tianxing Sun, Xinyu Liu, Xuemiao Li, Yikui Cao, Yongxiang Feng, Zhaolong Xing, Zhen Chen, Zhida Jiang, Zhixin Wu.

Figure 1
Figure 1. Figure 1: Hybrid decentralized architecture for large-scale rec [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Impact of cluster scale on sparse lookup and commu [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of NestPipe. operation but a multi-stage data-movement. Before a batch can enter effective model computation, we must complete CPU￾side data preprocessing, distributed key routing, embedding re￾trieval, and H2D transfers. We analyze the dependency chains in decentralized training architecture and identify distinct hardware resources (e.g., CPU for preprocessing, network for communication, HBM for … view at source ↗
Figure 4
Figure 4. Figure 4: Dual-buffer synchronization in DBP strategy. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Implementation of FWP strategy through coordinated communication and computation stream scheduling. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The training loss and accuracy curve of different methods. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of micro-batch size on step latency and exposed [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Resource utilization ratio of different methods for [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Step latency breakdown for varying embedding dimensions, dense layers, and sequence lengths. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents NestPipe, a large-scale decentralized embedding training framework that tackles both bottlenecks while preserving synchronous training semantics. NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining. At the inter-batch level, Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline through dual-buffer synchronization, mitigating lookup bottlenecks without embedding staleness. At the intra-batch level, we identify the embedding freezing phenomenon, which inspires Frozen-Window Pipelining (FWP) to overlap All2All communication with dense computation via coordinated stream scheduling and key-centric sample clustering. Experiments on production GPU and NPU clusters with 1,536 workers demonstrate that NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. NestPipe presents a decentralized embedding training framework for trillion-parameter recommendation models on clusters up to 1,536 accelerators. It uses nested pipelining: Dual-Buffer Pipelining (DBP) creates a five-stage staleness-free inter-batch pipeline via dual-buffer synchronization to mitigate lookup latency, while Frozen-Window Pipelining (FWP) exploits the embedding freezing phenomenon with key-centric sample clustering and coordinated stream scheduling to overlap All2All communication with dense computation at the intra-batch level. The paper reports up to 3.06× speedup and 94.07% scaling efficiency on production GPU/NPU clusters while claiming to preserve exact synchronous training semantics.

Significance. If the synchronous-semantics claim holds, the work would be significant for scaling recommendation training beyond current bottlenecks in embedding lookup and communication. The use of real production workloads and large-scale clusters (1,536 workers) is a positive aspect, as is the focus on maintaining training consistency rather than trading it for throughput. However, the absence of accuracy validation limits the assessed impact.

major comments (2)
  1. [§5] §5 (Evaluation): The reported results focus exclusively on throughput (up to 3.06×) and scaling efficiency (94.07%) but provide no accuracy, AUC, loss-curve, or model-equivalence metrics versus a non-pipelined synchronous baseline. This is load-bearing for the central claim because FWP's 'staleness-free' guarantee and preservation of synchronous semantics rest on the unvalidated assumption that embedding freezing is stable and complete enough to avoid altering gradient flow or final model quality.
  2. [§3.2] §3.2 (FWP description): The mechanism by which key-centric sample clustering and coordinated stream scheduling ensure identical computation order and gradient updates (no hidden inconsistencies) is described at a high level but lacks a formal argument, invariant, or small-scale equivalence proof. Without this, the claim that FWP overlaps communication without introducing staleness cannot be assessed as sound.
minor comments (2)
  1. [Figure 3] Figure 3 (or equivalent pipeline diagram): The five-stage DBP pipeline and FWP window overlap would benefit from explicit timing annotations showing buffer synchronization points to improve clarity of the 'staleness-free' property.
  2. [§1] The abstract and §1 use 'parameter-free' or similar phrasing for certain ratios; verify that no hidden workload-dependent parameters are introduced in the FWP clustering heuristic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the importance of preserving synchronous semantics at scale. We address each major concern below with proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation): The reported results focus exclusively on throughput (up to 3.06×) and scaling efficiency (94.07%) but provide no accuracy, AUC, loss-curve, or model-equivalence metrics versus a non-pipelined synchronous baseline. This is load-bearing for the central claim because FWP's 'staleness-free' guarantee and preservation of synchronous semantics rest on the unvalidated assumption that embedding freezing is stable and complete enough to avoid altering gradient flow or final model quality.

    Authors: We agree that direct empirical validation of model quality is essential to substantiate the synchronous-semantics claim. DBP is designed to be staleness-free by construction via dual-buffer synchronization, while FWP exploits the embedding-freezing phenomenon (where selected embeddings remain unchanged within a window) to ensure that overlapped All2All communication does not alter computation order, gradient flow, or updates. Nevertheless, we acknowledge the value of explicit metrics. In the revised manuscript we will add small-scale equivalence experiments on representative production workloads, reporting AUC, loss curves, and final model quality comparisons against a non-pipelined synchronous baseline to confirm that NestPipe produces identical results. revision: yes

  2. Referee: [§3.2] §3.2 (FWP description): The mechanism by which key-centric sample clustering and coordinated stream scheduling ensure identical computation order and gradient updates (no hidden inconsistencies) is described at a high level but lacks a formal argument, invariant, or small-scale equivalence proof. Without this, the claim that FWP overlaps communication without introducing staleness cannot be assessed as sound.

    Authors: We appreciate this observation. Key-centric sample clustering groups samples sharing the same embedding keys into contiguous windows, and coordinated stream scheduling overlaps All2All communication only for frozen embeddings whose values do not participate in the current dense computation or gradient update. This preserves the exact computation order and gradient flow of the original synchronous schedule. To address the request for rigor, the revised §3.2 will include an explicit invariant stating that the dataflow graph and gradient updates remain identical under freezing, together with a small-scale equivalence argument (including a toy-model proof sketch) demonstrating that no hidden inconsistencies arise. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on experimental speedups and scaling measurements, not self-referential derivations or fitted predictions

full rationale

The paper describes NestPipe as a systems framework using Dual-Buffer Pipelining (DBP) and Frozen-Window Pipelining (FWP) to address embedding lookup and All2All bottlenecks while claiming to preserve synchronous semantics. The load-bearing assertions are empirical: up to 3.06x speedup and 94.07% scaling efficiency on 1,536-worker clusters. No mathematical derivation chain, first-principles equations, or parameter-fitting steps appear in the provided text that reduce by construction to the inputs (e.g., no 'prediction' of speedup derived from a model whose parameters were fitted to the same speedup data). The embedding freezing phenomenon is presented as an observed property of production workloads that the FWP design exploits via key-centric clustering and stream scheduling; it is not defined circularly in terms of the resulting performance. Self-citations, if present in the full manuscript, are not load-bearing for the core claims, which are validated by direct cluster measurements rather than uniqueness theorems or ansatzes imported from prior author work. This is a standard engineering paper whose contributions are the implementation and measured outcomes, not a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5541 in / 1135 out tokens · 38546 ms · 2026-05-10T17:55:15.170718+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 44 canonical work pages

  1. [1]

    ISBN 9798400705052

    G. Zhang, Y . Hou, H. Lu, Y . Chen, W. X. Zhao, and J. Wen, “Scaling law of large sequential recommendation models,” inProceedings of the 18th ACM Conference on Recommender Systems, RecSys 2024, Bari, Italy, October 14-18, 2024. ACM, 2024, pp. 444–453. [Online]. Available: https://doi.org/10.1145/3640457.3688129

  2. [2]

    MTGR: industrial-scale generative recommendation framework in meituan,

    R. Han, B. Yin, S. Chen, H. Jiang, F. Jiang, X. Li, C. Ma, M. Huang, X. Li, C. Jing, Y . Han, M. Zhou, L. Yu, C. Liu, and W. Lin, “MTGR: industrial-scale generative recommendation framework in meituan,” in Proceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM 2025, Seoul, Republic of Korea, November 10-14, 2025...

  3. [3]

    MTGR: industrial-scale generative recommendation framework in meituan,

    S. Xu, S. Wang, D. Guo, X. Guo, Q. Xiao, B. Huang, G. Wu, and C. Luo, “Climber: Toward efficient scaling laws for large recommendation models,” inProceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM 2025, Seoul, Republic of Korea, November 10-14, 2025. ACM, 2025, pp. 6193–6200. [Online]. Available: https://doi...

  4. [4]

    Bending the scaling law curve in large-scale recommendation systems,

    Q. Ding, K. Course, L. Ma, J. Sun, R. Liu, Z. Zhu, C. Yin, W. Li, D. Li, Y . Shi, X. Cao, Z. Yang, H. Li, X. Liu, B. Xue, H. Li, R. Jian, D. S. He, J. Qian, M. Ma, Q. Zhang, and R. Li, “Bending the scaling law curve in large-scale recommendation systems,”CoRR, vol. abs/2602.16986,

  5. [5]
  6. [6]

    The evolution of embedding table optimization and multi-epoch training in pinterest ads conversion,

    A. Qiu, S. Barhate, H. W. Lui, R. Su, R. R. M ¨uller, K. Li, L. Leng, H. Sun, S. Ehsani, and Z. Liu, “The evolution of embedding table optimization and multi-epoch training in pinterest ads conversion,”CoRR, vol. abs/2505.05605, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.05605

  7. [7]

    Adaembed: Adaptive embedding for large-scale recommendation models,

    F. Lai, W. Zhang, R. Liu, W. Tsai, X. Wei, Y . Hu, S. Devkota, J. Huang, J. Park, X. Liu, Z. Chen, E. Wen, P. Rivera, J. You, C. J. Chen, and M. Chowdhury, “Adaembed: Adaptive embedding for large-scale recommendation models,” in17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023, R. Geambasu a...

  8. [8]

    Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

    X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y . Li, H. Zhang, H. Zhang, D. Zhao, and W. Liang, “Conditional memory via scalable lookup: A new axis of sparsity for large language models,”CoRR, vol. abs/2601.07372, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.07372

  9. [9]

    Two-dimensional sparse parallelism for large scale deep learning recommendation model training.arXiv preprint arXiv:2508.03854, 2025

    X. Zhang, Q. Zhu, L. Xu, Z. Huda, W. Zhou, J. Fang, D. van der Staay, Y . Hu, J. Nie, J. Yang, and C. Yang, “Two-dimensional sparse parallelism for large scale deep learning recommendation model training,”CoRR, vol. abs/2508.03854, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.03854

  10. [10]

    Research on model parallelism and data parallelism optimization methods in large language model-based recommendation systems,

    H. Yang, Y . Tian, Z. Yang, Z. Wang, C. Zhou, and D. Li, “Research on model parallelism and data parallelism optimization methods in large language model-based recommendation systems,”CoRR, vol. abs/2506.17551, 2025. [Online]. Available: https://doi.org/10.48550/ arXiv.2506.17551

  11. [11]

    InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023)

    D. H. Kurniawan, R. Wang, K. S. Zulkifli, F. A. Wiranata, J. Bent, Y . Vigfusson, and H. S. Gunawi, “Evstore: Storage and caching capabilities for scaling embedding tables in deep recommendation systems,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, ...

  12. [12]

    Distributed hierarchical GPU parameter server for massive scale deep learning ads systems,

    W. Zhao, D. Xie, R. Jia, Y . Qian, R. Ding, M. Sun, and P. Li, “Distributed hierarchical GPU parameter server for massive scale deep learning ads systems,” inProceedings of the Third Conference on Machine Learning and Systems, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020

  13. [13]

    Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters,

    X. Lian, B. Yuan, X. Zhu, Y . Wang, Y . He, H. Wu, L. Sun, H. Lyu, C. Liu, X. Dong, Y . Liao, M. Luo, C. Zhang, J. Xie, H. Li, L. Chen, R. Huang, J. Lin, C. Shu, X. Qiu, Z. Liu, D. Kong, L. Yuan, H. Yu, S. Yang, C. Zhang, and J. Liu, “Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters,” inKDD ’22: The 28t...

  14. [14]

    Efficient and scalable huge embedding model training via distributed cache management,

    X. Miao, H. Zhang, Y . Shi, X. Nie, Z. Yang, Y . Tao, J. Jiang, and B. Cui, “Efficient and scalable huge embedding model training via distributed cache management,”VLDB J., vol. 34, no. 3, p. 27, 2025. [Online]. Available: https://doi.org/10.1007/s00778-025-00908-w

  15. [15]

    GBA: A tuning-free approach to switch between synchronous and asynchronous training for recommen- dation models,

    W. Su, Y . Zhang, Y . Cai, K. Ren, P. Wang, H. Yi, Y . Song, J. Chen, H. Deng, J. Xu, L. Qu, and B. Zheng, “GBA: A tuning-free approach to switch between synchronous and asynchronous training for recommen- dation models,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, ...

  16. [16]

    InProceedings of the 27th ACM SIGKDD Confer- ence on Knowledge Discovery & Data Mining(Virtual Event, Singapore)(KDD ’21)

    Y . Huang, X. Wei, X. Wang, J. Yang, B. Su, S. Bharuka, D. Choudhary, Z. Jiang, H. Zheng, and J. Langman, “Hierarchical training: Scaling deep recommendation models on large CPU clusters,” inKDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021. ACM, 2021, pp. 3050–3058. [Online]. Avai...

  17. [17]

    CAFE+: towards compact, adaptive, and fast embedding for large-scale online recommendation models,

    Z. Liu, H. Zhang, B. Chen, Z. Jiang, Y . Zhao, Y . Tao, T. Yang, and B. Cui, “CAFE+: towards compact, adaptive, and fast embedding for large-scale online recommendation models,”ACM Trans. Inf. Syst., vol. 43, no. 3, pp. 61:1–61:42, 2025. [Online]. Available: https://doi.org/10.1145/3713072

  18. [18]

    Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

    H. Feng, B. Zhang, F. Ye, M. Si, C. Chu, J. Tian, C. Yin, S. Deng, Y . Hao, P. Balaji, T. Geng, and D. Tao, “Accelerating communication in deep learning recommendation model training with dual-level adaptive lossy compression,” inProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2024, Atlanta...

  19. [19]

    Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

    W. Wang, Y . Xia, D. Yang, X. Zhou, and D. Cheng, “Accelerating distributed DLRM training with optimized TT decomposition and micro-batching,” inProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2024, Atlanta, GA, USA, November 17-22, 2024. IEEE, 2024. [Online]. Available: https://doi.org/10....

  20. [20]

    Generative recommendation models: Progress and directions,

    Y . Hou, A. Zhang, L. Sheng, Z. Yang, X. Wang, T. Chua, and J. J. McAuley, “Generative recommendation models: Progress and directions,” inCompanion Proceedings of the ACM on Web Conference 2025, WWW 2025, Sydney, NSW, Australia, 28 April 2025 - 2 May 2025. ACM, 2025, pp. 13–16. [Online]. Available: https://doi.org/10.1145/3701716.3715856

  21. [21]

    Slimpipe: Memory-thrifty and efficient pipeline parallelism for long-context LLM training,

    Z. Li, Y . Liu, W. Zhang, T. Yuan, B. Chen, and C. Song, “Slimpipe: Memory-thrifty and efficient pipeline parallelism for long-context LLM training,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2025, St. Louis, MO, USA, November 16-21, 2025. ACM, 2025, pp. 1409–

  22. [22]

    Available: https://doi.org/10.1145/3712285.3759855

    [Online]. Available: https://doi.org/10.1145/3712285.3759855

  23. [23]

    Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models,

    D. Zha, L. Feng, L. Luo, B. Bhushanam, Z. Liu, Y . Hu, J. Nie, Y . Huang, Y . Tian, A. Kejariwal, and X. Hu, “Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models,” inProceedings of the Sixth Conference on Machine Learning and Systems, MLSys 2023, Miami, FL, USA, June 4-8, 2023. mlsys.org, 2023

  24. [24]

    Embedding optimization for training large-scale deep learning recommendation systems with embark,

    S. Liu, N. Zheng, H. Kang, X. Simmons, J. Zhang, M. Langer, W. Zhu, M. Lee, and Z. Wang, “Embedding optimization for training large-scale deep learning recommendation systems with embark,” inProceedings of the 18th ACM Conference on Recommender Systems, RecSys 2024, Bari, Italy, October 14-18, 2024. ACM, 2024, pp. 622–632. [Online]. Available: https://doi...

  25. [25]

    Autoshard: Automated embedding table sharding for recommender systems,

    D. Zha, L. Feng, B. Bhushanam, D. Choudhary, J. Nie, Y . Tian, J. Chae, Y . Ma, A. Kejariwal, and X. Hu, “Autoshard: Automated embedding table sharding for recommender systems,” inKDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022. ACM, 2022, pp. 4461–

  26. [26]

    Available: https://doi.org/10.1145/3534678.3539034

    [Online]. Available: https://doi.org/10.1145/3534678.3539034

  27. [27]

    OPER: optimality-guided embedding table parallelization for large-scale recommendation model,

    Z. Wang, Y . Wang, B. Feng, G. Huang, D. Mudigere, B. Muthiah, A. Li, and Y . Ding, “OPER: optimality-guided embedding table parallelization for large-scale recommendation model,” inProceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 2024, pp. 667–682

  28. [28]

    A., Gao, L., Ivchenko, D., Basant, A., Hu, Y., Yang, J., Ardestani, E

    D. Mudigere, Y . Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park, L. Luo, J. A. Yang, L. Gao, D. Ivchenko, A. Basant, Y . Hu, J. Yang, E. K. Ardestani, X. Wang, R. Komuravelli, C. Chu, S. Yilmaz, H. Li, J. Qian, Z. Feng, Y . Ma, J. Yang, E. Wen, H. Li, L. Yang, C. Sun, W. Zhao, D. Melts, K. Dhulipala, K. R. Kishore, T. G...

  29. [29]

    Accelerating neural recommendation training with embedding schedul- ing,

    C. Zeng, X. Liao, X. Cheng, H. Tian, X. Wan, H. Wang, and K. Chen, “Accelerating neural recommendation training with embedding schedul- ing,” in21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024. USENIX Association, 2024

  30. [30]

    Mixed-precision embeddings for large-scale recommendation models,

    S. Li, Z. Hu, F. Lyu, X. Tang, H. Wang, S. Xu, W. Luo, Y . Li, X. Liu, X. He, and R. Li, “Mixed-precision embeddings for large-scale recommendation models,”CoRR, vol. abs/2409.20305, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2409.20305

  31. [31]

    Fusedrec: Fused embedding communication for distributed recommendation training on gpus,

    X. Huang, F. Li, R. Hu, J. Zhang, Y . Peng, Y . Zhou, F. Chen, and X. Zhang, “Fusedrec: Fused embedding communication for distributed recommendation training on gpus,” inFortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advances in Artific...

  32. [32]

    DQRM: deep quantized recommendation models,

    Y . Zhou, Z. Dong, E. Chan, D. Kalamkar, D. Marculescu, and K. Keutzer, “DQRM: deep quantized recommendation models,”CoRR, vol. abs/2410.20046, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2410.20046

  33. [33]

    Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak

    Z. Wang, Y . Wang, B. Feng, D. Mudigere, B. Muthiah, and Y . Ding, “El-rec: Efficient large-scale recommendation model training via tensor-train embedding table,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022. IEEE, 2022, pp. 70:1–70:14. [Online]. Available: https:/...

  34. [34]

    Disaggregated multi-tower: Topology-aware modeling technique for efficient large scale recommendation,

    L. Luo, B. Zhang, M. Tsang, Y . Ma, C. Chu, Y . Chen, S. Li, Y . Hao, Y . Zhao, G. Lakshminarayanan, E. Wen, J. Park, D. Mudigere, and M. Naumov, “Disaggregated multi-tower: Topology-aware modeling technique for efficient large scale recommendation,” inProceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, C...

  35. [35]

    Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations,

    J. Zhai, L. Liao, X. Liu, Y . Wang, R. Li, X. Cao, L. Gao, Z. Gong, F. Gu, J. He, Y . Lu, and Y . Shi, “Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations,” in Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, ser. Proceedings of Machine Learning ...

  36. [36]

    Merlin hugectr: Gpu-accelerated recommender system training and inference,

    Z. Wang, Y . Wei, M. Lee, M. Langer, F. Yu, J. Liu, S. Liu, D. G. Abel, X. Guo, J. Dong, J. Shi, and K. Li, “Merlin hugectr: Gpu-accelerated recommender system training and inference,” inRecSys ’22: Sixteenth ACM Conference on Recommender Systems, Seattle, WA, USA, September 18 - 23, 2022. ACM, 2022, pp. 534–537. [Online]. Available: https://doi.org/10.11...

  37. [37]

    Recis: Sparse to dense, A unified training framework for recommendation models,

    H. Zong, Q. Zeng, Z. Zhou, Z. Han, Z. Yan, M. Liu, H. Sun, J. Liu, Y . Hu, Q. Wang, Y . Xian, W. Guo, H. Xiang, Z. Zeng, X. Sheng, B. Yan, N. Hu, Y . Huang, J. Lian, Z. Xu, Y . Zhang, J. Huang, S. Yang, H. Yi, J. Wang, P. Wang, Z. Han, J. Wu, D. Ou, J. Xu, H. Tang, Y . Jiang, B. Zheng, and L. Qu, “Recis: Sparse to dense, A unified training framework for r...

  38. [38]
  39. [39]

    Unified and near-optimal multi-gpu cache for embedding-based deep learning,

    X. Song, R. Chen, H. Song, Y . Zhang, and H. Chen, “Unified and near-optimal multi-gpu cache for embedding-based deep learning,” ACM Trans. Comput. Syst., vol. 44, no. 1, pp. 3:1–3:32, 2026. [Online]. Available: https://doi.org/10.1145/3767725

  40. [40]

    In2022 IEEE 38th International Conference on Data Engineering (ICDE)

    Y . Zhang, L. Chen, S. Yang, M. Yuan, H. Yi, J. Zhang, J. Wang, J. Dong, Y . Xu, Y . Song, Y . Li, D. Zhang, W. Lin, L. Qu, and B. Zheng, “PICASSO: unleashing the potential of gpu-centric training for wide-and-deep recommender systems,” in38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022. IEEE, 2022,...

  41. [41]

    Designing cloud servers for lower carbon,

    M. Adnan, Y . E. Maboud, D. Mahajan, and P. J. Nair, “Heterogeneous acceleration pipeline for recommendation system training,” in51st ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2024, Buenos Aires, Argentina, June 29 - July 3, 2024. IEEE, 2024, pp. 1063–1079. [Online]. Available: https://doi.org/10.1109/ ISCA59077.2024.00081

  42. [42]

    Training personalized recommendation systems from (GPU) scratch: look forward not backwards,

    Y . Kwon and M. Rhu, “Training personalized recommendation systems from (GPU) scratch: look forward not backwards,” inISCA ’22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022. ACM, 2022, pp. 860–873. [Online]. Available: https://doi.org/10.1145/3470496.3527386

  43. [43]

    Hypereca: Distributed het- erogeneous in-memory embedding database for training recommender models,

    J. He, S. Chen, K. Huang, and J. Zhai, “Hypereca: Distributed het- erogeneous in-memory embedding database for training recommender models,” inProceedings of the 2025 USENIX Annual Technical Confer- ence, USENIX ATC 2025, Boston, MA, USA, July 7-9, 2025. USENIX Association, 2025, pp. 1071–1087

  44. [44]

    Adaptis: Reducing pipeline bubbles with adaptive pipeline parallelism on heterogeneous models,

    J. Guo, T. Ma, W. Gao, P. Sun, J. Li, X. Chen, Y . Jin, and D. Lin, “Adaptis: Reducing pipeline bubbles with adaptive pipeline parallelism on heterogeneous models,”CoRR, vol. abs/2509.23722, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.23722

  45. [45]

    Bagpipe: Accelerating deep recommendation model training,

    S. Agarwal, C. Yan, Z. Zhang, and S. Venkataraman, “Bagpipe: Accelerating deep recommendation model training,” inProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023. ACM, 2023, pp. 348–363. [Online]. Available: https://doi.org/10.1145/3600006.3613142

  46. [46]

    Mitigating

    H. Jung, S. Shin, and N. Lee, “Mitigating staleness in asynchronous pipeline parallelism via basis rotation,”CoRR, vol. abs/2602.03515,

  47. [47]

    Mitigating

    [Online]. Available: https://doi.org/10.48550/arXiv.2602.03515

  48. [48]

    Design of a hybrid MPI-CUDA benchmark suite for CPU-GPU clusters,

    T. Agarwal and M. Becchi, “Design of a hybrid MPI-CUDA benchmark suite for CPU-GPU clusters,” inInternational Conference on Parallel Architectures and Compilation, PACT ’14, Edmonton, AB, Canada, August 24-27, 2014. ACM, 2014, pp. 505–506. [Online]. Available: https://doi.org/10.1145/2628071.2671423

  49. [49]

    DCMA: accelerating parallel DMA transfers with a multi-port direct cached memory access in a massive-parallel vector processor,

    G. B. Thieu, S. Gesper, and G. Pay ´a-Vay´a, “DCMA: accelerating parallel DMA transfers with a multi-port direct cached memory access in a massive-parallel vector processor,”ACM Trans. Archit. Code Optim., vol. 22, no. 2, pp. 72:1–72:25, 2025. [Online]. Available: https://doi.org/10.1145/3730582

  50. [50]

    Td-pipe: Temporally-disaggregated pipeline parallelism architecture for high- throughput LLM inference,

    H. Zhang, T. Wei, Z. Zheng, J. Du, Z. Chen, and Y . Lu, “Td-pipe: Temporally-disaggregated pipeline parallelism architecture for high- throughput LLM inference,” inProceedings of the 54th International Conference on Parallel Processing, ICPP 2025, San Diego, CA, USA, September 8-11, 2025. ACM, 2025, pp. 689–698. [Online]. Available: https://doi.org/10.114...

  51. [51]

    Revisiting parameter server in LLM post-training,

    X. Wan, P. Qi, G. Huang, C. Ruan, M. Lin, and J. Li, “Revisiting parameter server in LLM post-training,”CoRR, vol. abs/2601.19362,

  52. [52]

    Revisiting parameter server in LLM post-training,

    [Online]. Available: https://doi.org/10.48550/arXiv.2601.19362

  53. [53]

    COMET: fine-grained computation-communication overlapping for mixture-of- experts,

    S. Zhang, N. Zheng, H. Lin, Z. Jiang, W. Bao, C. Jiang, Q. Hou, W. Cui, S. Zheng, L. Chang, Q. Chen, and X. Liu, “COMET: fine-grained computation-communication overlapping for mixture-of- experts,” inProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys.org, 2025

  54. [54]

    Kuairand: An unbiased sequential recommendation dataset with randomly exposed videos,

    C. Gao, S. Li, Y . Zhang, J. Chen, B. Li, W. Lei, P. Jiang, and X. He, “Kuairand: An unbiased sequential recommendation dataset with randomly exposed videos,” inProceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022. ACM, 2022, pp. 3953–3957. [Online]. Available: https://doi.org/10...

  55. [55]

    Fuxi-α: Scaling recommendation model with feature interaction enhanced transformer,

    Y . Ye, W. Guo, J. Y . Chin, H. Wang, H. Zhu, X. Lin, Y . Ye, Y . Liu, R. Tang, D. Lian, and E. Chen, “Fuxi-α: Scaling recommendation model with feature interaction enhanced transformer,” inCompanion Proceedings of the ACM on Web Conference 2025, WWW 2025, Sydney, NSW, Australia, 28 April 2025 - 2 May 2025. ACM, 2025, pp. 557–566. [Online]. Available: htt...

  56. [56]

    Torchrec: a pytorch domain library for recommendation systems,

    D. Ivchenko, D. V . D. Staay, C. Taylor, X. Liu, W. Feng, R. Kindi, A. Sudarshan, and S. Sefati, “Torchrec: a pytorch domain library for recommendation systems,” inRecSys ’22: Sixteenth ACM Conference on Recommender Systems, Seattle, WA, USA, September 18 - 23, 2022. ACM, 2022, pp. 482–483. [Online]. Available: https://doi.org/10.1145/3523227.3547387