Recognition: 2 theorem links
· Lean TheoremNestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining
Pith reviewed 2026-05-10 17:55 UTC · model grok-4.3
The pith
Nested pipelining trains trillion-parameter recommendation models synchronously on 1,536 accelerators with up to 3.06x speedup.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NestPipe shows that two hierarchical sparse-parallelism opportunities can be exploited together through nested pipelining. Dual-Buffer Pipelining (DBP) creates a staleness-free five-stage pipeline that hides lookup latency. Frozen-Window Pipelining (FWP) uses the embedding-freezing phenomenon to overlap All2All traffic with dense computation via stream scheduling and key-centric clustering. On production clusters of 1,536 workers the system reaches 3.06x speedup and 94.07 percent scaling efficiency without accuracy loss.
What carries the argument
Nested pipelining that combines Dual-Buffer Pipelining for inter-batch lookup hiding with Frozen-Window Pipelining for intra-batch communication-computation overlap, while preserving exact synchronous semantics.
If this is right
- Training throughput for trillion-parameter models rises without forcing a choice between speed and consistency.
- Clusters of 1,500+ accelerators can maintain scaling efficiency above 90 percent for recommendation workloads.
- Both GPU and NPU hardware benefit from the same dual-level pipeline structure.
- Production systems no longer need to relax synchronization to hide data-movement costs.
Where Pith is reading between the lines
- The same freezing insight could be applied to other sparse-data workloads such as graph neural networks or large language models with dynamic vocabularies.
- If freezing patterns prove architecture-dependent, an adaptive window size could further improve overlap on varied model designs.
- Reducing effective communication volume this way may change the economic trade-off between adding more accelerators versus investing in faster interconnects.
Load-bearing premise
The embedding freezing phenomenon stays stable and predictable enough in real workloads that overlapping communication never introduces hidden inconsistencies or accuracy loss.
What would settle it
Measure final model accuracy on a production workload whose embedding access patterns shift rapidly; any statistically significant deviation from a baseline synchronous run would falsify the claim.
Figures
read the original abstract
Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents NestPipe, a large-scale decentralized embedding training framework that tackles both bottlenecks while preserving synchronous training semantics. NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining. At the inter-batch level, Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline through dual-buffer synchronization, mitigating lookup bottlenecks without embedding staleness. At the intra-batch level, we identify the embedding freezing phenomenon, which inspires Frozen-Window Pipelining (FWP) to overlap All2All communication with dense computation via coordinated stream scheduling and key-centric sample clustering. Experiments on production GPU and NPU clusters with 1,536 workers demonstrate that NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. NestPipe presents a decentralized embedding training framework for trillion-parameter recommendation models on clusters up to 1,536 accelerators. It uses nested pipelining: Dual-Buffer Pipelining (DBP) creates a five-stage staleness-free inter-batch pipeline via dual-buffer synchronization to mitigate lookup latency, while Frozen-Window Pipelining (FWP) exploits the embedding freezing phenomenon with key-centric sample clustering and coordinated stream scheduling to overlap All2All communication with dense computation at the intra-batch level. The paper reports up to 3.06× speedup and 94.07% scaling efficiency on production GPU/NPU clusters while claiming to preserve exact synchronous training semantics.
Significance. If the synchronous-semantics claim holds, the work would be significant for scaling recommendation training beyond current bottlenecks in embedding lookup and communication. The use of real production workloads and large-scale clusters (1,536 workers) is a positive aspect, as is the focus on maintaining training consistency rather than trading it for throughput. However, the absence of accuracy validation limits the assessed impact.
major comments (2)
- [§5] §5 (Evaluation): The reported results focus exclusively on throughput (up to 3.06×) and scaling efficiency (94.07%) but provide no accuracy, AUC, loss-curve, or model-equivalence metrics versus a non-pipelined synchronous baseline. This is load-bearing for the central claim because FWP's 'staleness-free' guarantee and preservation of synchronous semantics rest on the unvalidated assumption that embedding freezing is stable and complete enough to avoid altering gradient flow or final model quality.
- [§3.2] §3.2 (FWP description): The mechanism by which key-centric sample clustering and coordinated stream scheduling ensure identical computation order and gradient updates (no hidden inconsistencies) is described at a high level but lacks a formal argument, invariant, or small-scale equivalence proof. Without this, the claim that FWP overlaps communication without introducing staleness cannot be assessed as sound.
minor comments (2)
- [Figure 3] Figure 3 (or equivalent pipeline diagram): The five-stage DBP pipeline and FWP window overlap would benefit from explicit timing annotations showing buffer synchronization points to improve clarity of the 'staleness-free' property.
- [§1] The abstract and §1 use 'parameter-free' or similar phrasing for certain ratios; verify that no hidden workload-dependent parameters are introduced in the FWP clustering heuristic.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the importance of preserving synchronous semantics at scale. We address each major concern below with proposed revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§5] §5 (Evaluation): The reported results focus exclusively on throughput (up to 3.06×) and scaling efficiency (94.07%) but provide no accuracy, AUC, loss-curve, or model-equivalence metrics versus a non-pipelined synchronous baseline. This is load-bearing for the central claim because FWP's 'staleness-free' guarantee and preservation of synchronous semantics rest on the unvalidated assumption that embedding freezing is stable and complete enough to avoid altering gradient flow or final model quality.
Authors: We agree that direct empirical validation of model quality is essential to substantiate the synchronous-semantics claim. DBP is designed to be staleness-free by construction via dual-buffer synchronization, while FWP exploits the embedding-freezing phenomenon (where selected embeddings remain unchanged within a window) to ensure that overlapped All2All communication does not alter computation order, gradient flow, or updates. Nevertheless, we acknowledge the value of explicit metrics. In the revised manuscript we will add small-scale equivalence experiments on representative production workloads, reporting AUC, loss curves, and final model quality comparisons against a non-pipelined synchronous baseline to confirm that NestPipe produces identical results. revision: yes
-
Referee: [§3.2] §3.2 (FWP description): The mechanism by which key-centric sample clustering and coordinated stream scheduling ensure identical computation order and gradient updates (no hidden inconsistencies) is described at a high level but lacks a formal argument, invariant, or small-scale equivalence proof. Without this, the claim that FWP overlaps communication without introducing staleness cannot be assessed as sound.
Authors: We appreciate this observation. Key-centric sample clustering groups samples sharing the same embedding keys into contiguous windows, and coordinated stream scheduling overlaps All2All communication only for frozen embeddings whose values do not participate in the current dense computation or gradient update. This preserves the exact computation order and gradient flow of the original synchronous schedule. To address the request for rigor, the revised §3.2 will include an explicit invariant stating that the dataflow graph and gradient updates remain identical under freezing, together with a small-scale equivalence argument (including a toy-model proof sketch) demonstrating that no hidden inconsistencies arise. revision: yes
Circularity Check
No circularity: claims rest on experimental speedups and scaling measurements, not self-referential derivations or fitted predictions
full rationale
The paper describes NestPipe as a systems framework using Dual-Buffer Pipelining (DBP) and Frozen-Window Pipelining (FWP) to address embedding lookup and All2All bottlenecks while claiming to preserve synchronous semantics. The load-bearing assertions are empirical: up to 3.06x speedup and 94.07% scaling efficiency on 1,536-worker clusters. No mathematical derivation chain, first-principles equations, or parameter-fitting steps appear in the provided text that reduce by construction to the inputs (e.g., no 'prediction' of speedup derived from a model whose parameters were fitted to the same speedup data). The embedding freezing phenomenon is presented as an observed property of production workloads that the FWP design exploits via key-centric clustering and stream scheduling; it is not defined circularly in terms of the resulting performance. Self-citations, if present in the full manuscript, are not load-bearing for the core claims, which are validated by direct cluster measurements rather than uniqueness theorems or ansatzes imported from prior author work. This is a standard engineering paper whose contributions are the implementation and measured outcomes, not a closed-form derivation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining... Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline... Frozen-Window Pipelining (FWP) to overlap All2All communication with dense computation via coordinated stream scheduling and key-centric sample clustering.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and recovery theorems unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also provide theoretical consistency analysis for NestPipe... Proposition 1 (Consistency of DBP)... Proposition 2 (Consistency of FWP)... Corollary 1 (Consistency of NestPipe)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
G. Zhang, Y . Hou, H. Lu, Y . Chen, W. X. Zhao, and J. Wen, “Scaling law of large sequential recommendation models,” inProceedings of the 18th ACM Conference on Recommender Systems, RecSys 2024, Bari, Italy, October 14-18, 2024. ACM, 2024, pp. 444–453. [Online]. Available: https://doi.org/10.1145/3640457.3688129
-
[2]
MTGR: industrial-scale generative recommendation framework in meituan,
R. Han, B. Yin, S. Chen, H. Jiang, F. Jiang, X. Li, C. Ma, M. Huang, X. Li, C. Jing, Y . Han, M. Zhou, L. Yu, C. Liu, and W. Lin, “MTGR: industrial-scale generative recommendation framework in meituan,” in Proceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM 2025, Seoul, Republic of Korea, November 10-14, 2025...
-
[3]
MTGR: industrial-scale generative recommendation framework in meituan,
S. Xu, S. Wang, D. Guo, X. Guo, Q. Xiao, B. Huang, G. Wu, and C. Luo, “Climber: Toward efficient scaling laws for large recommendation models,” inProceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM 2025, Seoul, Republic of Korea, November 10-14, 2025. ACM, 2025, pp. 6193–6200. [Online]. Available: https://doi...
-
[4]
Bending the scaling law curve in large-scale recommendation systems,
Q. Ding, K. Course, L. Ma, J. Sun, R. Liu, Z. Zhu, C. Yin, W. Li, D. Li, Y . Shi, X. Cao, Z. Yang, H. Li, X. Liu, B. Xue, H. Li, R. Jian, D. S. He, J. Qian, M. Ma, Q. Zhang, and R. Li, “Bending the scaling law curve in large-scale recommendation systems,”CoRR, vol. abs/2602.16986,
-
[5]
Bending the scaling law curve in large-scale recommendation systems,
[Online]. Available: https://doi.org/10.48550/arXiv.2602.16986
-
[6]
The evolution of embedding table optimization and multi-epoch training in pinterest ads conversion,
A. Qiu, S. Barhate, H. W. Lui, R. Su, R. R. M ¨uller, K. Li, L. Leng, H. Sun, S. Ehsani, and Z. Liu, “The evolution of embedding table optimization and multi-epoch training in pinterest ads conversion,”CoRR, vol. abs/2505.05605, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.05605
-
[7]
Adaembed: Adaptive embedding for large-scale recommendation models,
F. Lai, W. Zhang, R. Liu, W. Tsai, X. Wei, Y . Hu, S. Devkota, J. Huang, J. Park, X. Liu, Z. Chen, E. Wen, P. Rivera, J. You, C. J. Chen, and M. Chowdhury, “Adaembed: Adaptive embedding for large-scale recommendation models,” in17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023, R. Geambasu a...
2023
-
[8]
X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y . Li, H. Zhang, H. Zhang, D. Zhao, and W. Liang, “Conditional memory via scalable lookup: A new axis of sparsity for large language models,”CoRR, vol. abs/2601.07372, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.07372
-
[9]
X. Zhang, Q. Zhu, L. Xu, Z. Huda, W. Zhou, J. Fang, D. van der Staay, Y . Hu, J. Nie, J. Yang, and C. Yang, “Two-dimensional sparse parallelism for large scale deep learning recommendation model training,”CoRR, vol. abs/2508.03854, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.03854
-
[10]
H. Yang, Y . Tian, Z. Yang, Z. Wang, C. Zhou, and D. Li, “Research on model parallelism and data parallelism optimization methods in large language model-based recommendation systems,”CoRR, vol. abs/2506.17551, 2025. [Online]. Available: https://doi.org/10.48550/ arXiv.2506.17551
-
[11]
D. H. Kurniawan, R. Wang, K. S. Zulkifli, F. A. Wiranata, J. Bent, Y . Vigfusson, and H. S. Gunawi, “Evstore: Storage and caching capabilities for scaling embedding tables in deep recommendation systems,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, ...
-
[12]
Distributed hierarchical GPU parameter server for massive scale deep learning ads systems,
W. Zhao, D. Xie, R. Jia, Y . Qian, R. Ding, M. Sun, and P. Li, “Distributed hierarchical GPU parameter server for massive scale deep learning ads systems,” inProceedings of the Third Conference on Machine Learning and Systems, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020
2020
-
[13]
X. Lian, B. Yuan, X. Zhu, Y . Wang, Y . He, H. Wu, L. Sun, H. Lyu, C. Liu, X. Dong, Y . Liao, M. Luo, C. Zhang, J. Xie, H. Li, L. Chen, R. Huang, J. Lin, C. Shu, X. Qiu, Z. Liu, D. Kong, L. Yuan, H. Yu, S. Yang, C. Zhang, and J. Liu, “Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters,” inKDD ’22: The 28t...
-
[14]
Efficient and scalable huge embedding model training via distributed cache management,
X. Miao, H. Zhang, Y . Shi, X. Nie, Z. Yang, Y . Tao, J. Jiang, and B. Cui, “Efficient and scalable huge embedding model training via distributed cache management,”VLDB J., vol. 34, no. 3, p. 27, 2025. [Online]. Available: https://doi.org/10.1007/s00778-025-00908-w
-
[15]
GBA: A tuning-free approach to switch between synchronous and asynchronous training for recommen- dation models,
W. Su, Y . Zhang, Y . Cai, K. Ren, P. Wang, H. Yi, Y . Song, J. Chen, H. Deng, J. Xu, L. Qu, and B. Zheng, “GBA: A tuning-free approach to switch between synchronous and asynchronous training for recommen- dation models,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, ...
2022
-
[16]
Y . Huang, X. Wei, X. Wang, J. Yang, B. Su, S. Bharuka, D. Choudhary, Z. Jiang, H. Zheng, and J. Langman, “Hierarchical training: Scaling deep recommendation models on large CPU clusters,” inKDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021. ACM, 2021, pp. 3050–3058. [Online]. Avai...
-
[17]
CAFE+: towards compact, adaptive, and fast embedding for large-scale online recommendation models,
Z. Liu, H. Zhang, B. Chen, Z. Jiang, Y . Zhao, Y . Tao, T. Yang, and B. Cui, “CAFE+: towards compact, adaptive, and fast embedding for large-scale online recommendation models,”ACM Trans. Inf. Syst., vol. 43, no. 3, pp. 61:1–61:42, 2025. [Online]. Available: https://doi.org/10.1145/3713072
-
[18]
H. Feng, B. Zhang, F. Ye, M. Si, C. Chu, J. Tian, C. Yin, S. Deng, Y . Hao, P. Balaji, T. Geng, and D. Tao, “Accelerating communication in deep learning recommendation model training with dual-level adaptive lossy compression,” inProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2024, Atlanta...
-
[19]
W. Wang, Y . Xia, D. Yang, X. Zhou, and D. Cheng, “Accelerating distributed DLRM training with optimized TT decomposition and micro-batching,” inProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2024, Atlanta, GA, USA, November 17-22, 2024. IEEE, 2024. [Online]. Available: https://doi.org/10....
-
[20]
Generative recommendation models: Progress and directions,
Y . Hou, A. Zhang, L. Sheng, Z. Yang, X. Wang, T. Chua, and J. J. McAuley, “Generative recommendation models: Progress and directions,” inCompanion Proceedings of the ACM on Web Conference 2025, WWW 2025, Sydney, NSW, Australia, 28 April 2025 - 2 May 2025. ACM, 2025, pp. 13–16. [Online]. Available: https://doi.org/10.1145/3701716.3715856
-
[21]
Slimpipe: Memory-thrifty and efficient pipeline parallelism for long-context LLM training,
Z. Li, Y . Liu, W. Zhang, T. Yuan, B. Chen, and C. Song, “Slimpipe: Memory-thrifty and efficient pipeline parallelism for long-context LLM training,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2025, St. Louis, MO, USA, November 16-21, 2025. ACM, 2025, pp. 1409–
2025
-
[22]
Available: https://doi.org/10.1145/3712285.3759855
[Online]. Available: https://doi.org/10.1145/3712285.3759855
-
[23]
Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models,
D. Zha, L. Feng, L. Luo, B. Bhushanam, Z. Liu, Y . Hu, J. Nie, Y . Huang, Y . Tian, A. Kejariwal, and X. Hu, “Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models,” inProceedings of the Sixth Conference on Machine Learning and Systems, MLSys 2023, Miami, FL, USA, June 4-8, 2023. mlsys.org, 2023
2023
-
[24]
Embedding optimization for training large-scale deep learning recommendation systems with embark,
S. Liu, N. Zheng, H. Kang, X. Simmons, J. Zhang, M. Langer, W. Zhu, M. Lee, and Z. Wang, “Embedding optimization for training large-scale deep learning recommendation systems with embark,” inProceedings of the 18th ACM Conference on Recommender Systems, RecSys 2024, Bari, Italy, October 14-18, 2024. ACM, 2024, pp. 622–632. [Online]. Available: https://doi...
-
[25]
Autoshard: Automated embedding table sharding for recommender systems,
D. Zha, L. Feng, B. Bhushanam, D. Choudhary, J. Nie, Y . Tian, J. Chae, Y . Ma, A. Kejariwal, and X. Hu, “Autoshard: Automated embedding table sharding for recommender systems,” inKDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022. ACM, 2022, pp. 4461–
2022
-
[26]
Available: https://doi.org/10.1145/3534678.3539034
[Online]. Available: https://doi.org/10.1145/3534678.3539034
-
[27]
OPER: optimality-guided embedding table parallelization for large-scale recommendation model,
Z. Wang, Y . Wang, B. Feng, G. Huang, D. Mudigere, B. Muthiah, A. Li, and Y . Ding, “OPER: optimality-guided embedding table parallelization for large-scale recommendation model,” inProceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 2024, pp. 667–682
2024
-
[28]
A., Gao, L., Ivchenko, D., Basant, A., Hu, Y., Yang, J., Ardestani, E
D. Mudigere, Y . Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park, L. Luo, J. A. Yang, L. Gao, D. Ivchenko, A. Basant, Y . Hu, J. Yang, E. K. Ardestani, X. Wang, R. Komuravelli, C. Chu, S. Yilmaz, H. Li, J. Qian, Z. Feng, Y . Ma, J. Yang, E. Wen, H. Li, L. Yang, C. Sun, W. Zhao, D. Melts, K. Dhulipala, K. R. Kishore, T. G...
-
[29]
Accelerating neural recommendation training with embedding schedul- ing,
C. Zeng, X. Liao, X. Cheng, H. Tian, X. Wan, H. Wang, and K. Chen, “Accelerating neural recommendation training with embedding schedul- ing,” in21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024. USENIX Association, 2024
2024
-
[30]
Mixed-precision embeddings for large-scale recommendation models,
S. Li, Z. Hu, F. Lyu, X. Tang, H. Wang, S. Xu, W. Luo, Y . Li, X. Liu, X. He, and R. Li, “Mixed-precision embeddings for large-scale recommendation models,”CoRR, vol. abs/2409.20305, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2409.20305
-
[31]
Fusedrec: Fused embedding communication for distributed recommendation training on gpus,
X. Huang, F. Li, R. Hu, J. Zhang, Y . Peng, Y . Zhou, F. Chen, and X. Zhang, “Fusedrec: Fused embedding communication for distributed recommendation training on gpus,” inFortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advances in Artific...
-
[32]
DQRM: deep quantized recommendation models,
Y . Zhou, Z. Dong, E. Chan, D. Kalamkar, D. Marculescu, and K. Keutzer, “DQRM: deep quantized recommendation models,”CoRR, vol. abs/2410.20046, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2410.20046
-
[33]
Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak
Z. Wang, Y . Wang, B. Feng, D. Mudigere, B. Muthiah, and Y . Ding, “El-rec: Efficient large-scale recommendation model training via tensor-train embedding table,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022. IEEE, 2022, pp. 70:1–70:14. [Online]. Available: https:/...
-
[34]
Disaggregated multi-tower: Topology-aware modeling technique for efficient large scale recommendation,
L. Luo, B. Zhang, M. Tsang, Y . Ma, C. Chu, Y . Chen, S. Li, Y . Hao, Y . Zhao, G. Lakshminarayanan, E. Wen, J. Park, D. Mudigere, and M. Naumov, “Disaggregated multi-tower: Topology-aware modeling technique for efficient large scale recommendation,” inProceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, C...
2024
-
[35]
Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations,
J. Zhai, L. Liao, X. Liu, Y . Wang, R. Li, X. Cao, L. Gao, Z. Gong, F. Gu, J. He, Y . Lu, and Y . Shi, “Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations,” in Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, ser. Proceedings of Machine Learning ...
2024
-
[36]
Merlin hugectr: Gpu-accelerated recommender system training and inference,
Z. Wang, Y . Wei, M. Lee, M. Langer, F. Yu, J. Liu, S. Liu, D. G. Abel, X. Guo, J. Dong, J. Shi, and K. Li, “Merlin hugectr: Gpu-accelerated recommender system training and inference,” inRecSys ’22: Sixteenth ACM Conference on Recommender Systems, Seattle, WA, USA, September 18 - 23, 2022. ACM, 2022, pp. 534–537. [Online]. Available: https://doi.org/10.11...
-
[37]
Recis: Sparse to dense, A unified training framework for recommendation models,
H. Zong, Q. Zeng, Z. Zhou, Z. Han, Z. Yan, M. Liu, H. Sun, J. Liu, Y . Hu, Q. Wang, Y . Xian, W. Guo, H. Xiang, Z. Zeng, X. Sheng, B. Yan, N. Hu, Y . Huang, J. Lian, Z. Xu, Y . Zhang, J. Huang, S. Yang, H. Yi, J. Wang, P. Wang, Z. Han, J. Wu, D. Ou, J. Xu, H. Tang, Y . Jiang, B. Zheng, and L. Qu, “Recis: Sparse to dense, A unified training framework for r...
-
[38]
Recis: Sparse to dense, A unified training framework for recommendation models,
[Online]. Available: https://doi.org/10.48550/arXiv.2509.20883
-
[39]
Unified and near-optimal multi-gpu cache for embedding-based deep learning,
X. Song, R. Chen, H. Song, Y . Zhang, and H. Chen, “Unified and near-optimal multi-gpu cache for embedding-based deep learning,” ACM Trans. Comput. Syst., vol. 44, no. 1, pp. 3:1–3:32, 2026. [Online]. Available: https://doi.org/10.1145/3767725
-
[40]
In2022 IEEE 38th International Conference on Data Engineering (ICDE)
Y . Zhang, L. Chen, S. Yang, M. Yuan, H. Yi, J. Zhang, J. Wang, J. Dong, Y . Xu, Y . Song, Y . Li, D. Zhang, W. Lin, L. Qu, and B. Zheng, “PICASSO: unleashing the potential of gpu-centric training for wide-and-deep recommender systems,” in38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022. IEEE, 2022,...
-
[41]
Designing cloud servers for lower carbon,
M. Adnan, Y . E. Maboud, D. Mahajan, and P. J. Nair, “Heterogeneous acceleration pipeline for recommendation system training,” in51st ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2024, Buenos Aires, Argentina, June 29 - July 3, 2024. IEEE, 2024, pp. 1063–1079. [Online]. Available: https://doi.org/10.1109/ ISCA59077.2024.00081
-
[42]
Training personalized recommendation systems from (GPU) scratch: look forward not backwards,
Y . Kwon and M. Rhu, “Training personalized recommendation systems from (GPU) scratch: look forward not backwards,” inISCA ’22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022. ACM, 2022, pp. 860–873. [Online]. Available: https://doi.org/10.1145/3470496.3527386
-
[43]
Hypereca: Distributed het- erogeneous in-memory embedding database for training recommender models,
J. He, S. Chen, K. Huang, and J. Zhai, “Hypereca: Distributed het- erogeneous in-memory embedding database for training recommender models,” inProceedings of the 2025 USENIX Annual Technical Confer- ence, USENIX ATC 2025, Boston, MA, USA, July 7-9, 2025. USENIX Association, 2025, pp. 1071–1087
2025
-
[44]
Adaptis: Reducing pipeline bubbles with adaptive pipeline parallelism on heterogeneous models,
J. Guo, T. Ma, W. Gao, P. Sun, J. Li, X. Chen, Y . Jin, and D. Lin, “Adaptis: Reducing pipeline bubbles with adaptive pipeline parallelism on heterogeneous models,”CoRR, vol. abs/2509.23722, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.23722
-
[45]
Bagpipe: Accelerating deep recommendation model training,
S. Agarwal, C. Yan, Z. Zhang, and S. Venkataraman, “Bagpipe: Accelerating deep recommendation model training,” inProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023. ACM, 2023, pp. 348–363. [Online]. Available: https://doi.org/10.1145/3600006.3613142
-
[46]
H. Jung, S. Shin, and N. Lee, “Mitigating staleness in asynchronous pipeline parallelism via basis rotation,”CoRR, vol. abs/2602.03515,
-
[47]
[Online]. Available: https://doi.org/10.48550/arXiv.2602.03515
-
[48]
Design of a hybrid MPI-CUDA benchmark suite for CPU-GPU clusters,
T. Agarwal and M. Becchi, “Design of a hybrid MPI-CUDA benchmark suite for CPU-GPU clusters,” inInternational Conference on Parallel Architectures and Compilation, PACT ’14, Edmonton, AB, Canada, August 24-27, 2014. ACM, 2014, pp. 505–506. [Online]. Available: https://doi.org/10.1145/2628071.2671423
-
[49]
G. B. Thieu, S. Gesper, and G. Pay ´a-Vay´a, “DCMA: accelerating parallel DMA transfers with a multi-port direct cached memory access in a massive-parallel vector processor,”ACM Trans. Archit. Code Optim., vol. 22, no. 2, pp. 72:1–72:25, 2025. [Online]. Available: https://doi.org/10.1145/3730582
-
[50]
H. Zhang, T. Wei, Z. Zheng, J. Du, Z. Chen, and Y . Lu, “Td-pipe: Temporally-disaggregated pipeline parallelism architecture for high- throughput LLM inference,” inProceedings of the 54th International Conference on Parallel Processing, ICPP 2025, San Diego, CA, USA, September 8-11, 2025. ACM, 2025, pp. 689–698. [Online]. Available: https://doi.org/10.114...
-
[51]
Revisiting parameter server in LLM post-training,
X. Wan, P. Qi, G. Huang, C. Ruan, M. Lin, and J. Li, “Revisiting parameter server in LLM post-training,”CoRR, vol. abs/2601.19362,
-
[52]
Revisiting parameter server in LLM post-training,
[Online]. Available: https://doi.org/10.48550/arXiv.2601.19362
-
[53]
COMET: fine-grained computation-communication overlapping for mixture-of- experts,
S. Zhang, N. Zheng, H. Lin, Z. Jiang, W. Bao, C. Jiang, Q. Hou, W. Cui, S. Zheng, L. Chang, Q. Chen, and X. Liu, “COMET: fine-grained computation-communication overlapping for mixture-of- experts,” inProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys.org, 2025
2025
-
[54]
Kuairand: An unbiased sequential recommendation dataset with randomly exposed videos,
C. Gao, S. Li, Y . Zhang, J. Chen, B. Li, W. Lei, P. Jiang, and X. He, “Kuairand: An unbiased sequential recommendation dataset with randomly exposed videos,” inProceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022. ACM, 2022, pp. 3953–3957. [Online]. Available: https://doi.org/10...
-
[55]
Fuxi-α: Scaling recommendation model with feature interaction enhanced transformer,
Y . Ye, W. Guo, J. Y . Chin, H. Wang, H. Zhu, X. Lin, Y . Ye, Y . Liu, R. Tang, D. Lian, and E. Chen, “Fuxi-α: Scaling recommendation model with feature interaction enhanced transformer,” inCompanion Proceedings of the ACM on Web Conference 2025, WWW 2025, Sydney, NSW, Australia, 28 April 2025 - 2 May 2025. ACM, 2025, pp. 557–566. [Online]. Available: htt...
-
[56]
Torchrec: a pytorch domain library for recommendation systems,
D. Ivchenko, D. V . D. Staay, C. Taylor, X. Liu, W. Feng, R. Kindi, A. Sudarshan, and S. Sefati, “Torchrec: a pytorch domain library for recommendation systems,” inRecSys ’22: Sixteenth ACM Conference on Recommender Systems, Seattle, WA, USA, September 18 - 23, 2022. ACM, 2022, pp. 482–483. [Online]. Available: https://doi.org/10.1145/3523227.3547387
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.