MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
Pith reviewed 2026-05-22 21:09 UTC · model grok-4.3
The pith
MegaScale-Data disaggregates preprocessing into role-specific actors and applies multi-level auto-partitioning to scale dataloaders across multiple data sources for large foundation model training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MegaScale-Data is a distributed data-loading architecture that uses disaggregated preprocessing via role-specific actors to eliminate source and parallelism redundancy, a centralized declarative data plane to orchestrate multisource mixing, and a multi-level auto-partitioning mechanism to balance heterogeneous preprocessing costs; the resulting system reports up to 4.5x end-to-end training throughput improvement and 13.5x reduction in CPU memory usage.
What carries the argument
The multi-level auto-partitioning and scaling mechanism for source loaders, which estimates and balances preprocessing costs across heterogeneous data sources while preserving multisource scalability.
If this is right
- End-to-end training throughput rises by up to 4.5 times when data sources differ in preprocessing cost.
- CPU memory footprint of the dataloader drops by up to 13.5 times by removing replicated file-access state.
- Dynamic mixing policies such as curriculum learning or long-short context become practical without extra redundancy.
- Hybrid parallelism configurations avoid duplicated access and memory overhead across data sources.
Where Pith is reading between the lines
- The same disaggregation pattern could reduce state duplication in other training components that currently replicate per-rank metadata.
- Production clusters might adopt the design to support more frequent data-source changes without re-tuning partitions manually.
- The approach invites direct measurement of how estimation error in cost prediction scales with the number of distinct sources.
Load-bearing premise
The auto-partitioning can accurately predict and equalize preprocessing costs across sources without adding coordination overhead that cancels the gains.
What would settle it
Measure achieved throughput and memory on a workload whose sources have deliberately mismatched preprocessing times; if the speedup falls below the claimed factor while overhead rises, the balancing claim is falsified.
read the original abstract
Modern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. When preparing data for LFM training that originates from multiple, distinct sources, two fundamental challenges arise. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant workload imbalance among dataloaders, degrading the training efficiency. Second, supporting diverse data sources requires per-dataset file access states that are redundantly replicated across parallel loaders, consuming excessive memory. This also hinders dynamic data mixing (e.g., curriculum learning) and causes redundant access/memory overhead in hybrid parallelism. We present MegaScale-Data, an industrial-grade distributed data loading architecture for multisource LFMs training, with three key innovations: (1) Disaggregated data preprocessing via role-specific actors (Source Loaders/Data Constructors) to eliminate source and parallelism redundant data access and ensure multisource scalability. (2) Centralized and declarative data plane for load-time multisource orchestration, such as long-short context, multimodality, and curriculum learning. (3) Multi-level auto-partitioning and scaling mechanism for source loaders under heterogeneous preprocessing costs. We also contribute our designs and operational experience in deployment and fault tolerance. MegaScale-Data achieves up to: (1) 4.5x end-to-end training throughput improvement, and (2) 13.5x reduction in CPU memory usage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce MegaScale-Data, a distributed data loading architecture for multisource large foundation model training. It addresses two challenges: workload imbalance among dataloaders due to non-uniform sample distribution and excessive memory consumption from replicated per-dataset file access states. The key innovations are disaggregated data preprocessing using role-specific actors, a centralized declarative data plane for multisource orchestration, and a multi-level auto-partitioning and scaling mechanism for heterogeneous preprocessing costs. The system is reported to achieve up to 4.5x end-to-end training throughput improvement and 13.5x reduction in CPU memory usage, with additional contributions on deployment and fault tolerance.
Significance. If the results hold, the paper makes a significant contribution to the field of distributed systems for machine learning by providing practical solutions to scalability issues in dataloaders for multisource data. The performance improvements could lead to more efficient training of large models, and the industrial experience adds value. The work is grounded in real deployment challenges and offers mechanisms that could be adopted in production environments.
major comments (2)
- [Abstract] Abstract: The claim that the multi-level auto-partitioning mechanism enables the 4.5x throughput improvement by accurately estimating and balancing heterogeneous preprocessing costs lacks supporting evidence in the form of cost model validation, such as comparisons between estimated and actual preprocessing times or ablations showing the impact of partitioning quality. This is load-bearing for the central empirical claim.
- [Abstract] Abstract: The experimental results are presented without reference to the full methodology, including baseline configurations, specific workload definitions for multisource mixes, or statistical measures like error bars, which is necessary to substantiate the 13.5x memory reduction and throughput gains.
minor comments (1)
- The abstract could more clearly distinguish between the contributions of each of the three innovations to the reported performance numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript to incorporate additional evidence and clarifications as outlined.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the multi-level auto-partitioning mechanism enables the 4.5x throughput improvement by accurately estimating and balancing heterogeneous preprocessing costs lacks supporting evidence in the form of cost model validation, such as comparisons between estimated and actual preprocessing times or ablations showing the impact of partitioning quality. This is load-bearing for the central empirical claim.
Authors: We agree that the abstract's claim would be strengthened by explicit validation of the cost model. The current manuscript describes the multi-level auto-partitioning mechanism but does not include direct comparisons of estimated versus actual preprocessing times or dedicated ablations on partitioning quality in the presented results. In the revision, we will add these elements—specifically, cost model validation data and an ablation on partitioning impact—to the experimental evaluation section, with a brief reference added to the abstract to support the 4.5x throughput claim. revision: yes
-
Referee: [Abstract] Abstract: The experimental results are presented without reference to the full methodology, including baseline configurations, specific workload definitions for multisource mixes, or statistical measures like error bars, which is necessary to substantiate the 13.5x memory reduction and throughput gains.
Authors: We concur that the abstract would benefit from explicit linkages to the methodology to better substantiate the reported gains. The full manuscript contains the experimental methodology, but the abstract does not reference baseline configurations, specific multisource workload definitions, or statistical measures such as error bars. We will revise the abstract to include concise references to these aspects (e.g., baseline setups, workload mixes, and error bar details from repeated runs) while directing readers to the relevant evaluation sections for full details. revision: yes
Circularity Check
Empirical systems paper with measured performance claims; no circular derivation steps
full rationale
This paper presents a distributed dataloader system with three architectural innovations and reports end-to-end measured improvements (4.5x throughput, 13.5x memory reduction). No equations, fitted parameters, or self-referential definitions appear in the provided text. The multi-level auto-partitioning is presented as an implemented mechanism whose effectiveness is asserted via deployment results rather than derived by construction from its own inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The derivation chain is self-contained against external benchmarks (real training workloads), qualifying for the default non-circularity outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Preprocessing costs across data sources are heterogeneous and can be estimated and partitioned at multiple levels without prohibitive coordination cost.
Forward citations
Cited by 3 Pith papers
-
BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training
BatchWeave delivers an object-store-native data plane for distributed large foundation model training via transactional global batches and a decentralized adaptive commit algorithm.
-
BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training
Lakestream provides a consistent brokerless object-store-native data plane for large foundation model training using transactional global batches and decentralized adaptive commit.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
Reference graph
Works this paper leans on
-
[1]
Amazon S3 (simple storage service), 2025
Amazon Web Services. Amazon S3 (simple storage service), 2025. URLhttps://docs.aws.amazon.com/zh_cn/ emr/latest/ReleaseGuide/emr-hbase-s3.html. Accessed: 2025-03-22
work page 2025
-
[2]
Hadoop distributed file system (hdfs), 2025
Apache Software Foundation. Hadoop distributed file system (hdfs), 2025. URLhttps://docs.aws.amazon.com/ zh_cn/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html. Accessed: 2025-03-22
work page 2025
-
[3]
Apache parquet documentation: File format configurations, 2025
Apache Software Foundation. Apache parquet documentation: File format configurations, 2025. URLhttps: //parquet.apache.org/docs/file-format/configurations/. Accessed: 2025-03-22
work page 2025
-
[4]
Key-frame extraction techniques: A review
Milan K Asha Paul, Jeyaraman Kavitha, and P Arockia Jansi Rani. Key-frame extraction techniques: A review. Recent Patents on Computer Science, 11(1):3–16, 2018
work page 2018
-
[5]
Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiří Šimša, and Chandramohan A. Thekkath. tf.data service: A case for disaggregating ml input data processing. InProceedings of the 2023 ACM Symposium on Cloud Computing, SoCC ’23, page 358–375, New York, NY, USA, 2023. Association for Computing Machinery
work page 2023
-
[6]
Pathways: Asynchronous distributed dataflow for ml
Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent Shafey, Chandu Thekkath, and Yonghui Wu. Pathways: Asynchronous distributed dataflow for ml. In D. Marculescu, Y. Chi, and C. Wu, editors,Proceedings of Machine Le...
-
[7]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism, 2024. URL https://arxiv.org/abs/2401.02954
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
S. Boettcher and S. Mertens. Analysis of the karmarkar-karp differencing algorithm.The European Physical Journal B, 65(1):131–140, August 2008. ISSN 1434-6036. doi: 10.1140/epjb/e2008-00320-9. URLhttp://dx.doi. org/10.1140/epjb/e2008-00320-9
-
[9]
Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022
work page 2022
-
[10]
Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré
Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models, 2023. URLhttps: //arxiv.org/abs/2307.14430
-
[11]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation, 2023. URLhttps://arxiv.org/abs/2306.15595
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Clark C. Evans and contributors. Pillow library. https://pillow.readthedocs.io/en/stable/, 2024. Python Imaging Library (PIL) Fork
work page 2024
-
[13]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvancesin Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA, 2022. Curran Associates Inc
work page 2022
-
[14]
Large scale distributed deep networks
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. Large scale distributed deep networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors,Advancesin Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012
work page 2012
-
[15]
Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Grit- senko, Mario Lucic, and Neil Houlsby. Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution. In A. Oh, T. Naumann...
work page 2023
-
[16]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 21
work page 2009
-
[17]
Mycroft: Tracing dependencies in collective communication towards reliable llm training
Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, et al. Mycroft: Tracing dependencies in collective communication towards reliable llm training. arXiv preprint arXiv:2509.03018, 2025
-
[18]
Evolution of aegis: Fault diagnosis for AI model training service in production
Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao W...
work page 2025
-
[19]
An image is worth 16x16 words: Transformers for image recognition at scale, 2021
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021
work page 2021
-
[20]
The llama 3 herd of models, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models, 2024
work page 2024
-
[21]
Check-N-Run: a checkpointing system for training deep learning recommendation models
Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. Check-N-Run: a checkpointing system for training deep learning recommendation models. In19th USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 22), pages 929–943, Renton, WA, April ...
work page 2022
-
[22]
Pytorchvideo: A deep learning library for video understanding
Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, and Christoph Feichtenhofer. Pytorchvideo: A deep learning library for video understanding. InProceedings of the 29th ACM International Conference on Multi...
-
[23]
Association for Computing Machinery. ISBN 9781450386517. doi: 10.1145/3474085.3478329. URL https://doi.org/10.1145/3474085.3478329
-
[24]
Optimus: Accelerating large-scale multi-modal llm training by bubble exploitation, 2024
Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, and Minlan Yu. Optimus: Accelerating large-scale multi-modal llm training by bubble exploitation, 2024. URLhttps://arxiv.org/abs/2408.03505
-
[25]
Common crawl.https://commoncrawl.org, 2014
Common Crawl Foundation. Common crawl.https://commoncrawl.org, 2014
work page 2014
-
[26]
Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, and Francisco Herrera. A comparison on scalability for batch big data processing on apache spark and apache flink.Big Data Analytics, 2:1–11, 2017
work page 2017
-
[27]
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025. URL https://arxiv.org/abs/2502.21231
-
[28]
G. Graefe. Volcano: An extensible and parallel query evaluation system.IEEE Trans.on Knowl. and Data Eng., 6(1):120–135, February 1994. doi: 10.1109/69.273032
-
[29]
Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A. Thekkath, and Ana Klimovic. Cachew: Machine learning input data processing as a service. In2022 USENIX Annual Technical Conference (USENIX ATC22), pages 689–706, Carlsbad, CA, July 2022. USENIX Association
work page 2022
-
[30]
Dan Graur, Oto Mraz, Muyu Li, Sepehr Pourghannad, Chandramohan A. Thekkath, and Ana Klimovic. Pecan: Cost-Efficient ML data preprocessing with automatic transformation ordering and hybrid placement. In2024 USENIX Annual TechnicalConference (USENIX ATC24), pages 649–665, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-41-0. URLhttps://www...
work page 2024
-
[31]
Characterization of large language model development in the datacenter, 2024
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. Characterization of large language model development in the datacenter, 2024. URLhttps://arxiv.org/abs/2403.07648
-
[32]
Characterization of large language model development in the datacenter
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. Characterization of large language model development in the datacenter. In21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 709–729, Santa Clara, CA, April 2024. U...
work page 2024
-
[33]
Distmm: accelerating distributed mul- timodal model training
Jun Huang, Zhen Zhang, Shuai Zheng, Feng Qin, and Yida Wang. Distmm: accelerating distributed mul- timodal model training. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI’24, USA, 2024. USENIX Association
work page 2024
-
[34]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen.GPipe: efficienttraining of giant neural networks using pipeline parallelism. Curran Associates Inc., Red Hook, NY, USA, 2019
work page 2019
-
[35]
System optimizations for enabling training of extreme long sequence transformer models
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACMSymposium on Principles of Distributed Computing, PODC ’24, page 121–130, New York, NY, USA, 202...
work page 2024
-
[36]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Yiding Jiang, Allan Zhou, Zhili Feng, Sadhika Malladi, and J. Zico Kolter. Adaptive data optimization: Dynamic sample selection with scaling laws, 2024. URLhttps://arxiv.org/abs/2410.11820
-
[38]
MegaScale: Scaling large language model training to more than 10,000 GPUs
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...
work page 2024
-
[39]
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al.{MegaScale}: Scaling large language model training to more than 10,000{GPUs}. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 745–760, USA,
- [40]
- [41]
-
[42]
Kubernetes. Sidecar containers, 2024. URL https://kubernetes.io/docs/concepts/workloads/pods/ sidecar-containers/. Kubernetes Documentation v1.29
work page 2024
-
[43]
The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models, 2022
Conglong Li, Minjia Zhang, and Yuxiong He. The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models, 2022. URLhttps://arxiv.org/abs/2108.06084
-
[44]
Pytorch distributed: experiences on accelerating data parallel training
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, August 2020. ISSN 2150-8097
work page 2020
-
[45]
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training, 2020. URLhttps://arxiv.org/abs/2006.15704
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[46]
Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, and Wei Lin. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024. URLhttps://arxiv.org/abs/2401.02669
-
[47]
Ring attention with blockwise transformers for near-infinite context, 2023
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023. 23
work page 2023
-
[48]
The streaming batch model for efficient and fault-tolerant heterogeneous execution, 2025
Frank Sifei Luan, Ziming Mao, Ron Yifeng Wang, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, and Stephanie Wang. The streaming batch model for efficient and fault-tolerant heterogeneous execution, 2025. URLhttps://arxiv.org/abs/2501. 12407
work page 2025
-
[49]
The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, April 2025
Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, April 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed: 2025-04-06
work page 2025
-
[50]
CheckFreq: Frequent, Fine-Grained DNN checkpointing
Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. CheckFreq: Frequent, Fine-Grained DNN checkpointing. In19th USENIX Conference on File and Storage Technologies(FAST21), pages 203–216. USENIX Association, February 2021. ISBN 978-1-939133-20-5. URL https://www.usenix.org/conference/fast21/ presentation/mohan
work page 2021
-
[51]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: a distributed framework for emerging ai applications. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 561–577, USA, 2018. US...
work page 2018
-
[52]
Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk
Derek G. Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk. tf.data: A machine learning data processing framework, 2021. URLhttps://arxiv.org/abs/2101.12127
-
[53]
Pipedream: Generalized pipeline parallelism for dnn training
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, New York, NY, USA,
-
[54]
Association for Computing Machinery
-
[55]
Efficient large-scale language model training on gpu clusters using megatron-lm
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Conference for High Performance Computing, Networking, St...
work page 2021
-
[56]
torch.utils.data — PyTorch 2.4 documentation, 2024
PyTorch contributors. torch.utils.data — PyTorch 2.4 documentation, 2024. URLhttps://pytorch.org/docs/ stable/data.html. Accessed: [Insert access date]
work page 2024
-
[57]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery
work page 2020
-
[58]
Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021
work page 2021
-
[59]
Proximal policy optimization algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017
work page 2017
-
[60]
Horovod: fast and easy distributed deep learning in TensorFlow
Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow, 2018. URLhttps://arxiv.org/abs/1802.05799
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[61]
Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020
work page 2020
-
[62]
Kafka: The modern platform for data management and analysis in big data domain
Rishika Shree, Tanupriya Choudhury, Subhash Chand Gupta, and Praveen Kumar. Kafka: The modern platform for data management and analysis in big data domain. In2017 2nd International Conference on Telecommunication and Networks (TEL-NET), pages 1–5, 2017. doi: 10.1109/TEL-NET.2017.8343593
-
[63]
Curriculum learning: A survey, 2022
Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey, 2022. URL https://arxiv.org/abs/2101.10382
-
[64]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models, 2025. URLhttps://arxiv.org/abs/2312.11805. 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Taegeon Um, Byungsoo Oh, Byeongchan Seo, Minhyeok Kweun, Goeun Kim, and Woo-Yeon Lee. Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline.Proc. VLDB Endow., 16 (5):1086–1099, jan 2023
work page 2023
-
[66]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URLhttps://arxiv.org/abs/1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections
Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. InProceedings oftheACMSIGOPS30thSymposium onOperating Systems Principles, pages 195–210, New York, NY, USA, 2024. Association for Computing Machinery
work page 2024
-
[68]
Bytecheckpoint: A unified checkpointing system for llm development, 2024
Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for llm development, 2024. URL https://arxiv.org/abs/2407.20143
-
[69]
Robust llm training infrastructure at bytedance, 2025
Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zheru...
-
[70]
Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,
-
[71]
URLhttps://arxiv.org/abs/2409.12191
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 364–381, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi: 10.1...
-
[73]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation, 2024. URLhttps://arxiv.org/abs/2410.13848
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Data mixing laws: Optimizing data mixtures by predicting language modeling performance, 2024
Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance, 2024. URLhttps://arxiv.org/abs/2403.16952
-
[76]
An empirical evaluation of columnar storage formats.Proc
Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. An empirical evaluation of columnar storage formats.Proc. VLDB Endow., 17(2):148–161, October 2023. doi: 10.14778/ 3626292.3626298
-
[77]
Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Daxin Jiang, and Xin Jin. Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models. In Proceedings of the ACM SIGCOMM 2025 Conference, SIGCOMM ’25, page 24–38, New York, NY, USA, 2025. Association for Computing Machinery
work page 2025
-
[78]
Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, and Parik Pol. Understanding data storage and ingestion for large-scale deep recommendation model training: industrial p...
work page 2022
-
[79]
cedar: Optimized and unified machine learning input data pipelines
Mark Zhao, Emanuel Adamiak, and Christos Kozyrakis. cedar: Optimized and unified machine learning input data pipelines. Proc. VLDB Endow., 18(2):488–502, 2024. 25
work page 2024
-
[80]
Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 16(12):3848–3860, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.