pith. sign in

arxiv: 2504.09844 · v4 · submitted 2025-04-14 · 💻 cs.DC · cs.AI

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Pith reviewed 2026-05-22 21:09 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords distributed dataloadermultisource traininglarge foundation modelsdata parallelismpreprocessing scalingmemory optimizationworkload balancing
0
0 comments X

The pith

MegaScale-Data disaggregates preprocessing into role-specific actors and applies multi-level auto-partitioning to scale dataloaders across multiple data sources for large foundation model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard data-parallel loaders create workload imbalance because attention's quadratic cost varies with sample distribution, and they duplicate file-access state for each source across every rank. MegaScale-Data separates preprocessing into Source Loaders and Data Constructors, routes orchestration through a central declarative plane, and uses multi-level auto-partitioning to match heterogeneous costs. The design removes redundant memory copies and supports dynamic mixing such as curriculum learning. If the mechanism works, training runs finish with substantially less idle time and far lower per-rank memory footprint when data comes from many sources.

Core claim

MegaScale-Data is a distributed data-loading architecture that uses disaggregated preprocessing via role-specific actors to eliminate source and parallelism redundancy, a centralized declarative data plane to orchestrate multisource mixing, and a multi-level auto-partitioning mechanism to balance heterogeneous preprocessing costs; the resulting system reports up to 4.5x end-to-end training throughput improvement and 13.5x reduction in CPU memory usage.

What carries the argument

The multi-level auto-partitioning and scaling mechanism for source loaders, which estimates and balances preprocessing costs across heterogeneous data sources while preserving multisource scalability.

If this is right

  • End-to-end training throughput rises by up to 4.5 times when data sources differ in preprocessing cost.
  • CPU memory footprint of the dataloader drops by up to 13.5 times by removing replicated file-access state.
  • Dynamic mixing policies such as curriculum learning or long-short context become practical without extra redundancy.
  • Hybrid parallelism configurations avoid duplicated access and memory overhead across data sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same disaggregation pattern could reduce state duplication in other training components that currently replicate per-rank metadata.
  • Production clusters might adopt the design to support more frequent data-source changes without re-tuning partitions manually.
  • The approach invites direct measurement of how estimation error in cost prediction scales with the number of distinct sources.

Load-bearing premise

The auto-partitioning can accurately predict and equalize preprocessing costs across sources without adding coordination overhead that cancels the gains.

What would settle it

Measure achieved throughput and memory on a workload whose sources have deliberately mismatched preprocessing times; if the speedup falls below the claimed factor while overhead rises, the balancing claim is falsified.

read the original abstract

Modern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. When preparing data for LFM training that originates from multiple, distinct sources, two fundamental challenges arise. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant workload imbalance among dataloaders, degrading the training efficiency. Second, supporting diverse data sources requires per-dataset file access states that are redundantly replicated across parallel loaders, consuming excessive memory. This also hinders dynamic data mixing (e.g., curriculum learning) and causes redundant access/memory overhead in hybrid parallelism. We present MegaScale-Data, an industrial-grade distributed data loading architecture for multisource LFMs training, with three key innovations: (1) Disaggregated data preprocessing via role-specific actors (Source Loaders/Data Constructors) to eliminate source and parallelism redundant data access and ensure multisource scalability. (2) Centralized and declarative data plane for load-time multisource orchestration, such as long-short context, multimodality, and curriculum learning. (3) Multi-level auto-partitioning and scaling mechanism for source loaders under heterogeneous preprocessing costs. We also contribute our designs and operational experience in deployment and fault tolerance. MegaScale-Data achieves up to: (1) 4.5x end-to-end training throughput improvement, and (2) 13.5x reduction in CPU memory usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce MegaScale-Data, a distributed data loading architecture for multisource large foundation model training. It addresses two challenges: workload imbalance among dataloaders due to non-uniform sample distribution and excessive memory consumption from replicated per-dataset file access states. The key innovations are disaggregated data preprocessing using role-specific actors, a centralized declarative data plane for multisource orchestration, and a multi-level auto-partitioning and scaling mechanism for heterogeneous preprocessing costs. The system is reported to achieve up to 4.5x end-to-end training throughput improvement and 13.5x reduction in CPU memory usage, with additional contributions on deployment and fault tolerance.

Significance. If the results hold, the paper makes a significant contribution to the field of distributed systems for machine learning by providing practical solutions to scalability issues in dataloaders for multisource data. The performance improvements could lead to more efficient training of large models, and the industrial experience adds value. The work is grounded in real deployment challenges and offers mechanisms that could be adopted in production environments.

major comments (2)
  1. [Abstract] Abstract: The claim that the multi-level auto-partitioning mechanism enables the 4.5x throughput improvement by accurately estimating and balancing heterogeneous preprocessing costs lacks supporting evidence in the form of cost model validation, such as comparisons between estimated and actual preprocessing times or ablations showing the impact of partitioning quality. This is load-bearing for the central empirical claim.
  2. [Abstract] Abstract: The experimental results are presented without reference to the full methodology, including baseline configurations, specific workload definitions for multisource mixes, or statistical measures like error bars, which is necessary to substantiate the 13.5x memory reduction and throughput gains.
minor comments (1)
  1. The abstract could more clearly distinguish between the contributions of each of the three innovations to the reported performance numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript to incorporate additional evidence and clarifications as outlined.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the multi-level auto-partitioning mechanism enables the 4.5x throughput improvement by accurately estimating and balancing heterogeneous preprocessing costs lacks supporting evidence in the form of cost model validation, such as comparisons between estimated and actual preprocessing times or ablations showing the impact of partitioning quality. This is load-bearing for the central empirical claim.

    Authors: We agree that the abstract's claim would be strengthened by explicit validation of the cost model. The current manuscript describes the multi-level auto-partitioning mechanism but does not include direct comparisons of estimated versus actual preprocessing times or dedicated ablations on partitioning quality in the presented results. In the revision, we will add these elements—specifically, cost model validation data and an ablation on partitioning impact—to the experimental evaluation section, with a brief reference added to the abstract to support the 4.5x throughput claim. revision: yes

  2. Referee: [Abstract] Abstract: The experimental results are presented without reference to the full methodology, including baseline configurations, specific workload definitions for multisource mixes, or statistical measures like error bars, which is necessary to substantiate the 13.5x memory reduction and throughput gains.

    Authors: We concur that the abstract would benefit from explicit linkages to the methodology to better substantiate the reported gains. The full manuscript contains the experimental methodology, but the abstract does not reference baseline configurations, specific multisource workload definitions, or statistical measures such as error bars. We will revise the abstract to include concise references to these aspects (e.g., baseline setups, workload mixes, and error bar details from repeated runs) while directing readers to the relevant evaluation sections for full details. revision: yes

Circularity Check

0 steps flagged

Empirical systems paper with measured performance claims; no circular derivation steps

full rationale

This paper presents a distributed dataloader system with three architectural innovations and reports end-to-end measured improvements (4.5x throughput, 13.5x memory reduction). No equations, fitted parameters, or self-referential definitions appear in the provided text. The multi-level auto-partitioning is presented as an implemented mechanism whose effectiveness is asserted via deployment results rather than derived by construction from its own inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The derivation chain is self-contained against external benchmarks (real training workloads), qualifying for the default non-circularity outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard distributed-systems assumptions about data parallelism and the ability to measure preprocessing costs at runtime; no new physical constants or fitted global parameters are introduced.

axioms (1)
  • domain assumption Preprocessing costs across data sources are heterogeneous and can be estimated and partitioned at multiple levels without prohibitive coordination cost.
    Invoked to justify the multi-level auto-partitioning mechanism for source loaders.

pith-pipeline@v0.9.0 · 5854 in / 1253 out tokens · 33978 ms · 2026-05-22T21:09:23.757012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training

    cs.DC 2026-05 unverdicted novelty 7.0

    BatchWeave delivers an object-store-native data plane for distributed large foundation model training via transactional global batches and a decentralized adaptive commit algorithm.

  2. BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training

    cs.DC 2026-05 unverdicted novelty 6.0

    Lakestream provides a consistent brokerless object-store-native data plane for large foundation model training using transactional global batches and decentralized adaptive commit.

  3. MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

    cs.DC 2026-05 unverdicted novelty 6.0

    MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 2 Pith papers · 10 internal anchors

  1. [1]

    Amazon S3 (simple storage service), 2025

    Amazon Web Services. Amazon S3 (simple storage service), 2025. URLhttps://docs.aws.amazon.com/zh_cn/ emr/latest/ReleaseGuide/emr-hbase-s3.html. Accessed: 2025-03-22

  2. [2]

    Hadoop distributed file system (hdfs), 2025

    Apache Software Foundation. Hadoop distributed file system (hdfs), 2025. URLhttps://docs.aws.amazon.com/ zh_cn/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html. Accessed: 2025-03-22

  3. [3]

    Apache parquet documentation: File format configurations, 2025

    Apache Software Foundation. Apache parquet documentation: File format configurations, 2025. URLhttps: //parquet.apache.org/docs/file-format/configurations/. Accessed: 2025-03-22

  4. [4]

    Key-frame extraction techniques: A review

    Milan K Asha Paul, Jeyaraman Kavitha, and P Arockia Jansi Rani. Key-frame extraction techniques: A review. Recent Patents on Computer Science, 11(1):3–16, 2018

  5. [5]

    Thekkath

    Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiří Šimša, and Chandramohan A. Thekkath. tf.data service: A case for disaggregating ml input data processing. InProceedings of the 2023 ACM Symposium on Cloud Computing, SoCC ’23, page 358–375, New York, NY, USA, 2023. Association for Computing Machinery

  6. [6]

    Pathways: Asynchronous distributed dataflow for ml

    Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent Shafey, Chandu Thekkath, and Yonghui Wu. Pathways: Asynchronous distributed dataflow for ml. In D. Marculescu, Y. Chi, and C. Wu, editors,Proceedings of Machine Le...

  7. [7]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism, 2024. URL https://arxiv.org/abs/2401.02954

  8. [8]

    Boettcher and S

    S. Boettcher and S. Mertens. Analysis of the karmarkar-karp differencing algorithm.The European Physical Journal B, 65(1):131–140, August 2008. ISSN 1434-6036. doi: 10.1140/epjb/e2008-00320-9. URLhttp://dx.doi. org/10.1140/epjb/e2008-00320-9

  9. [9]

    Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

  10. [10]

    Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré

    Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models, 2023. URLhttps: //arxiv.org/abs/2307.14430

  11. [11]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation, 2023. URLhttps://arxiv.org/abs/2306.15595

  12. [12]

    Evans and contributors

    Clark C. Evans and contributors. Pillow library. https://pillow.readthedocs.io/en/stable/, 2024. Python Imaging Library (PIL) Fork

  13. [13]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvancesin Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA, 2022. Curran Associates Inc

  14. [14]

    Large scale distributed deep networks

    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. Large scale distributed deep networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors,Advancesin Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012

  15. [15]

    Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Grit- senko, Mario Lucic, and Neil Houlsby. Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution. In A. Oh, T. Naumann...

  16. [16]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 21

  17. [17]

    Mycroft: Tracing dependencies in collective communication towards reliable llm training

    Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, et al. Mycroft: Tracing dependencies in collective communication towards reliable llm training. arXiv preprint arXiv:2509.03018, 2025

  18. [18]

    Evolution of aegis: Fault diagnosis for AI model training service in production

    Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao W...

  19. [19]

    An image is worth 16x16 words: Transformers for image recognition at scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

  20. [20]

    The llama 3 herd of models, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models, 2024

  21. [21]

    Check-N-Run: a checkpointing system for training deep learning recommendation models

    Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. Check-N-Run: a checkpointing system for training deep learning recommendation models. In19th USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 22), pages 929–943, Renton, WA, April ...

  22. [22]

    Pytorchvideo: A deep learning library for video understanding

    Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, and Christoph Feichtenhofer. Pytorchvideo: A deep learning library for video understanding. InProceedings of the 29th ACM International Conference on Multi...

  23. [23]

    ISBN 9781450386517

    Association for Computing Machinery. ISBN 9781450386517. doi: 10.1145/3474085.3478329. URL https://doi.org/10.1145/3474085.3478329

  24. [24]

    Optimus: Accelerating large-scale multi-modal llm training by bubble exploitation, 2024

    Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, and Minlan Yu. Optimus: Accelerating large-scale multi-modal llm training by bubble exploitation, 2024. URLhttps://arxiv.org/abs/2408.03505

  25. [25]

    Common crawl.https://commoncrawl.org, 2014

    Common Crawl Foundation. Common crawl.https://commoncrawl.org, 2014

  26. [26]

    A comparison on scalability for batch big data processing on apache spark and apache flink.Big Data Analytics, 2:1–11, 2017

    Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, and Francisco Herrera. A comparison on scalability for batch big data processing on apache spark and apache flink.Big Data Analytics, 2:1–11, 2017

  27. [27]

    Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025

    Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025. URL https://arxiv.org/abs/2502.21231

  28. [28]

    G. Graefe. Volcano: An extensible and parallel query evaluation system.IEEE Trans.on Knowl. and Data Eng., 6(1):120–135, February 1994. doi: 10.1109/69.273032

  29. [29]

    Thekkath, and Ana Klimovic

    Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A. Thekkath, and Ana Klimovic. Cachew: Machine learning input data processing as a service. In2022 USENIX Annual Technical Conference (USENIX ATC22), pages 689–706, Carlsbad, CA, July 2022. USENIX Association

  30. [30]

    Thekkath, and Ana Klimovic

    Dan Graur, Oto Mraz, Muyu Li, Sepehr Pourghannad, Chandramohan A. Thekkath, and Ana Klimovic. Pecan: Cost-Efficient ML data preprocessing with automatic transformation ordering and hybrid placement. In2024 USENIX Annual TechnicalConference (USENIX ATC24), pages 649–665, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-41-0. URLhttps://www...

  31. [31]

    Characterization of large language model development in the datacenter, 2024

    Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. Characterization of large language model development in the datacenter, 2024. URLhttps://arxiv.org/abs/2403.07648

  32. [32]

    Characterization of large language model development in the datacenter

    Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. Characterization of large language model development in the datacenter. In21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 709–729, Santa Clara, CA, April 2024. U...

  33. [33]

    Distmm: accelerating distributed mul- timodal model training

    Jun Huang, Zhen Zhang, Shuai Zheng, Feng Qin, and Yida Wang. Distmm: accelerating distributed mul- timodal model training. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI’24, USA, 2024. USENIX Association

  34. [34]

    Le, Yonghui Wu, and Zhifeng Chen.GPipe: efficienttraining of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen.GPipe: efficienttraining of giant neural networks using pipeline parallelism. Curran Associates Inc., Red Hook, NY, USA, 2019

  35. [35]

    System optimizations for enabling training of extreme long sequence transformer models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACMSymposium on Principles of Distributed Computing, PODC ’24, page 121–130, New York, NY, USA, 202...

  36. [36]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

  37. [37]

    Zico Kolter

    Yiding Jiang, Allan Zhou, Zhili Feng, Sadhika Malladi, and J. Zico Kolter. Adaptive data optimization: Dynamic sample selection with scaling laws, 2024. URLhttps://arxiv.org/abs/2410.11820

  38. [38]

    MegaScale: Scaling large language model training to more than 10,000 GPUs

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

  39. [39]

    In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 745–760, USA,

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al.{MegaScale}: Scaling large language model training to more than 10,000{GPUs}. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 745–760, USA,

  40. [40]

    Kuaishou

    Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022. URL https: //arxiv.org/abs/2205.05198

  41. [41]

    Kosec, S

    Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance, 2022. URLhttps: //arxiv.org/abs/2107.02027

  42. [42]

    Sidecar containers, 2024

    Kubernetes. Sidecar containers, 2024. URL https://kubernetes.io/docs/concepts/workloads/pods/ sidecar-containers/. Kubernetes Documentation v1.29

  43. [43]

    The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models, 2022

    Conglong Li, Minjia Zhang, and Yuxiong He. The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models, 2022. URLhttps://arxiv.org/abs/2108.06084

  44. [44]

    Pytorch distributed: experiences on accelerating data parallel training

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, August 2020. ISSN 2150-8097

  45. [45]

    PyTorch Distributed: Experiences on Accelerating Data Parallel Training

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training, 2020. URLhttps://arxiv.org/abs/2006.15704

  46. [46]

    Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024

    Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, and Wei Lin. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024. URLhttps://arxiv.org/abs/2401.02669

  47. [47]

    Ring attention with blockwise transformers for near-infinite context, 2023

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023. 23

  48. [48]

    The streaming batch model for efficient and fault-tolerant heterogeneous execution, 2025

    Frank Sifei Luan, Ziming Mao, Ron Yifeng Wang, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, and Stephanie Wang. The streaming batch model for efficient and fault-tolerant heterogeneous execution, 2025. URLhttps://arxiv.org/abs/2501. 12407

  49. [49]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, April 2025

    Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, April 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed: 2025-04-06

  50. [50]

    CheckFreq: Frequent, Fine-Grained DNN checkpointing

    Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. CheckFreq: Frequent, Fine-Grained DNN checkpointing. In19th USENIX Conference on File and Storage Technologies(FAST21), pages 203–216. USENIX Association, February 2021. ISBN 978-1-939133-20-5. URL https://www.usenix.org/conference/fast21/ presentation/mohan

  51. [51]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: a distributed framework for emerging ai applications. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 561–577, USA, 2018. US...

  52. [52]

    Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk

    Derek G. Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk. tf.data: A machine learning data processing framework, 2021. URLhttps://arxiv.org/abs/2101.12127

  53. [53]

    Pipedream: Generalized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, New York, NY, USA,

  54. [54]

    Association for Computing Machinery

  55. [55]

    Efficient large-scale language model training on gpu clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Conference for High Performance Computing, Networking, St...

  56. [56]

    torch.utils.data — PyTorch 2.4 documentation, 2024

    PyTorch contributors. torch.utils.data — PyTorch 2.4 documentation, 2024. URLhttps://pytorch.org/docs/ stable/data.html. Accessed: [Insert access date]

  57. [57]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery

  58. [58]

    Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021

  59. [59]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  60. [60]

    Horovod: fast and easy distributed deep learning in TensorFlow

    Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow, 2018. URLhttps://arxiv.org/abs/1802.05799

  61. [61]

    Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

  62. [62]

    Kafka: The modern platform for data management and analysis in big data domain

    Rishika Shree, Tanupriya Choudhury, Subhash Chand Gupta, and Praveen Kumar. Kafka: The modern platform for data management and analysis in big data domain. In2017 2nd International Conference on Telecommunication and Networks (TEL-NET), pages 1–5, 2017. doi: 10.1109/TEL-NET.2017.8343593

  63. [63]

    Curriculum learning: A survey, 2022

    Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey, 2022. URL https://arxiv.org/abs/2101.10382

  64. [64]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models, 2025. URLhttps://arxiv.org/abs/2312.11805. 24

  65. [65]

    Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline.Proc

    Taegeon Um, Byungsoo Oh, Byeongchan Seo, Minhyeok Kweun, Goeun Kim, and Woo-Yeon Lee. Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline.Proc. VLDB Endow., 16 (5):1086–1099, jan 2023

  66. [66]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URLhttps://arxiv.org/abs/1706.03762

  67. [67]

    Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

    Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. InProceedings oftheACMSIGOPS30thSymposium onOperating Systems Principles, pages 195–210, New York, NY, USA, 2024. Association for Computing Machinery

  68. [68]

    Bytecheckpoint: A unified checkpointing system for llm development, 2024

    Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for llm development, 2024. URL https://arxiv.org/abs/2407.20143

  69. [69]

    Robust llm training infrastructure at bytedance, 2025

    Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zheru...

  70. [70]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

  71. [71]

    URLhttps://arxiv.org/abs/2409.12191

  72. [72]

    Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 364–381, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi: 10.1...

  73. [73]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation, 2024. URLhttps://arxiv.org/abs/2410.13848

  74. [74]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

  75. [75]

    Data mixing laws: Optimizing data mixtures by predicting language modeling performance, 2024

    Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance, 2024. URLhttps://arxiv.org/abs/2403.16952

  76. [76]

    An empirical evaluation of columnar storage formats.Proc

    Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. An empirical evaluation of columnar storage formats.Proc. VLDB Endow., 17(2):148–161, October 2023. doi: 10.14778/ 3626292.3626298

  77. [77]

    Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models

    Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Daxin Jiang, and Xin Jin. Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models. In Proceedings of the ACM SIGCOMM 2025 Conference, SIGCOMM ’25, page 24–38, New York, NY, USA, 2025. Association for Computing Machinery

  78. [78]

    Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product

    Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, and Parik Pol. Understanding data storage and ingestion for large-scale deep recommendation model training: industrial p...

  79. [79]

    cedar: Optimized and unified machine learning input data pipelines

    Mark Zhao, Emanuel Adamiak, and Christos Kozyrakis. cedar: Optimized and unified machine learning input data pipelines. Proc. VLDB Endow., 18(2):488–502, 2024. 25

  80. [80]

    Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 16(12):3848–3860, 2023

Showing first 80 references.