MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Juntao Zhao , Qi Lu , Wei Jia , Borui Wan , Lei Zuo , Junda Feng , Jianyu Jiang , Yangrui Chen

show 8 more authors

Shuaishuai Cao Jialing He Kaihua Jiang Yuanzhe Hu Shibiao Nong Yanghua Peng Haibin Lin Chuan Wu

Authors on Pith no claims yet

classification 💻 cs.DC cs.AI

keywords datatrainingmultisourceaccessloadersmegascale-datamemorysource

0 comments

read the original abstract

Modern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. When preparing data for LFM training that originates from multiple, distinct sources, two fundamental challenges arise. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant workload imbalance among dataloaders, degrading the training efficiency. Second, supporting diverse data sources requires per-dataset file access states that are redundantly replicated across parallel loaders, consuming excessive memory. This also hinders dynamic data mixing (e.g., curriculum learning) and causes redundant access/memory overhead in hybrid parallelism. We present MegaScale-Data, an industrial-grade distributed data loading architecture for multisource LFMs training, with three key innovations: (1) Disaggregated data preprocessing via role-specific actors (Source Loaders/Data Constructors) to eliminate source and parallelism redundant data access and ensure multisource scalability. (2) Centralized and declarative data plane for load-time multisource orchestration, such as long-short context, multimodality, and curriculum learning. (3) Multi-level auto-partitioning and scaling mechanism for source loaders under heterogeneous preprocessing costs. We also contribute our designs and operational experience in deployment and fault tolerance. MegaScale-Data achieves up to: (1) 4.5x end-to-end training throughput improvement, and (2) 13.5x reduction in CPU memory usage.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lakestream: A Consistent and Brokerless Data Plane for Large Foundation Model Training
cs.DC 2026-05 unverdicted novelty 6.0

Lakestream provides a consistent brokerless object-store-native data plane for large foundation model training using transactional global batches and decentralized adaptive commit.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
cs.DC 2026-05 unverdicted novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.