MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Borui Wan; Chuan Wu; Haibin Lin; Jialing He; Jianyu Jiang; Junda Feng; Juntao Zhao; Kaihua Jiang; Lei Zuo; Qi Lu

arxiv: 2504.09844 · v4 · submitted 2025-04-14 · 💻 cs.DC · cs.AI

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Juntao Zhao , Qi Lu , Wei Jia , Borui Wan , Lei Zuo , Junda Feng , Jianyu Jiang , Yangrui Chen

show 8 more authors

Shuaishuai Cao Jialing He Kaihua Jiang Yuanzhe Hu Shibiao Nong Yanghua Peng Haibin Lin Chuan Wu

This is my paper

Pith reviewed 2026-05-22 21:09 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords distributed dataloadermultisource traininglarge foundation modelsdata parallelismpreprocessing scalingmemory optimizationworkload balancing

0 comments

The pith

MegaScale-Data disaggregates preprocessing into role-specific actors and applies multi-level auto-partitioning to scale dataloaders across multiple data sources for large foundation model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard data-parallel loaders create workload imbalance because attention's quadratic cost varies with sample distribution, and they duplicate file-access state for each source across every rank. MegaScale-Data separates preprocessing into Source Loaders and Data Constructors, routes orchestration through a central declarative plane, and uses multi-level auto-partitioning to match heterogeneous costs. The design removes redundant memory copies and supports dynamic mixing such as curriculum learning. If the mechanism works, training runs finish with substantially less idle time and far lower per-rank memory footprint when data comes from many sources.

Core claim

MegaScale-Data is a distributed data-loading architecture that uses disaggregated preprocessing via role-specific actors to eliminate source and parallelism redundancy, a centralized declarative data plane to orchestrate multisource mixing, and a multi-level auto-partitioning mechanism to balance heterogeneous preprocessing costs; the resulting system reports up to 4.5x end-to-end training throughput improvement and 13.5x reduction in CPU memory usage.

What carries the argument

The multi-level auto-partitioning and scaling mechanism for source loaders, which estimates and balances preprocessing costs across heterogeneous data sources while preserving multisource scalability.

If this is right

End-to-end training throughput rises by up to 4.5 times when data sources differ in preprocessing cost.
CPU memory footprint of the dataloader drops by up to 13.5 times by removing replicated file-access state.
Dynamic mixing policies such as curriculum learning or long-short context become practical without extra redundancy.
Hybrid parallelism configurations avoid duplicated access and memory overhead across data sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disaggregation pattern could reduce state duplication in other training components that currently replicate per-rank metadata.
Production clusters might adopt the design to support more frequent data-source changes without re-tuning partitions manually.
The approach invites direct measurement of how estimation error in cost prediction scales with the number of distinct sources.

Load-bearing premise

The auto-partitioning can accurately predict and equalize preprocessing costs across sources without adding coordination overhead that cancels the gains.

What would settle it

Measure achieved throughput and memory on a workload whose sources have deliberately mismatched preprocessing times; if the speedup falls below the claimed factor while overhead rises, the balancing claim is falsified.

read the original abstract

Modern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. When preparing data for LFM training that originates from multiple, distinct sources, two fundamental challenges arise. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant workload imbalance among dataloaders, degrading the training efficiency. Second, supporting diverse data sources requires per-dataset file access states that are redundantly replicated across parallel loaders, consuming excessive memory. This also hinders dynamic data mixing (e.g., curriculum learning) and causes redundant access/memory overhead in hybrid parallelism. We present MegaScale-Data, an industrial-grade distributed data loading architecture for multisource LFMs training, with three key innovations: (1) Disaggregated data preprocessing via role-specific actors (Source Loaders/Data Constructors) to eliminate source and parallelism redundant data access and ensure multisource scalability. (2) Centralized and declarative data plane for load-time multisource orchestration, such as long-short context, multimodality, and curriculum learning. (3) Multi-level auto-partitioning and scaling mechanism for source loaders under heterogeneous preprocessing costs. We also contribute our designs and operational experience in deployment and fault tolerance. MegaScale-Data achieves up to: (1) 4.5x end-to-end training throughput improvement, and (2) 13.5x reduction in CPU memory usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MegaScale-Data gives a concrete disaggregated dataloader design for multisource training that claims big throughput and memory wins, but the auto-partitioning evidence is thin.

read the letter

The paper's core contribution is a distributed dataloader built around three pieces: role-specific Source Loaders and Data Constructors that split preprocessing to cut redundant I/O and memory, a centralized declarative plane that handles mixing rules like curriculum or multimodality at load time, and multi-level auto-partitioning meant to balance heterogeneous preprocessing costs across data-parallel ranks. Those ideas directly address the two problems called out in the abstract—imbalance from non-uniform samples and duplicated per-source state—and the authors note operational experience with fault tolerance, which is useful for anyone running at this scale. The reported 4.5x end-to-end throughput and 13.5x CPU memory reduction are the headline numbers, and if they hold they would matter for production multisource runs. The architecture itself looks like a step beyond the frameworks whose limits are described. The main gap is validation. The abstract states the performance numbers but gives no experimental details on baselines, workload definitions, or controls. More critically, there is no profiling data or error metrics showing that the cost estimation in the auto-partitioning stays accurate when sources differ in I/O, decoding, or tokenization cost. If estimation error is large, the gains could depend on hand tuning rather than the automated mechanism. This is a practical systems paper aimed at teams building or tuning large-scale training stacks. Readers who need to mix heterogeneous data sources at high throughput will find the design patterns worth looking at. The work is coherent enough on its own terms to deserve referee time, even though the experiments will need more scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce MegaScale-Data, a distributed data loading architecture for multisource large foundation model training. It addresses two challenges: workload imbalance among dataloaders due to non-uniform sample distribution and excessive memory consumption from replicated per-dataset file access states. The key innovations are disaggregated data preprocessing using role-specific actors, a centralized declarative data plane for multisource orchestration, and a multi-level auto-partitioning and scaling mechanism for heterogeneous preprocessing costs. The system is reported to achieve up to 4.5x end-to-end training throughput improvement and 13.5x reduction in CPU memory usage, with additional contributions on deployment and fault tolerance.

Significance. If the results hold, the paper makes a significant contribution to the field of distributed systems for machine learning by providing practical solutions to scalability issues in dataloaders for multisource data. The performance improvements could lead to more efficient training of large models, and the industrial experience adds value. The work is grounded in real deployment challenges and offers mechanisms that could be adopted in production environments.

major comments (2)

[Abstract] Abstract: The claim that the multi-level auto-partitioning mechanism enables the 4.5x throughput improvement by accurately estimating and balancing heterogeneous preprocessing costs lacks supporting evidence in the form of cost model validation, such as comparisons between estimated and actual preprocessing times or ablations showing the impact of partitioning quality. This is load-bearing for the central empirical claim.
[Abstract] Abstract: The experimental results are presented without reference to the full methodology, including baseline configurations, specific workload definitions for multisource mixes, or statistical measures like error bars, which is necessary to substantiate the 13.5x memory reduction and throughput gains.

minor comments (1)

The abstract could more clearly distinguish between the contributions of each of the three innovations to the reported performance numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript to incorporate additional evidence and clarifications as outlined.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the multi-level auto-partitioning mechanism enables the 4.5x throughput improvement by accurately estimating and balancing heterogeneous preprocessing costs lacks supporting evidence in the form of cost model validation, such as comparisons between estimated and actual preprocessing times or ablations showing the impact of partitioning quality. This is load-bearing for the central empirical claim.

Authors: We agree that the abstract's claim would be strengthened by explicit validation of the cost model. The current manuscript describes the multi-level auto-partitioning mechanism but does not include direct comparisons of estimated versus actual preprocessing times or dedicated ablations on partitioning quality in the presented results. In the revision, we will add these elements—specifically, cost model validation data and an ablation on partitioning impact—to the experimental evaluation section, with a brief reference added to the abstract to support the 4.5x throughput claim. revision: yes
Referee: [Abstract] Abstract: The experimental results are presented without reference to the full methodology, including baseline configurations, specific workload definitions for multisource mixes, or statistical measures like error bars, which is necessary to substantiate the 13.5x memory reduction and throughput gains.

Authors: We concur that the abstract would benefit from explicit linkages to the methodology to better substantiate the reported gains. The full manuscript contains the experimental methodology, but the abstract does not reference baseline configurations, specific multisource workload definitions, or statistical measures such as error bars. We will revise the abstract to include concise references to these aspects (e.g., baseline setups, workload mixes, and error bar details from repeated runs) while directing readers to the relevant evaluation sections for full details. revision: yes

Circularity Check

0 steps flagged

Empirical systems paper with measured performance claims; no circular derivation steps

full rationale

This paper presents a distributed dataloader system with three architectural innovations and reports end-to-end measured improvements (4.5x throughput, 13.5x memory reduction). No equations, fitted parameters, or self-referential definitions appear in the provided text. The multi-level auto-partitioning is presented as an implemented mechanism whose effectiveness is asserted via deployment results rather than derived by construction from its own inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The derivation chain is self-contained against external benchmarks (real training workloads), qualifying for the default non-circularity outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard distributed-systems assumptions about data parallelism and the ability to measure preprocessing costs at runtime; no new physical constants or fitted global parameters are introduced.

axioms (1)

domain assumption Preprocessing costs across data sources are heterogeneous and can be estimated and partitioned at multiple levels without prohibitive coordination cost.
Invoked to justify the multi-level auto-partitioning mechanism for source loaders.

pith-pipeline@v0.9.0 · 5854 in / 1253 out tokens · 33978 ms · 2026-05-22T21:09:23.757012+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training
cs.DC 2026-05 unverdicted novelty 7.0

BatchWeave delivers an object-store-native data plane for distributed large foundation model training via transactional global batches and a decentralized adaptive commit algorithm.
BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training
cs.DC 2026-05 unverdicted novelty 6.0

Lakestream provides a consistent brokerless object-store-native data plane for large foundation model training using transactional global batches and decentralized adaptive commit.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
cs.DC 2026-05 unverdicted novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 2 Pith papers · 10 internal anchors

[1]

Amazon S3 (simple storage service), 2025

Amazon Web Services. Amazon S3 (simple storage service), 2025. URLhttps://docs.aws.amazon.com/zh_cn/ emr/latest/ReleaseGuide/emr-hbase-s3.html. Accessed: 2025-03-22

work page 2025
[2]

Hadoop distributed file system (hdfs), 2025

Apache Software Foundation. Hadoop distributed file system (hdfs), 2025. URLhttps://docs.aws.amazon.com/ zh_cn/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html. Accessed: 2025-03-22

work page 2025
[3]

Apache parquet documentation: File format configurations, 2025

Apache Software Foundation. Apache parquet documentation: File format configurations, 2025. URLhttps: //parquet.apache.org/docs/file-format/configurations/. Accessed: 2025-03-22

work page 2025
[4]

Key-frame extraction techniques: A review

Milan K Asha Paul, Jeyaraman Kavitha, and P Arockia Jansi Rani. Key-frame extraction techniques: A review. Recent Patents on Computer Science, 11(1):3–16, 2018

work page 2018
[5]

Thekkath

Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiří Šimša, and Chandramohan A. Thekkath. tf.data service: A case for disaggregating ml input data processing. InProceedings of the 2023 ACM Symposium on Cloud Computing, SoCC ’23, page 358–375, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023
[6]

Pathways: Asynchronous distributed dataflow for ml

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent Shafey, Chandu Thekkath, and Yonghui Wu. Pathways: Asynchronous distributed dataflow for ml. In D. Marculescu, Y. Chi, and C. Wu, editors,Proceedings of Machine Le...

work page
[7]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism, 2024. URL https://arxiv.org/abs/2401.02954

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Boettcher and S

S. Boettcher and S. Mertens. Analysis of the karmarkar-karp differencing algorithm.The European Physical Journal B, 65(1):131–140, August 2008. ISSN 1434-6036. doi: 10.1140/epjb/e2008-00320-9. URLhttp://dx.doi. org/10.1140/epjb/e2008-00320-9

work page doi:10.1140/epjb/e2008-00320-9 2008
[9]

Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

work page 2022
[10]

Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré

Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models, 2023. URLhttps: //arxiv.org/abs/2307.14430

work page arXiv 2023
[11]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation, 2023. URLhttps://arxiv.org/abs/2306.15595

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Evans and contributors

Clark C. Evans and contributors. Pillow library. https://pillow.readthedocs.io/en/stable/, 2024. Python Imaging Library (PIL) Fork

work page 2024
[13]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvancesin Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA, 2022. Curran Associates Inc

work page 2022
[14]

Large scale distributed deep networks

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. Large scale distributed deep networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors,Advancesin Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012

work page 2012
[15]

Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Grit- senko, Mario Lucic, and Neil Houlsby. Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution. In A. Oh, T. Naumann...

work page 2023
[16]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 21

work page 2009
[17]

Mycroft: Tracing dependencies in collective communication towards reliable llm training

Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, et al. Mycroft: Tracing dependencies in collective communication towards reliable llm training. arXiv preprint arXiv:2509.03018, 2025

work page arXiv 2025
[18]

Evolution of aegis: Fault diagnosis for AI model training service in production

Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao W...

work page 2025
[19]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

work page 2021
[20]

The llama 3 herd of models, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models, 2024

work page 2024
[21]

Check-N-Run: a checkpointing system for training deep learning recommendation models

Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. Check-N-Run: a checkpointing system for training deep learning recommendation models. In19th USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 22), pages 929–943, Renton, WA, April ...

work page 2022
[22]

Pytorchvideo: A deep learning library for video understanding

Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, and Christoph Feichtenhofer. Pytorchvideo: A deep learning library for video understanding. InProceedings of the 29th ACM International Conference on Multi...

work page
[23]

ISBN 9781450386517

Association for Computing Machinery. ISBN 9781450386517. doi: 10.1145/3474085.3478329. URL https://doi.org/10.1145/3474085.3478329

work page doi:10.1145/3474085.3478329
[24]

Optimus: Accelerating large-scale multi-modal llm training by bubble exploitation, 2024

Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, and Minlan Yu. Optimus: Accelerating large-scale multi-modal llm training by bubble exploitation, 2024. URLhttps://arxiv.org/abs/2408.03505

work page arXiv 2024
[25]

Common crawl.https://commoncrawl.org, 2014

Common Crawl Foundation. Common crawl.https://commoncrawl.org, 2014

work page 2014
[26]

A comparison on scalability for batch big data processing on apache spark and apache flink.Big Data Analytics, 2:1–11, 2017

Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, and Francisco Herrera. A comparison on scalability for batch big data processing on apache spark and apache flink.Big Data Analytics, 2:1–11, 2017

work page 2017
[27]

Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025

Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025. URL https://arxiv.org/abs/2502.21231

work page arXiv 2025
[28]

G. Graefe. Volcano: An extensible and parallel query evaluation system.IEEE Trans.on Knowl. and Data Eng., 6(1):120–135, February 1994. doi: 10.1109/69.273032

work page doi:10.1109/69.273032 1994
[29]

Thekkath, and Ana Klimovic

Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A. Thekkath, and Ana Klimovic. Cachew: Machine learning input data processing as a service. In2022 USENIX Annual Technical Conference (USENIX ATC22), pages 689–706, Carlsbad, CA, July 2022. USENIX Association

work page 2022
[30]

Thekkath, and Ana Klimovic

Dan Graur, Oto Mraz, Muyu Li, Sepehr Pourghannad, Chandramohan A. Thekkath, and Ana Klimovic. Pecan: Cost-Efficient ML data preprocessing with automatic transformation ordering and hybrid placement. In2024 USENIX Annual TechnicalConference (USENIX ATC24), pages 649–665, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-41-0. URLhttps://www...

work page 2024
[31]

Characterization of large language model development in the datacenter, 2024

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. Characterization of large language model development in the datacenter, 2024. URLhttps://arxiv.org/abs/2403.07648

work page arXiv 2024
[32]

Characterization of large language model development in the datacenter

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. Characterization of large language model development in the datacenter. In21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 709–729, Santa Clara, CA, April 2024. U...

work page 2024
[33]

Distmm: accelerating distributed mul- timodal model training

Jun Huang, Zhen Zhang, Shuai Zheng, Feng Qin, and Yida Wang. Distmm: accelerating distributed mul- timodal model training. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI’24, USA, 2024. USENIX Association

work page 2024
[34]

Le, Yonghui Wu, and Zhifeng Chen.GPipe: efficienttraining of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen.GPipe: efficienttraining of giant neural networks using pipeline parallelism. Curran Associates Inc., Red Hook, NY, USA, 2019

work page 2019
[35]

System optimizations for enabling training of extreme long sequence transformer models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACMSymposium on Principles of Distributed Computing, PODC ’24, page 121–130, New York, NY, USA, 202...

work page 2024
[36]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Zico Kolter

Yiding Jiang, Allan Zhou, Zhili Feng, Sadhika Malladi, and J. Zico Kolter. Adaptive data optimization: Dynamic sample selection with scaling laws, 2024. URLhttps://arxiv.org/abs/2410.11820

work page arXiv 2024
[38]

MegaScale: Scaling large language model training to more than 10,000 GPUs

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

work page 2024
[39]

In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 745–760, USA,

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al.{MegaScale}: Scaling large language model training to more than 10,000{GPUs}. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 745–760, USA,

work page
[40]

Kuaishou

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022. URL https: //arxiv.org/abs/2205.05198

work page arXiv 2022
[41]

Kosec, S

Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance, 2022. URLhttps: //arxiv.org/abs/2107.02027

work page arXiv 2022
[42]

Sidecar containers, 2024

Kubernetes. Sidecar containers, 2024. URL https://kubernetes.io/docs/concepts/workloads/pods/ sidecar-containers/. Kubernetes Documentation v1.29

work page 2024
[43]

The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models, 2022

Conglong Li, Minjia Zhang, and Yuxiong He. The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models, 2022. URLhttps://arxiv.org/abs/2108.06084

work page arXiv 2022
[44]

Pytorch distributed: experiences on accelerating data parallel training

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, August 2020. ISSN 2150-8097

work page 2020
[45]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training, 2020. URLhttps://arxiv.org/abs/2006.15704

work page internal anchor Pith review Pith/arXiv arXiv 2020
[46]

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024

Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, and Wei Lin. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024. URLhttps://arxiv.org/abs/2401.02669

work page arXiv 2024
[47]

Ring attention with blockwise transformers for near-infinite context, 2023

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023. 23

work page 2023
[48]

The streaming batch model for efficient and fault-tolerant heterogeneous execution, 2025

Frank Sifei Luan, Ziming Mao, Ron Yifeng Wang, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, and Stephanie Wang. The streaming batch model for efficient and fault-tolerant heterogeneous execution, 2025. URLhttps://arxiv.org/abs/2501. 12407

work page 2025
[49]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, April 2025

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, April 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed: 2025-04-06

work page 2025
[50]

CheckFreq: Frequent, Fine-Grained DNN checkpointing

Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. CheckFreq: Frequent, Fine-Grained DNN checkpointing. In19th USENIX Conference on File and Storage Technologies(FAST21), pages 203–216. USENIX Association, February 2021. ISBN 978-1-939133-20-5. URL https://www.usenix.org/conference/fast21/ presentation/mohan

work page 2021
[51]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: a distributed framework for emerging ai applications. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 561–577, USA, 2018. US...

work page 2018
[52]

Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk

Derek G. Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk. tf.data: A machine learning data processing framework, 2021. URLhttps://arxiv.org/abs/2101.12127

work page arXiv 2021
[53]

Pipedream: Generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, New York, NY, USA,

work page
[54]

Association for Computing Machinery

work page
[55]

Efficient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Conference for High Performance Computing, Networking, St...

work page 2021
[56]

torch.utils.data — PyTorch 2.4 documentation, 2024

PyTorch contributors. torch.utils.data — PyTorch 2.4 documentation, 2024. URLhttps://pytorch.org/docs/ stable/data.html. Accessed: [Insert access date]

work page 2024
[57]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery

work page 2020
[58]

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021

work page 2021
[59]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[60]

Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow, 2018. URLhttps://arxiv.org/abs/1802.05799

work page internal anchor Pith review Pith/arXiv arXiv 2018
[61]

Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

work page 2020
[62]

Kafka: The modern platform for data management and analysis in big data domain

Rishika Shree, Tanupriya Choudhury, Subhash Chand Gupta, and Praveen Kumar. Kafka: The modern platform for data management and analysis in big data domain. In2017 2nd International Conference on Telecommunication and Networks (TEL-NET), pages 1–5, 2017. doi: 10.1109/TEL-NET.2017.8343593

work page doi:10.1109/tel-net.2017.8343593 2017
[63]

Curriculum learning: A survey, 2022

Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey, 2022. URL https://arxiv.org/abs/2101.10382

work page arXiv 2022
[64]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models, 2025. URLhttps://arxiv.org/abs/2312.11805. 24

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline.Proc

Taegeon Um, Byungsoo Oh, Byeongchan Seo, Minhyeok Kweun, Goeun Kim, and Woo-Yeon Lee. Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline.Proc. VLDB Endow., 16 (5):1086–1099, jan 2023

work page 2023
[66]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URLhttps://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. InProceedings oftheACMSIGOPS30thSymposium onOperating Systems Principles, pages 195–210, New York, NY, USA, 2024. Association for Computing Machinery

work page 2024
[68]

Bytecheckpoint: A unified checkpointing system for llm development, 2024

Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for llm development, 2024. URL https://arxiv.org/abs/2407.20143

work page arXiv 2024
[69]

Robust llm training infrastructure at bytedance, 2025

Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zheru...

work page arXiv 2025
[70]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

work page
[71]

URLhttps://arxiv.org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 364–381, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi: 10.1...

work page doi:10.1145/3600006.3613145 2023
[73]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation, 2024. URLhttps://arxiv.org/abs/2410.13848

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Data mixing laws: Optimizing data mixtures by predicting language modeling performance, 2024

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance, 2024. URLhttps://arxiv.org/abs/2403.16952

work page arXiv 2024
[76]

An empirical evaluation of columnar storage formats.Proc

Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. An empirical evaluation of columnar storage formats.Proc. VLDB Endow., 17(2):148–161, October 2023. doi: 10.14778/ 3626292.3626298

work page arXiv 2023
[77]

Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models

Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Daxin Jiang, and Xin Jin. Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models. In Proceedings of the ACM SIGCOMM 2025 Conference, SIGCOMM ’25, page 24–38, New York, NY, USA, 2025. Association for Computing Machinery

work page 2025
[78]

Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product

Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, and Parik Pol. Understanding data storage and ingestion for large-scale deep recommendation model training: industrial p...

work page 2022
[79]

cedar: Optimized and unified machine learning input data pipelines

Mark Zhao, Emanuel Adamiak, and Christos Kozyrakis. cedar: Optimized and unified machine learning input data pipelines. Proc. VLDB Endow., 18(2):488–502, 2024. 25

work page 2024
[80]

Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 16(12):3848–3860, 2023

work page 2023

Showing first 80 references.

[1] [1]

Amazon S3 (simple storage service), 2025

Amazon Web Services. Amazon S3 (simple storage service), 2025. URLhttps://docs.aws.amazon.com/zh_cn/ emr/latest/ReleaseGuide/emr-hbase-s3.html. Accessed: 2025-03-22

work page 2025

[2] [2]

Hadoop distributed file system (hdfs), 2025

Apache Software Foundation. Hadoop distributed file system (hdfs), 2025. URLhttps://docs.aws.amazon.com/ zh_cn/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html. Accessed: 2025-03-22

work page 2025

[3] [3]

Apache parquet documentation: File format configurations, 2025

Apache Software Foundation. Apache parquet documentation: File format configurations, 2025. URLhttps: //parquet.apache.org/docs/file-format/configurations/. Accessed: 2025-03-22

work page 2025

[4] [4]

Key-frame extraction techniques: A review

Milan K Asha Paul, Jeyaraman Kavitha, and P Arockia Jansi Rani. Key-frame extraction techniques: A review. Recent Patents on Computer Science, 11(1):3–16, 2018

work page 2018

[5] [5]

Thekkath

Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiří Šimša, and Chandramohan A. Thekkath. tf.data service: A case for disaggregating ml input data processing. InProceedings of the 2023 ACM Symposium on Cloud Computing, SoCC ’23, page 358–375, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023

[6] [6]

Pathways: Asynchronous distributed dataflow for ml

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent Shafey, Chandu Thekkath, and Yonghui Wu. Pathways: Asynchronous distributed dataflow for ml. In D. Marculescu, Y. Chi, and C. Wu, editors,Proceedings of Machine Le...

work page

[7] [7]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism, 2024. URL https://arxiv.org/abs/2401.02954

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Boettcher and S

S. Boettcher and S. Mertens. Analysis of the karmarkar-karp differencing algorithm.The European Physical Journal B, 65(1):131–140, August 2008. ISSN 1434-6036. doi: 10.1140/epjb/e2008-00320-9. URLhttp://dx.doi. org/10.1140/epjb/e2008-00320-9

work page doi:10.1140/epjb/e2008-00320-9 2008

[9] [9]

Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

work page 2022

[10] [10]

Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré

Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models, 2023. URLhttps: //arxiv.org/abs/2307.14430

work page arXiv 2023

[11] [11]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation, 2023. URLhttps://arxiv.org/abs/2306.15595

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Evans and contributors

Clark C. Evans and contributors. Pillow library. https://pillow.readthedocs.io/en/stable/, 2024. Python Imaging Library (PIL) Fork

work page 2024

[13] [13]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvancesin Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA, 2022. Curran Associates Inc

work page 2022

[14] [14]

Large scale distributed deep networks

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. Large scale distributed deep networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors,Advancesin Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012

work page 2012

[15] [15]

Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Grit- senko, Mario Lucic, and Neil Houlsby. Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution. In A. Oh, T. Naumann...

work page 2023

[16] [16]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 21

work page 2009

[17] [17]

Mycroft: Tracing dependencies in collective communication towards reliable llm training

Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, et al. Mycroft: Tracing dependencies in collective communication towards reliable llm training. arXiv preprint arXiv:2509.03018, 2025

work page arXiv 2025

[18] [18]

Evolution of aegis: Fault diagnosis for AI model training service in production

Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao W...

work page 2025

[19] [19]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

work page 2021

[20] [20]

The llama 3 herd of models, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models, 2024

work page 2024

[21] [21]

Check-N-Run: a checkpointing system for training deep learning recommendation models

Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. Check-N-Run: a checkpointing system for training deep learning recommendation models. In19th USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 22), pages 929–943, Renton, WA, April ...

work page 2022

[22] [22]

Pytorchvideo: A deep learning library for video understanding

Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, and Christoph Feichtenhofer. Pytorchvideo: A deep learning library for video understanding. InProceedings of the 29th ACM International Conference on Multi...

work page

[23] [23]

ISBN 9781450386517

Association for Computing Machinery. ISBN 9781450386517. doi: 10.1145/3474085.3478329. URL https://doi.org/10.1145/3474085.3478329

work page doi:10.1145/3474085.3478329

[24] [24]

Optimus: Accelerating large-scale multi-modal llm training by bubble exploitation, 2024

Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, and Minlan Yu. Optimus: Accelerating large-scale multi-modal llm training by bubble exploitation, 2024. URLhttps://arxiv.org/abs/2408.03505

work page arXiv 2024

[25] [25]

Common crawl.https://commoncrawl.org, 2014

Common Crawl Foundation. Common crawl.https://commoncrawl.org, 2014

work page 2014

[26] [26]

A comparison on scalability for batch big data processing on apache spark and apache flink.Big Data Analytics, 2:1–11, 2017

Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, and Francisco Herrera. A comparison on scalability for batch big data processing on apache spark and apache flink.Big Data Analytics, 2:1–11, 2017

work page 2017

[27] [27]

Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025

Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025. URL https://arxiv.org/abs/2502.21231

work page arXiv 2025

[28] [28]

G. Graefe. Volcano: An extensible and parallel query evaluation system.IEEE Trans.on Knowl. and Data Eng., 6(1):120–135, February 1994. doi: 10.1109/69.273032

work page doi:10.1109/69.273032 1994

[29] [29]

Thekkath, and Ana Klimovic

Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A. Thekkath, and Ana Klimovic. Cachew: Machine learning input data processing as a service. In2022 USENIX Annual Technical Conference (USENIX ATC22), pages 689–706, Carlsbad, CA, July 2022. USENIX Association

work page 2022

[30] [30]

Thekkath, and Ana Klimovic

Dan Graur, Oto Mraz, Muyu Li, Sepehr Pourghannad, Chandramohan A. Thekkath, and Ana Klimovic. Pecan: Cost-Efficient ML data preprocessing with automatic transformation ordering and hybrid placement. In2024 USENIX Annual TechnicalConference (USENIX ATC24), pages 649–665, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-41-0. URLhttps://www...

work page 2024

[31] [31]

Characterization of large language model development in the datacenter, 2024

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. Characterization of large language model development in the datacenter, 2024. URLhttps://arxiv.org/abs/2403.07648

work page arXiv 2024

[32] [32]

Characterization of large language model development in the datacenter

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. Characterization of large language model development in the datacenter. In21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 709–729, Santa Clara, CA, April 2024. U...

work page 2024

[33] [33]

Distmm: accelerating distributed mul- timodal model training

Jun Huang, Zhen Zhang, Shuai Zheng, Feng Qin, and Yida Wang. Distmm: accelerating distributed mul- timodal model training. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI’24, USA, 2024. USENIX Association

work page 2024

[34] [34]

Le, Yonghui Wu, and Zhifeng Chen.GPipe: efficienttraining of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen.GPipe: efficienttraining of giant neural networks using pipeline parallelism. Curran Associates Inc., Red Hook, NY, USA, 2019

work page 2019

[35] [35]

System optimizations for enabling training of extreme long sequence transformer models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACMSymposium on Principles of Distributed Computing, PODC ’24, page 121–130, New York, NY, USA, 202...

work page 2024

[36] [36]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Zico Kolter

Yiding Jiang, Allan Zhou, Zhili Feng, Sadhika Malladi, and J. Zico Kolter. Adaptive data optimization: Dynamic sample selection with scaling laws, 2024. URLhttps://arxiv.org/abs/2410.11820

work page arXiv 2024

[38] [38]

MegaScale: Scaling large language model training to more than 10,000 GPUs

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

work page 2024

[39] [39]

In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 745–760, USA,

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al.{MegaScale}: Scaling large language model training to more than 10,000{GPUs}. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 745–760, USA,

work page

[40] [40]

Kuaishou

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022. URL https: //arxiv.org/abs/2205.05198

work page arXiv 2022

[41] [41]

Kosec, S

Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance, 2022. URLhttps: //arxiv.org/abs/2107.02027

work page arXiv 2022

[42] [42]

Sidecar containers, 2024

Kubernetes. Sidecar containers, 2024. URL https://kubernetes.io/docs/concepts/workloads/pods/ sidecar-containers/. Kubernetes Documentation v1.29

work page 2024

[43] [43]

The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models, 2022

Conglong Li, Minjia Zhang, and Yuxiong He. The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models, 2022. URLhttps://arxiv.org/abs/2108.06084

work page arXiv 2022

[44] [44]

Pytorch distributed: experiences on accelerating data parallel training

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, August 2020. ISSN 2150-8097

work page 2020

[45] [45]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training, 2020. URLhttps://arxiv.org/abs/2006.15704

work page internal anchor Pith review Pith/arXiv arXiv 2020

[46] [46]

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024

Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, and Wei Lin. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024. URLhttps://arxiv.org/abs/2401.02669

work page arXiv 2024

[47] [47]

Ring attention with blockwise transformers for near-infinite context, 2023

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023. 23

work page 2023

[48] [48]

The streaming batch model for efficient and fault-tolerant heterogeneous execution, 2025

Frank Sifei Luan, Ziming Mao, Ron Yifeng Wang, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, and Stephanie Wang. The streaming batch model for efficient and fault-tolerant heterogeneous execution, 2025. URLhttps://arxiv.org/abs/2501. 12407

work page 2025

[49] [49]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, April 2025

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, April 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed: 2025-04-06

work page 2025

[50] [50]

CheckFreq: Frequent, Fine-Grained DNN checkpointing

Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. CheckFreq: Frequent, Fine-Grained DNN checkpointing. In19th USENIX Conference on File and Storage Technologies(FAST21), pages 203–216. USENIX Association, February 2021. ISBN 978-1-939133-20-5. URL https://www.usenix.org/conference/fast21/ presentation/mohan

work page 2021

[51] [51]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: a distributed framework for emerging ai applications. InProceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 561–577, USA, 2018. US...

work page 2018

[52] [52]

Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk

Derek G. Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk. tf.data: A machine learning data processing framework, 2021. URLhttps://arxiv.org/abs/2101.12127

work page arXiv 2021

[53] [53]

Pipedream: Generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, New York, NY, USA,

work page

[54] [54]

Association for Computing Machinery

work page

[55] [55]

Efficient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Conference for High Performance Computing, Networking, St...

work page 2021

[56] [56]

torch.utils.data — PyTorch 2.4 documentation, 2024

PyTorch contributors. torch.utils.data — PyTorch 2.4 documentation, 2024. URLhttps://pytorch.org/docs/ stable/data.html. Accessed: [Insert access date]

work page 2024

[57] [57]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery

work page 2020

[58] [58]

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021

work page 2021

[59] [59]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017

[60] [60]

Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow, 2018. URLhttps://arxiv.org/abs/1802.05799

work page internal anchor Pith review Pith/arXiv arXiv 2018

[61] [61]

Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

work page 2020

[62] [62]

Kafka: The modern platform for data management and analysis in big data domain

Rishika Shree, Tanupriya Choudhury, Subhash Chand Gupta, and Praveen Kumar. Kafka: The modern platform for data management and analysis in big data domain. In2017 2nd International Conference on Telecommunication and Networks (TEL-NET), pages 1–5, 2017. doi: 10.1109/TEL-NET.2017.8343593

work page doi:10.1109/tel-net.2017.8343593 2017

[63] [63]

Curriculum learning: A survey, 2022

Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey, 2022. URL https://arxiv.org/abs/2101.10382

work page arXiv 2022

[64] [64]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models, 2025. URLhttps://arxiv.org/abs/2312.11805. 24

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline.Proc

Taegeon Um, Byungsoo Oh, Byeongchan Seo, Minhyeok Kweun, Goeun Kim, and Woo-Yeon Lee. Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline.Proc. VLDB Endow., 16 (5):1086–1099, jan 2023

work page 2023

[66] [66]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URLhttps://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. InProceedings oftheACMSIGOPS30thSymposium onOperating Systems Principles, pages 195–210, New York, NY, USA, 2024. Association for Computing Machinery

work page 2024

[68] [68]

Bytecheckpoint: A unified checkpointing system for llm development, 2024

Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for llm development, 2024. URL https://arxiv.org/abs/2407.20143

work page arXiv 2024

[69] [69]

Robust llm training infrastructure at bytedance, 2025

Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zheru...

work page arXiv 2025

[70] [70]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

work page

[71] [71]

URLhttps://arxiv.org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 364–381, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi: 10.1...

work page doi:10.1145/3600006.3613145 2023

[73] [73]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation, 2024. URLhttps://arxiv.org/abs/2410.13848

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [74]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

Data mixing laws: Optimizing data mixtures by predicting language modeling performance, 2024

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance, 2024. URLhttps://arxiv.org/abs/2403.16952

work page arXiv 2024

[76] [76]

An empirical evaluation of columnar storage formats.Proc

Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. An empirical evaluation of columnar storage formats.Proc. VLDB Endow., 17(2):148–161, October 2023. doi: 10.14778/ 3626292.3626298

work page arXiv 2023

[77] [77]

Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models

Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Daxin Jiang, and Xin Jin. Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models. In Proceedings of the ACM SIGCOMM 2025 Conference, SIGCOMM ’25, page 24–38, New York, NY, USA, 2025. Association for Computing Machinery

work page 2025

[78] [78]

Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product

Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, and Parik Pol. Understanding data storage and ingestion for large-scale deep recommendation model training: industrial p...

work page 2022

[79] [79]

cedar: Optimized and unified machine learning input data pipelines

Mark Zhao, Emanuel Adamiak, and Christos Kozyrakis. cedar: Optimized and unified machine learning input data pipelines. Proc. VLDB Endow., 18(2):488–502, 2024. 25

work page 2024

[80] [80]

Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 16(12):3848–3860, 2023

work page 2023