pith. sign in

arxiv: 2605.09994 · v2 · pith:NHFUK5FQnew · submitted 2026-05-11 · 💻 cs.DC · cs.LG

BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training

Pith reviewed 2026-05-19 17:51 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords object storedata planefoundation model trainingdistributed trainingconsistencyfault tolerancebatch processinglarge language models
0
0 comments X

The pith

BatchWeave builds a consistent object-store-native data plane that delivers atomic all-rank batch visibility and exactly-once recovery for distributed foundation model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large foundation model training has turned the data pipeline into a dynamic component that must evolve with the training process itself. Colocated dataloaders lack any failure isolation while message-queue systems use record and offset abstractions that cannot express the batch semantics required by distributed training. BatchWeave addresses both limits by placing coordination directly in the object store through versioned manifests and conditional writes. It defines the Transactional Global Batch to guarantee atomic visibility across ranks, globally ordered steps, checkpoint-aligned lifecycle management, and end-to-end exactly-once recovery. On 64-GPU multimodal pre-training and supervised fine-tuning workloads it exceeds colocated dataloader throughput while adding isolation and beats Apache Kafka on ingestion throughput and consumer read latency.

Core claim

BatchWeave uses versioned manifests and conditional object writes to coordinate batch publication, recovery, and lifecycle management in an object-store-native data plane. It introduces the Transactional Global Batch, which builds on versioned-manifest ACID storage semantics and adds training-specific consistency including atomic all-rank batch visibility, a globally ordered step sequence, checkpoint-aligned lifecycle management, and end-to-end exactly-once recovery. Recovery and retention are realized directly in the storage layer by durably persisting producer state through the commit protocol and tying reclamation to distributed checkpoint state. Its Decentralized Adaptive Commitalgorithm

What carries the argument

The Transactional Global Batch (TGB) that extends versioned-manifest ACID storage semantics with training-specific consistency guarantees such as atomic all-rank batch visibility, globally ordered steps, and checkpoint-aligned lifecycle management.

If this is right

  • Training frameworks can separate data loading from compute nodes while preserving consistency and fault isolation.
  • Object stores can replace message queues for batch ingestion without sacrificing speed or introducing coordination bottlenecks.
  • Data retention policies can be tied directly to training checkpoints, reducing unnecessary storage during long runs.
  • Exactly-once recovery becomes a storage-layer property, simplifying fault handling in large-scale pre-training jobs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design could generalize to other distributed systems that need batch-level consistency guarantees over commodity object storage.
  • Cluster operators might simplify architectures by moving more coordination logic into the storage service itself.
  • Smaller-scale experiments on 8- or 16-GPU setups could test whether the throughput and latency advantages hold before full 64-GPU deployment.

Load-bearing premise

Object stores can deliver the versioned-manifest ACID semantics and conditional-write performance needed for atomic all-rank batch visibility and checkpoint-aligned lifecycle management without introducing latency or throughput penalties that would erase the reported gains over colocated and Kafka baselines.

What would settle it

A head-to-head run of the 64-GPU multimodal pre-training workload in which BatchWeave fails to exceed colocated dataloader throughput or Apache Kafka ingestion throughput while still providing full failure isolation and lower consumer read latency.

Figures

Figures reproduced from arXiv: 2605.09994 by Bingyi Jing, Jiaxing Zhang, Jingyi Xi, Junjie Zhang, Songxin Zhang, Ting Sun, Xiao Yan, Zejian Xie, Zhuoyang Song, Zunyao Mao.

Figure 1
Figure 1. Figure 1: Training-time preprocessing inflates data volume [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three architectural patterns for training dataflow. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two structural limitations of message queues for [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two structural limitations of message queues for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Lakestream architecture. Producers write TGBs di [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Producer ingestion throughput versus producer count across three payload sizes. Lakestream is the only system [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-step latency (ms, log) and end-to-end throughput (step/s) across three workloads. Kafka failed on the Qwen3-VL [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: DAC ablation with 32 producers over 5 hours. DAC [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Checkpoint-driven storage reclamation over a 1,010- [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Consumer throughput, P95 latency, and read am [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
read the original abstract

Modern Large Foundation Model (LFM) training has transformed the data pipeline from a static ingestion layer into a dynamic component that must co-evolve with the training process. Existing systems are ill-equipped: colocated dataloaders offer no failure isolation, while message queue-based disaggregated dataloaders operate on a record/offset abstraction that cannot express the batch-level semantics required by distributed training. We present BatchWeave, an object-store-native training data plane for distributed LFM training. BatchWeave uses versioned manifests and conditional object writes to coordinate batch publication, recovery, and lifecycle management. First, it introduces the Transactional Global Batch (TGB), which builds on versioned-manifest ACID storage semantics and extends them with training-specific consistency, including atomic all-rank batch visibility, a globally ordered step sequence, checkpoint-aligned lifecycle management, and end-to-end exactly-once recovery. Second, it realizes recovery and retention directly in the storage layer, by durably persisting producer state through the commit protocol and tying reclamation to distributed checkpoint state. Third, its Decentralized Adaptive Commit (DAC) algorithm sustains stable ingestion throughput as the manifest grows, without any inter-producer communication. Evaluations on large-scale multimodal pre-training and SFT workloads using 64 GPUs show that BatchWeave outperforms colocated dataloader throughput while providing full failure isolation, outperforms Apache Kafka in ingestion throughput, and achieves lower consumer read latency than Kafka.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents BatchWeave as an object-store-native data plane for large foundation model training. It leverages versioned manifests and conditional object writes to implement the Transactional Global Batch (TGB) construct, which provides atomic all-rank batch visibility, globally ordered step sequences, checkpoint-aligned lifecycle management, and exactly-once recovery. The Decentralized Adaptive Commit (DAC) algorithm is introduced to maintain stable ingestion throughput as manifests grow without inter-producer communication. Evaluations on multimodal pre-training and SFT workloads using 64 GPUs claim that BatchWeave exceeds colocated dataloader throughput with full failure isolation, surpasses Apache Kafka in ingestion throughput, and delivers lower consumer read latency than Kafka.

Significance. Should the performance and consistency claims be substantiated, this work would represent a meaningful advance in disaggregated data planes for distributed ML training. By shifting coordination to object store primitives, it offers a path to better failure isolation and scalability compared to traditional colocated or queue-based approaches. The integration of training-specific semantics like checkpoint alignment directly into the storage layer is a promising direction that could influence future system designs in the field.

major comments (3)
  1. [§5 (Evaluation)] The performance results for the 64-GPU multimodal and SFT workloads are presented without error bars, detailed workload specifications, or exclusion criteria. This omission makes it difficult to assess whether the reported gains in throughput and latency are statistically robust or influenced by specific experimental conditions.
  2. [§3.1 (TGB)] The assumption that object stores can deliver versioned-manifest ACID semantics and conditional-write performance at the rates required for training batches without introducing penalties that offset the gains is load-bearing for the central claim. The manuscript compares overall system performance to baselines but does not provide isolated measurements of manifest operation overheads or conditional write latencies under sustained load.
  3. [§4 (DAC)] The DAC algorithm is claimed to sustain stable throughput without inter-producer communication, but the scaling behavior with manifest size and the impact of object store list operation consistency models are not sufficiently analyzed to confirm it avoids the eventual consistency issues common in object stores.
minor comments (2)
  1. [Abstract] The abstract mentions 'large-scale' workloads but could benefit from specifying the exact model sizes or dataset characteristics for context.
  2. [Throughout] Ensure consistent use of acronyms like TGB and DAC upon first introduction in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their detailed and insightful comments, which have helped us improve the clarity and rigor of our evaluation and analysis sections. We address each major comment below, indicating the revisions we have made or will make in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§5 (Evaluation)] The performance results for the 64-GPU multimodal and SFT workloads are presented without error bars, detailed workload specifications, or exclusion criteria. This omission makes it difficult to assess whether the reported gains in throughput and latency are statistically robust or influenced by specific experimental conditions.

    Authors: We agree with this observation and have revised the manuscript to include error bars on all throughput and latency figures in Section 5, representing the standard deviation across at least five independent runs for each data point. We have also expanded the workload specifications to include precise details on dataset composition, batch sizes, sequence lengths, and the exact hardware configuration for the 64-GPU setup. Additionally, we now describe the exclusion criteria, which were limited to runs affected by transient hardware faults unrelated to the data plane. These changes should allow readers to better evaluate the statistical robustness of our results. revision: yes

  2. Referee: [§3.1 (TGB)] The assumption that object stores can deliver versioned-manifest ACID semantics and conditional-write performance at the rates required for training batches without introducing penalties that offset the gains is load-bearing for the central claim. The manuscript compares overall system performance to baselines but does not provide isolated measurements of manifest operation overheads or conditional write latencies under sustained load.

    Authors: This is a valid point regarding the need for more granular performance data. While the end-to-end evaluations demonstrate that any overheads are outweighed by the benefits, we have added isolated microbenchmark results in a new subsection of §3.1. These measure the latency and throughput of versioned manifest operations and conditional writes under sustained load matching our training batch rates. The results confirm that these primitives introduce negligible overhead compared to the overall system gains, supporting the central claim without offsetting penalties. revision: yes

  3. Referee: [§4 (DAC)] The DAC algorithm is claimed to sustain stable throughput without inter-producer communication, but the scaling behavior with manifest size and the impact of object store list operation consistency models are not sufficiently analyzed to confirm it avoids the eventual consistency issues common in object stores.

    Authors: We acknowledge that a more detailed analysis of scaling and consistency would strengthen the presentation. In the revised manuscript, we have extended Section 4 with additional experiments plotting ingestion throughput against increasing manifest sizes up to the maximum observed in our workloads. We also include a discussion of the object store's consistency model (strong consistency for list operations in the evaluated setup) and how the DAC algorithm's design—relying on conditional writes rather than list operations for critical paths—mitigates potential eventual consistency issues. This analysis confirms stable throughput without inter-producer communication. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; design and claims are self-contained

full rationale

The paper introduces architectural components (TGB extending versioned-manifest ACID semantics, DAC algorithm for commit without inter-producer communication) and supports performance claims via direct empirical evaluation against external baselines (colocated dataloaders, Apache Kafka) on 64-GPU workloads. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described derivation. The central claims rest on implementation details and comparative measurements rather than reducing to self-defined inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The design rests on the domain assumption that object stores can provide the required consistency primitives at scale; two new entities are introduced without independent evidence outside the paper.

axioms (1)
  • domain assumption Object stores can support versioned manifests and conditional object writes with ACID properties suitable for training coordination.
    Central to the coordination of batch publication, recovery, and lifecycle management described in the abstract.
invented entities (2)
  • Transactional Global Batch (TGB) no independent evidence
    purpose: Extend versioned-manifest ACID semantics with atomic all-rank batch visibility, globally ordered step sequence, checkpoint-aligned lifecycle management, and end-to-end exactly-once recovery.
    New abstraction introduced to meet training-specific consistency requirements.
  • Decentralized Adaptive Commit (DAC) no independent evidence
    purpose: Sustain stable ingestion throughput as the manifest grows without any inter-producer communication.
    New algorithm proposed to address manifest growth scalability.

pith-pipeline@v0.9.0 · 5820 in / 1479 out tokens · 47782 ms · 2026-05-19T17:51:06.961186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 5 internal anchors

  1. [1]

    Alex Aizman, Gavin Maltby, and Thomas Breuel. 2020. High Performance I/O For Large Scale Deep Learning. https://arxiv.org/abs/2001.01858. doi:10.48550/ arXiv.2001.01858 arXiv:2001.01858

  2. [2]

    Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J Dagum, Sam Knight, Frances Perry, Reiner Schmidt, and Sam Whittle. 2015. The dataflow model: a practical approach to balancing correctness, latency, and BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training cost in massive-scale, unbounded, out-of...

  3. [3]

    Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Bogdan Ghit, Mad- hukara Bhat, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. 2020. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proceedings of the VLDB Endowment (PVLDB)13, 12 (2020),...

  4. [4]

    Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Liao, Yin Huai, Hossein Hosseini, Matei Zaharia, and Reynold Xin. 2018. Structured streaming: A declarative api for real-time applications in apache spark. InPro- ceedings of the 2018 International Conference on Management of Data (SIGMOD). Association for Computing Machinery, New York,...

  5. [5]

    Thekkath

    Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiri Simsa, and Chan- dramohan A. Thekkath. 2023. tf.data service: A Case for Disaggregating ML Input Data Processing. InProceedings of the 2023 ACM Symposium on Cloud Computing (SoCC). Association for Computing Machinery, New York, NY, USA, 358–375

  6. [6]

    AutoMQ Team. 2024. AutoMQ: Cloud-Native Streaming with Offloaded Storage. https://www.automq.com

  7. [7]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  8. [8]

    Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, Dan Graur, Viktor Gsteiger, and Ana Klimovic. 2026. Mixtera: A Data Plane for Foundation Model Training. Proc. ACM Manag. Data4, 1 (April 2026). doi:10.1145/3786668

  9. [9]

    Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zoui- tine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caroline Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. 2024. LeRobot: State-of-the-art Machine Learning for Real-World Robotics in PyTorch. https://githu...

  10. [10]

    Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Kostas. 2015. Apache Flink™: Stream processing at scale.ACM SIGMOD Record44, 4 (2015), 28–39

  11. [11]

    DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948

  12. [12]

    Thekkath, and Ana Klimovic

    Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A. Thekkath, and Ana Klimovic. 2022. Cachew: Machine Learning Input Data Processing as a Service. InProceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC). USENIX Association, Berkeley, CA, USA, 689–706

  13. [13]

    Thekkath, and Ana Klimovic

    Dan Graur, Oto Mraz, Muyu Li, Sepehr Pourghannad, Chandramohan A. Thekkath, and Ana Klimovic. 2024. Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement. InProceed- ings of the 2024 USENIX Annual Technical Conference (USENIX ATC). USENIX Association, Berkeley, CA, USA, 649–665

  14. [14]

    Gabriel Ilharco, Mitchell Wortsman, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. 2021. OpenCLIP. doi:10.5281/zenodo.5143773

  15. [15]

    Uber Technologies Inc. 2018. Petastorm: Open source library to enable training deep learning models from Apache Parquet datasets. https://github.com/uber/ petastorm

  16. [16]

    Van Jacobson. 1988. Congestion Avoidance and Control. InSymposium Proceed- ings on Communications Architectures and Protocols (SIGCOMM ’88). Association for Computing Machinery, New York, NY, USA, 314–329. doi:10.1145/52324.52356

  17. [17]

    Taeyoon Kim, Youngbin Jeong, Myeongjae Jang, and Jong-Geun Lee. 2023. Fu- sionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation.Proceedings of the VLDB Endowment (PVLDB)17, 3 (2023), 488–502

  18. [18]

    Jay Kreps, Neha Narkhede, and Jun Rao. 2011. Kafka: A Distributed Messaging System for Log Processing. InProceedings of the 4th International Workshop on Networking Meets Databases (NetDB). Association for Computing Machinery, New York, NY, USA, 1–7

  19. [19]

    Lance Format. 2025. Lance. https://github.com/lance-format/lance/

  20. [20]

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, Mona Anvari, Minjune Hwang, Manasi Sharma, Arman Aydin, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R Matthews, Ivan Villa-Renteria, Jerry Huayang Tang, Claire Tang, Fei Xia, Silvio Sav...

  21. [21]

    Jiahao Li, Biao Cao, Jielong Jian, Cheng Li, Sen Han, Yiduo Wang, Yufei Wu, Kang Chen, Zhihui Yin, Qiushi Chen, Jiwei Xiong, Jie Zhao, Fengyuan Liu, Yan Xing, Liguo Duan, Miao Yu, Ran Zheng, Feng Wu, and Xianjun Meng. 2025. Mantle: Efficient Hierarchical Metadata Management for Cloud Object Storage Services. InProceedings of the ACM SIGOPS 31st Symposium ...

  22. [22]

    Matteo Merli, Sijie Guo, Penghui Li, Hang Chen, and Neng Lu. 2025. Ursa: A Lakehouse-Native Data Streaming Engine for Kafka.Proceedings of the VLDB Endowment (PVLDB)18, 12 (2025), 5184–5196

  23. [23]

    Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chi- dambaram. 2021. CoorDL: Co-ordinated Data Loading for Deep Learning. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC). USENIX Association, Berkeley, CA, USA, 305–319

  24. [24]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. InProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Berk...

  25. [25]

    MosaicML. 2023. StreamingDataset: A high-performance dataset for deep learn- ing. https://github.com/mosaicml/streaming

  26. [26]

    Murray, Jiří Šimša, Ana Klimovic, and Ihor Indyk

    Derek G. Murray, Jiří Šimša, Ana Klimovic, and Ihor Indyk. 2021. tf.data: a machine learning data processing framework.Proc. VLDB Endow.14, 12 (July 2021), 2945–2958. doi:10.14778/3476311.3476374

  27. [27]

    NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, Y...

  28. [28]

    Christiano, Jan Leike, and Ryan Lowe

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with huma...

  29. [29]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre- gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, Hi...

  30. [30]

    PyTorch Team. 2025. torch.distributed.checkpoint Documentation. https:// pytorch.org/docs/stable/distributed.checkpoint.html

  31. [31]

    Matthew Rocklin. 2015. Dask: Parallel computation with blocked algorithms and task scheduling. InProceedings of the 14th Python in Science Conference, Vol. 130. SciPy, Austin, TX, USA, 136

  32. [32]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300

  33. [33]

    Jun Song, Jingyi Ding, Irshad Kandy, Yanghao Lin, Zhongjia Wei, Zilong Zhou, Zhiwei Peng, Jixi Shan, Hongyue Mao, Xiuqi Huang, Xun Song, Cheng Chen, Yanjia Li, Tianhao Yang, Wei Jia, Xiaohong Dong, Kang Lei, Rui Shi, Pengwei Zhao, and Wei Chen. 2025. Magnus: A Holistic Approach to Data Management for Large-Scale Machine Learning Workloads.Proc. VLDB Endow...

  34. [34]

    The Apache Software Foundation. 2015. Apache Pulsar. https://pulsar.apache. org

  35. [35]

    The Apache Software Foundation. 2025. Apache Iceberg. https://iceberg.apache. org/

  36. [36]

    Taegeon Um, Goeun Byun, Hwarim Choi, Mincheol Han, and Hyuck Park. 2023. FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline.Proceedings of the VLDB Endowment (PVLDB)16, 11 (2023), 1086–1099

  37. [37]

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. 2023. HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World. https://openaccess.thecvf. com/content/ICCV2023/html/Wang_HoloAssist_an_Egoc...

  38. [38]

    WarpStream Labs. 2025. WarpStream: A Cloud-Native, Zero-Disk Apache Kafka Alternative. https://www.warpstream.com

  39. [39]

    Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Re- silient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In9th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 12). USENIX Association, Berkeley, C...

  40. [40]

    Juntao Zhao, Qi Lu, Wei Jia, Borui Wan, Lei Zuo, Junda Feng, Jianyu Jiang, Yangrui Chen, Shuaishuai Cao, Jialing He, Kaihua Jiang, Yuanzhe Hu, Shibiao Nong, Yanghua Peng, Haibin Lin, and Chuan Wu. 2026. MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training. https://arxiv.org/ abs/2504.09844

  41. [41]

    Mark Zhao, Emanuel Adamiak, and Christos Kozyrakis. 2024. Cedar: Optimized and Unified Machine Learning Input Data Pipelines.Proceedings of the VLDB Endowment (PVLDB)18, 2 (2024), 488–502

  42. [42]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.Proc. VLDB Endow.16, 12 (2023), 3848–386...