DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training

Boxun Li; Chunyang Zhu; Daning Cheng; Guohao Dai; Hao Lin; Junhao Hu; Quanlu Zhang; Yuanqing Wang; Yuchen Zhang; Yunquan Zhang

arxiv: 2605.18815 · v1 · pith:X7AHNLGFnew · submitted 2026-05-12 · 💻 cs.LG · cs.DC

DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training

Yuanqing Wang , Yuchen Zhang , Hao Lin , Junhao Hu , Chunyang Zhu , Quanlu Zhang , Boxun Li , Guohao Dai

show 4 more authors

Zhi Yang Daning Cheng Yunquan Zhang Yu Wang

This is my paper

Pith reviewed 2026-05-20 23:03 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords elastic trainingparallelism reconfigurationLLM trainingvirtual parameter spacedistributed systemsMoE modelsonline switching

0 comments

The pith

DynaTrain uses a Virtual Parameter Space to reconfigure LLM training parallelism in seconds without checkpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DynaTrain to handle the dynamic nature of LLM training where optimal parallelism changes due to resource shifts or elasticity. It establishes that complex parallelism transitions can be reduced to geometric intersections in a unified virtual space. This allows sub-second switches for models up to 235B parameters. A sympathetic reader would care because current systems rely on slow checkpointing that disrupts long training runs. The system preserves model correctness while achieving massive speedups over prior elastic methods.

Core claim

DynaTrain presents a Virtual Parameter Space abstraction that maps any distributed training state for arbitrary multi-dimensional parallelism into deterministic coordinates, enabling transition via geometric intersection calculations rather than full state saves and restores.

What carries the argument

The Virtual Parameter Space (VPS) abstraction that unifies all distributed training states under one logical coordinate space.

If this is right

Training can continue uninterrupted during resource reallocation in elastic clusters.
MoE and dense models up to 235B can switch configurations in 4s or less.
Outperforms checkpoint-based systems by orders of magnitude in reconfiguration time.
Correctness is maintained through rank-local transfers under memory-aware schedules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such fast switching could enable real-time adaptation to varying cluster loads without manual intervention.
Future systems might integrate this with automatic parallelism search for ongoing optimization.
This approach may extend to other distributed computing domains beyond LLMs where state reconfiguration is costly.

Load-bearing premise

The Virtual Parameter Space correctly represents all possible parallelism states as deterministic mappings without causing state corruption or correctness issues in transitions.

What would settle it

A test where after a parallelism switch using DynaTrain the model produces different outputs or loses accuracy compared to a checkpoint-based switch on the same model.

Figures

Figures reproduced from arXiv: 2605.18815 by Boxun Li, Chunyang Zhu, Daning Cheng, Guohao Dai, Hao Lin, Junhao Hu, Quanlu Zhang, Yuanqing Wang, Yuchen Zhang, Yunquan Zhang, Yu Wang, Zhi Yang.

**Figure 2.** Figure 2: DYNATRAIN architecture overview [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: VPS mapping from a shared global view to a rank [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: State routing plan derived from VPS for rank 3 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: State routing plan for sharded optimizer states. The [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Single-dimension reconfiguration vs. Tenplex. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Single-dimension reconfiguration vs. MCP. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Job migration time compared with Tenplex. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Convergence under elastic resharding. 6.5 Correctness Validation via Convergence We validate the numerical correctness of DYNATRAIN’s resharding mechanism by comparing its loss convergence trajectories against a static baseline (without reconfiguration) training a LLaMA-2 13B model on 32 GPUs. As depicted in [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

read the original abstract

Modern large language model (LLM) training is inherently dynamic: resource fluctuations, RLHF phase shifts, and cluster elasticity continually reshape the optimal parallelism layout, posing a significant challenge to existing training frameworks built around a static execution model. We present DynaTrain, a distributed training system for sub-second, online reconfiguration across arbitrary multi-dimensional parallelism. At its core, we propose a Virtual Parameter Space (VPS) abstraction that unifies all distributed training states under one logical coordinate space, turning any parallelism configuration into a deterministic mapping and collapsing complex transition into manageable geometric intersections. On top of VPS, a state routing-and-transition layer executes rank-local transfers under a memory-aware, deadlock-free schedule, and an Elastic Device Manager overlaps new-world construction with ongoing training to mask topology-change cost. On dense and MoE models up to 235B parameters, DynaTrain reconfigures a 70B dense model in under 2s and a 235B MoE model in 4.36s, outperforming state-of-the-art checkpoint-based and elastic systems by up to three orders of magnitude while preserving correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DynaTrain gets parallelism switches down to seconds on big models by mapping states to a virtual coordinate space instead of full restarts.

read the letter

DynaTrain gets parallelism switches down to seconds on big models by mapping states to a virtual coordinate space instead of full restarts. The core idea is the Virtual Parameter Space that turns any multi-dimensional config into a deterministic mapping, so transitions reduce to geometric intersections plus rank-local transfers. They schedule those transfers to avoid deadlocks and overlap new topology setup with ongoing training via the Elastic Device Manager. That combination is what lets them report under 2s for a 70B dense model and 4.36s for a 235B MoE while claiming correctness holds. The speed numbers stand out against checkpoint-based baselines by orders of magnitude if the measurements are clean. The design looks practical for anyone who has to resize training jobs mid-run because of resource changes or RLHF stages. The MoE case is the softer spot. Expert routing creates non-uniform parameter ownership, and it is not obvious that pure geometric intersections automatically handle the expert-to-rank remapping without a window of inconsistency. The abstract states correctness is preserved, but the paper would be stronger with explicit walk-throughs or checks on how routed parameters move during the transition. Overall the work targets system builders who need elastic LLM training rather than theorists. Readers working on distributed frameworks or production clusters will see the most direct value from the scheduling and overlap techniques. The concrete timings and the stated problem make it worth a serious referee even if some edge cases around MoE need extra scrutiny. I would send it for review.

Referee Report

1 major / 2 minor

Summary. The paper presents DynaTrain, a distributed training system for sub-second online reconfiguration of multi-dimensional parallelism in LLM training. It introduces a Virtual Parameter Space (VPS) abstraction that unifies all distributed states under a single logical coordinate space, reducing transitions to geometric intersections. A state routing-and-transition layer performs rank-local transfers under a memory-aware deadlock-free schedule, while an Elastic Device Manager overlaps new-world construction with ongoing training. Experiments on dense and MoE models up to 235B parameters report reconfiguration of a 70B dense model in under 2s and a 235B MoE model in 4.36s, outperforming checkpoint-based and elastic baselines by up to three orders of magnitude while preserving correctness.

Significance. If the VPS correctly maps and transitions all states including non-uniform expert routing in MoE models, the work offers a substantial advance for elastic LLM training by enabling rapid, low-overhead adaptation to resource changes. The reported speedups are large and the geometric-intersection approach is a distinctive technical contribution that could influence future system designs for dynamic parallelism.

major comments (1)

[§4.2] §4.2, VPS transition procedure: The description of geometric intersections for state transfer does not explicitly address how expert-to-rank mappings and routing tables are updated for MoE models during the transition window; because expert parameters are routed rather than uniformly sharded, a coordinate-space intersection alone may not guarantee deterministic ownership transfer without additional routing-state reconciliation logic. This is load-bearing for the central correctness claim on 235B MoE models.

minor comments (2)

[Table 1] Table 1: The baseline comparison columns would be clearer if they explicitly listed the parallelism dimensions (TP, PP, EP, DP) used for each system rather than only aggregate times.
[§3.1] §3.1: The definition of the VPS coordinate mapping could include a small worked example for a 2D (TP, PP) to 3D (TP, PP, EP) transition to illustrate the intersection calculation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of the VPS abstraction for elastic LLM training. We address the single major comment below.

read point-by-point responses

Referee: [§4.2] §4.2, VPS transition procedure: The description of geometric intersections for state transfer does not explicitly address how expert-to-rank mappings and routing tables are updated for MoE models during the transition window; because expert parameters are routed rather than uniformly sharded, a coordinate-space intersection alone may not guarantee deterministic ownership transfer without additional routing-state reconciliation logic. This is load-bearing for the central correctness claim on 235B MoE models.

Authors: We agree that §4.2 would benefit from an explicit description of the MoE case. Under the VPS, each expert is assigned a unique coordinate tuple consisting of its expert index together with the standard sharding dimensions; the geometric intersection therefore operates over this augmented space and directly yields the set of expert shards that must move between ranks. The new expert-to-rank mapping is obtained by applying the target parallelism configuration's deterministic VPS-to-physical mapping function to the intersected coordinates. Routing tables are reconciled locally at each rank by (i) installing the newly received expert shards and (ii) atomically rewriting the expert-assignment table at the conclusion of the transition window, using the same memory-aware schedule already described for dense parameters. This logic is already exercised by the 235B MoE experiments, but we will add a short dedicated paragraph and accompanying pseudocode to §4.2 to make the reconciliation steps explicit. revision: yes

Circularity Check

0 steps flagged

No circularity in DynaTrain derivation: VPS and transitions rest on system design and measurements

full rationale

The paper introduces the Virtual Parameter Space (VPS) as a novel abstraction that maps parallelism configurations to deterministic coordinate spaces and reduces transitions to geometric intersections. All performance claims (sub-second reconfiguration for 70B/235B models, orders-of-magnitude speedup) are grounded in the described implementation (state routing layer, Elastic Device Manager) and empirical results rather than any equation, fitted parameter, or self-referential definition. No load-bearing self-citations, ansatzes, or uniqueness theorems appear in the provided text that would collapse the central claims back to their inputs. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only; VPS is the primary invented abstraction. No explicit free parameters or standard axioms are detailed.

invented entities (1)

Virtual Parameter Space (VPS) no independent evidence
purpose: Unifies all distributed training states under one logical coordinate space for deterministic mapping of parallelism configurations
Core abstraction introduced to collapse complex transitions into geometric intersections

pith-pipeline@v0.9.0 · 5761 in / 1057 out tokens · 34437 ms · 2026-05-20T23:03:32.445883+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At its core, we propose a Virtual Parameter Space (VPS) abstraction that unifies all distributed training states under one logical coordinate space, turning any parallelism configuration into a deterministic mapping and collapsing complex transition into manageable geometric intersections.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the planner maps parameters... by comparing the logical coordinate intersections between the source (VPS-1) and destination (VPS-2) layouts.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Varuna: Scal- able, low-cost training of massive deep learning models

Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ra- machandran Ramjee, and Nipun Kwatra. Varuna: Scal- able, low-cost training of massive deep learning models. InSeventeenth European Conference on Computer Sys- tems (EuroSys ’22), pages 472–487. Association for Computing Machinery, 2022

work page 2022
[2]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Ji- ashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv preprint arXiv:2505.24298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Enabling parallelism hot switching for efficient training of large language models

Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. Enabling parallelism hot switching for efficient training of large language models. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 178–194, New York, NY , USA, 2024. Association for Comp...

work page 2024
[4]

Oobleck: Resilient distributed training of large models using pipeline templates

Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowdhury. Oobleck: Resilient distributed training of large models using pipeline templates. In ACM SIGOPS 29th Symposium of Operating Systems and Principles (SOSP ’23), 2023

work page 2023
[5]

MegaScale: Scaling large language model training to more than 10,000 GPUs

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

work page 2024
[6]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[7]

Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

Wagenlander Marcel, Li Guo, Zhao Bo, Mai Luo, and Pietzuch Peter. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. InPro- ceedings of the ACM SIGOPS 30th Symposium on Op- erating Systems Principles, 2024

work page 2024
[8]

Ganger, and Eric P

Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subra- manya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gre- gory R. Ganger, and Eric P. Xing. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learn- ing. In15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Asso- ciation, 2021

work page 2021
[9]

Zero: Memory optimizations toward train- ing trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward train- ing trillion parameter models. InProceedings of SC, pages 1–16, 2020

work page 2020
[10]

Deepspeed: System optimizations enable training deep learning models with over 100 billion pa- rameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion pa- rameters. InProceedings of SIGKDD, pages 3505–3506, 2020

work page 2020
[11]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[12]

Bamboo: Making preemptible in- stances resilient for affordable training of large DNNs

John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yi- fan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. Bamboo: Making preemptible in- stances resilient for affordable training of large DNNs. In20th USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 23), pages 497–513, Boston, MA, April 2023. USENIX Association

work page 2023
[13]

ByteCheckpoint: A unified checkpointing system for large foundation model development

Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. ByteCheckpoint: A unified checkpointing system for large foundation model development. In22nd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 25), pages 559–578, Philadelphia, PA, Apri...

work page 2025
[14]

Wikimedia downloads

Wikimedia Foundation. Wikimedia downloads. https: //dumps.wikimedia.org, 2024

work page 2024
[15]

Gandiva: Introspective cluster scheduling for deep learning

Wencong Xiao, Romil Bhardwaj, Ramachandran Ram- jee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva: Introspective cluster scheduling for deep learning. In13th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 18), pages 595–610, Carlsbad, CA, Octo...

work page 2018
[16]

Antman: Dynamic scaling on GPU clus- ters for deep learning

Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and 13 Yangqing Jia. Antman: Dynamic scaling on GPU clus- ters for deep learning. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533–548. USENIX Association, 2020

work page 2020
[17]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Sho- janazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Optimizing rlhf training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, and Xin Jin. Optimizing rlhf training for large language models with stage fusion. In Proceedings of the 22nd USENIX Symposium on Net- worked Systems Design and Implementation, NSDI ’25, USA, 2025. USENIX Association. 14

work page 2025

[1] [1]

Varuna: Scal- able, low-cost training of massive deep learning models

Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ra- machandran Ramjee, and Nipun Kwatra. Varuna: Scal- able, low-cost training of massive deep learning models. InSeventeenth European Conference on Computer Sys- tems (EuroSys ’22), pages 472–487. Association for Computing Machinery, 2022

work page 2022

[2] [2]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Ji- ashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv preprint arXiv:2505.24298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Enabling parallelism hot switching for efficient training of large language models

Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. Enabling parallelism hot switching for efficient training of large language models. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 178–194, New York, NY , USA, 2024. Association for Comp...

work page 2024

[4] [4]

Oobleck: Resilient distributed training of large models using pipeline templates

Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowdhury. Oobleck: Resilient distributed training of large models using pipeline templates. In ACM SIGOPS 29th Symposium of Operating Systems and Principles (SOSP ’23), 2023

work page 2023

[5] [5]

MegaScale: Scaling large language model training to more than 10,000 GPUs

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

work page 2024

[6] [6]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023

[7] [7]

Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

Wagenlander Marcel, Li Guo, Zhao Bo, Mai Luo, and Pietzuch Peter. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. InPro- ceedings of the ACM SIGOPS 30th Symposium on Op- erating Systems Principles, 2024

work page 2024

[8] [8]

Ganger, and Eric P

Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subra- manya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gre- gory R. Ganger, and Eric P. Xing. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learn- ing. In15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Asso- ciation, 2021

work page 2021

[9] [9]

Zero: Memory optimizations toward train- ing trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward train- ing trillion parameter models. InProceedings of SC, pages 1–16, 2020

work page 2020

[10] [10]

Deepspeed: System optimizations enable training deep learning models with over 100 billion pa- rameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion pa- rameters. InProceedings of SIGKDD, pages 3505–3506, 2020

work page 2020

[11] [11]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[12] [12]

Bamboo: Making preemptible in- stances resilient for affordable training of large DNNs

John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yi- fan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. Bamboo: Making preemptible in- stances resilient for affordable training of large DNNs. In20th USENIX Symposium on Networked Systems De- sign and Implementation (NSDI 23), pages 497–513, Boston, MA, April 2023. USENIX Association

work page 2023

[13] [13]

ByteCheckpoint: A unified checkpointing system for large foundation model development

Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. ByteCheckpoint: A unified checkpointing system for large foundation model development. In22nd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 25), pages 559–578, Philadelphia, PA, Apri...

work page 2025

[14] [14]

Wikimedia downloads

Wikimedia Foundation. Wikimedia downloads. https: //dumps.wikimedia.org, 2024

work page 2024

[15] [15]

Gandiva: Introspective cluster scheduling for deep learning

Wencong Xiao, Romil Bhardwaj, Ramachandran Ram- jee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva: Introspective cluster scheduling for deep learning. In13th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 18), pages 595–610, Carlsbad, CA, Octo...

work page 2018

[16] [16]

Antman: Dynamic scaling on GPU clus- ters for deep learning

Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and 13 Yangqing Jia. Antman: Dynamic scaling on GPU clus- ters for deep learning. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533–548. USENIX Association, 2020

work page 2020

[17] [17]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Sho- janazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Optimizing rlhf training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, and Xin Jin. Optimizing rlhf training for large language models with stage fusion. In Proceedings of the 22nd USENIX Symposium on Net- worked Systems Design and Implementation, NSDI ’25, USA, 2025. USENIX Association. 14

work page 2025