Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

Chen Tian; Guanghuan Fang; Haoran Xia; Hengxi Xu; Jian He; Jingyi Zhang; Junhe Lu; Peng Jiang; Qianyu Jiang; Rong Gu

arxiv: 2508.21613 · v4 · submitted 2025-08-29 · 💻 cs.DC

Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

Yuhang Zhou , Zhibin Wang , Peng Jiang , Haoran Xia , Junhe Lu , Qianyu Jiang , Rong Gu , Hengxi Xu

show 7 more authors

Xinjing Huang Guanghuan Fang Zhiheng Hu Jingyi Zhang Yongjin Cai Jian He Chen Tian

This is my paper

Pith reviewed 2026-05-18 20:36 UTC · model grok-4.3

classification 💻 cs.DC

keywords fault tolerancedistributed traininglarge language modelsrecovery strategiesperformance modelingadaptive systemscluster computingreal-time selection

0 comments

The pith

Chameleon selects optimal recovery strategies in real time to keep distributed LLM training within 11% of failure-free performance after faults.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Chameleon as a system that picks the best way to recover from faults during distributed training of large language models. Current backup-free methods each come with drawbacks like constant extra work, slow restarts, or slower running after recovery. Chameleon builds a single model to estimate how different recovery options will perform, then quickly searches for the fastest one and applies communication tweaks to make it efficient. On a 32-card cluster it keeps the speed drop small, preserves how well the model learns, and runs faster on average than earlier approaches.

Core claim

Chameleon achieves adaptive fault tolerance by combining a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations to select and apply optimal recovery strategies in real time, resulting in a performance gap of within 11.00% between post-recovery and failure-free training while preserving model convergence and efficient memory usage, and delivering up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle on a 32-card cluster.

What carries the argument

Unified performance model that estimates and compares recovery options in real time to choose the fastest feasible strategy without large added cost.

If this is right

Recovery can switch between strategies like redundant computation or data rerouting based on the specific failure to minimize time lost.
Training continues with little extra memory pressure and without harming final model accuracy.
Overall cluster throughput improves compared with fixed recovery methods used by prior systems.
Frequent faults no longer force long pauses or major slowdowns in large-scale runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same real-time estimation approach might help other distributed workloads that face random interruptions, such as scientific simulations on clusters.
If the model generalizes across hardware, it could let operators run training jobs with less manual setup for fault handling.
Testing the method on clusters larger than 32 cards or with different model architectures would show whether the gains scale.

Load-bearing premise

The performance estimates and search process can pick the right recovery option quickly and accurately enough that they do not create meaningful extra slowdown or wrong choices.

What would settle it

Measuring a slowdown larger than 11% after recovery or lower average throughput than Oobleck and Recycle on the same 32-card cluster with comparable faults would disprove the performance results.

Figures

Figures reproduced from arXiv: 2508.21613 by Chen Tian, Guanghuan Fang, Haoran Xia, Hengxi Xu, Jian He, Jingyi Zhang, Junhe Lu, Peng Jiang, Qianyu Jiang, Rong Gu, Xinjing Huang, Yongjin Cai, Yuhang Zhou, Zhibin Wang, Zhiheng Hu.

**Figure 1.** Figure 1: The overall workflow of Odyssey. • How do we minimize the step time after recovery tstep,Si+1 ? (§IV-B) The tstep,Si+1 consists of both pipeline computation time and synchronization communication time. While the former can be estimated by the estimator, the latter, especially under asymmetric parallelism, can be optimized as a graph coloring problem. • How do we estimate the step time after recovery tstep,… view at source ↗

**Figure 3.** Figure 3: Optimization of weight transfer. the two, resulting in varying amounts of weight data that need to be transferred. Assuming the number of remaining nodes is N, we can construct an N × N cost matrix Cost based on the different layer distributions, where Cost[i][j] represents the cost for node i to migrate to the j-th node under the new plan. For example, if the first node in DP2 corresponds to the first nod… view at source ↗

**Figure 4.** Figure 4: The asymmetric DP gradient update communication. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Execution of different pipeline scenarios. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world training results of Odyssey. 0 2 4 6 8 Time (hours) 0 5 10 15 20 25 30 35 Number of NPUs NPU NPU Average [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 10.** Figure 10: Impact of weight transfer optimization 16 32 64 Batch Size 0 2 4 6 Time (s) 1.69 2.36 2.09 2.98 4.08 3.45 5.02 6.31 5.98 Original W/O Optimization With Optimization Communication [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 12.** Figure 12: Memory analysis of Odyssey 0 1000 2000 3000 Steps 0 2 4 6 8 10 Loss Baseline Ours Loss=0.1 [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

read the original abstract

Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Chameleon, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Chameleon achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Chameleon maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Chameleon achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chameleon combines a unified model with real-time search to pick recovery strategies on failure, and the 32-card results look decent on paper, but the lack of isolated checks on model error and search cost leaves the gains hard to trust fully.

read the letter

The main point is that this system tries to handle faults in distributed training by switching between recovery options like redundant compute or rerouting based on a quick performance estimate rather than locking into one static method. If the estimates hold, it keeps the slowdown after a failure to about 11% on their 32-card runs and beats the baselines on throughput. That addresses a real pain point in long LLM jobs where failures happen often and fixed approaches waste time or resources. What is new is the real-time selection loop that pulls together a single performance model, a fast plan search, and some communication tweaks to decide on the spot. Earlier work tended to pick one technique ahead of time and live with its overhead, so the adaptive angle is a step forward. The experiments also show they kept model convergence and memory use in check, which is useful to see. The soft spot is exactly the one the stress test flags: there is no clear breakdown of how accurate the performance predictions were against actual runs or how much the search itself added to latency. Without those numbers, it is difficult to know whether the reported speedups would survive when the model is off or the search takes longer on bigger setups. The 32-card scale is also modest, so extrapolation to hundreds of cards is still open. This paper is for systems people who build or tune large training clusters and want ideas on making recovery less painful. A reader working on fault tolerance or scheduling would find the mechanism worth looking at even if they end up testing the model themselves. I would send it for peer review because the problem is practical, the approach is distinct from prior static methods, and referees can push for the missing measurements on overhead and accuracy.

Referee Report

2 major / 2 minor

Summary. The paper proposes Chameleon, an adaptive fault-tolerant system for distributed LLM training that selects optimal recovery strategies in real time upon failures. It relies on a unified performance model, expedient execution plan search, accurate performance estimation, and communication optimizations to avoid the penalties of existing backup-free methods such as redundant computation or dynamic parallelism. Experiments on a 32-card cluster report that Chameleon maintains post-recovery performance within 11% of failure-free training while preserving convergence and memory usage, and achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.

Significance. If the performance model predictions prove accurate and search overhead remains negligible, this approach could meaningfully advance fault tolerance for large-scale distributed training by enabling low-penalty, adaptive recovery without fixed overheads or lengthy reconfigurations. The concrete throughput comparisons to prior systems provide a useful baseline for evaluating such adaptive mechanisms in practice.

major comments (2)

[Experiments (abstract and §5)] The central claim that the unified performance model plus expedient search selects optimal strategies without eroding gains rests on unvalidated accuracy and overhead. The 32-card experiments report the 11% gap and 1.229x/1.355x throughput improvements but do not isolate or bound estimation errors or plan-search latency against ground-truth measurements, leaving open whether these factors were measured or could have affected the results relative to Oobleck and Recycle.
[§5] Experimental setup details are insufficient to support the quantitative claims. The abstract and results lack information on model sizes, failure injection methodology, number of runs for statistical significance, or potential confounding factors such as network variability, which directly impacts verification of the reported performance gap and throughput advantages.

minor comments (2)

[§4] Notation for the unified performance model components could be clarified with a summary table or diagram early in the paper to aid readers in following the real-time selection logic.
[Abstract] The abstract would benefit from briefly stating the fault models considered (e.g., node failures, link failures) to set expectations for the recovery strategies evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We have carefully addressed each major comment below and will incorporate revisions to strengthen the experimental validation and reproducibility of the results.

read point-by-point responses

Referee: [Experiments (abstract and §5)] The central claim that the unified performance model plus expedient search selects optimal strategies without eroding gains rests on unvalidated accuracy and overhead. The 32-card experiments report the 11% gap and 1.229x/1.355x throughput improvements but do not isolate or bound estimation errors or plan-search latency against ground-truth measurements, leaving open whether these factors were measured or could have affected the results relative to Oobleck and Recycle.

Authors: We acknowledge the value of isolating the performance model's estimation accuracy and plan-search latency to more rigorously validate that these components do not erode the reported gains. The original experiments emphasize end-to-end throughput and the 11% performance gap to demonstrate practical benefits under realistic conditions. To directly address this point, we will add a dedicated analysis in the revised §5 that compares model-predicted execution times against measured ground-truth values across multiple recovery scenarios and reports the measured latency of the expedient search procedure. This will explicitly bound estimation errors and search overhead relative to the throughput improvements over Oobleck and Recycle. revision: yes
Referee: [§5] Experimental setup details are insufficient to support the quantitative claims. The abstract and results lack information on model sizes, failure injection methodology, number of runs for statistical significance, or potential confounding factors such as network variability, which directly impacts verification of the reported performance gap and throughput advantages.

Authors: We agree that expanded experimental setup details will improve clarity and reproducibility. While §5 of the manuscript contains core configuration information, we will revise it to explicitly specify the model sizes and architectures evaluated, provide a precise description of the failure injection methodology (including timing and types of faults), report the number of independent runs performed for statistical significance, and discuss potential confounding factors such as network variability with the controls applied during measurements. Corresponding updates will be made to the abstract for consistency where appropriate. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system validated by external comparisons

full rationale

The paper describes an adaptive fault-tolerance system whose core claims rest on a unified performance model plus real-time search and optimizations, with results measured directly against failure-free baselines and prior systems (Oobleck, Recycle) on a 32-card cluster. No equations or derivations reduce a reported quantity (e.g., the 11% gap or 1.229×/1.355× throughput) to a fitted parameter or self-citation by construction; the model is presented as an engineering tool whose accuracy is assessed via end-to-end experiments rather than assumed. Self-citations, if present, are not load-bearing for the central performance claims. This is a standard empirical systems paper whose derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Since only the abstract is available, specific free parameters, axioms, or invented entities are not detailed in the provided information. The approach relies on standard assumptions in distributed systems such as fault occurrence and performance predictability.

pith-pipeline@v0.9.0 · 5728 in / 1129 out tokens · 44447 ms · 2026-05-18T20:36:07.586347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery / Peano structure unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Algorithm 1: Search for the Best Execution Plan (integer partition, distribute batch, split layers, time estimator)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 9 internal anchors

[1]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhriet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

URLhttps://doi.org/10.1145/3600006.3613145

Z. Wang, Z. Jia, S. Zheng, Z. Zhanget al., “Gemini: Fast failure recovery in distributed training with in-memory checkpoints,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 364–381. [Online]. Available: https://doi.org/10.1145/3600006.3613145

work page doi:10.1145/3600006.3613145 2023
[4]

Check- N-Run: a checkpointing system for training deep learning recommendation models,

A. Eisenman, K. K. Matam, S. Ingramet al., “Check- N-Run: a checkpointing system for training deep learning recommendation models,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, W A: USENIX Association, Apr. 2022, pp. 929–943. [Online]. Available: https://www.usenix.org/conference/nsdi22/presentation/eisenman

work page 2022
[5]

Megascale: Scaling large language model training to more than 10,000 gpus,

Z. Jiang, H. Lin, Y . Zhong, Q. Huanget al., “Megascale: Scaling large language model training to more than 10,000 gpus,” 2024. [Online]. Available: https://arxiv.org/abs/2402.15627

work page arXiv 2024
[6]

Elan: Towards generic and efficient elastic training for deep learning,

L. Xie, J. Zhai, B. Wu, Y . Wanget al., “Elan: Towards generic and efficient elastic training for deep learning,” in2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), 2020, pp. 78–88

work page 2020
[7]

Bamboo: Making preemptible instances resilient for affordable training of large{DNNs},

J. Thorpe, P. Zhao, J. Eyolfson, Y . Qiao, Z. Jia, M. Zhang, R. Netravali, and G. H. Xu, “Bamboo: Making preemptible instances resilient for affordable training of large{DNNs},” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 497–513

work page 2023
[8]

Oobleck: Resilient distributed training of large models using pipeline templates,

I. Jang, Z. Yang, Z. Zhang, X. Jin, and M. Chowdhury, “Oobleck: Resilient distributed training of large models using pipeline templates,” inProceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 382–395

work page 2023
[9]

Recycle: Re- silient training of large dnns using pipeline adaptation,

S. Gandhi, M. Zhao, A. Skiadopoulos, and C. Kozyrakis, “Recycle: Re- silient training of large dnns using pipeline adaptation,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 211–228

work page 2024
[10]

Tenplex: Dynamic parallelism for deep learning using parallelizable tensor col- lections,

M. Wagenl ¨ander, G. Li, B. Zhao, L. Mai, and P. Pietzuch, “Tenplex: Dynamic parallelism for deep learning using parallelizable tensor col- lections,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 195–210

work page 2024
[11]

Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper,

H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, and Y . Hu, “Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper,” in2021 IEEE International Sympo- sium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 789–801

work page 2021
[12]

Parallel scan on ascend ai accelerators,

B. Wr ´oblewski, G. Gottardo, and A. Zouzias, “Parallel scan on ascend ai accelerators,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15112

work page arXiv 2025
[13]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuiset al., “Accurate, large minibatch sgd: Training imagenet in 1 hour,” 2018. [Online]. Available: https://arxiv.org/abs/1706.02677

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Horovod: fast and easy distributed deep learning in TensorFlow

A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” 2018. [Online]. Available: https://arxiv.org/abs/1802.05799

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

ImageNet Training in Minutes

Y . You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “Imagenet training in minutes,” 2018. [Online]. Available: https://arxiv.org/abs/1709.05011

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

S. Smith, M. Patwary, B. Norick, P. LeGresleyet al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” 2022. [Online]. Available: https://arxiv.org/abs/2201.11990

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Y . Huang, Y . Cheng, A. Bapnaet al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” 2019. [Online]. Available: https://arxiv.org/abs/1811.06965

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

BPipe: Memory-balanced pipeline parallelism for training large language models,

T. Kim, H. Kim, G.-I. Yu, and B.-G. Chun, “BPipe: Memory-balanced pipeline parallelism for training large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, ...

work page 2023
[19]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “Zero: Memory optimizations toward training trillion parameter models,” 2020. [Online]. Available: https://arxiv.org/abs/1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Shortcut-connected expert parallelism for accelerating mixture-of-experts,

W. Cai, J. Jiang, L. Qinet al., “Shortcut-connected expert parallelism for accelerating mixture-of-experts,” 2025. [Online]. Available: https://arxiv.org/abs/2404.05019

work page arXiv 2025
[21]

Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

Y . Qian, F. Li, X. Jiet al., “Eps-moe: Expert pipeline scheduler for cost-efficient moe inference,” 2025. [Online]. Available: https://arxiv.org/abs/2410.12247

work page arXiv 2025
[22]

Moe parallel folding: Heterogeneous parallelism mappings for efficient large-scale moe model training with megatron core,

D. Liu, Z. Yan, X. Yao, T. Liuet al., “Moe parallel folding: Heterogeneous parallelism mappings for efficient large-scale moe model training with megatron core,” 2025. [Online]. Available: https://arxiv.org/abs/2504.14960

work page arXiv 2025
[23]

Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022

work page 2022
[24]

Tutel: Adaptive mixture-of-experts at scale,

C. Hwang, W. Cui, Y . Xionget al., “Tutel: Adaptive mixture-of-experts at scale,” 2023. [Online]. Available: https://arxiv.org/abs/2206.03382

work page arXiv 2023
[25]

Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production,

C. Jin, Z. Jiang, Z. Baiet al., “Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11432

work page arXiv 2025
[26]

Understanding communication characteristics of distributed training,

W. Li, X. Liu, Y . Liet al., “Understanding communication characteristics of distributed training,” inProceedings of the 8th Asia-Pacific Workshop on Networking, ser. APNet ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1–8. [Online]. Available: https://doi.org/10.1145/3663408.3663409

work page doi:10.1145/3663408.3663409 2024
[27]

Amped: An analytical model for performance in distributed training of transformers,

D. Moolchandani, J. Kunduet al., “Amped: An analytical model for performance in distributed training of transformers,” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 306–315

work page 2023
[28]

Reducing activation recomputation in large transformer models,

V . Korthikanti, J. Casper, S. Lym, L. McAfeeet al., “Reducing activation recomputation in large transformer models,” 2022. [Online]. Available: https://arxiv.org/abs/2205.05198

work page arXiv 2022
[29]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

S. Li and T. Hoefler, “Chimera: efficiently training large-scale neural networks with bidirectional pipelines,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’21. ACM, Nov. 2021, p. 1–14. [Online]. Available: http://dx.doi.org/10.1145/3458817.3476145

work page doi:10.1145/3458817.3476145 2021
[30]

Pytorch distributed: experiences on accelerating data parallel training,

S. Li, Y . Zhao, R. Varma, O. Salpekaret al., “Pytorch distributed: experiences on accelerating data parallel training,”Proc. VLDB Endow., vol. 13, no. 12, p. 3005–3018, Aug. 2020. [Online]. Available: https://doi.org/10.14778/3415478.3415530

work page doi:10.14778/3415478.3415530 2020
[31]

Varuna: scalable, low-cost training of massive deep learning models,

S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, “Varuna: scalable, low-cost training of massive deep learning models,” inPro- ceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472–487

work page 2022
[32]

Failures in large scale systems: Long-term measurement, analysis, and implications,

S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, “Failures in large scale systems: Long-term measurement, analysis, and implications,” inSC17: International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1–12

work page 2017
[33]

MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters,

Q. Weng, W. Xiao, Y . Yu, W. Wanget al., “MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, W A: USENIX Association, Apr. 2022, pp. 945–960. [Online]. Available: https://www.usenix.org/conference/nsdi22/presentation/weng

work page 2022
[34]

Minder: Faulty machine detection for large-scale distributed model training,

Y . Deng, X. Shi, Z. Jiang, X. Zhanget al., “Minder: Faulty machine detection for large-scale distributed model training,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). Philadelphia, PA: USENIX Association, Apr. 2025, pp. 505–521. [Online]. Available: https://www.usenix.org/conference/nsdi25/presentation/deng

work page 2025
[35]

Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning,

L. Zheng, Z. Li, H. Zhang, Y . Zhuanget al., “Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 559–578. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin

work page 2022
[36]

The hungarian method for the assignment problem,

H. Kuhn, “The hungarian method for the assignment problem,”Naval Research Logistic Quarterly, vol. 2, 05 2012

work page 2012
[37]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albertet al., “Llama 2: Open foundation and fine-tuned chat models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhriet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

URLhttps://doi.org/10.1145/3600006.3613145

Z. Wang, Z. Jia, S. Zheng, Z. Zhanget al., “Gemini: Fast failure recovery in distributed training with in-memory checkpoints,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 364–381. [Online]. Available: https://doi.org/10.1145/3600006.3613145

work page doi:10.1145/3600006.3613145 2023

[4] [4]

Check- N-Run: a checkpointing system for training deep learning recommendation models,

A. Eisenman, K. K. Matam, S. Ingramet al., “Check- N-Run: a checkpointing system for training deep learning recommendation models,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, W A: USENIX Association, Apr. 2022, pp. 929–943. [Online]. Available: https://www.usenix.org/conference/nsdi22/presentation/eisenman

work page 2022

[5] [5]

Megascale: Scaling large language model training to more than 10,000 gpus,

Z. Jiang, H. Lin, Y . Zhong, Q. Huanget al., “Megascale: Scaling large language model training to more than 10,000 gpus,” 2024. [Online]. Available: https://arxiv.org/abs/2402.15627

work page arXiv 2024

[6] [6]

Elan: Towards generic and efficient elastic training for deep learning,

L. Xie, J. Zhai, B. Wu, Y . Wanget al., “Elan: Towards generic and efficient elastic training for deep learning,” in2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), 2020, pp. 78–88

work page 2020

[7] [7]

Bamboo: Making preemptible instances resilient for affordable training of large{DNNs},

J. Thorpe, P. Zhao, J. Eyolfson, Y . Qiao, Z. Jia, M. Zhang, R. Netravali, and G. H. Xu, “Bamboo: Making preemptible instances resilient for affordable training of large{DNNs},” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 497–513

work page 2023

[8] [8]

Oobleck: Resilient distributed training of large models using pipeline templates,

I. Jang, Z. Yang, Z. Zhang, X. Jin, and M. Chowdhury, “Oobleck: Resilient distributed training of large models using pipeline templates,” inProceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 382–395

work page 2023

[9] [9]

Recycle: Re- silient training of large dnns using pipeline adaptation,

S. Gandhi, M. Zhao, A. Skiadopoulos, and C. Kozyrakis, “Recycle: Re- silient training of large dnns using pipeline adaptation,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 211–228

work page 2024

[10] [10]

Tenplex: Dynamic parallelism for deep learning using parallelizable tensor col- lections,

M. Wagenl ¨ander, G. Li, B. Zhao, L. Mai, and P. Pietzuch, “Tenplex: Dynamic parallelism for deep learning using parallelizable tensor col- lections,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 195–210

work page 2024

[11] [11]

Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper,

H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, and Y . Hu, “Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper,” in2021 IEEE International Sympo- sium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 789–801

work page 2021

[12] [12]

Parallel scan on ascend ai accelerators,

B. Wr ´oblewski, G. Gottardo, and A. Zouzias, “Parallel scan on ascend ai accelerators,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15112

work page arXiv 2025

[13] [13]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuiset al., “Accurate, large minibatch sgd: Training imagenet in 1 hour,” 2018. [Online]. Available: https://arxiv.org/abs/1706.02677

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Horovod: fast and easy distributed deep learning in TensorFlow

A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” 2018. [Online]. Available: https://arxiv.org/abs/1802.05799

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

ImageNet Training in Minutes

Y . You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “Imagenet training in minutes,” 2018. [Online]. Available: https://arxiv.org/abs/1709.05011

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

S. Smith, M. Patwary, B. Norick, P. LeGresleyet al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” 2022. [Online]. Available: https://arxiv.org/abs/2201.11990

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Y . Huang, Y . Cheng, A. Bapnaet al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” 2019. [Online]. Available: https://arxiv.org/abs/1811.06965

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

BPipe: Memory-balanced pipeline parallelism for training large language models,

T. Kim, H. Kim, G.-I. Yu, and B.-G. Chun, “BPipe: Memory-balanced pipeline parallelism for training large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, ...

work page 2023

[19] [19]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “Zero: Memory optimizations toward training trillion parameter models,” 2020. [Online]. Available: https://arxiv.org/abs/1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020

[20] [20]

Shortcut-connected expert parallelism for accelerating mixture-of-experts,

W. Cai, J. Jiang, L. Qinet al., “Shortcut-connected expert parallelism for accelerating mixture-of-experts,” 2025. [Online]. Available: https://arxiv.org/abs/2404.05019

work page arXiv 2025

[21] [21]

Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

Y . Qian, F. Li, X. Jiet al., “Eps-moe: Expert pipeline scheduler for cost-efficient moe inference,” 2025. [Online]. Available: https://arxiv.org/abs/2410.12247

work page arXiv 2025

[22] [22]

Moe parallel folding: Heterogeneous parallelism mappings for efficient large-scale moe model training with megatron core,

D. Liu, Z. Yan, X. Yao, T. Liuet al., “Moe parallel folding: Heterogeneous parallelism mappings for efficient large-scale moe model training with megatron core,” 2025. [Online]. Available: https://arxiv.org/abs/2504.14960

work page arXiv 2025

[23] [23]

Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022

work page 2022

[24] [24]

Tutel: Adaptive mixture-of-experts at scale,

C. Hwang, W. Cui, Y . Xionget al., “Tutel: Adaptive mixture-of-experts at scale,” 2023. [Online]. Available: https://arxiv.org/abs/2206.03382

work page arXiv 2023

[25] [25]

Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production,

C. Jin, Z. Jiang, Z. Baiet al., “Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11432

work page arXiv 2025

[26] [26]

Understanding communication characteristics of distributed training,

W. Li, X. Liu, Y . Liet al., “Understanding communication characteristics of distributed training,” inProceedings of the 8th Asia-Pacific Workshop on Networking, ser. APNet ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1–8. [Online]. Available: https://doi.org/10.1145/3663408.3663409

work page doi:10.1145/3663408.3663409 2024

[27] [27]

Amped: An analytical model for performance in distributed training of transformers,

D. Moolchandani, J. Kunduet al., “Amped: An analytical model for performance in distributed training of transformers,” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 306–315

work page 2023

[28] [28]

Reducing activation recomputation in large transformer models,

V . Korthikanti, J. Casper, S. Lym, L. McAfeeet al., “Reducing activation recomputation in large transformer models,” 2022. [Online]. Available: https://arxiv.org/abs/2205.05198

work page arXiv 2022

[29] [29]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

S. Li and T. Hoefler, “Chimera: efficiently training large-scale neural networks with bidirectional pipelines,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’21. ACM, Nov. 2021, p. 1–14. [Online]. Available: http://dx.doi.org/10.1145/3458817.3476145

work page doi:10.1145/3458817.3476145 2021

[30] [30]

Pytorch distributed: experiences on accelerating data parallel training,

S. Li, Y . Zhao, R. Varma, O. Salpekaret al., “Pytorch distributed: experiences on accelerating data parallel training,”Proc. VLDB Endow., vol. 13, no. 12, p. 3005–3018, Aug. 2020. [Online]. Available: https://doi.org/10.14778/3415478.3415530

work page doi:10.14778/3415478.3415530 2020

[31] [31]

Varuna: scalable, low-cost training of massive deep learning models,

S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, “Varuna: scalable, low-cost training of massive deep learning models,” inPro- ceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472–487

work page 2022

[32] [32]

Failures in large scale systems: Long-term measurement, analysis, and implications,

S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, “Failures in large scale systems: Long-term measurement, analysis, and implications,” inSC17: International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1–12

work page 2017

[33] [33]

MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters,

Q. Weng, W. Xiao, Y . Yu, W. Wanget al., “MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, W A: USENIX Association, Apr. 2022, pp. 945–960. [Online]. Available: https://www.usenix.org/conference/nsdi22/presentation/weng

work page 2022

[34] [34]

Minder: Faulty machine detection for large-scale distributed model training,

Y . Deng, X. Shi, Z. Jiang, X. Zhanget al., “Minder: Faulty machine detection for large-scale distributed model training,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). Philadelphia, PA: USENIX Association, Apr. 2025, pp. 505–521. [Online]. Available: https://www.usenix.org/conference/nsdi25/presentation/deng

work page 2025

[35] [35]

Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning,

L. Zheng, Z. Li, H. Zhang, Y . Zhuanget al., “Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 559–578. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin

work page 2022

[36] [36]

The hungarian method for the assignment problem,

H. Kuhn, “The hungarian method for the assignment problem,”Naval Research Logistic Quarterly, vol. 2, 05 2012

work page 2012

[37] [37]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albertet al., “Llama 2: Open foundation and fine-tuned chat models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023