Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
Pith reviewed 2026-05-18 20:36 UTC · model grok-4.3
The pith
Chameleon selects optimal recovery strategies in real time to keep distributed LLM training within 11% of failure-free performance after faults.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Chameleon achieves adaptive fault tolerance by combining a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations to select and apply optimal recovery strategies in real time, resulting in a performance gap of within 11.00% between post-recovery and failure-free training while preserving model convergence and efficient memory usage, and delivering up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle on a 32-card cluster.
What carries the argument
Unified performance model that estimates and compares recovery options in real time to choose the fastest feasible strategy without large added cost.
If this is right
- Recovery can switch between strategies like redundant computation or data rerouting based on the specific failure to minimize time lost.
- Training continues with little extra memory pressure and without harming final model accuracy.
- Overall cluster throughput improves compared with fixed recovery methods used by prior systems.
- Frequent faults no longer force long pauses or major slowdowns in large-scale runs.
Where Pith is reading between the lines
- The same real-time estimation approach might help other distributed workloads that face random interruptions, such as scientific simulations on clusters.
- If the model generalizes across hardware, it could let operators run training jobs with less manual setup for fault handling.
- Testing the method on clusters larger than 32 cards or with different model architectures would show whether the gains scale.
Load-bearing premise
The performance estimates and search process can pick the right recovery option quickly and accurately enough that they do not create meaningful extra slowdown or wrong choices.
What would settle it
Measuring a slowdown larger than 11% after recovery or lower average throughput than Oobleck and Recycle on the same 32-card cluster with comparable faults would disprove the performance results.
Figures
read the original abstract
Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Chameleon, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Chameleon achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Chameleon maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Chameleon achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Chameleon, an adaptive fault-tolerant system for distributed LLM training that selects optimal recovery strategies in real time upon failures. It relies on a unified performance model, expedient execution plan search, accurate performance estimation, and communication optimizations to avoid the penalties of existing backup-free methods such as redundant computation or dynamic parallelism. Experiments on a 32-card cluster report that Chameleon maintains post-recovery performance within 11% of failure-free training while preserving convergence and memory usage, and achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.
Significance. If the performance model predictions prove accurate and search overhead remains negligible, this approach could meaningfully advance fault tolerance for large-scale distributed training by enabling low-penalty, adaptive recovery without fixed overheads or lengthy reconfigurations. The concrete throughput comparisons to prior systems provide a useful baseline for evaluating such adaptive mechanisms in practice.
major comments (2)
- [Experiments (abstract and §5)] The central claim that the unified performance model plus expedient search selects optimal strategies without eroding gains rests on unvalidated accuracy and overhead. The 32-card experiments report the 11% gap and 1.229x/1.355x throughput improvements but do not isolate or bound estimation errors or plan-search latency against ground-truth measurements, leaving open whether these factors were measured or could have affected the results relative to Oobleck and Recycle.
- [§5] Experimental setup details are insufficient to support the quantitative claims. The abstract and results lack information on model sizes, failure injection methodology, number of runs for statistical significance, or potential confounding factors such as network variability, which directly impacts verification of the reported performance gap and throughput advantages.
minor comments (2)
- [§4] Notation for the unified performance model components could be clarified with a summary table or diagram early in the paper to aid readers in following the real-time selection logic.
- [Abstract] The abstract would benefit from briefly stating the fault models considered (e.g., node failures, link failures) to set expectations for the recovery strategies evaluated.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review of our manuscript. We have carefully addressed each major comment below and will incorporate revisions to strengthen the experimental validation and reproducibility of the results.
read point-by-point responses
-
Referee: [Experiments (abstract and §5)] The central claim that the unified performance model plus expedient search selects optimal strategies without eroding gains rests on unvalidated accuracy and overhead. The 32-card experiments report the 11% gap and 1.229x/1.355x throughput improvements but do not isolate or bound estimation errors or plan-search latency against ground-truth measurements, leaving open whether these factors were measured or could have affected the results relative to Oobleck and Recycle.
Authors: We acknowledge the value of isolating the performance model's estimation accuracy and plan-search latency to more rigorously validate that these components do not erode the reported gains. The original experiments emphasize end-to-end throughput and the 11% performance gap to demonstrate practical benefits under realistic conditions. To directly address this point, we will add a dedicated analysis in the revised §5 that compares model-predicted execution times against measured ground-truth values across multiple recovery scenarios and reports the measured latency of the expedient search procedure. This will explicitly bound estimation errors and search overhead relative to the throughput improvements over Oobleck and Recycle. revision: yes
-
Referee: [§5] Experimental setup details are insufficient to support the quantitative claims. The abstract and results lack information on model sizes, failure injection methodology, number of runs for statistical significance, or potential confounding factors such as network variability, which directly impacts verification of the reported performance gap and throughput advantages.
Authors: We agree that expanded experimental setup details will improve clarity and reproducibility. While §5 of the manuscript contains core configuration information, we will revise it to explicitly specify the model sizes and architectures evaluated, provide a precise description of the failure injection methodology (including timing and types of faults), report the number of independent runs performed for statistical significance, and discuss potential confounding factors such as network variability with the controls applied during measurements. Corresponding updates will be made to the abstract for consistency where appropriate. revision: yes
Circularity Check
No circularity: empirical system validated by external comparisons
full rationale
The paper describes an adaptive fault-tolerance system whose core claims rest on a unified performance model plus real-time search and optimizations, with results measured directly against failure-free baselines and prior systems (Oobleck, Recycle) on a 32-card cluster. No equations or derivations reduce a reported quantity (e.g., the 11% gap or 1.229×/1.355× throughput) to a fitted parameter or self-citation by construction; the model is presented as an engineering tool whose accuracy is assessed via end-to-end experiments rather than assumed. Self-citations, if present, are not load-bearing for the central performance claims. This is a standard empirical systems paper whose derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery / Peano structure unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Algorithm 1: Search for the Best Execution Plan (integer partition, distribute batch, split layers, time estimator)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Grattafiori, A. Dubey, A. Jauhriet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
URLhttps://doi.org/10.1145/3600006.3613145
Z. Wang, Z. Jia, S. Zheng, Z. Zhanget al., “Gemini: Fast failure recovery in distributed training with in-memory checkpoints,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 364–381. [Online]. Available: https://doi.org/10.1145/3600006.3613145
-
[4]
Check- N-Run: a checkpointing system for training deep learning recommendation models,
A. Eisenman, K. K. Matam, S. Ingramet al., “Check- N-Run: a checkpointing system for training deep learning recommendation models,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, W A: USENIX Association, Apr. 2022, pp. 929–943. [Online]. Available: https://www.usenix.org/conference/nsdi22/presentation/eisenman
work page 2022
-
[5]
Megascale: Scaling large language model training to more than 10,000 gpus,
Z. Jiang, H. Lin, Y . Zhong, Q. Huanget al., “Megascale: Scaling large language model training to more than 10,000 gpus,” 2024. [Online]. Available: https://arxiv.org/abs/2402.15627
-
[6]
Elan: Towards generic and efficient elastic training for deep learning,
L. Xie, J. Zhai, B. Wu, Y . Wanget al., “Elan: Towards generic and efficient elastic training for deep learning,” in2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), 2020, pp. 78–88
work page 2020
-
[7]
Bamboo: Making preemptible instances resilient for affordable training of large{DNNs},
J. Thorpe, P. Zhao, J. Eyolfson, Y . Qiao, Z. Jia, M. Zhang, R. Netravali, and G. H. Xu, “Bamboo: Making preemptible instances resilient for affordable training of large{DNNs},” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 497–513
work page 2023
-
[8]
Oobleck: Resilient distributed training of large models using pipeline templates,
I. Jang, Z. Yang, Z. Zhang, X. Jin, and M. Chowdhury, “Oobleck: Resilient distributed training of large models using pipeline templates,” inProceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 382–395
work page 2023
-
[9]
Recycle: Re- silient training of large dnns using pipeline adaptation,
S. Gandhi, M. Zhao, A. Skiadopoulos, and C. Kozyrakis, “Recycle: Re- silient training of large dnns using pipeline adaptation,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 211–228
work page 2024
-
[10]
Tenplex: Dynamic parallelism for deep learning using parallelizable tensor col- lections,
M. Wagenl ¨ander, G. Li, B. Zhao, L. Mai, and P. Pietzuch, “Tenplex: Dynamic parallelism for deep learning using parallelizable tensor col- lections,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 195–210
work page 2024
-
[11]
H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, and Y . Hu, “Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper,” in2021 IEEE International Sympo- sium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 789–801
work page 2021
-
[12]
Parallel scan on ascend ai accelerators,
B. Wr ´oblewski, G. Gottardo, and A. Zouzias, “Parallel scan on ascend ai accelerators,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15112
-
[13]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuiset al., “Accurate, large minibatch sgd: Training imagenet in 1 hour,” 2018. [Online]. Available: https://arxiv.org/abs/1706.02677
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Horovod: fast and easy distributed deep learning in TensorFlow
A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” 2018. [Online]. Available: https://arxiv.org/abs/1802.05799
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Y . You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “Imagenet training in minutes,” 2018. [Online]. Available: https://arxiv.org/abs/1709.05011
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
S. Smith, M. Patwary, B. Norick, P. LeGresleyet al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” 2022. [Online]. Available: https://arxiv.org/abs/2201.11990
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Y . Huang, Y . Cheng, A. Bapnaet al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” 2019. [Online]. Available: https://arxiv.org/abs/1811.06965
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[18]
BPipe: Memory-balanced pipeline parallelism for training large language models,
T. Kim, H. Kim, G.-I. Yu, and B.-G. Chun, “BPipe: Memory-balanced pipeline parallelism for training large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, ...
work page 2023
-
[19]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “Zero: Memory optimizations toward training trillion parameter models,” 2020. [Online]. Available: https://arxiv.org/abs/1910.02054
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[20]
Shortcut-connected expert parallelism for accelerating mixture-of-experts,
W. Cai, J. Jiang, L. Qinet al., “Shortcut-connected expert parallelism for accelerating mixture-of-experts,” 2025. [Online]. Available: https://arxiv.org/abs/2404.05019
-
[21]
Y . Qian, F. Li, X. Jiet al., “Eps-moe: Expert pipeline scheduler for cost-efficient moe inference,” 2025. [Online]. Available: https://arxiv.org/abs/2410.12247
-
[22]
D. Liu, Z. Yan, X. Yao, T. Liuet al., “Moe parallel folding: Heterogeneous parallelism mappings for efficient large-scale moe model training with megatron core,” 2025. [Online]. Available: https://arxiv.org/abs/2504.14960
-
[23]
Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022
work page 2022
-
[24]
Tutel: Adaptive mixture-of-experts at scale,
C. Hwang, W. Cui, Y . Xionget al., “Tutel: Adaptive mixture-of-experts at scale,” 2023. [Online]. Available: https://arxiv.org/abs/2206.03382
-
[25]
C. Jin, Z. Jiang, Z. Baiet al., “Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11432
-
[26]
Understanding communication characteristics of distributed training,
W. Li, X. Liu, Y . Liet al., “Understanding communication characteristics of distributed training,” inProceedings of the 8th Asia-Pacific Workshop on Networking, ser. APNet ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1–8. [Online]. Available: https://doi.org/10.1145/3663408.3663409
-
[27]
Amped: An analytical model for performance in distributed training of transformers,
D. Moolchandani, J. Kunduet al., “Amped: An analytical model for performance in distributed training of transformers,” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 306–315
work page 2023
-
[28]
Reducing activation recomputation in large transformer models,
V . Korthikanti, J. Casper, S. Lym, L. McAfeeet al., “Reducing activation recomputation in large transformer models,” 2022. [Online]. Available: https://arxiv.org/abs/2205.05198
-
[29]
S. Li and T. Hoefler, “Chimera: efficiently training large-scale neural networks with bidirectional pipelines,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’21. ACM, Nov. 2021, p. 1–14. [Online]. Available: http://dx.doi.org/10.1145/3458817.3476145
-
[30]
Pytorch distributed: experiences on accelerating data parallel training,
S. Li, Y . Zhao, R. Varma, O. Salpekaret al., “Pytorch distributed: experiences on accelerating data parallel training,”Proc. VLDB Endow., vol. 13, no. 12, p. 3005–3018, Aug. 2020. [Online]. Available: https://doi.org/10.14778/3415478.3415530
-
[31]
Varuna: scalable, low-cost training of massive deep learning models,
S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, “Varuna: scalable, low-cost training of massive deep learning models,” inPro- ceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472–487
work page 2022
-
[32]
Failures in large scale systems: Long-term measurement, analysis, and implications,
S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, “Failures in large scale systems: Long-term measurement, analysis, and implications,” inSC17: International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1–12
work page 2017
-
[33]
MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters,
Q. Weng, W. Xiao, Y . Yu, W. Wanget al., “MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, W A: USENIX Association, Apr. 2022, pp. 945–960. [Online]. Available: https://www.usenix.org/conference/nsdi22/presentation/weng
work page 2022
-
[34]
Minder: Faulty machine detection for large-scale distributed model training,
Y . Deng, X. Shi, Z. Jiang, X. Zhanget al., “Minder: Faulty machine detection for large-scale distributed model training,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). Philadelphia, PA: USENIX Association, Apr. 2025, pp. 505–521. [Online]. Available: https://www.usenix.org/conference/nsdi25/presentation/deng
work page 2025
-
[35]
Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning,
L. Zheng, Z. Li, H. Zhang, Y . Zhuanget al., “Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 559–578. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin
work page 2022
-
[36]
The hungarian method for the assignment problem,
H. Kuhn, “The hungarian method for the assignment problem,”Naval Research Logistic Quarterly, vol. 2, 05 2012
work page 2012
-
[37]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albertet al., “Llama 2: Open foundation and fine-tuned chat models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.