ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
Le, Yonghui Wu, and Zhifeng Chen
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.DC 2years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.
citing papers explorer
-
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
-
DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization
DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.