{MegaScale}: Scal- ing large language model training to more than 10,000 {GPUs}

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud

cs.SE · 2025-06-02 · unverdicted · novelty 5.0

TSGuard builds domain knowledge bases offline from historical incidents and applies online multi-agent structured reasoning to diagnose AI workload failures, delivering 19.8% higher accuracy and 63.4% lower verification time than baselines on Azure production data.

citing papers explorer

Showing 1 of 1 citing paper.

TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud cs.SE · 2025-06-02 · unverdicted · none · ref 25
TSGuard builds domain knowledge bases offline from historical incidents and applies online multi-agent structured reasoning to diagnose AI workload failures, delivering 19.8% higher accuracy and 63.4% lower verification time than baselines on Azure production data.

{MegaScale}: Scal- ing large language model training to more than 10,000 {GPUs}

fields

years

verdicts

representative citing papers

citing papers explorer