MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-oriented frameworks.
Minimizing pigeons for 2-hop and multihop demands is NP-hard, but a polynomial-time demand-aggregation algorithm achieves a 2-approximation.
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.
citing papers explorer
-
Exploiting Multicast for Accelerating Collective Communication
MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
-
HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters
HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-oriented frameworks.
-
The Carrier Pigeon Internet Protocol: An Algorithmic (and Lighthearted) Perspective
Minimizing pigeons for 2-hop and multihop demands is NP-hard, but a polynomial-time demand-aggregation algorithm achieves a 2-approximation.
-
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.