OptCC is a pipelined AllReduce algorithm that completes within 2-6% of fault-free NCCL performance under up to 50% bandwidth loss by approaching a new lower bound showing O(1/p) unavoidable overhead for p GPUs.
SHIFT: Exploring the boundary of RDMA network fault tolerance.arXiv preprint arXiv:2512.11094, 2025
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Don't Let a Few Network Failures Slow the Entire AllReduce
OptCC is a pipelined AllReduce algorithm that completes within 2-6% of fault-free NCCL performance under up to 50% bandwidth loss by approaching a new lower bound showing O(1/p) unavoidable overhead for p GPUs.