Guard combines online performance monitoring and offline node qualification to detect stragglers and fail-slow behaviors in large-scale training, reporting up to 1.7x higher mean FLOPs utilization and reduction of step-time variance from 20% to 1%.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training
Guard combines online performance monitoring and offline node qualification to detect stragglers and fail-slow behaviors in large-scale training, reporting up to 1.7x higher mean FLOPs utilization and reduction of step-time variance from 20% to 1%.