Topology-aware gpu scheduling for learning workloads in cloud environments

Marcelo Amaral, Jordà Polo, David Carrera, Seetharami Seelam, Malgorzata Steinder · 2017

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

cs.DC · 2026-05-10 · unverdicted · novelty 5.0

Production-scale empirical study of a 63-node 504-GPU cluster reports multi-signal failure detection needs, low checkpoint bandwidth utilization, heavy-tailed node exclusions, and 2.7x higher success for auto-retry chains.

citing papers explorer

Showing 1 of 1 citing paper.

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs cs.DC · 2026-05-10 · unverdicted · none · ref 15
Production-scale empirical study of a 63-node 504-GPU cluster reports multi-signal failure detection needs, low checkpoint bandwidth utilization, heavy-tailed node exclusions, and 2.7x higher success for auto-retry chains.

Topology-aware gpu scheduling for learning workloads in cloud environments

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer