Oobleck: Resilient distributed training of large models using pipeline templates

Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, Mosharaf Chowdhury · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

cs.DC · 2026-05-10 · unverdicted · novelty 5.0

Production-scale empirical study of a 63-node 504-GPU cluster reports multi-signal failure detection needs, low checkpoint bandwidth utilization, heavy-tailed node exclusions, and 2.7x higher success for auto-retry chains.

citing papers explorer

Showing 1 of 1 citing paper after filters.

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs cs.DC · 2026-05-10 · unverdicted · none · ref 42
Production-scale empirical study of a 63-node 504-GPU cluster reports multi-signal failure detection needs, low checkpoint bandwidth utilization, heavy-tailed node exclusions, and 2.7x higher success for auto-retry chains.

Oobleck: Resilient distributed training of large models using pipeline templates

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer