Analysis of large-scale multi-tenant gpu clusters for dnn training workloads,

· 2019 · arXiv 1901.05758

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.

citing papers explorer

Showing 1 of 1 citing paper.

DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint cs.LG · 2026-07-02 · unverdicted · none · ref 30
DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.

Analysis of large-scale multi-tenant gpu clusters for dnn training workloads,

fields

years

verdicts

representative citing papers

citing papers explorer