Understanding stragglers in large model training using what-if analysis

Lin, J · 2025 · arXiv 2505.05713

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

cs.DC · 2026-03-26 · unverdicted · novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.

ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

cs.DC · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

cs.DC · 2026-05-10

citing papers explorer

Showing 3 of 3 citing papers.

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving cs.DC · 2026-03-26 · unverdicted · none · ref 14
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism cs.DC · 2026-05-07 · unverdicted · none · ref 30 · 2 links
ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs cs.DC · 2026-05-10 · unreviewed · ref 9

Understanding stragglers in large model training using what-if analysis

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer