GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
Training llms with fault tolerant hsdp on 100,000 gpus,
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.
ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.
Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.
TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.
Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.
citing papers explorer
-
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
-
DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint
DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.
-
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.
-
Decoupled DiLoCo for Resilient Distributed Pre-training
Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.
-
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.
-
Libra: Efficient Resource Management for Agentic RL Post-Training
Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.