Training llms with fault tolerant hsdp on 100,000 gpus,

· 2026 · arXiv 2602.00277

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

cs.DC · 2026-03-26 · unverdicted · novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.

DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

cs.DC · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.

Decoupled DiLoCo for Resilient Distributed Pre-training

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

cs.DC · 2026-05-18 · unverdicted · novelty 5.0

TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.

Libra: Efficient Resource Management for Agentic RL Post-Training

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.

citing papers explorer

Showing 6 of 6 citing papers after filters.

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving cs.DC · 2026-03-26 · unverdicted · none · ref 16
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint cs.LG · 2026-07-02 · unverdicted · none · ref 47
DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload cs.DC · 2026-05-11 · unverdicted · none · ref 29 · 2 links
ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.
Decoupled DiLoCo for Resilient Distributed Pre-training cs.CL · 2026-04-23 · unverdicted · none · ref 22
Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training cs.DC · 2026-05-18 · unverdicted · none · ref 37
TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.
Libra: Efficient Resource Management for Agentic RL Post-Training cs.LG · 2026-06-02 · unverdicted · none · ref 40
Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.

Training llms with fault tolerant hsdp on 100,000 gpus,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer