Title resolution pending

· 2026 · arXiv 2602.00277

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

cs.DC · 2026-03-26 · unverdicted · novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

cs.DC · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.

Decoupled DiLoCo for Resilient Distributed Pre-training

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

cs.DC · 2026-05-18 · unverdicted · novelty 5.0

TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.

citing papers explorer

Showing 4 of 4 citing papers.

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving cs.DC · 2026-03-26 · unverdicted · none · ref 16
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload cs.DC · 2026-05-11 · unverdicted · none · ref 29 · 2 links
ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.
Decoupled DiLoCo for Resilient Distributed Pre-training cs.CL · 2026-04-23 · unverdicted · none · ref 22
Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training cs.DC · 2026-05-18 · unverdicted · none · ref 37
TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer