TRANSOM: An efficient fault-tolerant system for training LLMs

Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo, Tieyao Xiang, Yuheng Chen, Shigang Li · 2023 · arXiv 2310.10046

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

cs.SE · 2026-05-06 · unverdicted · novelty 7.0

Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

cs.DC · 2026-05-18 · unverdicted · novelty 5.0

TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

cs.DC · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

Production-scale empirical study of a 63-node 504-GPU cluster reports multi-signal failure detection needs, low checkpoint bandwidth utilization, heavy-tailed node exclusions, and 2.7x higher success for auto-retry chains.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning cs.SE · 2026-05-06 · unverdicted · none · ref 8
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training cs.DC · 2026-05-18 · unverdicted · none · ref 44
TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs cs.DC · 2026-05-10 · unverdicted · none · ref 45 · 2 links
Production-scale empirical study of a 63-node 504-GPU cluster reports multi-signal failure detection needs, low checkpoint bandwidth utilization, heavy-tailed node exclusions, and 2.7x higher success for auto-retry chains.

TRANSOM: An efficient fault-tolerant system for training LLMs

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer