Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
TRANSOM: An efficient fault-tolerant system for training LLMs
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 2polarities
background 2representative citing papers
DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.
TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.
Production-scale empirical study of a 63-node 504-GPU cluster reports multi-signal failure detection needs, low checkpoint bandwidth utilization, heavy-tailed node exclusions, and 2.7x higher success for auto-retry chains.
citing papers explorer
No citing papers match the current filters.