Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
TRANSOM: An efficient fault-tolerant system for training LLMs
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 2
citation-polarity summary
years
2026 3roles
background 2polarities
background 2representative citing papers
TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.
citing papers explorer
-
Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
-
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.
- From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs