Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Stacking as accelerated gradient descent.arXiv preprint arXiv:2403.04978
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to
citing papers explorer
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
All is Not Lost: LLM Recovery without Checkpoints
CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to