CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Bor-Yiing Su; Brandon Lucia; Carole-Jean Wu; Caroline Trippel; Isabel Gao; Jiyan Yang; Kiwan Maeng; Mark C. Jeffrey; Mike Rabbat; Shivam Bharuka

arxiv: 2011.02999 · v1 · pith:GA2G4BRInew · submitted 2020-11-05 · 💻 cs.LG · cs.DC

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Kiwan Maeng , Shivam Bharuka , Isabel Gao , Mark C. Jeffrey , Vikram Saraph , Bor-Yiing Su , Caroline Trippel , Jiyan Yang

show 3 more authors

Mike Rabbat Brandon Lucia Carole-Jean Wu

This is my paper

classification 💻 cs.LG cs.DC

keywords recoverytrainingpartialrecommendationaccuracymodeloverheadanalysis

0 comments

read the original abstract

The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during training, improving failure-related overheads. The paper is the first to the extent of our knowledge to perform a data-driven, in-depth analysis of applying partial recovery to recommendation models and identified a trade-off between accuracy and performance. Motivated by the analysis, we present CPR, a partial recovery training system that can reduce the training time and maintain the desired level of model accuracy by (1) estimating the benefit of partial recovery, (2) selecting an appropriate checkpoint saving interval, and (3) prioritizing to save updates of more frequently accessed parameters. Two variants of CPR, CPR-MFU and CPR-SSU, reduce the checkpoint-related overhead from 8.2-8.5% to 0.53-0.68% compared to full recovery, on a configuration emulating the failure pattern and overhead of a production-scale cluster. While reducing overhead significantly, CPR achieves model quality on par with the more expensive full recovery scheme, training the state-of-the-art recommendation model using Criteo's Ads CTR dataset. Our preliminary results also suggest that CPR can speed up training on a real production-scale cluster, without notably degrading the accuracy.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint
cs.LG 2026-07 unverdicted novelty 6.0

DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.