ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 2polarities
background 2representative citing papers
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
ScheduleFree+ scales schedule-free learning to LLMs with fixes for large batches and models, outperforming Warmup-Stable-Decay schedules by up to 31% at 1000 tokens per parameter.
citing papers explorer
-
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.
-
Predicting Large Model Test Losses with a Noisy Quadratic System
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
-
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
ScheduleFree+ scales schedule-free learning to LLMs with fixes for large batches and models, outperforming Warmup-Stable-Decay schedules by up to 31% at 1000 tokens per parameter.