SCAPE enables 90-99% sparse gradient communication in sharded Adam-style LLM training by deriving masks from first-moment statistics, achieving up to 43.3% faster pre-training on Llama-500M with no loss in validation loss or downstream accuracy.
H2o-danube3 technical report
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.
citing papers explorer
-
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.