pith. sign in
Pith Number

pith:LG76FO4F

pith:2019:LG76FO4FXIRQAKPLNJWYADOHLW
not attested not anchored not stored refs resolved

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Jeff Rasley, Olatunji Ruwase, Samyam Rajbhandari, Yuxiong He

ZeRO partitions optimizer states and gradients across devices to remove memory redundancy in parallel training.

arxiv:1910.02054 v3 · 2019-10-04 · cs.LG · cs.DC · stat.ML

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{LG76FO4FXIRQAKPLNJWYADOHLW}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis demonstrates ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware.

C2weakest assumption

The assumption that partitioning optimizer states and gradients will not introduce new communication bottlenecks or synchronization overheads that scale worse than linearly when moving to thousands of devices.

C3one line summary

ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.

References

26 extracted · 26 resolved · 9 Pith anchors

[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2018 · arXiv:1810.04805
[2] Language models are unsupervised multitask learners 2019
[3] Megatron-lm: Training multi-billion parameter language models using model parallelism 2019
[4] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learn- ing with a unified text-to-text tran 2019
[5] Nimit Sharad Sohoni, Christopher Richard Aberger, Megan Leszczynski, Jian Zhang, and Christo- pher R´e 2018 · arXiv:1811.02084

Formal links

2 machine-checked theorem links

Cited by

31 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.364346Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

59bfe2bb85ba230029eb6a6d800dc75da176779950b8cf7ce12fe03970dfb98d

Aliases

arxiv: 1910.02054 · arxiv_version: 1910.02054v3 · doi: 10.48550/arxiv.1910.02054 · pith_short_12: LG76FO4FXIRQ · pith_short_16: LG76FO4FXIRQAKPL · pith_short_8: LG76FO4F
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/LG76FO4FXIRQAKPLNJWYADOHLW \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 59bfe2bb85ba230029eb6a6d800dc75da176779950b8cf7ce12fe03970dfb98d
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "769410855d6e6defbf18a87865b61cd2c4373b74c87a93f622ec300280dd1a77",
    "cross_cats_sorted": [
      "cs.DC",
      "stat.ML"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2019-10-04T17:29:39Z",
    "title_canon_sha256": "5c51bb8d9d15dc00904edb477c9632c6ae88312b10fbfa1a9d71978551cf7643"
  },
  "schema_version": "1.0",
  "source": {
    "id": "1910.02054",
    "kind": "arxiv",
    "version": 3
  }
}