{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2020:WVMNY27ORQRCGY3BXFUU6EKCWV","short_pith_number":"pith:WVMNY27O","schema_version":"1.0","canonical_sha256":"b558dc6bee8c22236361b9694f1142b57431b7e931c3b1b02fcbf29e0e94bffe","source":{"kind":"arxiv","id":"2006.15704","version":1},"attestation_state":"computed","paper":{"title":"PyTorch Distributed: Experiences on Accelerating Data Parallel Training","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"PyTorch's distributed data parallel module achieves near-linear scaling to 256 GPUs by overlapping computation with communication.","cross_cats":["cs.LG"],"primary_cat":"cs.DC","authors_text":"Adam Paszke, Brian Vaughan, Jeff Smith, Omkar Salpekar, Pieter Noordhuis, Pritam Damania, Rohan Varma, Shen Li, Soumith Chintala, Teng Li, Yanli Zhao","submitted_at":"2020-06-28T20:39:45Z","abstract_excerpt":"This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism re"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2006.15704","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.DC","submitted_at":"2020-06-28T20:39:45Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"56fb8a04d11c1bb768f2d45e28e499a3d58de29e597e84838d5d5da921881e1f","abstract_canon_sha256":"bbd501c503bf854713f3294e6d4239118a5d4d86df4399aa29fe93a2e5f475e5"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.320230Z","signature_b64":"EPV+XAqSb2rsZLlMQRGh6MacQKwG8+0/nZlf8q2Uf3gRHtyxH+C2hY23pAeDgZ38dlrgEhSgK0tXsqIsmP89Bg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"b558dc6bee8c22236361b9694f1142b57431b7e931c3b1b02fcbf29e0e94bffe","last_reissued_at":"2026-05-17T23:38:13.319742Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.319742Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"PyTorch Distributed: Experiences on Accelerating Data Parallel Training","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"PyTorch's distributed data parallel module achieves near-linear scaling to 256 GPUs by overlapping computation with communication.","cross_cats":["cs.LG"],"primary_cat":"cs.DC","authors_text":"Adam Paszke, Brian Vaughan, Jeff Smith, Omkar Salpekar, Pieter Noordhuis, Pritam Damania, Rohan Varma, Shen Li, Soumith Chintala, Teng Li, Yanli Zhao","submitted_at":"2020-06-28T20:39:45Z","abstract_excerpt":"This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism re"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that typical deep learning models have enough computation per layer to effectively overlap with gradient communication and that the underlying network fabric supports low-latency all-reduce operations at the tested scale.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"PyTorch distributed data parallel attains near-linear scalability on 256 GPUs through gradient bucketing, computation-communication overlap, and selective synchronization skipping.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"PyTorch's distributed data parallel module achieves near-linear scaling to 256 GPUs by overlapping computation with communication.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"bac7834c700c915c21489719e51680e26098d3fe8029a6aa988ff9bc19ea4d6c"},"source":{"id":"2006.15704","kind":"arxiv","version":1},"verdict":{"id":"ed5f0378-f648-4cb2-98c6-4432d51cbae2","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T19:09:51.262341Z","strongest_claim":"Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.","one_line_summary":"PyTorch distributed data parallel attains near-linear scalability on 256 GPUs through gradient bucketing, computation-communication overlap, and selective synchronization skipping.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that typical deep learning models have enough computation per layer to effectively overlap with gradient communication and that the underlying network fabric supports low-latency all-reduce operations at the tested scale.","pith_extraction_headline":"PyTorch's distributed data parallel module achieves near-linear scaling to 256 GPUs by overlapping computation with communication."},"references":{"count":48,"sample":[{"doi":"","year":2006,"title":"PyTorch Distributed: Experiences on Accelerating Data Parallel Training","work_id":"353279b8-3b33-45fd-9b64-41e5bd1708b9","ref_index":1,"cited_arxiv_id":"2006.15704","is_internal_anchor":true},{"doi":"","year":null,"title":"Then, we explain and justify the idea of data parallelism and describe communication primitives","work_id":"63ffbee3-2352-45e0-b30f-65e8698633b0","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"During distributed training, each pro- cess has its own local model replica and local optimizer","work_id":"e7e10062-1089-40c0-8004-fc2537bb126a","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"This section focus on the current status as of PyTorch v1.5.0","work_id":"96d96112-f4f7-4336-ac4a-53e4bab42d7f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"In the exclusive cluster, the GPUs are located on 4 servers, connected using Mellanox MT27700 ConnectX-4 100GB/s NIC","work_id":"3ebc3ab6-6d37-40b7-b04f-ebff8257f8ed","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":48,"snapshot_sha256":"9d526b9adeb1a0aaff8e34edee60eb08c01dbd8db2679c2d899465733c01ebd9","internal_anchors":6},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2006.15704","created_at":"2026-05-17T23:38:13.319826+00:00"},{"alias_kind":"arxiv_version","alias_value":"2006.15704v1","created_at":"2026-05-17T23:38:13.319826+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2006.15704","created_at":"2026-05-17T23:38:13.319826+00:00"},{"alias_kind":"pith_short_12","alias_value":"WVMNY27ORQRC","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"WVMNY27ORQRCGY3B","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"WVMNY27O","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":32,"internal_anchor_count":16,"sample":[{"citing_arxiv_id":"2406.08334","citing_title":"ProTrain: Efficient LLM Training via Memory-Aware Techniques","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2410.21316","citing_title":"Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2410.15155","citing_title":"On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2504.09844","citing_title":"MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21603","citing_title":"DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22428","citing_title":"Exploiting Multicast for Accelerating Collective Communication","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20866","citing_title":"LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18174","citing_title":"Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18404","citing_title":"JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19169","citing_title":"Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18710","citing_title":"Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18404","citing_title":"JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02960","citing_title":"MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2511.09861","citing_title":"Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2511.14579","citing_title":"Gradient-descent methods for scalable quantum detector tomography","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2006.15704","citing_title":"PyTorch Distributed: Experiences on Accelerating Data Parallel Training","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2602.22437","citing_title":"veScale-FSDP: Flexible and High-Performance FSDP at Scale","ref_index":12,"is_internal_anchor":false},{"citing_arxiv_id":"2605.13434","citing_title":"Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity","ref_index":177,"is_internal_anchor":false},{"citing_arxiv_id":"2303.00915","citing_title":"BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs","ref_index":65,"is_internal_anchor":false},{"citing_arxiv_id":"2605.11111","citing_title":"ShardTensor: Domain Parallelism for Scientific Machine Learning","ref_index":52,"is_internal_anchor":false},{"citing_arxiv_id":"2605.08871","citing_title":"Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction","ref_index":56,"is_internal_anchor":false},{"citing_arxiv_id":"2605.09623","citing_title":"Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum","ref_index":12,"is_internal_anchor":false},{"citing_arxiv_id":"2605.10501","citing_title":"Accelerating Compound LLM Training Workloads with Maestro","ref_index":6,"is_internal_anchor":false},{"citing_arxiv_id":"2304.11277","citing_title":"PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel","ref_index":14,"is_internal_anchor":false},{"citing_arxiv_id":"2605.08962","citing_title":"MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production","ref_index":31,"is_internal_anchor":false}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/WVMNY27ORQRCGY3BXFUU6EKCWV","json":"https://pith.science/pith/WVMNY27ORQRCGY3BXFUU6EKCWV.json","graph_json":"https://pith.science/api/pith-number/WVMNY27ORQRCGY3BXFUU6EKCWV/graph.json","events_json":"https://pith.science/api/pith-number/WVMNY27ORQRCGY3BXFUU6EKCWV/events.json","paper":"https://pith.science/paper/WVMNY27O"},"agent_actions":{"view_html":"https://pith.science/pith/WVMNY27ORQRCGY3BXFUU6EKCWV","download_json":"https://pith.science/pith/WVMNY27ORQRCGY3BXFUU6EKCWV.json","view_paper":"https://pith.science/paper/WVMNY27O","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2006.15704&json=true","fetch_graph":"https://pith.science/api/pith-number/WVMNY27ORQRCGY3BXFUU6EKCWV/graph.json","fetch_events":"https://pith.science/api/pith-number/WVMNY27ORQRCGY3BXFUU6EKCWV/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/WVMNY27ORQRCGY3BXFUU6EKCWV/action/timestamp_anchor","attest_storage":"https://pith.science/pith/WVMNY27ORQRCGY3BXFUU6EKCWV/action/storage_attestation","attest_author":"https://pith.science/pith/WVMNY27ORQRCGY3BXFUU6EKCWV/action/author_attestation","sign_citation":"https://pith.science/pith/WVMNY27ORQRCGY3BXFUU6EKCWV/action/citation_signature","submit_replication":"https://pith.science/pith/WVMNY27ORQRCGY3BXFUU6EKCWV/action/replication_record"}},"created_at":"2026-05-17T23:38:13.319826+00:00","updated_at":"2026-05-17T23:38:13.319826+00:00"}