{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2018:3PESEQ43EVWKC6AUZMT6SUTXWB","short_pith_number":"pith:3PESEQ43","schema_version":"1.0","canonical_sha256":"dbc922439b256ca17814cb27e95277b057dca2f72edbadaf25fcd2f0c8b596db","source":{"kind":"arxiv","id":"1811.02084","version":1},"attestation_state":"computed","paper":{"title":"Mesh-TensorFlow: Deep Learning for Supercomputers","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"","cross_cats":["cs.DC","stat.ML"],"primary_cat":"cs.LG","authors_text":"Ashish Vaswani, Blake Hechtman, Cliff Young, Dustin Tran, HyoukJoong Lee, Mingsheng Hong, Niki Parmar, Noam Shazeer, Penporn Koanantakool, Peter Hawkins, Ryan Sepassi, Youlong Cheng","submitted_at":"2018-11-05T23:25:02Z","abstract_excerpt":"Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, pa"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":false},"canonical_record":{"source":{"id":"1811.02084","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2018-11-05T23:25:02Z","cross_cats_sorted":["cs.DC","stat.ML"],"title_canon_sha256":"11a3aa1674bcf71f0b2b94ad6a399c147cfa941b0eb3c9d41885a123c3dcb541","abstract_canon_sha256":"e9ffc5853e135df3a9900b14d3ddd78fe513191f384e5fe271a9c059b80b1337"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T00:01:25.083563Z","signature_b64":"ir110NbrTnXeDTSRma43qPyzp7lIfiTOB9kcYqbnW/1PvSpuAWSf8/naX/4d7ZjEDVGiHEDFrwjr4rhZ45saDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"dbc922439b256ca17814cb27e95277b057dca2f72edbadaf25fcd2f0c8b596db","last_reissued_at":"2026-05-18T00:01:25.083097Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T00:01:25.083097Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Mesh-TensorFlow: Deep Learning for Supercomputers","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"","cross_cats":["cs.DC","stat.ML"],"primary_cat":"cs.LG","authors_text":"Ashish Vaswani, Blake Hechtman, Cliff Young, Dustin Tran, HyoukJoong Lee, Mingsheng Hong, Niki Parmar, Noam Shazeer, Penporn Koanantakool, Peter Hawkins, Ryan Sepassi, Youlong Cheng","submitted_at":"2018-11-05T23:25:02Z","abstract_excerpt":"Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, pa"},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"1811.02084","kind":"arxiv","version":1},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"1811.02084","created_at":"2026-05-18T00:01:25.083171+00:00"},{"alias_kind":"arxiv_version","alias_value":"1811.02084v1","created_at":"2026-05-18T00:01:25.083171+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.1811.02084","created_at":"2026-05-18T00:01:25.083171+00:00"},{"alias_kind":"pith_short_12","alias_value":"3PESEQ43EVWK","created_at":"2026-05-18T12:32:02.567920+00:00"},{"alias_kind":"pith_short_16","alias_value":"3PESEQ43EVWKC6AU","created_at":"2026-05-18T12:32:02.567920+00:00"},{"alias_kind":"pith_short_8","alias_value":"3PESEQ43","created_at":"2026-05-18T12:32:02.567920+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":7,"internal_anchor_count":2,"sample":[{"citing_arxiv_id":"2102.01293","citing_title":"Scaling Laws for Transfer","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"1910.02054","citing_title":"ZeRO: Memory Optimizations Toward Training Trillion Parameter Models","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2001.04451","citing_title":"Reformer: The Efficient Transformer","ref_index":18,"is_internal_anchor":false},{"citing_arxiv_id":"2112.00861","citing_title":"A General Language Assistant as a Laboratory for Alignment","ref_index":81,"is_internal_anchor":false},{"citing_arxiv_id":"2207.05221","citing_title":"Language Models (Mostly) Know What They Know","ref_index":139,"is_internal_anchor":false},{"citing_arxiv_id":"2604.21428","citing_title":"Decoupled DiLoCo for Resilient Distributed Pre-training","ref_index":24,"is_internal_anchor":false},{"citing_arxiv_id":"2001.08361","citing_title":"Scaling Laws for Neural Language Models","ref_index":11,"is_internal_anchor":false}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3PESEQ43EVWKC6AUZMT6SUTXWB","json":"https://pith.science/pith/3PESEQ43EVWKC6AUZMT6SUTXWB.json","graph_json":"https://pith.science/api/pith-number/3PESEQ43EVWKC6AUZMT6SUTXWB/graph.json","events_json":"https://pith.science/api/pith-number/3PESEQ43EVWKC6AUZMT6SUTXWB/events.json","paper":"https://pith.science/paper/3PESEQ43"},"agent_actions":{"view_html":"https://pith.science/pith/3PESEQ43EVWKC6AUZMT6SUTXWB","download_json":"https://pith.science/pith/3PESEQ43EVWKC6AUZMT6SUTXWB.json","view_paper":"https://pith.science/paper/3PESEQ43","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=1811.02084&json=true","fetch_graph":"https://pith.science/api/pith-number/3PESEQ43EVWKC6AUZMT6SUTXWB/graph.json","fetch_events":"https://pith.science/api/pith-number/3PESEQ43EVWKC6AUZMT6SUTXWB/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3PESEQ43EVWKC6AUZMT6SUTXWB/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3PESEQ43EVWKC6AUZMT6SUTXWB/action/storage_attestation","attest_author":"https://pith.science/pith/3PESEQ43EVWKC6AUZMT6SUTXWB/action/author_attestation","sign_citation":"https://pith.science/pith/3PESEQ43EVWKC6AUZMT6SUTXWB/action/citation_signature","submit_replication":"https://pith.science/pith/3PESEQ43EVWKC6AUZMT6SUTXWB/action/replication_record"}},"created_at":"2026-05-18T00:01:25.083171+00:00","updated_at":"2026-05-18T00:01:25.083171+00:00"}