arxiv: 2105.04663 · v2 · submitted 2021-05-10 · 💻 cs.DC · cs.LG

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong Xu , HyoukJoong Lee , Dehao Chen , Blake Hechtman , Yanping Huang , Rahul Joshi , Maxim Krikun , Dmitry Lepikhin

show 8 more authors

Andy Ly Marcello Maggioni Ruoming Pang Noam Shazeer Shibo Wang Tao Wang Yonghui Wu Zhifeng Chen

This is my paper

Pith reviewed 2026-05-18 12:31 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords parallelizationmachine learningcompilertensor partitioningdistributed trainingscalabilityTPU

0 comments

The pith

GSPMD infers full operator partitioning from a few tensor distribution annotations so single-device ML programs scale automatically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GSPMD is a compiler that takes ML programs written for one device and turns them into distributed versions across many cores. Users supply only limited hints on how tensors should be split, and the system works out the right partitioning for every operation in the graph. The same annotation style supports different parallelism styles and mixes of them. The approach reaches 50 to 62 percent compute utilization on clusters as large as 2048 TPUv3 cores for models containing up to one trillion parameters.

Core claim

GSPMD supplies a simple yet general representation of tensor partitioning that lets the compiler automatically determine the distribution strategy for every operator once a user has annotated only a small number of tensors.

What carries the argument

Automatic inference of per-operator partitioning from limited user annotations on tensor distributions.

If this is right

Single-device ML code can be scaled to thousands of cores with only a handful of added annotations.
The same mechanism expresses data, model, and pipeline parallelism or any combination of them.
Models with up to one trillion parameters become practical to train at 50-62 percent utilization on 2048 TPUv3 cores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Engineering effort for writing distributed training code could drop sharply if the annotation style proves robust across many model families.
The technique may allow rapid experimentation with new parallelism mixtures that would otherwise require extensive manual rewrites.

Load-bearing premise

A small set of user annotations on how tensors are distributed is sufficient for the compiler to choose correct and efficient partitioning for every operator without large overhead or wrong results.

What would settle it

An ML graph where the compiler-chosen partitioning produces incorrect numerical results or utilization below 40 percent on a 1024-core TPU run while a manually tuned version exceeds 60 percent.

read the original abstract

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation. Its representation of partitioning is simple yet general, allowing it to express different or mixed paradigms of parallelism on a wide variety of models. GSPMD infers the partitioning for every operator based on limited user annotations, making it convenient to scale existing single-device programs. It solves several technical challenges for production usage, allowing GSPMD to achieve 50% to 62% compute utilization on up to 2048 Cloud TPUv3 cores for models with up to one trillion parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents GSPMD, an automatic compiler-based parallelization system for ML computation graphs. Users write single-device programs and supply limited annotations specifying tensor distributions; GSPMD then infers a partitioning for every operator. The partitioning representation is designed to be simple yet general enough to express mixed data, model, and pipeline parallelism across a wide range of models, and the system is reported to deliver 50–62 % compute utilization on up to 2048 Cloud TPUv3 cores for models containing up to one trillion parameters.

Significance. If the central claims hold, GSPMD would materially lower the barrier to scaling existing single-device ML programs to very large clusters. The combination of minimal user annotations with automatic inference across arbitrary graphs, together with the reported utilization numbers on 2048-core TPUv3 runs, would constitute a practical contribution to production-scale training of trillion-parameter models.

major comments (2)

[Abstract] Abstract: the performance claims of 50 %–62 % compute utilization on up to 2048 TPUv3 cores are presented without any description of measurement methodology, chosen baselines, statistical error, or hardware/software configuration details. Because these numbers are the primary empirical support for the scalability claim, their absence leaves the central result only partially substantiated.
[Partitioning Inference (inferred from §3–4)] The manuscript asserts that a small set of tensor-distribution annotations suffices for the compiler to correctly infer partitions for every operator in arbitrary ML graphs. The description of the propagation rules does not address completeness for operators involving dynamic control flow, non-standard reduction axes, or user-defined kernels; if any of these cases fall outside the supported rules, the system would either require more annotations than advertised or silently produce incorrect parallelizations.

minor comments (1)

[Notation and API] The notation used for tensor-distribution annotations should be defined more explicitly (e.g., with a small grammar or example table) so that readers can reproduce the exact user input required.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below in a point-by-point manner and indicate the revisions made to the next version of the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the performance claims of 50 %–62 % compute utilization on up to 2048 TPUv3 cores are presented without any description of measurement methodology, chosen baselines, statistical error, or hardware/software configuration details. Because these numbers are the primary empirical support for the scalability claim, their absence leaves the central result only partially substantiated.

Authors: We agree that the abstract would be strengthened by including a brief reference to the evaluation setup. In the revised manuscript we have added one sentence to the abstract noting the hardware (Cloud TPUv3), the measurement of compute utilization via the standard TPU profiler, and that full experimental details appear in Section 5. The core utilization numbers themselves remain unchanged. revision: yes
Referee: [Partitioning Inference (inferred from §3–4)] The manuscript asserts that a small set of tensor-distribution annotations suffices for the compiler to correctly infer partitions for every operator in arbitrary ML graphs. The description of the propagation rules does not address completeness for operators involving dynamic control flow, non-standard reduction axes, or user-defined kernels; if any of these cases fall outside the supported rules, the system would either require more annotations than advertised or silently produce incorrect parallelizations.

Authors: GSPMD targets the static dataflow graphs that dominate large-scale ML training. Dynamic control flow is handled by treating the enclosing higher-level constructs (e.g., loops in JAX or TensorFlow) as separate sub-graphs that receive explicit annotations when automatic inference is insufficient. Non-standard reductions and user-defined kernels fall back to user-specified sharding when the built-in propagation rules do not apply. We have added a clarifying paragraph in Section 3 and a limitations subsection in Section 6 that explicitly states these scope assumptions and the annotation fallback mechanism. revision: partial

Circularity Check

0 steps flagged

GSPMD presents a rule-based compiler for tensor partitioning with no self-referential derivations or fitted predictions.

full rationale

The paper describes a practical compiler system in which users supply a small number of tensor-distribution annotations and the system propagates partitioning decisions across operators via explicit inference rules. No equations, uniqueness theorems, or performance predictions are shown to reduce by construction to fitted parameters, self-citations, or renamed inputs. Reported utilization numbers (50-62 % on up to 2048 TPU cores) are empirical measurements on concrete models, not quantities defined by the partitioning rules themselves. The central claim therefore remains independent of the inputs it consumes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about static computation graphs in ML frameworks and the sufficiency of user annotations for partitioning decisions.

axioms (1)

domain assumption ML computation graphs are static and can be analyzed to infer partitioning from tensor annotations.
Invoked when describing how GSPMD infers operator partitioning from limited user hints.

pith-pipeline@v0.9.0 · 5722 in / 1205 out tokens · 56326 ms · 2026-05-18T12:31:42.752308+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
cs.LG 2025-12 conditional novelty 8.0

Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than ...
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
cs.DC 2026-04 unverdicted novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
veScale-FSDP: Flexible and High-Performance FSDP at Scale
cs.DC 2026-02 unverdicted novelty 6.0

veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at...
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
cs.LG 2025-12 unverdicted novelty 6.0

BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
Cambrian-S: Towards Spatial Supersensing in Video
cs.CV 2025-11 unverdicted novelty 6.0

Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise o...
HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters
cs.DC 2025-09 unverdicted novelty 6.0

HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-orie...
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
cs.DC 2023-04 unverdicted novelty 6.0

PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
CoCa: Contrastive Captioners are Image-Text Foundation Models
cs.CV 2022-05 accept novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
LaMDA: Language Models for Dialog Applications
cs.CL 2022-01 unverdicted novelty 6.0

LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML
cs.DC 2026-04 unverdicted novelty 5.0

Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 17 Pith papers · 8 internal anchors

[1]

https://blog

LaMDA: our breakthrough conversation technology. https://blog. google/technology/ai/lamda/

work page
[2]

https://www.tensorflow.org/xla/operation_ semantics

XLA operation semantics. https://www.tensorflow.org/xla/operation_ semantics. Online; accessed 17 April 2021

work page 2021
[3]

https://www.tensorflow

XLA: Optimizing Compiler for TensorFlow. https://www.tensorflow. org/xla. Online; accessed 17 April 2021

work page 2021
[4]

https://www.microsoft.com/en-us/research/blog/deepspeed- extreme-scale-model-training-for-everyone/ , 2020

DeepSpeed: Extreme-scale model training for everyone. https://www.microsoft.com/en-us/research/blog/deepspeed- extreme-scale-model-training-for-everyone/ , 2020. Online; accessed 17 April 2021

work page 2020
[5]

G., Steiner, B., Tucker, P., V asude- van, V., W arden, P., Wicke, M., Yu, Y., and Zheng, X

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., V asude- van, V., W arden, P., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating...

work page 2016
[6]

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. Julia: A fresh approach to numerical computing. SIAM review 59 , 1 (2017), 65–98

work page 2017
[7]

J., Leary, C., Maclaurin, D., and W anderman-Milne, S

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and W anderman-Milne, S. JAX: composable trans- formations of Python+NumPy programs

work page
[8]

Language Models are Few-Shot Learners

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhari- wal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners.arXiv preprint arXiv:2005.14165 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2005
[9]

TVM: An automated end-to-end optimizing compiler for deep learning

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Y an, E., Cowan, M., Shen, 13 H., W ang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (USA, 2018), OSDI’18, USENIX Association, p. 579–594

work page 2018
[10]

Training deep nets with sublinear memory cost, 2016

Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost, 2016

work page 2016
[11]

Train ML models on large images and 3D volumes with spatial partitioning on Cloud TPUs

Cheng, Y., Lee, H., and Berghammer, T. Train ML models on large images and 3D volumes with spatial partitioning on Cloud TPUs. https://cloud.google.com/blog/products/ai-machine- learning/train-ml-models-on-large-images-and-3d-volumes-with- spatial-partitioning-on-cloud-tpus , 2019. Online; accessed 17 April 2021

work page 2019
[12]

S., Brox, T., and Ron- neberger, O

Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., and Ron- neberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016 (Cham, 2016), S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and W. Wells, Eds., Springer International Publishing, pp. 424–432

work page 2016
[13]

M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A

Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., Zoph, B., Fedus, L., Bosma, M., Zhou, Z., W ang, T., W ang, Y. E., Webster, K., Pellat, M., Robinson, K., Meier-Hellstern, K., Duke, T., Dixon, L., Zhang, K., Le, Q. V., Wu, Y., Chen, Z., and Cui, C. Glam: Efficient scaling of language models with mi...

work page 2021
[14]

Skillful twelve hour precipitation forecasts using large context neural networks, 2021

Espeholt, L., Agrawal, S., Sønderby, C., Kumar, M., Heek, J., Bromberg, C., Gazen, C., Hickey, J., Bell, A., and Kalchbrenner, N. Skillful twelve hour precipitation forecasts using large context neural networks, 2021

work page 2021
[15]

In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, NY, USA, 2021), PPoPP ’21, Association for Computing Machinery, p

Fan, S., Rong, Y., Meng, C., Cao, Z., W ang, S., Zheng, Z., Wu, C., Long, G., Y ang, J., Xia, L., Diao, L., Liu, X., and Lin, W.DAPPLE: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, NY, USA, 2021), PPoPP ’21, Association for Compu...

work page 2021
[16]

Cloud TPU

Google Cloud. Cloud TPU. https://cloud.google.com/tpu/. Online; accessed 17 April 2021

work page 2021
[17]

Conformer: Convolution- augmented transformer for speech recognition, 2020

Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., W ang, S., Zhang, Z., Wu, Y., and Pang, R. Conformer: Convolution- augmented transformer for speech recognition, 2020

work page 2020
[18]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR abs/1811.06965 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Beyond Data and Model Paral- lelism for Deep Neural Networks

Jia, Z., Zaharia, M., and Aiken, A. Beyond Data and Model Paral- lelism for Deep Neural Networks. In Proceedings of the Conference on Systems and Machine Learning (SysML) (Palo Alto, CA, 2019)

work page 2019
[21]

P., and Ba, J

Kingma, D. P., and Ba, J. L. Adam: a Method for Stochastic Optimiza- tion. In International Conference on Learning Representations (ICLR) (San Diego, CA, May 2015)

work page 2015
[22]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet clas- sification with deep convolutional neural networks. In Advances in neural information processing systems (2012), pp. 1097–1105

work page 2012
[23]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. CoRR abs/2006.16668 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2006
[24]

GShard: Scaling giant models with conditional computation and automatic sharding

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations (2021)

work page 2021
[25]

TeraPipe: Token-level pipeline parallelism for training large-scale language models, 2021

Li, Z., Zhuang, S., Guo, S., Zhuo, D., Zhang, H., Song, D., and Stoica, I. TeraPipe: Token-level pipeline parallelism for training large-scale language models, 2021

work page 2021
[26]

MPI: A Message-Passing Interface Standard

MPI Forum. MPI: A Message-Passing Interface Standard. Version 2.2, September 4th 2009. available at: http://www.mpi-forum.org (Dec. 2009)

work page 2009
[27]

R., Ganger, G

Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Zaharia, M. PipeDream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP) (2019)

work page 2019
[28]

Efficient large- scale language model training on gpu clusters, 2021

Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., V ainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M. Efficient large- scale language model training on gpu clusters, 2021

work page 2021
[29]

PyTorch: An imperative style, high-performance deep learning library

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019), 8026–8037

work page 2019
[30]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. ZeRO: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1910
[31]

Zero-Shot Text-to-Image Generation

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. CoRR abs/2102.12092 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Glow: Graph lowering compiler techniques for neural networks, 2018

Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, S., Dzhabarov, R., Gibson, N., Hegeman, J., Lele, M., Levenstein, R., Montgomery, J., Maher, B., Nadathur, S., Olesen, J., Park, J., Rakhov, A., Smelyan- skiy, M., and W ang, M. Glow: Graph lowering compiler techniques for neural networks, 2018

work page 2018
[33]

Mesh- tensorflow: Deep learning for supercomputers

Shazeer, N., Cheng, Y., Parmar, N., Tran, D., V aswani, A., Koanan- takool, P., Hawkins, P., Lee, H., Hong, M., Young, C., et al. Mesh- tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems (2018), pp. 10414–10423

work page 2018
[34]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely- gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Shazeer, N., and Stern, M. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. CoRR abs/1804.04235 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-LM: Training multi-billion parameter language models using GPU model parallelism. arXiv preprint arXiv:1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1909
[37]

R., Mahajan, D., and Paravecino, F

Tarnawski, J., Phanishayee, A., Devanur, N. R., Mahajan, D., and Paravecino, F. N. Efficient algorithms for device placement of dnn graph operators, 2020

work page 2020
[38]

N., Kaiser, L., and Polosukhin, I

V aswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS) (Long Beach, CA, 2017)

work page 2017
[39]

Supporting very large models using automatic dataflow graph partitioning

W ang, M., Huang, C.-c., and Li, J. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019 (2019), pp. 1–17

work page 2019
[40]

Automatic cross-replica sharding of weight update in data-parallel training, 2020

Xu, Y., Lee, H., Chen, D., Choi, H., Hechtman, B., and W ang, S. Automatic cross-replica sharding of weight update in data-parallel training, 2020

work page 2020
[41]

S., Han, W., Qin, J., Gulati, A., Shor, J., Jansen, A., Xu, Y., Huang, Y., W ang, S., Zhou, Z., Li, B., Ma, M., Chan, W., Yu, J., W ang, Y., Cao, L., Sim, K

Zhang, Y., Park, D. S., Han, W., Qin, J., Gulati, A., Shor, J., Jansen, A., Xu, Y., Huang, Y., W ang, S., Zhou, Z., Li, B., Ma, M., Chan, W., Yu, J., W ang, Y., Cao, L., Sim, K. C., Ramabhadran, B., Sainath, T. N., Beaufays, F., Chen, Z., Le, Q. V., Chiu, C.-C., Pang, R., and Wu, Y. Bigssl: Exploring the frontier of large-scale semi-supervised learning fo...

work page 2021
[42]

Exchange maximum halo for left (1) and right (3)

work page
[43]

DynamicSlice on the region actually needed (e.g., 0 left halo and 2 right halo for partition 2)

work page
[44]

Concatenate exchanged left and right halos Dynamic Slice slice and collective-permute slice and collective-permute

work page
[45]

Collective permute Collective permute base (b) Sequence of operations for a general halo exchange

Mask out invalid regions with the identity value (0) (e.g., partition 3 has 4 elements in the invalid region) 0 0 0 0 iota, select, broadcast, .. Collective permute Collective permute base (b) Sequence of operations for a general halo exchange. Figure 9. Non-constant halo size in a partitioned convolution and the solution with padding and slicing. P0 P1 P...

work page