pith. sign in

Nimit Sharad Sohoni, Christopher Richard Aberger, Megan Leszczynski, Jian Zhang, and Christo- pher R´e

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it
abstract

Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh .

fields

cs.LG 4 cs.CL 3

representative citing papers

Scaling Laws for Neural Language Models

cs.LG · 2020-01-23 · unverdicted · novelty 8.0

Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.

Reformer: The Efficient Transformer

cs.LG · 2020-01-13 · accept · novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Decoupled DiLoCo for Resilient Distributed Pre-training

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

citing papers explorer

Showing 7 of 7 citing papers.

  • Scaling Laws for Neural Language Models cs.LG · 2020-01-23 · unverdicted · none · ref 11

    Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.

  • Reformer: The Efficient Transformer cs.LG · 2020-01-13 · accept · none · ref 18

    Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

  • ZeRO: Memory Optimizations Toward Training Trillion Parameter Models cs.LG · 2019-10-04 · accept · none · ref 5 · internal anchor

    ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.

  • Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 52 · internal anchor

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

  • Decoupled DiLoCo for Resilient Distributed Pre-training cs.CL · 2026-04-23 · unverdicted · none · ref 24

    Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.

  • Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 139

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  • A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 81

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.