Recognition: 2 theorem links
· Lean TheoremGShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Pith reviewed 2026-05-11 02:22 UTC · model grok-4.3
The pith
GShard enables scaling of sparsely-gated mixture-of-experts models beyond 600 billion parameters through automatic sharding and minimal code changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GShard is a module of lightweight annotation APIs and an extension to the XLA compiler that provides an elegant way to express a wide range of parallel computation patterns with minimal changes to existing model code. Using GShard, the authors scaled a multilingual neural machine translation Transformer with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters. The model trained efficiently on 2048 TPU v3 accelerators in four days and delivered far superior quality for translation from 100 languages to English compared with prior art.
What carries the argument
GShard module consisting of lightweight annotation APIs and an XLA compiler extension that automates sharding for conditional computation patterns such as Sparsely-Gated Mixture-of-Experts.
If this is right
- Models that activate only a small subset of parameters per input can be trained at scales previously limited by manual sharding effort.
- Training runs for models exceeding 600 billion parameters become feasible on accelerator clusters within days rather than weeks or months.
- Multilingual neural machine translation quality improves measurably when the number of experts and total parameters increase under the same training budget.
- Existing Transformer code bases can adopt conditional computation and model parallelism with only localized annotation changes.
Where Pith is reading between the lines
- The same annotation-plus-compiler pattern could be applied to other sparse architectures in vision or speech models without requiring new hardware primitives.
- Widespread adoption might shift research focus from hand-tuned parallelism to higher-level decisions about which computations should be conditional.
- If the overhead remains low, future work could explore even larger numbers of experts or dynamic routing across modalities while keeping code readable.
Load-bearing premise
The automatic sharding and conditional computation can be realized with minimal model-code changes and without introducing correctness or performance problems that would invalidate the reported quality gains or training efficiency.
What would settle it
Re-implementing the 600-billion-parameter multilingual translation model with GShard, running it on 2048 TPU v3 accelerators, and checking whether training completes in roughly four days while matching or exceeding the claimed BLEU improvements over prior models.
read the original abstract
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GShard, a module of lightweight annotation APIs plus an XLA compiler extension that lets users express a wide range of parallel computation patterns (including conditional computation) with minimal changes to existing model code. It demonstrates the approach by scaling a multilingual NMT Transformer that uses a Sparsely-Gated Mixture-of-Experts layer to more than 600 billion parameters, training the model on 2048 TPU v3 chips in four days and reporting substantially better translation quality from 100 languages into English than prior systems.
Significance. If the empirical claims are reproducible and the sharding semantics are preserved, the work is significant because it shows a practical route to training giant conditional-computation models at the 600 B+ scale with only modest code changes. The combination of automatic sharding and MoE routing could lower the barrier to experimenting with models whose size would otherwise be limited by manual partitioning effort.
major comments (2)
- [Abstract and §4] Abstract and §4 (MoE scaling results): the headline claim that the 600 B+ model achieves 'far superior quality' is presented without any quantitative metrics (BLEU scores, baselines, number of languages evaluated, or statistical significance), so the link between the GShard implementation and the reported quality gain cannot be evaluated from the given text.
- [§3 and §4] §3 (GShard API and XLA extension) and §4 (MoE dispatch/combine): the paper asserts that the automatic sharding of top-k expert routing, capacity-factor dispatch, and all-to-all communication preserves exact semantics and gradient flow, yet supplies neither a machine-checked equivalence argument nor a side-by-side numerical audit of sharded versus unsharded forward/backward passes at the scale used; this is load-bearing for the correctness of the 4-day training result.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from a short table or bullet list of the exact API annotations introduced (@gshard, mesh, etc.) so readers can immediately see the claimed 'minimal code change' surface.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for improving the presentation of results and verification of implementation correctness. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (MoE scaling results): the headline claim that the 600 B+ model achieves 'far superior quality' is presented without any quantitative metrics (BLEU scores, baselines, number of languages evaluated, or statistical significance), so the link between the GShard implementation and the reported quality gain cannot be evaluated from the given text.
Authors: We agree that the absence of explicit quantitative metrics in the abstract and Section 4 makes it difficult to evaluate the quality claims. In the revised manuscript we have added specific BLEU scores for the 600B model, direct baseline comparisons against prior systems, the exact number of languages evaluated, and notes on evaluation methodology to make the improvements verifiable. revision: yes
-
Referee: [§3 and §4] §3 (GShard API and XLA extension) and §4 (MoE dispatch/combine): the paper asserts that the automatic sharding of top-k expert routing, capacity-factor dispatch, and all-to-all communication preserves exact semantics and gradient flow, yet supplies neither a machine-checked equivalence argument nor a side-by-side numerical audit of sharded versus unsharded forward/backward passes at the scale used; this is load-bearing for the correctness of the 4-day training result.
Authors: We acknowledge that the manuscript does not include a machine-checked equivalence proof or a full-scale numerical audit at 600B parameters. A formal machine-checked argument for the XLA extensions is outside the scope of the paper. The GShard annotations are designed to produce an identical computation graph to the unsharded version, with sharding applied as a transparent compiler transformation that preserves dataflow and gradients by construction. In the revision we have added a side-by-side numerical audit on a smaller-scale model (showing forward and backward passes match within floating-point tolerance) in the appendix to provide concrete verification evidence. revision: partial
Circularity Check
No circularity: empirical systems demonstration without derivation or fitted predictions
full rationale
The paper presents GShard as a set of lightweight annotation APIs plus an XLA compiler extension that enables automatic sharding for conditional computation patterns such as sparsely-gated MoE. Its core claim is an end-to-end empirical result: a 600B-parameter multilingual Transformer was trained on 2048 TPU v3 chips in four days and produced superior BLEU scores. No equations, first-principles derivations, parameter fits, or predictions appear in the abstract or described content. The result is externally falsifiable by re-implementation and re-training rather than being forced by any self-definition, self-citation chain, or renaming of prior results. This is a standard non-circular engineering paper whose validity rests on implementation correctness and experimental reproducibility.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingeight_tick_forces_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
-
Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation
BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.
-
AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
Approximate multipliers degrade MoE and dense DNNs at different rates; ResNet-20 recovers fully after retraining while VGG models often fail at aggressive approximations except Cluster MoE, and Hard MoE can outperform...
-
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
-
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
-
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
-
Path-Constrained Mixture-of-Experts
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE
BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.
-
Combining pre-trained models via localized model averaging
Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.
-
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
-
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
-
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
-
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
DySHARP accelerates MoE expert parallelism via dynamic multimem addressing and token-centric kernel fusion to cut redundant traffic and deliver up to 1.79x speedup over prior in-switch solutions.
-
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
-
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
-
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
-
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
-
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
-
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
-
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
WiFo-MiSAC: A Wireless Foundation Model for Multimodal Sensing and Communication Integration via Synesthesia of Machines (SoM)
WiFo-MiSAC is a task-agnostic foundation model that unifies multimodal wireless signals via tokenization and self-supervised learning with SS-DMoE to achieve strong few-shot performance on beam prediction and channel ...
-
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
-
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
ResiHP improves LLM training throughput by 1.04-4.39x under hardware failures by using a workload-aware execution time predictor to avoid false failure detections and a scheduler that dynamically changes parallelism g...
-
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving
FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.
-
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
-
PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs
PINNACLE is an open-source framework for classical and quantum PINNs that supplies modular training methods and benchmarks showing high sensitivity to architecture choices plus parameter-efficiency gains in some hybri...
-
M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model
M-IDoL learns modality-specific and diverse representations by maximizing inter-modality entropy and minimizing intra-modality uncertainty through information decomposition in MoE subspaces.
-
HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.
-
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
-
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
-
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
-
Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation
LLM chain-of-thought rewriting of job postings plus category-aware MoE improves person-job fit AUC by 2.4%, GAUC by 7.5%, and live click-through conversion by 19.4%.
-
Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input
Sparsely gated MoE policies double the success rate of a real Unitree Go2 quadruped on large-obstacle parkour versus matched-active-parameter MLP baselines while cutting inference time compared with a scaled-up MLP.
-
Efficient Handwriting-Based Alzheimer,s Disease Diagnosis Using a Low-Rank Mixture of Experts Deep Learning Framework
A low-rank mixture of experts model trained on handwriting data delivers strong Alzheimer's diagnosis performance with substantially reduced parameter activation during inference.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Reference graph
Works this paper leans on
-
[1]
On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018
Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018
-
[2]
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018
work page Pith review arXiv 2018
-
[3]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[4]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Exploring the limits of weakly supervised pretraining
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 181–196, 2018
work page 2018
-
[7]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[8]
Identity mappings in deep residual networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016
work page 2016
-
[9]
Nas-fpn: Learning scalable feature pyramid architecture for object detection
Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019
work page 2019
-
[10]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017
work page 2017
-
[11]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019
work page 2019
-
[12]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[13]
Unsupervised cross-lingual representation learning at scale, 2019
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale, 2019
work page 2019
-
[14]
Massively multilingual neural machine translation in the wild: Findings and challenges, 2019
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. Massively multilingual neural machine translation in the wild: Findings and challenges, 2019
work page 2019
-
[15]
Gpipe: Efficient training of giant neural networks using pipeline parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems 32, pages 103–112, 2019. 26
work page 2019
-
[16]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks, 2017
work page 2017
-
[18]
Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017
work page 2017
-
[19]
Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy. Pro- ceedings of the 24th Symposium on Principles and Practice of Parallel Programming , Feb 2019
work page 2019
-
[20]
Scaling description of generalization with number of parameters in deep learning
Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’ Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, Feb 2020
work page 2020
-
[21]
Tensorflow: a system for large-scale machine learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016
work page 2016
-
[23]
Mesh-tensorflow: Deep learning for supercomputers
Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan- takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414–10423, 2018
work page 2018
-
[24]
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018
work page Pith review arXiv 2018
-
[25]
Conditional computa- tion in neural networks for faster models, 2015
Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computa- tion in neural networks for faster models, 2015
work page 2015
-
[26]
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. ArXiv, abs/1910.10073, 2020
-
[27]
Controlling computation versus quality for neural sequence models, 2020
Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Controlling computation versus quality for neural sequence models, 2020
work page 2020
-
[28]
https://www.tensorflow.org/xla, 2019
XLA: Optimizing Compiler for TensorFlow. https://www.tensorflow.org/xla, 2019. Online; accessed 1 June 2020
work page 2019
-
[29]
Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010
work page 2010
-
[30]
Die grundlage der allgemeinen relativitätstheorie
Albert Einstein. Die grundlage der allgemeinen relativitätstheorie. In Das Relativitätsprinzip, pages 81–124. Springer, 1923
work page 1923
-
[31]
Lingvo: a modular and scalable framework for sequence-to-sequence modeling
Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia Xu Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019
-
[32]
Train ML models on large images and 3D volumes with spatial partitioning on Cloud TPUs
Youlong Cheng, HyoukJoong Lee, and Tamas Berghammer. Train ML models on large images and 3D volumes with spatial partitioning on Cloud TPUs. https: //cloud.google.com/blog/products/ai-machine-learning/train-ml-models- on-large-images-and-3d-volumes-with-spatial-partitioning-on-cloud-tpus ,
-
[33]
Online; accessed 12 June 2020. 27
work page 2020
-
[34]
https://github.com/onnx/onnx, 2019
ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx, 2019. Online; accessed 1 June 2020
work page 2019
-
[35]
Relay: a new ir for machine learning frameworks
Jared Roesch, Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, and Zachary Tatlock. Relay: a new ir for machine learning frameworks. Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages - MAPL 2018, 2018
work page 2018
-
[36]
Glow: Graph lowering compiler techniques for neural networks, 2018
Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, Jack Mont- gomery, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, Misha Smelyanskiy, and Man Wang. Glow: Graph lowering compiler techniques for neural networks, 2018
work page 2018
-
[37]
MPI: A Message-Passing Interface Standard
MPI Forum. MPI: A Message-Passing Interface Standard. Version 2.2, September 4th 2009. available at: http://www.mpi-forum.org (Dec. 2009)
work page 2009
-
[38]
BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy
Minsik Cho, Ulrich Finkler, and David Kung. BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy. In Proceedings of the Conference on Systems and Machine Learning (SysML), Palo Alto, CA, 2019
work page 2019
-
[39]
A Cellular Computer to Implement the Kalman Filter Algorithm
Lynn Elliot Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, USA, 1969. AAI7010025
work page 1969
-
[40]
Multi-way, multilingual neural machine translation with a shared attention mechanism
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016
work page 2016
-
[41]
Melvin Johnson, Mike Schuster, Quoc V . Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, and et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, Dec 2017
work page 2017
-
[42]
Massively multilingual neural machine translation
Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. CoRR, abs/1903.00089, 2019
-
[43]
Exploring massively multilingual, massive neural machine translation. https://ai. googleblog.com/2019/10/exploring-massively-multilingual.html. Accessed: 2020-06-05
work page 2019
-
[44]
https://ai.googleblog.com/2020/06/recent- advances-in-google-translate.html
Recent advances in google translate. https://ai.googleblog.com/2020/06/recent- advances-in-google-translate.html . Accessed: 2020-06-05
work page 2020
-
[45]
Transfer of training: A review and directions for future research
Timothy T Baldwin and J Kevin Ford. Transfer of training: A review and directions for future research. Personnel psychology, 41(1):63–105, 1988
work page 1988
-
[46]
Estimating or propagating gradients through stochastic neurons for conditional computation, 2013
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013
work page 2013
-
[47]
Low-rank approximations for conditional feedforward compu- tation in deep neural networks, 2013
Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward compu- tation in deep neural networks, 2013
work page 2013
-
[48]
Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Moshe Dubiner. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, page 1101–1109, USA, 2010. Association for Computational Linguistics
work page 2010
-
[49]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002
work page 2002
-
[50]
Training deeper neural machine translation models with transparent attention
Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training deeper neural machine translation models with transparent attention. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. 28
work page 2018
-
[51]
Language modeling with deep transformers
Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney. Language modeling with deep transformers. Interspeech 2019, Sep 2019
work page 2019
-
[52]
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[53]
David R. So, Chen Liang, and Quoc V . Le. The evolved transformer, 2019
work page 2019
-
[54]
https://cloud.google.com/tpu/docs/ bfloat16, 2020
Using bfloat16 with TensorFlow models. https://cloud.google.com/tpu/docs/ bfloat16, 2020. Online; accessed 12 June 2020
work page 2020
-
[55]
Wide and deep learning for recommender systems
Heng-Tze Cheng, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah, Levent Koc, Jeremiah Harmsen, and et al. Wide and deep learning for recommender systems. Proceedings of the 1st Workshop on Deep Learning for Recommender Systems - DLRS 2016, 2016
work page 2016
-
[56]
Andrew K. Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks, 2018
work page 2018
-
[57]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[58]
ImageNet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012
work page 2012
-
[59]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015
work page 2015
-
[60]
Sequence to sequence learning with neural networks
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014
work page 2014
-
[61]
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014
work page internal anchor Pith review arXiv 2014
-
[62]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016
work page internal anchor Pith review arXiv 2016
-
[63]
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012
work page 2012
-
[64]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964. IEEE, 2016
work page 2016
-
[65]
State-of-the-art speech recognition with sequence-to-sequence models
Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778. IEEE, 2018
work page 2018
-
[66]
WaveNet: A Generative Model for Raw Audio
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. 29
work page internal anchor Pith review arXiv 2016
-
[67]
Natural tts synthesis by conditioning wavenet on mel spectrogram predictions
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018
work page 2018
-
[68]
Understanding deep learning requires rethinking generalization
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. 2017
work page 2017
-
[69]
Exploring generalization in deep learning, 2017
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning, 2017
work page 2017
-
[70]
Special-purpose digital hardware for neural networks: An architectural survey
Paolo Ienne, Thierry Cornu, and Gary Kuhn. Special-purpose digital hardware for neural networks: An architectural survey. Journal of VLSI signal processing systems for signal, image and video technology, 13(1):5–25, 1996
work page 1996
-
[71]
Large-scale deep unsupervised learning using graphics processors
Rajat Raina, Anand Madhavan, and Andrew Y Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning, pages 873–880, 2009
work page 2009
-
[72]
Deep, big, simple neural nets for handwritten digit recognition
Dan Claudiu Cire¸ san, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22(12):3207–3220, 2010
work page 2010
-
[73]
In-datacenter performance analysis of a tensor processing unit
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12, 2017
work page 2017
-
[74]
https://aiimpacts.org/2019-recent- trends-in-gpu-price-per-flops/
2019 recent trends in GPU price per FLOPS. https://aiimpacts.org/2019-recent- trends-in-gpu-price-per-flops/ . Accessed: 2020-06-05
work page 2019
-
[75]
Summarizing cpu and gpu design trends with product data
Yifan Sun, Nicolas Bohm Agostini, Shi Dong, and David Kaeli. Summarizing cpu and gpu design trends with product data. arXiv preprint arXiv:1911.11313, 2019
-
[76]
Large scale distributed deep networks
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012
work page 2012
-
[77]
Theano: new features and speed improvements
Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012
-
[78]
Automatic differentiation in pytorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017
work page 2017
-
[79]
Scalable parallel programming with cuda
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, 2008
work page 2008
-
[80]
JAX: composable transformations of Python+NumPy programs
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs. 2018
work page 2018
-
[81]
Compiling machine learning programs via high-level tracing
Roy Frostig, Matthew Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing. In Machine Learning and Systems (MLSys), 2018
work page 2018
-
[82]
Beyond Data and Model Parallelism for Deep Neural Networks
Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of the Conference on Systems and Machine Learning (SysML), Palo Alto, CA, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.