arxiv: 2006.16668 · v1 · submitted 2020-06-30 · 💻 cs.CL · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin , HyoukJoong Lee , Yuanzhong Xu , Dehao Chen , Orhan Firat , Yanping Huang , Maxim Krikun , Noam Shazeer

show 1 more author

Zhifeng Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:22 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML

keywords model scalingmixture of expertsautomatic shardingconditional computationneural machine translationTransformermultilingual modelsmodel parallelism

0 comments

The pith

GShard enables scaling of sparsely-gated mixture-of-experts models beyond 600 billion parameters through automatic sharding and minimal code changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GShard as a set of lightweight annotation APIs plus an XLA compiler extension that lets developers express parallel computation patterns without rewriting large parts of their models. It demonstrates this by scaling a multilingual neural machine translation Transformer that uses Sparsely-Gated Mixture-of-Experts layers to more than 600 billion parameters. The authors trained the resulting model on 2048 TPU v3 accelerators in four days and report substantially better translation quality from 100 languages into English than earlier systems. A reader would care because scaling model size has repeatedly improved performance on language tasks, yet the practical barriers of manual parallelization and compute cost have limited how far that scaling can go. If the approach holds, it removes much of the engineering friction that currently stands between researchers and giant conditional-computation models.

Core claim

GShard is a module of lightweight annotation APIs and an extension to the XLA compiler that provides an elegant way to express a wide range of parallel computation patterns with minimal changes to existing model code. Using GShard, the authors scaled a multilingual neural machine translation Transformer with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters. The model trained efficiently on 2048 TPU v3 accelerators in four days and delivered far superior quality for translation from 100 languages to English compared with prior art.

What carries the argument

GShard module consisting of lightweight annotation APIs and an XLA compiler extension that automates sharding for conditional computation patterns such as Sparsely-Gated Mixture-of-Experts.

If this is right

Models that activate only a small subset of parameters per input can be trained at scales previously limited by manual sharding effort.
Training runs for models exceeding 600 billion parameters become feasible on accelerator clusters within days rather than weeks or months.
Multilingual neural machine translation quality improves measurably when the number of experts and total parameters increase under the same training budget.
Existing Transformer code bases can adopt conditional computation and model parallelism with only localized annotation changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation-plus-compiler pattern could be applied to other sparse architectures in vision or speech models without requiring new hardware primitives.
Widespread adoption might shift research focus from hand-tuned parallelism to higher-level decisions about which computations should be conditional.
If the overhead remains low, future work could explore even larger numbers of experts or dynamic routing across modalities while keeping code readable.

Load-bearing premise

The automatic sharding and conditional computation can be realized with minimal model-code changes and without introducing correctness or performance problems that would invalidate the reported quality gains or training efficiency.

What would settle it

Re-implementing the 600-billion-parameter multilingual translation model with GShard, running it on 2048 TPU v3 accelerators, and checking whether training completes in roughly four days while matching or exceeding the claimed BLEU improvements over prior models.

read the original abstract

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GShard, a module of lightweight annotation APIs plus an XLA compiler extension that lets users express a wide range of parallel computation patterns (including conditional computation) with minimal changes to existing model code. It demonstrates the approach by scaling a multilingual NMT Transformer that uses a Sparsely-Gated Mixture-of-Experts layer to more than 600 billion parameters, training the model on 2048 TPU v3 chips in four days and reporting substantially better translation quality from 100 languages into English than prior systems.

Significance. If the empirical claims are reproducible and the sharding semantics are preserved, the work is significant because it shows a practical route to training giant conditional-computation models at the 600 B+ scale with only modest code changes. The combination of automatic sharding and MoE routing could lower the barrier to experimenting with models whose size would otherwise be limited by manual partitioning effort.

major comments (2)

[Abstract and §4] Abstract and §4 (MoE scaling results): the headline claim that the 600 B+ model achieves 'far superior quality' is presented without any quantitative metrics (BLEU scores, baselines, number of languages evaluated, or statistical significance), so the link between the GShard implementation and the reported quality gain cannot be evaluated from the given text.
[§3 and §4] §3 (GShard API and XLA extension) and §4 (MoE dispatch/combine): the paper asserts that the automatic sharding of top-k expert routing, capacity-factor dispatch, and all-to-all communication preserves exact semantics and gradient flow, yet supplies neither a machine-checked equivalence argument nor a side-by-side numerical audit of sharded versus unsharded forward/backward passes at the scale used; this is load-bearing for the correctness of the 4-day training result.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a short table or bullet list of the exact API annotations introduced (@gshard, mesh, etc.) so readers can immediately see the claimed 'minimal code change' surface.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving the presentation of results and verification of implementation correctness. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (MoE scaling results): the headline claim that the 600 B+ model achieves 'far superior quality' is presented without any quantitative metrics (BLEU scores, baselines, number of languages evaluated, or statistical significance), so the link between the GShard implementation and the reported quality gain cannot be evaluated from the given text.

Authors: We agree that the absence of explicit quantitative metrics in the abstract and Section 4 makes it difficult to evaluate the quality claims. In the revised manuscript we have added specific BLEU scores for the 600B model, direct baseline comparisons against prior systems, the exact number of languages evaluated, and notes on evaluation methodology to make the improvements verifiable. revision: yes
Referee: [§3 and §4] §3 (GShard API and XLA extension) and §4 (MoE dispatch/combine): the paper asserts that the automatic sharding of top-k expert routing, capacity-factor dispatch, and all-to-all communication preserves exact semantics and gradient flow, yet supplies neither a machine-checked equivalence argument nor a side-by-side numerical audit of sharded versus unsharded forward/backward passes at the scale used; this is load-bearing for the correctness of the 4-day training result.

Authors: We acknowledge that the manuscript does not include a machine-checked equivalence proof or a full-scale numerical audit at 600B parameters. A formal machine-checked argument for the XLA extensions is outside the scope of the paper. The GShard annotations are designed to produce an identical computation graph to the unsharded version, with sharding applied as a transparent compiler transformation that preserves dataflow and gradients by construction. In the revision we have added a side-by-side numerical audit on a smaller-scale model (showing forward and backward passes match within floating-point tolerance) in the appendix to provide concrete verification evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical systems demonstration without derivation or fitted predictions

full rationale

The paper presents GShard as a set of lightweight annotation APIs plus an XLA compiler extension that enables automatic sharding for conditional computation patterns such as sparsely-gated MoE. Its core claim is an end-to-end empirical result: a 600B-parameter multilingual Transformer was trained on 2048 TPU v3 chips in four days and produced superior BLEU scores. No equations, first-principles derivations, parameter fits, or predictions appear in the abstract or described content. The result is externally falsifiable by re-implementation and re-training rather than being forced by any self-definition, self-citation chain, or renaming of prior results. This is a standard non-circular engineering paper whose validity rests on implementation correctness and experimental reproducibility.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and engineering paper; no mathematical free parameters, domain axioms, or new invented entities are introduced.

pith-pipeline@v0.9.0 · 5498 in / 1088 out tokens · 55391 ms · 2026-05-11T02:22:08.183651+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
cs.AR 2026-05 conditional novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation
cs.CV 2026-05 unverdicted novelty 7.0

BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.
AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
cs.LG 2026-05 unverdicted novelty 7.0

Approximate multipliers degrade MoE and dense DNNs at different rates; ResNet-20 recovers fully after retraining while VGG models often fail at aggressive approximations except Cluster MoE, and Hard MoE can outperform...
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
cs.DC 2026-05 unverdicted novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
cs.LG 2026-04 unverdicted novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
cs.DC 2026-04 unverdicted novelty 7.0

FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 7.0

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Depth Adaptive Efficient Visual Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
cs.LG 2026-04 unverdicted novelty 7.0

A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
Path-Constrained Mixture-of-Experts
cs.LG 2026-03 unverdicted novelty 7.0

PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE
cs.AI 2026-05 conditional novelty 6.0

BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.
Combining pre-trained models via localized model averaging
stat.ME 2026-05 unverdicted novelty 6.0

Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
cs.LG 2026-05 unverdicted novelty 6.0

DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
cs.LG 2026-05 unverdicted novelty 6.0

DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
cs.CL 2026-05 unverdicted novelty 6.0

XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
cs.LG 2026-05 unverdicted novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
Hierarchical Mixture-of-Experts with Two-Stage Optimization
cs.LG 2026-05 unverdicted novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
cs.AR 2026-05 unverdicted novelty 6.0

MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
cs.AR 2026-05 unverdicted novelty 6.0

DySHARP accelerates MoE expert parallelism via dynamic multimem addressing and token-centric kernel fusion to cut redundant traffic and deliver up to 1.79x speedup over prior in-switch solutions.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 6.0

Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
cs.AI 2026-05 unverdicted novelty 6.0

MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
cs.CL 2026-05 unverdicted novelty 6.0

EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
cs.CV 2026-04 unverdicted novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
cs.CL 2026-04 unverdicted novelty 6.0

X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
cs.LG 2026-04 unverdicted novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
WiFo-MiSAC: A Wireless Foundation Model for Multimodal Sensing and Communication Integration via Synesthesia of Machines (SoM)
eess.SP 2026-04 unverdicted novelty 6.0

WiFo-MiSAC is a task-agnostic foundation model that unifies multimodal wireless signals via tokenization and self-supervised learning with SS-DMoE to achieve strong few-shot performance on beam prediction and channel ...
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
cs.AR 2026-04 conditional novelty 6.0

DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 5.0

ResiHP improves LLM training throughput by 1.04-4.39x under hardware failures by using a workload-aware execution time predictor to avoid false failure detections and a scheduler that dynamically changes parallelism g...
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving
cs.DC 2026-04 unverdicted novelty 5.0

FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
cs.LG 2026-04 unverdicted novelty 5.0

Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs
cs.LG 2026-04 accept novelty 5.0

PINNACLE is an open-source framework for classical and quantum PINNs that supplies modular training methods and benchmarks showing high sensitivity to architecture choices plus parameter-efficiency gains in some hybri...
M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model
cs.CV 2026-04 unverdicted novelty 5.0

M-IDoL learns modality-specific and diverse representations by maximizing inter-modality entropy and minimizing intra-modality uncertainty through information decomposition in MoE subspaces.
HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
cs.CL 2026-04 unverdicted novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
gpt-oss-120b & gpt-oss-20b Model Card
cs.CL 2025-08 unverdicted novelty 5.0

OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 4.0

ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
cs.AI 2026-05 unverdicted novelty 4.0

AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
cs.DC 2026-05 accept novelty 4.0

LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation
cs.AI 2026-04 unverdicted novelty 4.0

LLM chain-of-thought rewriting of job postings plus category-aware MoE improves person-job fit AUC by 2.4%, GAUC by 7.5%, and live click-through conversion by 19.4%.
Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input
cs.RO 2026-04 unverdicted novelty 4.0

Sparsely gated MoE policies double the success rate of a real Unitree Go2 quadruped on large-obstacle parkour versus matched-active-parameter MLP baselines while cutting inference time compared with a scaled-up MLP.
Efficient Handwriting-Based Alzheimer,s Disease Diagnosis Using a Low-Rank Mixture of Experts Deep Learning Framework
cs.LG 2026-04 unverdicted novelty 4.0

A low-rank mixture of experts model trained on handwriting data delivers strong Alzheimer's diagnosis performance with substantially reduced parameter activation during inference.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
cs.CL 2026-05 unverdicted novelty 3.0

EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 58 Pith papers · 9 internal anchors

[1]

On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018

work page arXiv 2018
[2]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018

work page Pith review arXiv 2018
[3]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Exploring the limits of weakly supervised pretraining

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 181–196, 2018

work page 2018
[7]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[8]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016

work page 2016
[9]

Nas-fpn: Learning scalable feature pyramid architecture for object detection

Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019

work page 2019
[10]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017

work page 2017
[11]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer, 2019

work page 2019
[12]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[13]

Unsupervised cross-lingual representation learning at scale, 2019

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale, 2019

work page 2019
[14]

Massively multilingual neural machine translation in the wild: Findings and challenges, 2019

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. Massively multilingual neural machine translation in the wild: Findings and challenges, 2019

work page 2019
[15]

Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems 32, pages 103–112, 2019. 26

work page 2019
[16]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Advani and Andrew M

Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks, 2017

work page 2017
[18]

Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017

work page 2017
[19]

Beyond human-level accuracy

Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy. Pro- ceedings of the 24th Symposium on Principles and Practice of Parallel Programming , Feb 2019

work page 2019
[20]

Scaling description of generalization with number of parameters in deep learning

Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’ Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, Feb 2020

work page 2020
[21]

Tensorﬂow: a system for large-scale machine learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016

work page 2016
[23]

Mesh-tensorﬂow: Deep learning for supercomputers

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan- takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorﬂow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414–10423, 2018

work page 2018
[24]

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efﬁcient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018

work page Pith review arXiv 2018
[25]

Conditional computa- tion in neural networks for faster models, 2015

Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computa- tion in neural networks for faster models, 2015

work page 2015
[26]

Elbayad, J

Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. ArXiv, abs/1910.10073, 2020

work page arXiv 1910
[27]

Controlling computation versus quality for neural sequence models, 2020

Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Controlling computation versus quality for neural sequence models, 2020

work page 2020
[28]

https://www.tensorflow.org/xla, 2019

XLA: Optimizing Compiler for TensorFlow. https://www.tensorflow.org/xla, 2019. Online; accessed 1 June 2020

work page 2019
[29]

Vinod Nair and Geoffrey E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, 2010

work page 2010
[30]

Die grundlage der allgemeinen relativitätstheorie

Albert Einstein. Die grundlage der allgemeinen relativitätstheorie. In Das Relativitätsprinzip, pages 81–124. Springer, 1923

work page 1923
[31]

Lingvo: a modular and scalable framework for sequence-to-sequence modeling

Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia Xu Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019

work page arXiv 1902
[32]

Train ML models on large images and 3D volumes with spatial partitioning on Cloud TPUs

Youlong Cheng, HyoukJoong Lee, and Tamas Berghammer. Train ML models on large images and 3D volumes with spatial partitioning on Cloud TPUs. https: //cloud.google.com/blog/products/ai-machine-learning/train-ml-models- on-large-images-and-3d-volumes-with-spatial-partitioning-on-cloud-tpus ,

work page
[33]

Online; accessed 12 June 2020. 27

work page 2020
[34]

https://github.com/onnx/onnx, 2019

ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx, 2019. Online; accessed 1 June 2020

work page 2019
[35]

Relay: a new ir for machine learning frameworks

Jared Roesch, Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, and Zachary Tatlock. Relay: a new ir for machine learning frameworks. Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages - MAPL 2018, 2018

work page 2018
[36]

Glow: Graph lowering compiler techniques for neural networks, 2018

Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, Jack Mont- gomery, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, Misha Smelyanskiy, and Man Wang. Glow: Graph lowering compiler techniques for neural networks, 2018

work page 2018
[37]

MPI: A Message-Passing Interface Standard

MPI Forum. MPI: A Message-Passing Interface Standard. Version 2.2, September 4th 2009. available at: http://www.mpi-forum.org (Dec. 2009)

work page 2009
[38]

BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy

Minsik Cho, Ulrich Finkler, and David Kung. BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy. In Proceedings of the Conference on Systems and Machine Learning (SysML), Palo Alto, CA, 2019

work page 2019
[39]

A Cellular Computer to Implement the Kalman Filter Algorithm

Lynn Elliot Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, USA, 1969. AAI7010025

work page 1969
[40]

Multi-way, multilingual neural machine translation with a shared attention mechanism

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016

work page 2016
[41]

Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, and et al

Melvin Johnson, Mike Schuster, Quoc V . Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, and et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, Dec 2017

work page 2017
[42]

Massively multilingual neural machine translation

Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. CoRR, abs/1903.00089, 2019

work page arXiv 1903
[43]

https://ai

Exploring massively multilingual, massive neural machine translation. https://ai. googleblog.com/2019/10/exploring-massively-multilingual.html. Accessed: 2020-06-05

work page 2019
[44]

https://ai.googleblog.com/2020/06/recent- advances-in-google-translate.html

Recent advances in google translate. https://ai.googleblog.com/2020/06/recent- advances-in-google-translate.html . Accessed: 2020-06-05

work page 2020
[45]

Transfer of training: A review and directions for future research

Timothy T Baldwin and J Kevin Ford. Transfer of training: A review and directions for future research. Personnel psychology, 41(1):63–105, 1988

work page 1988
[46]

Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

work page 2013
[47]

Low-rank approximations for conditional feedforward compu- tation in deep neural networks, 2013

Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward compu- tation in deep neural networks, 2013

work page 2013
[48]

Ponte, Ashok C

Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Moshe Dubiner. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, page 1101–1109, USA, 2010. Association for Computational Linguistics

work page 2010
[49]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002

work page 2002
[50]

Training deeper neural machine translation models with transparent attention

Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training deeper neural machine translation models with transparent attention. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. 28

work page 2018
[51]

Language modeling with deep transformers

Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney. Language modeling with deep transformers. Interspeech 2019, Sep 2019

work page 2019
[52]

Wong, and Lidia S

Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[53]

So, Chen Liang, and Quoc V

David R. So, Chen Liang, and Quoc V . Le. The evolved transformer, 2019

work page 2019
[54]

https://cloud.google.com/tpu/docs/ bfloat16, 2020

Using bﬂoat16 with TensorFlow models. https://cloud.google.com/tpu/docs/ bfloat16, 2020. Online; accessed 12 June 2020

work page 2020
[55]

Wide and deep learning for recommender systems

Heng-Tze Cheng, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah, Levent Koc, Jeremiah Harmsen, and et al. Wide and deep learning for recommender systems. Proceedings of the 1st Workshop on Deep Learning for Recommender Systems - DLRS 2016, 2016

work page 2016
[56]

Lampinen and Surya Ganguli

Andrew K. Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks, 2018

work page 2018
[57]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[58]

ImageNet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012

work page 2012
[59]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015

work page 2015
[60]

Sequence to sequence learning with neural networks

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014

work page 2014
[61]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review arXiv 2014
[62]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review arXiv 2016
[63]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012

work page 2012
[64]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964. IEEE, 2016

work page 2016
[65]

State-of-the-art speech recognition with sequence-to-sequence models

Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778. IEEE, 2018

work page 2018
[66]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. 29

work page internal anchor Pith review arXiv 2016
[67]

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018

work page 2018
[68]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. 2017

work page 2017
[69]

Exploring generalization in deep learning, 2017

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning, 2017

work page 2017
[70]

Special-purpose digital hardware for neural networks: An architectural survey

Paolo Ienne, Thierry Cornu, and Gary Kuhn. Special-purpose digital hardware for neural networks: An architectural survey. Journal of VLSI signal processing systems for signal, image and video technology, 13(1):5–25, 1996

work page 1996
[71]

Large-scale deep unsupervised learning using graphics processors

Rajat Raina, Anand Madhavan, and Andrew Y Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning, pages 873–880, 2009

work page 2009
[72]

Deep, big, simple neural nets for handwritten digit recognition

Dan Claudiu Cire¸ san, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22(12):3207–3220, 2010

work page 2010
[73]

In-datacenter performance analysis of a tensor processing unit

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12, 2017

work page 2017
[74]

https://aiimpacts.org/2019-recent- trends-in-gpu-price-per-flops/

2019 recent trends in GPU price per FLOPS. https://aiimpacts.org/2019-recent- trends-in-gpu-price-per-flops/ . Accessed: 2020-06-05

work page 2019
[75]

Summarizing cpu and gpu design trends with product data

Yifan Sun, Nicolas Bohm Agostini, Shi Dong, and David Kaeli. Summarizing cpu and gpu design trends with product data. arXiv preprint arXiv:1911.11313, 2019

work page arXiv 1911
[76]

Large scale distributed deep networks

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012

work page 2012
[77]

Theano: new features and speed improvements

Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012

work page arXiv 2012
[78]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017

work page 2017
[79]

Scalable parallel programming with cuda

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, 2008

work page 2008
[80]

JAX: composable transformations of Python+NumPy programs

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs. 2018

work page 2018
[81]

Compiling machine learning programs via high-level tracing

Roy Frostig, Matthew Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing. In Machine Learning and Systems (MLSys), 2018

work page 2018
[82]

Beyond Data and Model Parallelism for Deep Neural Networks

Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of the Conference on Systems and Machine Learning (SysML), Palo Alto, CA, 2019

work page 2019

Showing first 80 references.