Gshard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen · 2021

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

cs.CL · 2025-09-26 · unverdicted · novelty 6.0

EMoE trains MoE models so they maintain performance when the number of activated experts changes at inference, expanding the usable range to 2-3 times the training k with higher peak results.

Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

DMoA is a differentiable multi-agent LLM framework with recurrent context-aware routing and predictive entropy self-supervision that claims SOTA results on 9 benchmarks through elastic agent collaboration.

citing papers explorer

Showing 2 of 2 citing papers.

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts cs.CL · 2025-09-26 · unverdicted · none · ref 22
EMoE trains MoE models so they maintain performance when the number of activated experts changes at inference, expanding the usable range to 2-3 times the training k with higher peak results.
Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models cs.LG · 2026-05-15 · unverdicted · none · ref 82
DMoA is a differentiable multi-agent LLM framework with recurrent context-aware routing and predictive entropy self-supervision that claims SOTA results on 9 benchmarks through elastic agent collaboration.

Gshard: Scaling giant models with conditional computation and automatic sharding

fields

years

verdicts

representative citing papers

citing papers explorer