EMoE trains MoE models so they maintain performance when the number of activated experts changes at inference, expanding the usable range to 2-3 times the training k with higher peak results.
Gshard: Scaling giant models with conditional computation and automatic sharding
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
DMoA is a differentiable multi-agent LLM framework with recurrent context-aware routing and predictive entropy self-supervision that claims SOTA results on 9 benchmarks through elastic agent collaboration.
citing papers explorer
-
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
EMoE trains MoE models so they maintain performance when the number of activated experts changes at inference, expanding the usable range to 2-3 times the training k with higher peak results.
-
Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models
DMoA is a differentiable multi-agent LLM framework with recurrent context-aware routing and predictive entropy self-supervision that claims SOTA results on 9 benchmarks through elastic agent collaboration.