Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean · 2017

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

cs.RO · 2026-04-27 · unverdicted · novelty 7.0 · 2 refs

ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.

Jamba: A Hybrid Transformer-Mamba Language Model

cs.CL · 2024-03-28 · conditional · novelty 7.0

Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.

DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

DeRegiME uses a sparse variational GP with nonstationary regime-mixing kernel to decompose forecasts into mean, residual regimes, and noise for improved probabilistic forecasting under distribution shift.

citing papers explorer

Showing 3 of 3 citing papers.

Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations cs.RO · 2026-04-27 · unverdicted · none · ref 52 · 2 links
ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
Jamba: A Hybrid Transformer-Mamba Language Model cs.CL · 2024-03-28 · conditional · none · ref 46
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift cs.LG · 2026-05-19 · unverdicted · none · ref 86
DeRegiME uses a sparse variational GP with nonstationary regime-mixing kernel to decompose forecasts into mean, residual regimes, and noise for improved probabilistic forecasting under distribution shift.

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer