ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
DeRegiME uses a sparse variational GP with nonstationary regime-mixing kernel to decompose forecasts into mean, residual regimes, and noise for improved probabilistic forecasting under distribution shift.
citing papers explorer
-
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
-
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
-
DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift
DeRegiME uses a sparse variational GP with nonstationary regime-mixing kernel to decompose forecasts into mean, residual regimes, and noise for improved probabilistic forecasting under distribution shift.