hub

Learning factored representations in a deep mixture of experts

David Eigen, Marc’Aurelio Ranzato, Ilya Sutskever · 2013 · cs.LG · arXiv 1312.4314

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

open full Pith review browse 13 citing papers arXiv PDF

abstract

Mixtures of Experts combine the outputs of several "expert" networks, each of which specializes in a different part of the input space. This is achieved by training a "gating" network that maps each input to a distribution over the experts. Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time. In this this work, we extend the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts. This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size. On a randomly translated version of the MNIST dataset, we find that the Deep Mixture of Experts automatically learns to develop location-dependent ("where") experts at the first layer, and class-specific ("what") experts at the second layer. In addition, we see that the different combinations are in use when the model is applied to a dataset of speech monophones. These demonstrate effective use of all expert combinations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.

Mean-field limit from general mixtures of experts to quantum neural networks

math-ph · 2025-01-24 · unverdicted · novelty 7.0

Proves mean-field limit and propagation of chaos for gradient-flow trained mixtures of experts with explicit rate depending only on expert count, applied to quantum neural networks.

RouterBench: A Benchmark for Multi-LLM Routing System

cs.LG · 2024-03-18 · unverdicted · novelty 7.0

RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

cs.LG · 2021-01-11 · accept · novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

SynCB adds a dynamic routing module and joint training to a hybrid concept-plus-neural architecture, reporting up to 3.9 pp higher accuracy than a full neural baseline and up to 6.43 pp better intervention responsiveness than prior hybrids across five datasets.

HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

HDMoE uses hierarchical MoE and RFR modules to address redundant information and fine-grained intra/inter-modality relationships in multimodal cancer survival prediction, with positive results on private liver cancer and TCGA datasets.

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

cs.LG · 2025-02-08 · unverdicted · novelty 6.0

TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

cs.CV · 2024-01-29 · conditional · novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

cs.LG · 2024-01-02 · unverdicted · novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.

Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

Patch-wise sparse MoE layers in CNNs for semantic segmentation yield architecture-dependent gains up to 3.9 mIoU on Cityscapes and BDD100K with low overhead, but show strong design sensitivity.

ST-MoE: Designing Stable and Transferable Sparse Expert Models

cs.CL · 2022-02-17 · unverdicted · novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.

MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

MoASE++ combines activation sparsity experts with domain-adaptive on-policy distillation to achieve state-of-the-art continual test-time adaptation on image classification and segmentation benchmarks.

citing papers explorer

Showing 13 of 13 citing papers.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer cs.LG · 2017-01-23 · accept · none · ref 16
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization cs.LG · 2026-05-13 · unverdicted · none · ref 15 · internal anchor
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
Mean-field limit from general mixtures of experts to quantum neural networks math-ph · 2025-01-24 · unverdicted · none · ref 14 · internal anchor
Proves mean-field limit and propagation of chaos for gradient-flow trained mixtures of experts with explicit rate depending only on expert count, applied to quantum neural networks.
RouterBench: A Benchmark for Multi-LLM Routing System cs.LG · 2024-03-18 · unverdicted · none · ref 28 · internal anchor
RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity cs.LG · 2021-01-11 · accept · none · ref 9
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches cs.CV · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
SynCB adds a dynamic routing module and joint training to a hybrid concept-plus-neural architecture, reporting up to 3.9 pp higher accuracy than a full neural baseline and up to 6.43 pp better intervention responsiveness than prior hybrids across five datasets.
HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction cs.CV · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
HDMoE uses hierarchical MoE and RFR modules to address redundant information and fine-grained intra/inter-modality relationships in multimodal cancer survival prediction, with positive results on private liver cancer and TCGA datasets.
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data cs.LG · 2025-02-08 · unverdicted · none · ref 139 · internal anchor
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models cs.CV · 2024-01-29 · conditional · none · ref 9 · internal anchor
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models cs.LG · 2024-01-02 · unverdicted · none · ref 7 · internal anchor
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation cs.CV · 2026-04-15 · unverdicted · none · ref 8
Patch-wise sparse MoE layers in CNNs for semantic segmentation yield architecture-dependent gains up to 3.9 mIoU on Cityscapes and BDD100K with low overhead, but show strong design sensitivity.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 145
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation cs.CV · 2026-05-18 · unverdicted · none · ref 49 · internal anchor
MoASE++ combines activation sparsity experts with domain-adaptive on-policy distillation to achieve state-of-the-art continual test-time adaptation on image classification and segmentation benchmarks.

Learning factored representations in a deep mixture of experts

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer