Neural Computation , volume=

Hierarchical Mixtures of Experts, the EM Algorithm , author= · 1994 · DOI 10.1162/neco.1994.6.2.181

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.

Fast Training of Mixture-of-Experts for Time Series Forecasting via Expert Loss Integration

stat.ML · 2026-05-11 · unverdicted · novelty 6.0

A new MoE training method integrates expert-level losses and partial online updates to improve forecasting accuracy and efficiency over standard statistical and neural models.

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

cs.CL · 2024-01-11 · unverdicted · novelty 5.0

DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.

Soft-to-Hard Routing in Sparse Mixture-of-Experts Models

cs.LG · 2026-05-04

citing papers explorer

Showing 5 of 5 citing papers.

Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning cs.LG · 2026-04-07 · unverdicted · none · ref 38
Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
Fast Training of Mixture-of-Experts for Time Series Forecasting via Expert Loss Integration stat.ML · 2026-05-11 · unverdicted · none · ref 18
A new MoE training method integrates expert-level losses and partial online updates to improve forecasting accuracy and efficiency over standard statistical and neural models.
When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing cs.CV · 2026-05-15 · unverdicted · none · ref 2
Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cs.CL · 2024-01-11 · unverdicted · none · ref 27
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
Soft-to-Hard Routing in Sparse Mixture-of-Experts Models cs.LG · 2026-05-04 · unreviewed · ref 7

Neural Computation , volume=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer