LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.
arXiv preprint arXiv:2402.01739 , year=
17 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
MATE is a multi-modal MoE trajectory policy using a cosine router and stochastic noise to improve expert balance, reporting 4.75% higher average success rate than prior methods on LIBERO under data scarcity.
AsyMoE adds hyperbolic geometry for cross-modal hierarchies and evidence-priority experts to address vision-language asymmetry in LVLMs, reporting 1.5% average gains and 25.45% fewer active parameters.
R2E-IG combines residual refined experts with instance-level gating and mixed-distribution training using dynamic weight adaptation to improve generalization of DRL solvers for vehicle routing problems.
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prior SOTA methods.
Multimodal MoE models exhibit 'Seeing but Not Thinking' due to routing distraction where visual inputs fail to activate reasoning experts; a targeted intervention improves results by up to 3.17% across models and benchmarks.
FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.
Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.
citing papers explorer
-
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.