arXiv preprint arXiv:2402.01739 , year=

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models , author= · 2024 · arXiv 2402.01739

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

citation-role summary

background 1 other 1

citation-polarity summary

background 1 unclear 1

representative citing papers

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

cs.CL · 2026-03-16 · conditional · novelty 7.0

Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.

Expert-Aware Refusal Steering

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.

Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation

cs.RO · 2026-05-31 · unverdicted · novelty 6.0

MATE is a multi-modal MoE trajectory policy using a cosine router and stochastic noise to improve expert balance, reporting 4.75% higher average success rate than prior methods on LIBERO under data scarcity.

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

AsyMoE adds hyperbolic geometry for cross-modal hierarchies and evidence-priority experts to address vision-language asymmetry in LVLMs, reporting 1.5% average gains and 25.45% fewer active parameters.

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

R2E-IG combines residual refined experts with instance-level gating and mixed-distribution training using dynamic weight adaptation to improve generalization of DRL solvers for vehicle routing problems.

Hierarchical Mixture-of-Experts with Two-Stage Optimization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prior SOTA methods.

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

cs.CV · 2026-04-09 · conditional · novelty 6.0

Multimodal MoE models exhibit 'Seeing but Not Thinking' due to routing distraction where visual inputs fail to activate reasoning experts; a targeted intervention improves results by up to 3.17% across models and benchmarks.

Token-Level LLM Collaboration via FusionRoute

cs.AI · 2026-01-08 · unverdicted · novelty 6.0

FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

cs.CL · 2025-09-17 · unverdicted · novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

cs.LG · 2025-03-07 · conditional · novelty 6.0

Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

cs.LG · 2026-04-19 · 2 refs

citing papers explorer

Showing 1 of 1 citing paper after filters.

Hierarchical Mixture-of-Experts with Two-Stage Optimization cs.LG · 2026-05-08 · unverdicted · none · ref 41
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.

arXiv preprint arXiv:2402.01739 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer