Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
arXiv preprint arXiv:2506.17368 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
Routing sensitivity in MoE models is necessary but insufficient for stereotype control because bias and knowledge remain entangled within expert groups and preference shifts do not transfer to generated text.
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
Safety enforcement in aligned MoE LLMs is localized to specific experts and can be altered independently of the model's topic-driven routing patterns via a new red-teaming method called RASET.
citing papers explorer
-
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
-
Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models
Routing sensitivity in MoE models is necessary but insufficient for stereotype control because bias and knowledge remain entangled within expert groups and preference shifts do not transfer to generated text.
-
Expert-Aware Refusal Steering
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
-
Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs
Safety enforcement in aligned MoE LLMs is localized to specific experts and can be altered independently of the model's topic-driven routing patterns via a new red-teaming method called RASET.