arXiv preprint arXiv:2506.17368 , year=

Safex: Analyzing vulnerabilities of moe-based llms via stable safety-critical expert identification , author= · 2020 · arXiv 2506.17368

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models

cs.CL · 2026-03-28 · unverdicted · novelty 7.0

Routing sensitivity in MoE models is necessary but insufficient for stereotype control because bias and knowledge remain entangled within expert groups and preference shifts do not transfer to generated text.

Expert-Aware Refusal Steering

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

Safety enforcement in aligned MoE LLMs is localized to specific experts and can be altered independently of the model's topic-driven routing patterns via a new red-teaming method called RASET.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs cs.CR · 2026-05-06 · unverdicted · none · ref 20
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 26
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models cs.CL · 2026-03-28 · unverdicted · none · ref 6
Routing sensitivity in MoE models is necessary but insufficient for stereotype control because bias and knowledge remain entangled within expert groups and preference shifts do not transfer to generated text.
Expert-Aware Refusal Steering cs.CL · 2026-06-02 · unverdicted · none · ref 2
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs cs.CL · 2026-05-28 · unverdicted · none · ref 3
Safety enforcement in aligned MoE LLMs is localized to specific experts and can be altered independently of the model's topic-driven routing patterns via a new red-teaming method called RASET.

arXiv preprint arXiv:2506.17368 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer