Recognition: unknown
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
Pith reviewed 2026-05-09 19:19 UTC · model grok-4.3
The pith
Mixture-of-Experts LLMs can be jailbroken by optimizing inputs to suppress safety-critical experts
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time the optimized suffix is appended to a malicious prompt, requiring only input access.
What carries the argument
Response-driven expert localization via activation contrasting, paired with a routing-aware optimization objective for adversarial suffixes that directly targets expert selection patterns
If this is right
- Achieves 69.3% average attack success rate across seven MoE LLMs, 3.2 times higher than prior optimization-based attacks
- Zero-shot transfer raises average ASR from 27.7% to 61.2% across five sibling MoE variants
- Generalizes to three MoE-based VLMs, lifting average ASR from 2.47% to 38.7%
- Exposes that sparse expert routing creates vulnerabilities not addressed by output-level alignment
Where Pith is reading between the lines
- Safety training may need to distribute refusal behavior across many more experts instead of concentrating it
- Routing decisions themselves could become a target for monitoring or regularization in deployed MoE systems
- Similar localization-plus-routing attacks may apply to other sparse or modular neural architectures
- Scaling strategies that rely on expert sparsity introduce new attack surfaces absent from dense models
Load-bearing premise
Safety behavior is concentrated in a small subset of experts that can be reliably localized by contrasting activations under safe refusals versus harmful completions and then suppressed via input optimization without privileged access
What would settle it
An experiment in which the localized safety experts are forced to activate on harmful prompts yet refusal rates remain unchanged would falsify the localization premise
Figures
read the original abstract
Safety alignment is critical for the responsible deployment of large language models (LLMs). As Mixture-of-Experts (MoE) architectures are increasingly adopted to scale model capacity, understanding their safety robustness becomes essential. Existing adversarial attacks, however, have notable limitations. Prompt-based jailbreaks rely on heuristic search and transfer poorly, model intervention methods require privileged access to internal representations, and optimization-based input attacks remain output-centric and are fundamentally limited to MoE models due to the non-differentiable routing mechanism. In this paper, we present RouteHijack, a routing-aware jailbreak for MoE LLMs. Our key insight is that safety behavior is concentrated in a small subset of experts, creating an opportunity to steer model behavior by influencing routing decisions through input optimization. Building on this observation, RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time, the optimized suffix is appended to a malicious prompt, requiring only input access. Across seven MoE LLMs, RouteHijack achieves a 69.3\% average attack success rate (ASR), outperforming prior optimization-based attack by $3.2\times$. RouteHijack also transfers zero-shot across five sibling MoE variants, raising average ASR from 27.7\% to 61.2\%, and further generalizes to three MoE-based VLMs, increasing average ASR from 2.47\% to 38.7\%. These findings expose a fundamental vulnerability in sparse expert architectures and highlight the need for defenses beyond output-level alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RouteHijack, a routing-aware jailbreak for Mixture-of-Experts LLMs. It first localizes safety-critical experts via response-driven activation contrast between safe refusals and harmful completions, then optimizes input suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and avoids early refusal. At inference the suffix is appended to malicious prompts using only input access. Across seven MoE LLMs the method reports 69.3% average ASR (3.2× prior optimization-based attacks), zero-shot transfer to five sibling variants (27.7% → 61.2%), and generalization to three MoE VLMs (2.47% → 38.7%).
Significance. If the empirical results hold, the work is significant because it isolates a structural vulnerability in sparse MoE routing: safety behavior is concentrated in a small expert subset that can be targeted by input optimization after localization. The concrete ASR numbers, cross-model transfer, and VLM generalization provide falsifiable evidence that output-level alignment is insufficient for these architectures and motivate routing-aware defenses.
major comments (2)
- [§3] §3 (Method), expert-localization paragraph: the procedure contrasts per-expert activations under safe refusals versus harmful completions. This step requires direct inspection of hidden states or router logits, which constitutes privileged access during attack construction. The paper contrasts RouteHijack against 'model intervention methods' that need privileged access, yet the localization phase uses exactly that access; the claim that the attack 'overcomes the non-differentiable routing limitation without privileged access' therefore needs explicit qualification that construction is white-box while inference is input-only.
- [§4.2] §4.2 (Results), Table 1 and transfer tables: the 69.3% average ASR and 3.2× improvement are reported without per-model standard deviations, exact prompt counts, or statistical tests. Because the central claim rests on these quantitative gains and on zero-shot transfer, the absence of variance estimates and baseline implementation details makes it impossible to verify that the reported margins are robust rather than sensitive to prompt selection or random seeds.
minor comments (2)
- [§3.3] Notation for the routing-aware loss (Eq. 3 or equivalent) should explicitly state whether router logits are used directly or approximated, and whether the objective remains differentiable with respect to the input suffix.
- [§1] The abstract and §1 state that prior optimization attacks are 'fundamentally limited to MoE models due to the non-differentiable routing mechanism'; a one-sentence clarification of how RouteHijack circumvents this (via the pre-computed expert mask) would help readers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below and will revise the manuscript to incorporate clarifications and additional details where appropriate.
read point-by-point responses
-
Referee: [§3] §3 (Method), expert-localization paragraph: the procedure contrasts per-expert activations under safe refusals versus harmful completions. This step requires direct inspection of hidden states or router logits, which constitutes privileged access during attack construction. The paper contrasts RouteHijack against 'model intervention methods' that need privileged access, yet the localization phase uses exactly that access; the claim that the attack 'overcomes the non-differentiable routing limitation without privileged access' therefore needs explicit qualification that construction is white-box while inference is input-only.
Authors: We agree that the expert-localization procedure requires white-box access to compute per-expert activations under contrasting response conditions. The manuscript already states that the optimized suffix is applied at inference using only input access, but we acknowledge the need for clearer separation between phases. In the revision we will explicitly qualify in §3, the abstract, and the introduction that expert localization and suffix optimization are performed in a white-box setting, while the deployed attack requires only black-box input access. This will distinguish RouteHijack from model-intervention baselines that need ongoing privileged access during inference. revision: yes
-
Referee: [§4.2] §4.2 (Results), Table 1 and transfer tables: the 69.3% average ASR and 3.2× improvement are reported without per-model standard deviations, exact prompt counts, or statistical tests. Because the central claim rests on these quantitative gains and on zero-shot transfer, the absence of variance estimates and baseline implementation details makes it impossible to verify that the reported margins are robust rather than sensitive to prompt selection or random seeds.
Authors: We concur that variance estimates and fuller experimental details would strengthen verifiability. In the revised version we will augment Table 1 and the transfer tables with per-model standard deviations computed across multiple independent runs, state the exact number of prompts drawn from each benchmark, and supply additional implementation details for the baselines. Where space allows we will also report basic statistical comparisons (e.g., paired tests) to quantify the significance of the observed improvements. revision: yes
Circularity Check
No circularity: purely empirical attack with independent evaluation
full rationale
The paper describes a two-stage empirical procedure—response-driven expert localization via activation contrast followed by routing-aware suffix optimization—then reports measured attack success rates on held-out prompts across seven MoE LLMs, five sibling variants, and three VLMs. No equations, fitted parameters, or self-citations are invoked as load-bearing derivations; the reported ASRs (69.3 % average, 3.2× improvement, zero-shot transfer gains) are direct experimental outcomes rather than quantities that reduce to the construction process by definition. The evaluation protocol is externally replicable and does not rely on any self-referential renaming or uniqueness theorem.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert activations under contrasting prompts reliably indicate functional specialization for safety versus harm.
Reference graph
Works this paper leans on
-
[1]
0xk1h0. 2023. ChatGPT_DAN: Jailbreak prompts for ChatGPT. https://github. com/0xk1h0/ChatGPT_DAN
2023
-
[2]
Argilla. 2024. notux-8x7b-v1. https://huggingface.co/argilla/notux-8x7b-v1
2024
-
[3]
Ankit Bisht, Lareina Yee, Roger Roberts, Brittany Presten, and Kather- ine Ottenbreit. 2025. Open Source Technology in the Age of AI. https://www.mckinsey.com/capabilities/quantumblack/our-insights/open- source-technology-in-the-age-of-ai
2025
- [4]
- [5]
-
[6]
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
-
[7]
A survey on mixture of experts.Authorea Preprints(2024)
2024
- [8]
-
[9]
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509(2025)
work page internal anchor Pith review arXiv 2025
-
[10]
Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. 2022. Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information processing systems35 (2022), 23049–23062
2022
-
[11]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. InMachine learning challenges workshop. Springer, 177–190
2005
-
[13]
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066 [cs.CL] https://arxiv.org/abs/2401.06066
work page internal anchor Pith review arXiv 2024
-
[14]
Marah Abdin et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] https://arxiv.org/abs/ 2404.14219
work page internal anchor Pith review arXiv 2024
- [15]
- [16]
-
[17]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39
2022
-
[18]
Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23951–23959
2025
-
[19]
Hector He. 2024. Qwen1.5-MOE-sft-nemotron-code. https://huggingface.co/ HectorHe/Qwen1.5-MOE-sft-nemotron-code
2024
-
[20]
IBM. 2024. IBM Study: More Companies Turning to Open-Source AI Tools to Un- lock ROI. https://newsroom.ibm.com/2024-12-19-IBM-Study-More-Companies- Turning-to-Open-Source-AI-Tools-to-Unlock-ROI
2024
-
[21]
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87
1991
-
[22]
Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems36 (2023), 24678–24704
2023
- [23]
- [24]
-
[25]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [26]
-
[27]
Jiahui Li, Yongchang Hao, Haoyu Xu, Xing Wang, and Yu Hong. 2025. Exploiting the index gradients for optimization-based jailbreaking on large language models. InProceedings of the 31st International Conference on Computational Linguistics. 4535–4547
2025
- [28]
- [29]
- [30]
- [31]
-
[32]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, and Wei Hu. 2025. A survey of attacks on large vision–language models: Resources, advances, and future trends.IEEE Transactions on Neural Networks and Learning Systems(2025)
2025
-
[34]
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451(2023)
work page internal anchor Pith review arXiv 2023
-
[35]
AI @ Meta Llama Team. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843(2016)
work page internal anchor Pith review arXiv 2016
-
[37]
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2381–2391
2018
-
[38]
Mistral AI. 2023. Mixtral of Experts. https://mistral.ai/news/mixtral-of-experts/
2023
-
[39]
Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. 2024. Jail- breaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309(2024). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Zhiyuan Xu, Joseph Gardiner, Sana Belguith, and Lichao Wu
-
[40]
2026.Introducing GPT-5.4
OpenAI. 2026.Introducing GPT-5.4. https://openai.com/index/introducing-gpt- 5-4/
2026
-
[41]
OpenRouter. 2026. About OpenRouter. https://openrouter.ai/about
2026
-
[42]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
2022
-
[43]
Maziyar Panahi. 2024. Qwen1.5-MoE-A2.7B-Wikihow. https://huggingface.co/ MaziyarPanahi/Qwen1.5-MoE-A2.7B-Wikihow
2024
-
[44]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277(2023)
work page internal anchor Pith review arXiv 2023
- [45]
- [46]
-
[47]
Qwen Team. 2024. Qwen-MoE: Scaling Open Large Language Models with Mixture-of-Experts. https://qwen.ai/blog?id=qwen-moe
2024
-
[48]
Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5
2026
-
[49]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.OpenAI blog 1, 8 (2019), 9
2019
-
[50]
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexan- der Turner. 2024. Steering llama 2 via contrastive activation addition. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15504–15522
2024
-
[51]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale.Commun. ACM 64, 9 (2021), 99–106
2021
-
[52]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[53]
do anything now
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685
2024
-
[54]
Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield- Menell, et al. 2024. Latent adversarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549(2024)
-
[55]
Chen, Tom Henighan, S
Nicholas Sofroniew, Isaac Kauvar, William Saunders, R. Chen, Tom Henighan, S. Hydrie, Craig Citro, Adam Pearce, Jeremy Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelly Rivoire, K. Fish, Chris Olah, and Jack Lindsey. 2026. Emotion Concepts and their Function in a Large Language Model. https://transformer- circuits.pub/2026/emotions/index.html
2026
-
[56]
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al . 2024. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems37 (2024), 125416–125440
2024
- [57]
-
[58]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al . 2025. Kimi-vl technical report.arXiv preprint arXiv:2504.07491(2025)
work page internal anchor Pith review arXiv 2025
-
[59]
Tencent Hunyuan Team. 2024. Hunyuan-A13B Technical Report. https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/report/ Hunyuan_A13B_Technical_Report.pdf. Technical report
2024
-
[60]
Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine.Nature medicine29, 8 (2023), 1930–1940
2023
-
[61]
Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou
-
[62]
arXiv preprint arXiv:2507.17702 , year=
Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models. arXiv:2507.17702 [cs.CL] https://arxiv.org/abs/2507.17702
-
[63]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2024. Steering Language Models With Activation Engineering. arXiv:2308.10248 [cs.CL] https://arxiv.org/abs/2308. 10248
work page internal anchor Pith review arXiv 2024
-
[64]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
2017
-
[65]
Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning lan- guage models during instruction tuning. InInternational Conference on Machine Learning. PMLR, 35413–35425
2023
- [66]
-
[67]
Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, and Cihang Xie
- [68]
-
[69]
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural net- work acceptability judgments.Transactions of the Association for Computational Linguistics7 (2019), 625–641
2019
-
[70]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems 36 (2023), 80079–80110
2023
- [71]
- [72]
- [73]
- [74]
- [75]
-
[76]
Zhiyuan Xu, Joseph Gardiner, and Sana Belguith. 2025. Reasoning That Leaks, Fine-Tuning That Amplifies: Exposing the Hidden Threats of Chain-of-Thought Models. In21st ACM ASIA Conference on Computer and Communications Security. Association for Computing Machinery
2025
-
[77]
An Yang et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[78]
Dokania, Adel Bibi, and Philip Torr
Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, and Philip Torr. 2025. Mixture of Experts Made Intrinsically Interpretable. arXiv:2503.07639 [cs.LG] https://arxiv.org/abs/2503. 07639
-
[79]
Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295(2024)
work page internal anchor Pith review arXiv 2024
-
[80]
Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al . 2025. Qwen3Guard Technical Report.arXiv preprint arXiv:2510.14276(2025)
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.