pith. machine review for the scientific record. sign in

arxiv: 2605.02946 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI

Recognition: unknown

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Mixture-of-Expertsjailbreak attackLLM safetyrouting mechanismadversarial suffixexpert localizationMoE vulnerability
0
0 comments X

The pith

Mixture-of-Experts LLMs can be jailbroken by optimizing inputs to suppress safety-critical experts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that safety alignment in MoE LLMs concentrates in a small number of experts, creating a route for input-based attacks that steer routing decisions away from those experts. RouteHijack first identifies the safety experts by comparing model activations on safe refusal responses versus harmful completions. It then optimizes an adversarial suffix using a routing-aware loss that discourages safety experts, encourages harmful ones, and blocks early refusals. This requires only black-box input access and works because the non-differentiable router can still be influenced through the prompt. The approach matters as MoE designs grow common for scaling capacity, yet current defenses focus on output behavior rather than internal routing.

Core claim

RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time the optimized suffix is appended to a malicious prompt, requiring only input access.

What carries the argument

Response-driven expert localization via activation contrasting, paired with a routing-aware optimization objective for adversarial suffixes that directly targets expert selection patterns

If this is right

  • Achieves 69.3% average attack success rate across seven MoE LLMs, 3.2 times higher than prior optimization-based attacks
  • Zero-shot transfer raises average ASR from 27.7% to 61.2% across five sibling MoE variants
  • Generalizes to three MoE-based VLMs, lifting average ASR from 2.47% to 38.7%
  • Exposes that sparse expert routing creates vulnerabilities not addressed by output-level alignment

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety training may need to distribute refusal behavior across many more experts instead of concentrating it
  • Routing decisions themselves could become a target for monitoring or regularization in deployed MoE systems
  • Similar localization-plus-routing attacks may apply to other sparse or modular neural architectures
  • Scaling strategies that rely on expert sparsity introduce new attack surfaces absent from dense models

Load-bearing premise

Safety behavior is concentrated in a small subset of experts that can be reliably localized by contrasting activations under safe refusals versus harmful completions and then suppressed via input optimization without privileged access

What would settle it

An experiment in which the localized safety experts are forced to activate on harmful prompts yet refusal rates remain unchanged would falsify the localization premise

Figures

Figures reproduced from arXiv: 2605.02946 by Joseph Gardiner, Lichao Wu, Sana Belguith, Zhiyuan Xu.

Figure 1
Figure 1. Figure 1: An overview of the RouteHijack framework. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Safety differential heatmap for DeepSeek. X- and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The impact of adversarial suffix length ( [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Activation heatmaps of the safety differential [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Safety alignment is critical for the responsible deployment of large language models (LLMs). As Mixture-of-Experts (MoE) architectures are increasingly adopted to scale model capacity, understanding their safety robustness becomes essential. Existing adversarial attacks, however, have notable limitations. Prompt-based jailbreaks rely on heuristic search and transfer poorly, model intervention methods require privileged access to internal representations, and optimization-based input attacks remain output-centric and are fundamentally limited to MoE models due to the non-differentiable routing mechanism. In this paper, we present RouteHijack, a routing-aware jailbreak for MoE LLMs. Our key insight is that safety behavior is concentrated in a small subset of experts, creating an opportunity to steer model behavior by influencing routing decisions through input optimization. Building on this observation, RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time, the optimized suffix is appended to a malicious prompt, requiring only input access. Across seven MoE LLMs, RouteHijack achieves a 69.3\% average attack success rate (ASR), outperforming prior optimization-based attack by $3.2\times$. RouteHijack also transfers zero-shot across five sibling MoE variants, raising average ASR from 27.7\% to 61.2\%, and further generalizes to three MoE-based VLMs, increasing average ASR from 2.47\% to 38.7\%. These findings expose a fundamental vulnerability in sparse expert architectures and highlight the need for defenses beyond output-level alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RouteHijack, a routing-aware jailbreak for Mixture-of-Experts LLMs. It first localizes safety-critical experts via response-driven activation contrast between safe refusals and harmful completions, then optimizes input suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and avoids early refusal. At inference the suffix is appended to malicious prompts using only input access. Across seven MoE LLMs the method reports 69.3% average ASR (3.2× prior optimization-based attacks), zero-shot transfer to five sibling variants (27.7% → 61.2%), and generalization to three MoE VLMs (2.47% → 38.7%).

Significance. If the empirical results hold, the work is significant because it isolates a structural vulnerability in sparse MoE routing: safety behavior is concentrated in a small expert subset that can be targeted by input optimization after localization. The concrete ASR numbers, cross-model transfer, and VLM generalization provide falsifiable evidence that output-level alignment is insufficient for these architectures and motivate routing-aware defenses.

major comments (2)
  1. [§3] §3 (Method), expert-localization paragraph: the procedure contrasts per-expert activations under safe refusals versus harmful completions. This step requires direct inspection of hidden states or router logits, which constitutes privileged access during attack construction. The paper contrasts RouteHijack against 'model intervention methods' that need privileged access, yet the localization phase uses exactly that access; the claim that the attack 'overcomes the non-differentiable routing limitation without privileged access' therefore needs explicit qualification that construction is white-box while inference is input-only.
  2. [§4.2] §4.2 (Results), Table 1 and transfer tables: the 69.3% average ASR and 3.2× improvement are reported without per-model standard deviations, exact prompt counts, or statistical tests. Because the central claim rests on these quantitative gains and on zero-shot transfer, the absence of variance estimates and baseline implementation details makes it impossible to verify that the reported margins are robust rather than sensitive to prompt selection or random seeds.
minor comments (2)
  1. [§3.3] Notation for the routing-aware loss (Eq. 3 or equivalent) should explicitly state whether router logits are used directly or approximated, and whether the objective remains differentiable with respect to the input suffix.
  2. [§1] The abstract and §1 state that prior optimization attacks are 'fundamentally limited to MoE models due to the non-differentiable routing mechanism'; a one-sentence clarification of how RouteHijack circumvents this (via the pre-computed expert mask) would help readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below and will revise the manuscript to incorporate clarifications and additional details where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Method), expert-localization paragraph: the procedure contrasts per-expert activations under safe refusals versus harmful completions. This step requires direct inspection of hidden states or router logits, which constitutes privileged access during attack construction. The paper contrasts RouteHijack against 'model intervention methods' that need privileged access, yet the localization phase uses exactly that access; the claim that the attack 'overcomes the non-differentiable routing limitation without privileged access' therefore needs explicit qualification that construction is white-box while inference is input-only.

    Authors: We agree that the expert-localization procedure requires white-box access to compute per-expert activations under contrasting response conditions. The manuscript already states that the optimized suffix is applied at inference using only input access, but we acknowledge the need for clearer separation between phases. In the revision we will explicitly qualify in §3, the abstract, and the introduction that expert localization and suffix optimization are performed in a white-box setting, while the deployed attack requires only black-box input access. This will distinguish RouteHijack from model-intervention baselines that need ongoing privileged access during inference. revision: yes

  2. Referee: [§4.2] §4.2 (Results), Table 1 and transfer tables: the 69.3% average ASR and 3.2× improvement are reported without per-model standard deviations, exact prompt counts, or statistical tests. Because the central claim rests on these quantitative gains and on zero-shot transfer, the absence of variance estimates and baseline implementation details makes it impossible to verify that the reported margins are robust rather than sensitive to prompt selection or random seeds.

    Authors: We concur that variance estimates and fuller experimental details would strengthen verifiability. In the revised version we will augment Table 1 and the transfer tables with per-model standard deviations computed across multiple independent runs, state the exact number of prompts drawn from each benchmark, and supply additional implementation details for the baselines. Where space allows we will also report basic statistical comparisons (e.g., paired tests) to quantify the significance of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack with independent evaluation

full rationale

The paper describes a two-stage empirical procedure—response-driven expert localization via activation contrast followed by routing-aware suffix optimization—then reports measured attack success rates on held-out prompts across seven MoE LLMs, five sibling variants, and three VLMs. No equations, fitted parameters, or self-citations are invoked as load-bearing derivations; the reported ASRs (69.3 % average, 3.2× improvement, zero-shot transfer gains) are direct experimental outcomes rather than quantities that reduce to the construction process by definition. The evaluation protocol is externally replicable and does not rely on any self-referential renaming or uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or new theoretical entities are introduced; the paper relies on standard assumptions of gradient-based optimization and activation-based expert attribution common in the adversarial ML literature.

axioms (1)
  • domain assumption Expert activations under contrasting prompts reliably indicate functional specialization for safety versus harm.
    Invoked in the expert localization step described in the abstract.

pith-pipeline@v0.9.0 · 5625 in / 1251 out tokens · 26874 ms · 2026-05-09T19:19:11.609817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 45 canonical work pages · 18 internal anchors

  1. [1]

    0xk1h0. 2023. ChatGPT_DAN: Jailbreak prompts for ChatGPT. https://github. com/0xk1h0/ChatGPT_DAN

  2. [2]

    Argilla. 2024. notux-8x7b-v1. https://huggingface.co/argilla/notux-8x7b-v1

  3. [3]

    Ankit Bisht, Lareina Yee, Roger Roberts, Brittany Presten, and Kather- ine Ottenbreit. 2025. Open Source Technology in the Age of AI. https://www.mckinsey.com/capabilities/quantumblack/our-insights/open- source-technology-in-the-age-of-ai

  4. [4]

    Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Ja- yaraman, et al. 2024. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247(2024)

  5. [5]

    Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krasheninnikov. 2025. Understanding (un) reliability of steering vectors in language models.arXiv preprint arXiv:2505.22637(2025)

  6. [6]

    Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang

  7. [7]

    A survey on mixture of experts.Authorea Preprints(2024)

  8. [8]

    Marmik Chaudhari, Jeremi Nuer, and Rome Thorstenson. 2025. Sparsity and Superposition in Mixture of Experts. arXiv:2510.23671 [cs.LG] https://arxiv.org/ abs/2510.23671

  9. [9]

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509(2025)

  10. [10]

    Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. 2022. Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information processing systems35 (2022), 23049–23062

  11. [11]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457(2018)

  12. [12]

    Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. InMachine learning challenges workshop. Springer, 177–190

  13. [13]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066 [cs.CL] https://arxiv.org/abs/2401.06066

  14. [14]

    Marah Abdin et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] https://arxiv.org/abs/ 2404.14219

  15. [15]

    Yehui Tang et al. 2025. Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity. arXiv:2505.21411 [cs.CL] https://arxiv.org/abs/2505.21411

  16. [16]

    Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, and Nanyun Peng. 2025. Steering moe llms via expert (de) activation.arXiv preprint arXiv:2509.09660(2025)

  17. [17]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  18. [18]

    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23951–23959

  19. [19]

    Hector He. 2024. Qwen1.5-MOE-sft-nemotron-code. https://huggingface.co/ HectorHe/Qwen1.5-MOE-sft-nemotron-code

  20. [20]

    IBM. 2024. IBM Study: More Companies Turning to Open-Source AI Tools to Un- lock ROI. https://newsroom.ibm.com/2024-12-19-IBM-Study-More-Companies- Turning-to-Open-Source-AI-Tools-to-Unlock-ROI

  21. [21]

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

  22. [22]

    Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems36 (2023), 24678–24704

  23. [23]

    Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, and Yang Zhang. 2026. Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.arXiv preprint arXiv:2602.08621(2026)

  24. [24]

    Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. Challenges and applications of large language models.arXiv preprint arXiv:2307.10169(2023)

  25. [25]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

  26. [26]

    Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, and Jianqiang Li. 2025. SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification. arXiv:2506.17368 [cs.LG] https://arxiv.org/abs/2506.17368

  27. [27]

    Jiahui Li, Yongchang Hao, Haoyu Xu, Xing Wang, and Yu Hong. 2025. Exploiting the index gradients for optimization-based jailbreaking on large language models. InProceedings of the 31st International Conference on Computational Linguistics. 4535–4547

  28. [28]

    Zeyi Liao and Huan Sun. 2024. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms.arXiv preprint arXiv:2404.07921(2024)

  29. [29]

    Runqi Lin, Bo Han, Fengwang Li, and Tongling Liu. 2025. Understanding and en- hancing the transferability of jailbreaking attacks.arXiv preprint arXiv:2502.03052 (2025)

  30. [30]

    Jack Lindsey. 2026. Emergent introspective awareness in large language models. arXiv preprint arXiv:2601.01828(2026)

  31. [31]

    Jona te Lintelo, Lichao Wu, and Stjepan Picek. 2026. Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing.arXiv preprint arXiv:2602.08741(2026)

  32. [32]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  33. [33]

    Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, and Wei Hu. 2025. A survey of attacks on large vision–language models: Resources, advances, and future trends.IEEE Transactions on Neural Networks and Learning Systems(2025)

  34. [34]

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451(2023)

  35. [35]

    AI @ Meta Llama Team. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

  36. [36]

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843(2016)

  37. [37]

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2381–2391

  38. [38]

    Mistral AI. 2023. Mixtral of Experts. https://mistral.ai/news/mixtral-of-experts/

  39. [39]

    Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. 2024. Jail- breaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309(2024). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Zhiyuan Xu, Joseph Gardiner, Sana Belguith, and Lichao Wu

  40. [40]

    2026.Introducing GPT-5.4

    OpenAI. 2026.Introducing GPT-5.4. https://openai.com/index/introducing-gpt- 5-4/

  41. [41]

    OpenRouter. 2026. About OpenRouter. https://openrouter.ai/about

  42. [42]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  43. [43]

    Maziyar Panahi. 2024. Qwen1.5-MoE-A2.7B-Wikihow. https://huggingface.co/ MaziyarPanahi/Qwen1.5-MoE-A2.7B-Wikihow

  44. [44]

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277(2023)

  45. [45]

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. 2024. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946(2024)

  46. [46]

    Enrique Queipo-de Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, and Ravid Shwartz-Ziv. 2025. Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477(2025)

  47. [47]

    Qwen Team. 2024. Qwen-MoE: Scaling Open Large Language Models with Mixture-of-Experts. https://qwen.ai/blog?id=qwen-moe

  48. [48]

    Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5

  49. [49]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.OpenAI blog 1, 8 (2019), 9

  50. [50]

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexan- der Turner. 2024. Steering llama 2 via contrastive activation addition. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15504–15522

  51. [51]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale.Commun. ACM 64, 9 (2021), 99–106

  52. [52]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)

  53. [53]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685

  54. [54]

    Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield- Menell, et al. 2024. Latent adversarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549(2024)

  55. [55]

    Chen, Tom Henighan, S

    Nicholas Sofroniew, Isaac Kauvar, William Saunders, R. Chen, Tom Henighan, S. Hydrie, Craig Citro, Adam Pearce, Jeremy Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelly Rivoire, K. Fish, Chris Olah, and Jack Lindsey. 2026. Emotion Concepts and their Function in a Large Language Model. https://transformer- circuits.pub/2026/emotions/index.html

  56. [56]

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al . 2024. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems37 (2024), 125416–125440

  57. [57]

    Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, and Peikang Hu. 2025. The Resurgence of GCG Adversarial Attacks on Large Language Models.arXiv preprint arXiv:2509.00391(2025)

  58. [58]

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al . 2025. Kimi-vl technical report.arXiv preprint arXiv:2504.07491(2025)

  59. [59]

    Tencent Hunyuan Team. 2024. Hunyuan-A13B Technical Report. https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/report/ Hunyuan_A13B_Technical_Report.pdf. Technical report

  60. [60]

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine.Nature medicine29, 8 (2023), 1930–1940

  61. [61]

    Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou

  62. [62]

    arXiv preprint arXiv:2507.17702 , year=

    Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models. arXiv:2507.17702 [cs.CL] https://arxiv.org/abs/2507.17702

  63. [63]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2024. Steering Language Models With Activation Engineering. arXiv:2308.10248 [cs.CL] https://arxiv.org/abs/2308. 10248

  64. [64]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  65. [65]

    Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning lan- guage models during instruction tuning. InInternational Conference on Machine Learning. PMLR, 35413–35425

  66. [66]

    Qingyue Wang, Qi Pang, Xixun Lin, Shuai Wang, and Daoyuan Wu. 2025. Bad- MoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts. arXiv:2504.18598 [cs.CR] https://arxiv.org/abs/ 2504.18598

  67. [67]

    Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, and Cihang Xie

  68. [68]

    Attngcg: Enhancing jailbreaking attacks on llms with attention manipula- tion.arXiv preprint arXiv:2410.09040(2024)

  69. [69]

    Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural net- work acceptability judgments.Transactions of the Association for Computational Linguistics7 (2019), 625–641

  70. [70]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems 36 (2023), 80079–80110

  71. [71]

    Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Stjepan Picek, and Ahmad- Reza Sadeghi. 2025. GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs.arXiv preprint arXiv:2512.21008(2025)

  72. [72]

    Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Maximilian Thang, Stjepan Picek, and Ahmad-Reza Sadeghi. 2025. NeuroStrike: Neuron-Level Attacks on Aligned LLMs.arXiv preprint arXiv:2509.11864(2025)

  73. [73]

    Yuanbo Xie, Yingjie Zhang, Tianyun Liu, Duohe Ma, and Tingwen Liu. 2025. Be- yond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction.arXiv preprint arXiv:2509.15202(2025)

  74. [74]

    Zhiyuan Xu, Stanislav Abaimov, Joseph Gardiner, and Sana Belguith. 2025. Steer- ing in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models.arXiv preprint arXiv:2511.17194(2025)

  75. [75]

    Zhiyuan Xu, Joseph Gardiner, and Sana Belguith. 2025. The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models.arXiv preprint arXiv:2502.01225(2025)

  76. [76]

    Zhiyuan Xu, Joseph Gardiner, and Sana Belguith. 2025. Reasoning That Leaks, Fine-Tuning That Amplifies: Exposing the Hidden Threats of Chain-of-Thought Models. In21st ACM ASIA Conference on Computer and Communications Security. Association for Computing Machinery

  77. [77]

    An Yang et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

  78. [78]

    Dokania, Adel Bibi, and Philip Torr

    Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, and Philip Torr. 2025. Mixture of Experts Made Intrinsically Interpretable. arXiv:2503.07639 [cs.LG] https://arxiv.org/abs/2503. 07639

  79. [79]

    Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295(2024)

  80. [80]

    Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al . 2025. Qwen3Guard Technical Report.arXiv preprint arXiv:2510.14276(2025)

Showing first 80 references.