arxiv: 2605.02946 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI

Recognition: unknown

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

Zhiyuan Xu , Joseph Gardiner , Sana Belguith , Lichao Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Mixture-of-Expertsjailbreak attackLLM safetyrouting mechanismadversarial suffixexpert localizationMoE vulnerability

0 comments

The pith

Mixture-of-Experts LLMs can be jailbroken by optimizing inputs to suppress safety-critical experts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that safety alignment in MoE LLMs concentrates in a small number of experts, creating a route for input-based attacks that steer routing decisions away from those experts. RouteHijack first identifies the safety experts by comparing model activations on safe refusal responses versus harmful completions. It then optimizes an adversarial suffix using a routing-aware loss that discourages safety experts, encourages harmful ones, and blocks early refusals. This requires only black-box input access and works because the non-differentiable router can still be influenced through the prompt. The approach matters as MoE designs grow common for scaling capacity, yet current defenses focus on output behavior rather than internal routing.

Core claim

RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time the optimized suffix is appended to a malicious prompt, requiring only input access.

What carries the argument

Response-driven expert localization via activation contrasting, paired with a routing-aware optimization objective for adversarial suffixes that directly targets expert selection patterns

If this is right

Achieves 69.3% average attack success rate across seven MoE LLMs, 3.2 times higher than prior optimization-based attacks
Zero-shot transfer raises average ASR from 27.7% to 61.2% across five sibling MoE variants
Generalizes to three MoE-based VLMs, lifting average ASR from 2.47% to 38.7%
Exposes that sparse expert routing creates vulnerabilities not addressed by output-level alignment

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety training may need to distribute refusal behavior across many more experts instead of concentrating it
Routing decisions themselves could become a target for monitoring or regularization in deployed MoE systems
Similar localization-plus-routing attacks may apply to other sparse or modular neural architectures
Scaling strategies that rely on expert sparsity introduce new attack surfaces absent from dense models

Load-bearing premise

Safety behavior is concentrated in a small subset of experts that can be reliably localized by contrasting activations under safe refusals versus harmful completions and then suppressed via input optimization without privileged access

What would settle it

An experiment in which the localized safety experts are forced to activate on harmful prompts yet refusal rates remain unchanged would falsify the localization premise

Figures

Figures reproduced from arXiv: 2605.02946 by Joseph Gardiner, Lichao Wu, Sana Belguith, Zhiyuan Xu.

**Figure 2.** Figure 2: Safety differential heatmap for DeepSeek. X- and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The impact of adversarial suffix length ( [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Activation heatmaps of the safety differential [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Safety alignment is critical for the responsible deployment of large language models (LLMs). As Mixture-of-Experts (MoE) architectures are increasingly adopted to scale model capacity, understanding their safety robustness becomes essential. Existing adversarial attacks, however, have notable limitations. Prompt-based jailbreaks rely on heuristic search and transfer poorly, model intervention methods require privileged access to internal representations, and optimization-based input attacks remain output-centric and are fundamentally limited to MoE models due to the non-differentiable routing mechanism. In this paper, we present RouteHijack, a routing-aware jailbreak for MoE LLMs. Our key insight is that safety behavior is concentrated in a small subset of experts, creating an opportunity to steer model behavior by influencing routing decisions through input optimization. Building on this observation, RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time, the optimized suffix is appended to a malicious prompt, requiring only input access. Across seven MoE LLMs, RouteHijack achieves a 69.3\% average attack success rate (ASR), outperforming prior optimization-based attack by $3.2\times$. RouteHijack also transfers zero-shot across five sibling MoE variants, raising average ASR from 27.7\% to 61.2\%, and further generalizes to three MoE-based VLMs, increasing average ASR from 2.47\% to 38.7\%. These findings expose a fundamental vulnerability in sparse expert architectures and highlight the need for defenses beyond output-level alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RouteHijack gets solid transfer on MoE jailbreaks by localizing safety experts then optimizing routing, but the localization step needs internal access that undercuts the pure input-only story.

read the letter

RouteHijack shows you can improve jailbreak success on MoE models by first spotting which experts handle refusals and then pushing the router to skip them. The average 69% ASR across seven models and the lift on transfer to sibling models and VLMs are the concrete results worth noting. The new piece is the response-driven localization that contrasts activations on safe refusals versus harmful completions, followed by a routing-aware objective that suppresses safety experts, boosts harmful ones, and blocks early refusal. This moves beyond pure output optimization and directly addresses the non-differentiable router, which prior work mostly ignored. The transfer numbers (27.7% to 61.2% on variants, 2.47% to 38.7% on VLMs) are the part that stands out as useful data. The soft spot is the access model. Localizing experts requires activation contrasts, so the preparation phase needs internal states even if the final suffix runs with input access only. That makes the method a white-box construction step plus a transferable artifact rather than a fully black-box attack from start to finish. The abstract also leaves out the exact optimization details, prompt selection rules, and any statistical checks, so the 3.2x gain over baselines needs the full methods to evaluate. This is for people working on MoE safety or adversarial robustness in sparse models. A reader who cares about structural weaknesses in routing will find the transfer experiments worth reading. It deserves a serious referee because the core observation is testable and the results are reported across enough models to spark discussion. I would send it to peer review but expect the first round of comments to focus on clarifying the access requirements and reproducing the localization step.

Referee Report

2 major / 2 minor

Summary. The paper introduces RouteHijack, a routing-aware jailbreak for Mixture-of-Experts LLMs. It first localizes safety-critical experts via response-driven activation contrast between safe refusals and harmful completions, then optimizes input suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and avoids early refusal. At inference the suffix is appended to malicious prompts using only input access. Across seven MoE LLMs the method reports 69.3% average ASR (3.2× prior optimization-based attacks), zero-shot transfer to five sibling variants (27.7% → 61.2%), and generalization to three MoE VLMs (2.47% → 38.7%).

Significance. If the empirical results hold, the work is significant because it isolates a structural vulnerability in sparse MoE routing: safety behavior is concentrated in a small expert subset that can be targeted by input optimization after localization. The concrete ASR numbers, cross-model transfer, and VLM generalization provide falsifiable evidence that output-level alignment is insufficient for these architectures and motivate routing-aware defenses.

major comments (2)

[§3] §3 (Method), expert-localization paragraph: the procedure contrasts per-expert activations under safe refusals versus harmful completions. This step requires direct inspection of hidden states or router logits, which constitutes privileged access during attack construction. The paper contrasts RouteHijack against 'model intervention methods' that need privileged access, yet the localization phase uses exactly that access; the claim that the attack 'overcomes the non-differentiable routing limitation without privileged access' therefore needs explicit qualification that construction is white-box while inference is input-only.
[§4.2] §4.2 (Results), Table 1 and transfer tables: the 69.3% average ASR and 3.2× improvement are reported without per-model standard deviations, exact prompt counts, or statistical tests. Because the central claim rests on these quantitative gains and on zero-shot transfer, the absence of variance estimates and baseline implementation details makes it impossible to verify that the reported margins are robust rather than sensitive to prompt selection or random seeds.

minor comments (2)

[§3.3] Notation for the routing-aware loss (Eq. 3 or equivalent) should explicitly state whether router logits are used directly or approximated, and whether the objective remains differentiable with respect to the input suffix.
[§1] The abstract and §1 state that prior optimization attacks are 'fundamentally limited to MoE models due to the non-differentiable routing mechanism'; a one-sentence clarification of how RouteHijack circumvents this (via the pre-computed expert mask) would help readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below and will revise the manuscript to incorporate clarifications and additional details where appropriate.

read point-by-point responses

Referee: [§3] §3 (Method), expert-localization paragraph: the procedure contrasts per-expert activations under safe refusals versus harmful completions. This step requires direct inspection of hidden states or router logits, which constitutes privileged access during attack construction. The paper contrasts RouteHijack against 'model intervention methods' that need privileged access, yet the localization phase uses exactly that access; the claim that the attack 'overcomes the non-differentiable routing limitation without privileged access' therefore needs explicit qualification that construction is white-box while inference is input-only.

Authors: We agree that the expert-localization procedure requires white-box access to compute per-expert activations under contrasting response conditions. The manuscript already states that the optimized suffix is applied at inference using only input access, but we acknowledge the need for clearer separation between phases. In the revision we will explicitly qualify in §3, the abstract, and the introduction that expert localization and suffix optimization are performed in a white-box setting, while the deployed attack requires only black-box input access. This will distinguish RouteHijack from model-intervention baselines that need ongoing privileged access during inference. revision: yes
Referee: [§4.2] §4.2 (Results), Table 1 and transfer tables: the 69.3% average ASR and 3.2× improvement are reported without per-model standard deviations, exact prompt counts, or statistical tests. Because the central claim rests on these quantitative gains and on zero-shot transfer, the absence of variance estimates and baseline implementation details makes it impossible to verify that the reported margins are robust rather than sensitive to prompt selection or random seeds.

Authors: We concur that variance estimates and fuller experimental details would strengthen verifiability. In the revised version we will augment Table 1 and the transfer tables with per-model standard deviations computed across multiple independent runs, state the exact number of prompts drawn from each benchmark, and supply additional implementation details for the baselines. Where space allows we will also report basic statistical comparisons (e.g., paired tests) to quantify the significance of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack with independent evaluation

full rationale

The paper describes a two-stage empirical procedure—response-driven expert localization via activation contrast followed by routing-aware suffix optimization—then reports measured attack success rates on held-out prompts across seven MoE LLMs, five sibling variants, and three VLMs. No equations, fitted parameters, or self-citations are invoked as load-bearing derivations; the reported ASRs (69.3 % average, 3.2× improvement, zero-shot transfer gains) are direct experimental outcomes rather than quantities that reduce to the construction process by definition. The evaluation protocol is externally replicable and does not rely on any self-referential renaming or uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or new theoretical entities are introduced; the paper relies on standard assumptions of gradient-based optimization and activation-based expert attribution common in the adversarial ML literature.

axioms (1)

domain assumption Expert activations under contrasting prompts reliably indicate functional specialization for safety versus harm.
Invoked in the expert localization step described in the abstract.

pith-pipeline@v0.9.0 · 5625 in / 1251 out tokens · 26874 ms · 2026-05-09T19:19:11.609817+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 45 canonical work pages · 18 internal anchors

[1]

0xk1h0. 2023. ChatGPT_DAN: Jailbreak prompts for ChatGPT. https://github. com/0xk1h0/ChatGPT_DAN

2023
[2]

Argilla. 2024. notux-8x7b-v1. https://huggingface.co/argilla/notux-8x7b-v1

2024
[3]

Ankit Bisht, Lareina Yee, Roger Roberts, Brittany Presten, and Kather- ine Ottenbreit. 2025. Open Source Technology in the Age of AI. https://www.mckinsey.com/capabilities/quantumblack/our-insights/open- source-technology-in-the-age-of-ai

2025
[4]

Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Ja- yaraman, et al. 2024. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247(2024)

work page arXiv 2024
[5]

Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krasheninnikov. 2025. Understanding (un) reliability of steering vectors in language models.arXiv preprint arXiv:2505.22637(2025)

work page arXiv 2025
[6]

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
[7]

A survey on mixture of experts.Authorea Preprints(2024)

2024
[8]

Marmik Chaudhari, Jeremi Nuer, and Rome Thorstenson. 2025. Sparsity and Superposition in Mixture of Experts. arXiv:2510.23671 [cs.LG] https://arxiv.org/ abs/2510.23671

work page arXiv 2025
[9]

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509(2025)

work page internal anchor Pith review arXiv 2025
[10]

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. 2022. Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information processing systems35 (2022), 23049–23062

2022
[11]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. InMachine learning challenges workshop. Springer, 177–190

2005
[13]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066 [cs.CL] https://arxiv.org/abs/2401.06066

work page internal anchor Pith review arXiv 2024
[14]

Marah Abdin et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] https://arxiv.org/abs/ 2404.14219

work page internal anchor Pith review arXiv 2024
[15]

Yehui Tang et al. 2025. Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity. arXiv:2505.21411 [cs.CL] https://arxiv.org/abs/2505.21411

work page arXiv 2025
[16]

Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, and Nanyun Peng. 2025. Steering moe llms via expert (de) activation.arXiv preprint arXiv:2509.09660(2025)

work page arXiv 2025
[17]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

2022
[18]

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23951–23959

2025
[19]

Hector He. 2024. Qwen1.5-MOE-sft-nemotron-code. https://huggingface.co/ HectorHe/Qwen1.5-MOE-sft-nemotron-code

2024
[20]

IBM. 2024. IBM Study: More Companies Turning to Open-Source AI Tools to Un- lock ROI. https://newsroom.ibm.com/2024-12-19-IBM-Study-More-Companies- Turning-to-Open-Source-AI-Tools-to-Unlock-ROI

2024
[21]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

1991
[22]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems36 (2023), 24678–24704

2023
[23]

Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, and Yang Zhang. 2026. Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.arXiv preprint arXiv:2602.08621(2026)

work page arXiv 2026
[24]

Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. Challenges and applications of large language models.arXiv preprint arXiv:2307.10169(2023)

work page arXiv 2023
[25]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, and Jianqiang Li. 2025. SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification. arXiv:2506.17368 [cs.LG] https://arxiv.org/abs/2506.17368

work page arXiv 2025
[27]

Jiahui Li, Yongchang Hao, Haoyu Xu, Xing Wang, and Yu Hong. 2025. Exploiting the index gradients for optimization-based jailbreaking on large language models. InProceedings of the 31st International Conference on Computational Linguistics. 4535–4547

2025
[28]

Zeyi Liao and Huan Sun. 2024. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms.arXiv preprint arXiv:2404.07921(2024)

work page arXiv 2024
[29]

Runqi Lin, Bo Han, Fengwang Li, and Tongling Liu. 2025. Understanding and en- hancing the transferability of jailbreaking attacks.arXiv preprint arXiv:2502.03052 (2025)

work page arXiv 2025
[30]

Jack Lindsey. 2026. Emergent introspective awareness in large language models. arXiv preprint arXiv:2601.01828(2026)

work page arXiv 2026
[31]

Jona te Lintelo, Lichao Wu, and Stjepan Picek. 2026. Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing.arXiv preprint arXiv:2602.08741(2026)

work page arXiv 2026
[32]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, and Wei Hu. 2025. A survey of attacks on large vision–language models: Resources, advances, and future trends.IEEE Transactions on Neural Networks and Learning Systems(2025)

2025
[34]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451(2023)

work page internal anchor Pith review arXiv 2023
[35]

AI @ Meta Llama Team. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843(2016)

work page internal anchor Pith review arXiv 2016
[37]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2381–2391

2018
[38]

Mistral AI. 2023. Mixtral of Experts. https://mistral.ai/news/mixtral-of-experts/

2023
[39]

Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. 2024. Jail- breaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309(2024). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Zhiyuan Xu, Joseph Gardiner, Sana Belguith, and Lichao Wu

work page arXiv 2024
[40]

2026.Introducing GPT-5.4

OpenAI. 2026.Introducing GPT-5.4. https://openai.com/index/introducing-gpt- 5-4/

2026
[41]

OpenRouter. 2026. About OpenRouter. https://openrouter.ai/about

2026
[42]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

2022
[43]

Maziyar Panahi. 2024. Qwen1.5-MoE-A2.7B-Wikihow. https://huggingface.co/ MaziyarPanahi/Qwen1.5-MoE-A2.7B-Wikihow

2024
[44]

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277(2023)

work page internal anchor Pith review arXiv 2023
[45]

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. 2024. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946(2024)

work page arXiv 2024
[46]

Enrique Queipo-de Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, and Ravid Shwartz-Ziv. 2025. Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477(2025)

work page arXiv 2025
[47]

Qwen Team. 2024. Qwen-MoE: Scaling Open Large Language Models with Mixture-of-Experts. https://qwen.ai/blog?id=qwen-moe

2024
[48]

Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5

2026
[49]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.OpenAI blog 1, 8 (2019), 9

2019
[50]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexan- der Turner. 2024. Steering llama 2 via contrastive activation addition. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15504–15522

2024
[51]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale.Commun. ACM 64, 9 (2021), 99–106

2021
[52]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685

2024
[54]

Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield- Menell, et al. 2024. Latent adversarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549(2024)

work page arXiv 2024
[55]

Chen, Tom Henighan, S

Nicholas Sofroniew, Isaac Kauvar, William Saunders, R. Chen, Tom Henighan, S. Hydrie, Craig Citro, Adam Pearce, Jeremy Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelly Rivoire, K. Fish, Chris Olah, and Jack Lindsey. 2026. Emotion Concepts and their Function in a Large Language Model. https://transformer- circuits.pub/2026/emotions/index.html

2026
[56]

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al . 2024. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems37 (2024), 125416–125440

2024
[57]

Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, and Peikang Hu. 2025. The Resurgence of GCG Adversarial Attacks on Large Language Models.arXiv preprint arXiv:2509.00391(2025)

work page arXiv 2025
[58]

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al . 2025. Kimi-vl technical report.arXiv preprint arXiv:2504.07491(2025)

work page internal anchor Pith review arXiv 2025
[59]

Tencent Hunyuan Team. 2024. Hunyuan-A13B Technical Report. https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/report/ Hunyuan_A13B_Technical_Report.pdf. Technical report

2024
[60]

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine.Nature medicine29, 8 (2023), 1930–1940

2023
[61]

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou
[62]

arXiv preprint arXiv:2507.17702 , year=

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models. arXiv:2507.17702 [cs.CL] https://arxiv.org/abs/2507.17702

work page arXiv
[63]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2024. Steering Language Models With Activation Engineering. arXiv:2308.10248 [cs.CL] https://arxiv.org/abs/2308. 10248

work page internal anchor Pith review arXiv 2024
[64]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[65]

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning lan- guage models during instruction tuning. InInternational Conference on Machine Learning. PMLR, 35413–35425

2023
[66]

Qingyue Wang, Qi Pang, Xixun Lin, Shuai Wang, and Daoyuan Wu. 2025. Bad- MoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts. arXiv:2504.18598 [cs.CR] https://arxiv.org/abs/ 2504.18598

work page arXiv 2025
[67]

Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, and Cihang Xie
[68]

Attngcg: Enhancing jailbreaking attacks on llms with attention manipula- tion.arXiv preprint arXiv:2410.09040(2024)

work page arXiv 2024
[69]

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural net- work acceptability judgments.Transactions of the Association for Computational Linguistics7 (2019), 625–641

2019
[70]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems 36 (2023), 80079–80110

2023
[71]

Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Stjepan Picek, and Ahmad- Reza Sadeghi. 2025. GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs.arXiv preprint arXiv:2512.21008(2025)

work page arXiv 2025
[72]

Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Maximilian Thang, Stjepan Picek, and Ahmad-Reza Sadeghi. 2025. NeuroStrike: Neuron-Level Attacks on Aligned LLMs.arXiv preprint arXiv:2509.11864(2025)

work page arXiv 2025
[73]

Yuanbo Xie, Yingjie Zhang, Tianyun Liu, Duohe Ma, and Tingwen Liu. 2025. Be- yond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction.arXiv preprint arXiv:2509.15202(2025)

work page arXiv 2025
[74]

Zhiyuan Xu, Stanislav Abaimov, Joseph Gardiner, and Sana Belguith. 2025. Steer- ing in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models.arXiv preprint arXiv:2511.17194(2025)

work page arXiv 2025
[75]

Zhiyuan Xu, Joseph Gardiner, and Sana Belguith. 2025. The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models.arXiv preprint arXiv:2502.01225(2025)

work page arXiv 2025
[76]

Zhiyuan Xu, Joseph Gardiner, and Sana Belguith. 2025. Reasoning That Leaks, Fine-Tuning That Amplifies: Exposing the Hidden Threats of Chain-of-Thought Models. In21st ACM ASIA Conference on Computer and Communications Security. Association for Computing Machinery

2025
[77]

An Yang et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Dokania, Adel Bibi, and Philip Torr

Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, and Philip Torr. 2025. Mixture of Experts Made Intrinsically Interpretable. arXiv:2503.07639 [cs.LG] https://arxiv.org/abs/2503. 07639

work page arXiv 2025
[79]

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295(2024)

work page internal anchor Pith review arXiv 2024
[80]

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al . 2025. Qwen3Guard Technical Report.arXiv preprint arXiv:2510.14276(2025)

work page internal anchor Pith review arXiv 2025

Showing first 80 references.