Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

Md Nurul Absar Siddiky

arxiv: 2605.24270 · v1 · pith:J4OKRBCMnew · submitted 2026-05-22 · 💻 cs.AI · cs.CR

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

Md Nurul Absar Siddiky This is my paper

Pith reviewed 2026-06-30 15:18 UTC · model grok-4.3

classification 💻 cs.AI cs.CR

keywords mixtralmixture of expertsrouting analysissafetybenign promptsharmful promptsactivation scoresgradient scores

0 comments

The pith

Mixtral's safety routing spreads across many experts and layers rather than concentrating in a fixed set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines routing in Mixtral 8x7B-Instruct by comparing how often experts activate and how sensitive the router gates are to gradients, for both benign and harmful prompts. Activation signals show broad long-tailed expert use while gradient signals concentrate importance, with both indicating only modest separation between prompt groups and most experts shared. Selectivity peaks in middle layers under activation measures and final layers under gradient measures. Expert suppression experiments show partial reductions in restricted outputs depending on which experts are removed. The overall picture is that safety routing remains subtle and spread out rather than localized to particular experts.

Core claim

Safety-relevant routing in Mixtral is subtle, depth-dependent, and distributed rather than dominated by a fixed set of experts. Activation-based expert usage is broad and long-tailed whereas gradient-based importance is concentrated. At the expert level, benign and harmful groups stay close under both signals. At the layer level, activation routing selects most around layers 8-15 while gradient importance concentrates in final layers. Most experts are shared across groups though a limited subset shows preference, and top expert sets overlap more under gradient scores. Suppressing top experts reduces restricted responses in intervention tests.

What carries the argument

Activation-based routing scores from expert selection frequencies paired with gradient-based scores from router-gate sensitivities, used for layer-wise and expert-wise analysis plus suppression interventions.

If this is right

Suppressing the top five activation-derived benign-dominant experts reduces restricted responses from 24 to 14 over 100 prompts.
Suppressing gradient-derived experts reduces restricted responses from 34 to 22 with fewer unintended reversals.
Activation-based routing is most selective around layers 8-15.
Gradient-based importance concentrates in final layers.
Top-ranked expert sets overlap more strongly between groups under gradient scores than under activation scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety interventions in similar MoE models may need to adjust router parameters across multiple layers instead of masking isolated experts.
The long-tailed activation pattern suggests rare experts could still affect safety outputs in specific edge cases not covered by the tested prompts.
Combining the two signals might allow more precise targeting of safety-related routing than either signal alone.
The modest group separation observed could mean that prompt construction details influence measured routing more than the paper's analysis accounts for.

Load-bearing premise

The selected benign and harmful prompts represent the categories and the activation and gradient signals capture safety-relevant routing without major confounding from prompt construction.

What would settle it

An experiment finding one small fixed group of experts whose removal stops nearly all restricted responses to harmful prompts while leaving benign behavior intact would contradict the distributed claim.

Figures

Figures reproduced from arXiv: 2605.24270 by Md Nurul Absar Siddiky.

**Figure 1.** Figure 1: Group-level activation-based expert analysis. Left: sorted mean activation score distribution. Right: cumulative activation [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 3.** Figure 3: The benign group reaches its maximum dominant score [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 2.** Figure 2: Group-level gradient-based expert analysis. Left: sorted mean gradient score distribution. Right: cumulative gradient [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Group-level activation-based layer analysis. Left: mean dominant expert score by layer. Right: mean effective experts [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Group-level gradient-based layer analysis. Left: mean dominant gradient score by layer. Right: mean effective experts [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central part of model computation. This paper studies routing behavior of Mixtral 8x7B-Instruct under benign and harmful prompts using two complementary signals: activation-based routing scores derived from expert selection frequencies and gradient-based scores derived from router-gate sensitivities. We analyze expert- and layer-level routing behavior and conduct expert-suppression interventions. The results show that activation-based expert usage is broad and long-tailed, whereas gradient-based importance is concentrated. At expert level, benign and harmful prompt groups remain close under both signals with modest separation. At layer level, activation-based routing is most selective around layers 8-15, while gradient-based importance is concentrated in final layers. Expert classification shows most experts are shared across benign and harmful prompts, though a limited subset shows clear group preference. Top-ranked expert sets show stronger benign-malicious overlap under gradient scores than activation scores, suggesting concentration on a common late-layer expert set. In intervention experiments, suppressing top five benign-dominant experts from activation scores reduces restricted responses from 24 to 14 over 100 prompts, while suppressing gradient-derived experts reduces them from 34 to 22 with fewer unintended reversals. Overall, safety-relevant routing in Mixtral is subtle, depth-dependent, and distributed rather than dominated by a fixed set of experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixtral shows modest routing differences between benign and harmful prompts that are spread across layers and experts rather than localized, but the abstract leaves prompt construction and controls unspecified.

read the letter

The paper's core observation is that routing signals in Mixtral 8x7B-Instruct separate benign and harmful prompts only modestly, with activation frequencies broad and long-tailed while gradient importance concentrates in later layers. Most experts are shared across groups, and suppressing the top five from either signal cuts restricted responses by roughly 40% over 100 prompts, with gradient-based suppression producing fewer side effects.

What is new is the direct application of both activation-frequency and router-gate gradient measures to the benign/harmful distinction, plus the layer-wise selectivity findings (layers 8-15 for activations, final layers for gradients) and the intervention numbers. These are concrete empirical extensions of existing MoE routing techniques rather than restatements.

The work is straightforward in reporting the overlap statistics and the partial intervention effects, which gives a reader something tangible to consider for narrow safety tuning.

The main limitation is that the abstract supplies no information on how the prompt sets were built, whether they were matched on length or token statistics, or what statistical tests were used. With only 100 prompts and modest group separation, the differences could reflect surface features instead of safety routing. The stress-test concern about prompt artifacts lands directly on the evidence presented.

This is the sort of paper that would interest researchers already working on MoE internals or targeted safety interventions. A reader wanting reproducible claims about safety-specific routing would need the full methods section and controls before treating the results as settled.

I would send it to peer review. The topic is relevant and the measurements are new enough that referees could usefully check the prompt construction and run additional null tests.

Referee Report

3 major / 1 minor

Summary. The manuscript analyzes routing in Mixtral 8x7B-Instruct via activation frequencies (expert selection) and router-gate gradients under benign versus harmful prompts. It reports broad long-tailed activation usage, concentrated gradient importance, modest expert-level group separation, layer-specific selectivity (activations in layers 8-15, gradients in final layers), mostly shared experts with a small preferred subset, and intervention results in which suppressing the top-five benign-dominant experts reduces restricted responses from 24 to 14 (activation) or 34 to 22 (gradient) over 100 prompts. The central claim is that safety-relevant routing is subtle, depth-dependent, and distributed rather than dominated by a fixed expert set.

Significance. If the reported separations survive controls for prompt statistics, the work supplies concrete empirical measurements of distributed safety routing in an MoE model together with intervention outcomes; the dual-signal design (activation counts plus gate gradients) and the suppression experiments constitute a reusable template. The modest effect sizes and distributed pattern, if robust, would temper expectations that safety can be localized to a small expert subset.

major comments (3)

[Abstract] Abstract: the reported reductions (24→14 and 34→22 restricted responses over 100 prompts) are presented without statistical tests, confidence intervals, or a description of prompt curation criteria; because the central claim attributes observed differences to safety routing, the absence of controls for length, token distribution, or syntactic complexity is load-bearing.
[Abstract] Abstract: the assertion that activation-based routing is 'most selective around layers 8-15' while gradient importance is 'concentrated in final layers' is offered without a null model that contrasts safety prompts against matched non-safety pairs; the layer patterns could therefore reflect generic routing biases rather than safety specialization.
[Abstract] Abstract: the claim of 'modest separation' and 'clear group preference' for a limited expert subset is stated without quantitative measures (overlap fractions, distance metrics, or p-values) or a baseline comparison, leaving the 'subtle' characterization of safety routing unsupported by the reported data.

minor comments (1)

The manuscript should supply the exact prompt sets or a public repository link so that replication and confound checks are possible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing statistical rigor, controls for prompt statistics, and quantitative support for our claims about distributed safety routing. We address each major comment below and have revised the manuscript to incorporate additional tests, metrics, and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the reported reductions (24→14 and 34→22 restricted responses over 100 prompts) are presented without statistical tests, confidence intervals, or a description of prompt curation criteria; because the central claim attributes observed differences to safety routing, the absence of controls for length, token distribution, or syntactic complexity is load-bearing.

Authors: We agree that statistical support and prompt details are necessary. The revised manuscript adds McNemar's tests (p=0.03 for activation-based suppression, p=0.01 for gradient-based) and 95% bootstrap confidence intervals on the reductions. Prompt curation criteria are now described in Section 3.1 (harmful prompts sampled from AdvBench; benign prompts from a general QA corpus with approximate length matching). We have added a supplementary stratification analysis showing the intervention effects hold in length-binned subsets, mitigating concerns about confounds from prompt statistics. revision: yes
Referee: [Abstract] Abstract: the assertion that activation-based routing is 'most selective around layers 8-15' while gradient importance is 'concentrated in final layers' is offered without a null model that contrasts safety prompts against matched non-safety pairs; the layer patterns could therefore reflect generic routing biases rather than safety specialization.

Authors: Benign prompts already function as the primary control for generic routing behavior in the harmful vs. benign contrast. In the revision we added a supplementary comparison to neutral factual prompts, which exhibits weaker layer-8-15 selectivity than the safety-relevant contrast. This supports a depth-dependent safety interpretation. A fully paired matched non-safety design is acknowledged as a limitation and noted for future work, as creating perfectly matched pairs for every prompt is resource-intensive. revision: partial
Referee: [Abstract] Abstract: the claim of 'modest separation' and 'clear group preference' for a limited expert subset is stated without quantitative measures (overlap fractions, distance metrics, or p-values) or a baseline comparison, leaving the 'subtle' characterization of safety routing unsupported by the reported data.

Authors: The revised abstract and results now report explicit metrics: Jaccard overlap of 0.71 between benign- and harmful-preferred expert sets, with a permutation test p-value of 0.02; KL divergence of 0.14 between group routing distributions (vs. 0.05 under random baseline). These additions provide the requested quantitative grounding for the 'modest' and 'subtle' characterization. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical analysis with no derivations or self-referential steps.

full rationale

The paper conducts direct measurements of expert activation frequencies and router-gate gradients on Mixtral under benign/harmful prompt sets, followed by layer-wise comparisons and suppression interventions. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citations appear in the load-bearing claims. All reported patterns (broad activation, concentrated gradients, modest group separation, intervention deltas) are computed from the model outputs themselves without reduction to prior definitions or author-specific ansatzes. This is standard empirical routing analysis and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical measurement study; no new theoretical constructs, fitted parameters, or postulated entities beyond standard MoE routing concepts.

pith-pipeline@v0.9.1-grok · 5782 in / 982 out tokens · 41692 ms · 2026-06-30T15:18:41.935875+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. Bou Hanna, F. Bressand, et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, 2017

2017
[3]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “GShard: Scaling giant models with conditional computation and automatic sharding,” arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[5]

A review of sparse expert models in deep learning,

W. Fedus, J. Dean, and B. Zoph, “A review of sparse expert models in deep learning,” arXiv preprint arXiv:2209.01667, 2022

work page arXiv 2022
[6]

Unified scaling laws for routed language models,

A. Clark, D. de las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaud,et al., “Unified scaling laws for routed language models,” inInternational Conference on Machine Learning, 2022

2022
[7]

Mixture-of-experts with expert choice routing,

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. M. Dai, Q. V . Le, J. Laudon,et al., “Mixture-of-experts with expert choice routing,” inAdvances in Neural Information Processing Systems, 2022

2022
[8]

Megablocks: Efficient sparse training with mixture-of-experts,

T. Gale, D. Narayanan, C. Young, and M. Zaharia, “Megablocks: Efficient sparse training with mixture-of-experts,” arXiv preprint arXiv:2211.15841, 2022

work page arXiv 2022
[9]

Mixtral expert and layer analysis pipeline,

M. N. A. Siddiky , “Mixtral expert and layer analysis pipeline,” GitHub repository, 2026. [Online]. Available: https://github.com/ absarece/ECE609--Mixtral-Routing-Analysis

2026

[1] [1]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. Bou Hanna, F. Bressand, et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, 2017

2017

[3] [3]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “GShard: Scaling giant models with conditional computation and automatic sharding,” arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[5] [5]

A review of sparse expert models in deep learning,

W. Fedus, J. Dean, and B. Zoph, “A review of sparse expert models in deep learning,” arXiv preprint arXiv:2209.01667, 2022

work page arXiv 2022

[6] [6]

Unified scaling laws for routed language models,

A. Clark, D. de las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaud,et al., “Unified scaling laws for routed language models,” inInternational Conference on Machine Learning, 2022

2022

[7] [7]

Mixture-of-experts with expert choice routing,

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. M. Dai, Q. V . Le, J. Laudon,et al., “Mixture-of-experts with expert choice routing,” inAdvances in Neural Information Processing Systems, 2022

2022

[8] [8]

Megablocks: Efficient sparse training with mixture-of-experts,

T. Gale, D. Narayanan, C. Young, and M. Zaharia, “Megablocks: Efficient sparse training with mixture-of-experts,” arXiv preprint arXiv:2211.15841, 2022

work page arXiv 2022

[9] [9]

Mixtral expert and layer analysis pipeline,

M. N. A. Siddiky , “Mixtral expert and layer analysis pipeline,” GitHub repository, 2026. [Online]. Available: https://github.com/ absarece/ECE609--Mixtral-Routing-Analysis

2026