RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models
Pith reviewed 2026-05-16 07:52 UTC · model grok-4.3
The pith
RASA aligns safety in mixture-of-experts models by selectively repairing experts activated during jailbreaks
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RASA identifies experts disproportionately activated by successful jailbreaks as Safety-Critical Experts, selectively fine-tunes only these experts under fixed routing to directly repair them, and enforces routing consistency with safety-aligned contexts to block bypasses.
What carries the argument
Identification and selective fine-tuning of Safety-Critical Experts under fixed routing to prevent routing-based safety bypasses
If this is right
- Achieves near-perfect robustness across diverse jailbreak attacks on two MoE architectures
- Exhibits strong generalization to new attacks not seen during alignment
- Reduces over-refusal rates substantially compared to other methods
- Maintains performance on general benchmarks including MMLU, GSM8K, and TruthfulQA
Where Pith is reading between the lines
- Safety issues in MoE models may be concentrated in a small set of experts rather than spread across the model
- This method could be adapted to other sparse activation architectures beyond the ones tested
- It implies that full-parameter updates are inefficient for safety in routed models and may even be counterproductive
Load-bearing premise
That the primary safety failures stem from specific experts activated during jailbreaks, and that fixing only those without altering routing will not create new bypass routes or reduce other model abilities
What would settle it
A new jailbreak attack that succeeds after RASA by routing around the repaired experts, or a drop in accuracy on MMLU or GSM8K after applying the method
read the original abstract
Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RASA, a routing-aware expert-level safety alignment framework for Mixture-of-Experts (MoE) models. It identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only those experts while holding routing fixed, and then enforces routing consistency with safety-aligned contexts. Empirical evaluation across two MoE architectures and multiple jailbreak attacks reports near-perfect robustness, cross-attack generalization, reduced over-refusal, and preserved performance on MMLU, GSM8K, and TruthfulQA, arguing that targeted expert repair is preferable to global fine-tuning.
Significance. If the central empirical claims hold under additional verification, the work offers a practical, architecture-preserving alternative to full-parameter safety fine-tuning for MoE models. It directly addresses routing-induced safety failures and provides evidence that selective repair can maintain general capabilities while improving robustness, which is relevant for deploying large sparse models safely.
major comments (3)
- [§4.2] §4.2 (Expert Identification): The procedure for identifying Safety-Critical Experts via activation disparities is described only at a high level; no explicit threshold, statistical criterion, or ablation on sensitivity to this choice is provided, making it impossible to assess whether the reported robustness is robust to reasonable variations in identification.
- [§5.1] §5.1 (Attack Evaluation): The manuscript reports strong robustness and cross-attack generalization but contains no experiments testing post-RASA models against adaptive attacks that specifically target the now-fixed routing or the unrepaired experts; without such checks the claim that fixed routing plus consistency enforcement eliminates bypasses remains unverified.
- [§5.3] §5.3 (Statistical Reporting): Quantitative results are presented without error bars, p-values, or details on the number of runs; the abstract's claim of 'near-perfect robustness' therefore cannot be assessed for statistical reliability or effect size relative to baselines.
minor comments (2)
- [§3.3] The notation for the routing consistency loss is introduced without an explicit equation number or comparison to standard MoE routing objectives, which would aid reproducibility.
- [Figure 3] Figure 3 caption should explicitly state the number of experts per model and the exact attack set used for the heatmaps.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point-by-point below and will revise the manuscript accordingly to improve clarity, rigor, and verification of the claims.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Expert Identification): The procedure for identifying Safety-Critical Experts via activation disparities is described only at a high level; no explicit threshold, statistical criterion, or ablation on sensitivity to this choice is provided, making it impossible to assess whether the reported robustness is robust to reasonable variations in identification.
Authors: We agree the identification procedure is underspecified. In the revision we will explicitly define the disparity metric, state the exact threshold (experts exceeding mean activation disparity by 2 standard deviations across jailbreak vs. benign prompts) and selection rule (top-5 experts), and add an ablation varying the threshold by ±1 SD to show that robustness remains stable within reasonable ranges. revision: yes
-
Referee: [§5.1] §5.1 (Attack Evaluation): The manuscript reports strong robustness and cross-attack generalization but contains no experiments testing post-RASA models against adaptive attacks that specifically target the now-fixed routing or the unrepaired experts; without such checks the claim that fixed routing plus consistency enforcement eliminates bypasses remains unverified.
Authors: We acknowledge this gap. The revised manuscript will include new experiments that attempt adaptive attacks explicitly designed to exploit the fixed routing or target unrepaired experts (e.g., by optimizing prompts to force routing to low-safety experts). We will show that the routing-consistency loss prevents successful bypasses, thereby verifying the mechanism. revision: yes
-
Referee: [§5.3] §5.3 (Statistical Reporting): Quantitative results are presented without error bars, p-values, or details on the number of runs; the abstract's claim of 'near-perfect robustness' therefore cannot be assessed for statistical reliability or effect size relative to baselines.
Authors: We will strengthen the statistical reporting. The revision will report results over 5 independent runs with different random seeds, include error bars (standard deviation), specify the exact number of runs, and add p-values and effect sizes for the key robustness comparisons against baselines to support the 'near-perfect' claim. revision: yes
Circularity Check
No circularity: empirical procedure with independent experimental validation
full rationale
The paper presents RASA as a practical, empirical alignment method that identifies safety-critical experts via activation analysis on jailbreak data, performs selective fine-tuning under fixed routing, and enforces consistency post-training. Robustness claims are supported by direct evaluation on diverse held-out attacks and capability benchmarks (MMLU, GSM8K, TruthfulQA) rather than any closed-form derivation, fitted parameter that encodes the target metric, or self-referential definition. No equations appear in the provided text, and the central results do not reduce to inputs by construction; the method's success is treated as an observed outcome of the procedure, not a tautological restatement of its design choices.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
bi-level optimization formulation min_Φ E_B [L_router(Φ; arg min_Θ L_SCE-FT(Θ,Φ))]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.