pith. sign in

arxiv: 2602.04448 · v2 · submitted 2026-02-04 · 💻 cs.LG · cs.AI· cs.CR

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

Pith reviewed 2026-05-16 07:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR
keywords mixture-of-expertssafety alignmentjailbreak attacksrouting mechanismsselective fine-tuningover-refusalcapability preservation
0
0 comments X

The pith

RASA aligns safety in mixture-of-experts models by selectively repairing experts activated during jailbreaks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-experts models can fail at safety because their routing can bypass fixes during standard fine-tuning. RASA counters this by finding experts that get activated a lot during successful jailbreaks and fine-tuning just those experts with routing locked in place. It then makes sure routing stays consistent for safe inputs. This targeted approach is meant to fix the problem at its source rather than relying on broad changes that might create new issues. If it works as described, it offers a way to align these efficient models without losing their advantages in computation or performance.

Core claim

RASA identifies experts disproportionately activated by successful jailbreaks as Safety-Critical Experts, selectively fine-tunes only these experts under fixed routing to directly repair them, and enforces routing consistency with safety-aligned contexts to block bypasses.

What carries the argument

Identification and selective fine-tuning of Safety-Critical Experts under fixed routing to prevent routing-based safety bypasses

If this is right

  • Achieves near-perfect robustness across diverse jailbreak attacks on two MoE architectures
  • Exhibits strong generalization to new attacks not seen during alignment
  • Reduces over-refusal rates substantially compared to other methods
  • Maintains performance on general benchmarks including MMLU, GSM8K, and TruthfulQA

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety issues in MoE models may be concentrated in a small set of experts rather than spread across the model
  • This method could be adapted to other sparse activation architectures beyond the ones tested
  • It implies that full-parameter updates are inefficient for safety in routed models and may even be counterproductive

Load-bearing premise

That the primary safety failures stem from specific experts activated during jailbreaks, and that fixing only those without altering routing will not create new bypass routes or reduce other model abilities

What would settle it

A new jailbreak attack that succeeds after RASA by routing around the repaired experts, or a drop in accuracy on MMLU or GSM8K after applying the method

read the original abstract

Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes RASA, a routing-aware expert-level safety alignment framework for Mixture-of-Experts (MoE) models. It identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only those experts while holding routing fixed, and then enforces routing consistency with safety-aligned contexts. Empirical evaluation across two MoE architectures and multiple jailbreak attacks reports near-perfect robustness, cross-attack generalization, reduced over-refusal, and preserved performance on MMLU, GSM8K, and TruthfulQA, arguing that targeted expert repair is preferable to global fine-tuning.

Significance. If the central empirical claims hold under additional verification, the work offers a practical, architecture-preserving alternative to full-parameter safety fine-tuning for MoE models. It directly addresses routing-induced safety failures and provides evidence that selective repair can maintain general capabilities while improving robustness, which is relevant for deploying large sparse models safely.

major comments (3)
  1. [§4.2] §4.2 (Expert Identification): The procedure for identifying Safety-Critical Experts via activation disparities is described only at a high level; no explicit threshold, statistical criterion, or ablation on sensitivity to this choice is provided, making it impossible to assess whether the reported robustness is robust to reasonable variations in identification.
  2. [§5.1] §5.1 (Attack Evaluation): The manuscript reports strong robustness and cross-attack generalization but contains no experiments testing post-RASA models against adaptive attacks that specifically target the now-fixed routing or the unrepaired experts; without such checks the claim that fixed routing plus consistency enforcement eliminates bypasses remains unverified.
  3. [§5.3] §5.3 (Statistical Reporting): Quantitative results are presented without error bars, p-values, or details on the number of runs; the abstract's claim of 'near-perfect robustness' therefore cannot be assessed for statistical reliability or effect size relative to baselines.
minor comments (2)
  1. [§3.3] The notation for the routing consistency loss is introduced without an explicit equation number or comparison to standard MoE routing objectives, which would aid reproducibility.
  2. [Figure 3] Figure 3 caption should explicitly state the number of experts per model and the exact attack set used for the heatmaps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below and will revise the manuscript accordingly to improve clarity, rigor, and verification of the claims.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Expert Identification): The procedure for identifying Safety-Critical Experts via activation disparities is described only at a high level; no explicit threshold, statistical criterion, or ablation on sensitivity to this choice is provided, making it impossible to assess whether the reported robustness is robust to reasonable variations in identification.

    Authors: We agree the identification procedure is underspecified. In the revision we will explicitly define the disparity metric, state the exact threshold (experts exceeding mean activation disparity by 2 standard deviations across jailbreak vs. benign prompts) and selection rule (top-5 experts), and add an ablation varying the threshold by ±1 SD to show that robustness remains stable within reasonable ranges. revision: yes

  2. Referee: [§5.1] §5.1 (Attack Evaluation): The manuscript reports strong robustness and cross-attack generalization but contains no experiments testing post-RASA models against adaptive attacks that specifically target the now-fixed routing or the unrepaired experts; without such checks the claim that fixed routing plus consistency enforcement eliminates bypasses remains unverified.

    Authors: We acknowledge this gap. The revised manuscript will include new experiments that attempt adaptive attacks explicitly designed to exploit the fixed routing or target unrepaired experts (e.g., by optimizing prompts to force routing to low-safety experts). We will show that the routing-consistency loss prevents successful bypasses, thereby verifying the mechanism. revision: yes

  3. Referee: [§5.3] §5.3 (Statistical Reporting): Quantitative results are presented without error bars, p-values, or details on the number of runs; the abstract's claim of 'near-perfect robustness' therefore cannot be assessed for statistical reliability or effect size relative to baselines.

    Authors: We will strengthen the statistical reporting. The revision will report results over 5 independent runs with different random seeds, include error bars (standard deviation), specify the exact number of runs, and add p-values and effect sizes for the key robustness comparisons against baselines to support the 'near-perfect' claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical procedure with independent experimental validation

full rationale

The paper presents RASA as a practical, empirical alignment method that identifies safety-critical experts via activation analysis on jailbreak data, performs selective fine-tuning under fixed routing, and enforces consistency post-training. Robustness claims are supported by direct evaluation on diverse held-out attacks and capability benchmarks (MMLU, GSM8K, TruthfulQA) rather than any closed-form derivation, fitted parameter that encodes the target metric, or self-referential definition. No equations appear in the provided text, and the central results do not reduce to inputs by construction; the method's success is treated as an observed outcome of the procedure, not a tautological restatement of its design choices.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce new mathematical axioms or free parameters; it relies on standard assumptions about MoE routing behavior and the existence of identifiable safety-critical experts.

pith-pipeline@v0.9.0 · 5510 in / 1103 out tokens · 20309 ms · 2026-05-16T07:52:20.458029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.