Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing
Pith reviewed 2026-05-21 07:40 UTC · model grok-4.3
The pith
Learned route selectivity is the main bottleneck in selective refusal editing of frozen transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Residual Paving separates route selectivity from residual-edit capacity by using an early-layer router to predict a scalar gate and expert mixture; when the gate is active, prompt-conditioned bottleneck residual experts produce later-layer updates without altering the frozen backbone. Replacing the learned gate with held-out edit/keep labels while keeping the editor fixed improves keep-side preservation on every tested backbone. On Gemma-3-4B-IT the learned system drops edit refusal from 88.6% to 4.0% while retaining 95.5% benign and 87.3% harmful distribution scores, outperforming one-direction steering baselines.
What carries the argument
The oracle-routing diagnostic, which substitutes held-out edit/keep labels for only the learned scalar gate while the residual editor and backbone remain fixed.
If this is right
- Methods that improve routing accuracy alone should close most of the remaining gap without changing the residual experts.
- The persistent off-target harmful-keep degradation even under oracle routing indicates that residual-edit capacity still needs refinement.
- Trajectory analysis on two backbones shows edits move toward edit-target continuations rather than generic refusal suppression.
- The same routed-residual split can be applied to other selective-control tasks on frozen models.
Where Pith is reading between the lines
- Routing improvements developed for mixture-of-experts models could be directly reused to strengthen selective editing.
- Auxiliary losses that supervise the gate more explicitly might reduce the observed routing bottleneck.
- The decomposition suggests testing whether multi-turn or chained refusal edits would expose additional routing limits.
- If routing remains the dominant issue, hybrid systems that combine learned gates with lightweight external verifiers could be explored.
Load-bearing premise
The oracle-routing test isolates route selectivity without side effects from label quality or distribution shift.
What would settle it
An experiment that applies the same oracle labels but observes no keep-side improvement on a new backbone or with noisier labels would show the diagnostic does not cleanly isolate routing.
Figures
read the original abstract
We study selective refusal editing as a three-way control problem: induce non-refusal on designated edit prompts while preserving benign behavior and harmful refusals outside the edit set. We introduce Residual Paving, a routed residual editing method for frozen instruction-tuned transformers that separates route selectivity, whether to intervene, from residual-edit capacity, what edit to apply. An early-layer router predicts a scalar gate and expert mixture; when active, prompt-conditioned bottleneck residual experts apply later-layer residual updates while leaving the backbone unchanged. This decomposition supports an oracle-routing diagnostic where only the learned scalar gate is replaced with the held-out edit/keep label, leaving the residual editor and frozen backbone fixed. On the primary Gemma-3-4B-IT held-out split, learned Residual Paving reduces edit refusal from 88.6% to 4.0%, with 95.5% benign distribution preservation and 87.3% harmful distribution preservation. Same-protocol one-direction steering controls are much weaker on edit success, leaving edit refusal at 86.8% for Edit-target ActAdd and 78.9% for DIM-style refusal steering. The remaining failure is off-target harmful-keep degradation: harmful refusal remains below the frozen-base rate, 65.3% vs. 81.6%. Across six backbones, oracle routing improves the keep-side diagnostic score on every reported row, with median gain +12.9 pp, supporting the interpretation that learned route selectivity is the main observed bottleneck. Trajectory diagnostics on two backbones further suggest directed movement toward edit-target continuations rather than generic refusal suppression.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Residual Paving, a routed residual editing approach for selective refusal editing in frozen instruction-tuned transformers. It decomposes the task into route selectivity (early-layer router predicting scalar gate and expert mixture) and residual-edit capacity (prompt-conditioned bottleneck residual experts applied later). An oracle-routing diagnostic replaces only the learned scalar gate with held-out edit/keep labels while keeping the residual editor and backbone fixed. On Gemma-3-4B-IT, learned Residual Paving achieves 4.0% edit refusal (vs. 88.6% baseline) with high preservation rates; across six backbones, oracle routing yields consistent keep-side gains (median +12.9 pp), supporting the claim that learned route selectivity is the primary bottleneck. Comparisons to one-direction steering baselines and trajectory diagnostics are also reported.
Significance. If the oracle diagnostic isolates route selectivity without substantial confounding from joint training of prompt-conditioned experts, the work offers a useful diagnostic lens on selective editing failures and highlights routing as a key lever for improving edit success while preserving out-of-distribution behavior. The cross-backbone consistency and explicit separation of routing from editing capacity are strengths; the method also supplies a concrete, externally grounded test (held-out labels) rather than relying solely on fitted quantities.
major comments (2)
- [§3.3 (Oracle Routing Diagnostic)] §3.3 (Oracle Routing Diagnostic) and associated results: The diagnostic replaces only the scalar gate while fixing the residual experts. Because the experts are explicitly prompt-conditioned and trained jointly under the learned gate/mixture distribution, substituting the gate at inference can change the effective selection statistics and conditioning seen by the experts. This raises the possibility that part of the reported median +12.9 pp keep-side gain reflects reduced train-test mismatch for the experts rather than a pure demonstration that route selectivity is the dominant bottleneck. An ablation or quantitative argument addressing this confound is needed to support the central interpretation.
- [Cross-backbone results table] Table reporting cross-backbone oracle results (period-5 or equivalent row): while oracle gains are shown on every row, the paper does not report whether the residual experts were retrained or re-evaluated under the oracle distribution; if they remain fixed to the learned distribution, the isolation of routing as the sole bottleneck is incomplete.
minor comments (3)
- [Method section] The abstract and method description omit the precise loss terms used to train the router (scalar gate + mixture) versus the residual experts; including the mathematical objective (e.g., Eq. for combined routing and editing loss) would improve reproducibility.
- [Experimental setup] Data-split details for the primary Gemma-3-4B-IT held-out set and how edit/keep labels are generated without leakage into router training are not fully specified; clarifying this would strengthen the claim of external grounding for the oracle.
- [Trajectory diagnostics] Figure or table captions for trajectory diagnostics should explicitly state the two backbones used and the metric for 'directed movement toward edit-target continuations'.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive suggestions. We address the major comments point by point below, clarifying the oracle diagnostic procedure and committing to revisions that strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§3.3 (Oracle Routing Diagnostic)] §3.3 (Oracle Routing Diagnostic) and associated results: The diagnostic replaces only the scalar gate while fixing the residual experts. Because the experts are explicitly prompt-conditioned and trained jointly under the learned gate/mixture distribution, substituting the gate at inference can change the effective selection statistics and conditioning seen by the experts. This raises the possibility that part of the reported median +12.9 pp keep-side gain reflects reduced train-test mismatch for the experts rather than a pure demonstration that route selectivity is the dominant bottleneck. An ablation or quantitative argument addressing this confound is needed to support the central interpretation.
Authors: We appreciate the referee's identification of this potential distributional effect. The residual experts are prompt-conditioned, meaning they take the full prompt as input to compute the bottleneck residual update at the later layers. The scalar gate solely determines whether this update is applied to the residual stream. Although training is joint, the experts are optimized to produce effective edits for prompts on which the router activates. In the oracle setting, we intervene precisely on the edit prompts using the same fixed expert parameters. To address the concern, we will include in the revision a quantitative argument comparing the distribution of prompts under learned versus oracle gating, along with a discussion of why the prompt-conditioning mitigates the train-test mismatch. This will support our interpretation that route selectivity remains the primary bottleneck. revision: yes
-
Referee: [Cross-backbone results table] Table reporting cross-backbone oracle results (period-5 or equivalent row): while oracle gains are shown on every row, the paper does not report whether the residual experts were retrained or re-evaluated under the oracle distribution; if they remain fixed to the learned distribution, the isolation of routing as the sole bottleneck is incomplete.
Authors: As described in the manuscript, the oracle-routing diagnostic 'replaces only the learned scalar gate with the held-out edit/keep label, leaving the residual editor and frozen backbone fixed.' The residual experts are therefore not retrained under the oracle distribution; they are evaluated using the parameters learned under the original routing. This fixed-expert design is central to isolating the contribution of route selectivity. We will revise the relevant table caption and the text in §3.3 to make this explicit and remove any potential ambiguity regarding the experimental protocol. revision: yes
Circularity Check
No significant circularity; oracle diagnostic uses independent held-out labels
full rationale
The paper's derivation introduces Residual Paving by separating scalar gating from prompt-conditioned residual experts, then supports the claim that learned route selectivity is the main bottleneck via an oracle diagnostic. This diagnostic replaces only the learned scalar gate with held-out edit/keep labels while fixing the residual editor and backbone. The held-out labels are external to the fitted router parameters, and the experts remain unchanged, providing grounding independent of the model's learned distribution. No equations or steps reduce by construction to inputs (no self-definitional loops, no fitted parameters renamed as predictions). No load-bearing self-citations or uniqueness theorems from prior author work appear in the derivation. The empirical gains across backbones are presented as direct measurements rather than tautological renamings. The chain is self-contained against the stated diagnostic protocol.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Held-out edit/keep labels from the test split accurately represent the target behavior for oracle routing.
invented entities (1)
-
Residual Paving router and bottleneck residual experts
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
An early-layer router predicts a scalar gate and expert mixture; when active, prompt-conditioned bottleneck residual experts apply later-layer residual updates
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. arXiv preprint arXiv:1701.06538 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. arXiv preprint arXiv:2101.03961 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
R. arXiv preprint arXiv:2308.01263 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Steering Language Models With Activation Engineering
Steering Language Models With Activation Engineering , author =. arXiv preprint arXiv:2308.10248 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Steering Llama 2 via Contrastive Activation Addition
Steering Llama 2 via Contrastive Activation Addition , author =. arXiv preprint arXiv:2312.06681 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , journal =. 2024 , url =
work page 2024
-
[8]
Li, Lijun and Dong, Bowen and Wang, Ruohui and Hu, Xuhao and Zuo, Wangmeng and Lin, Dahua and Qiao, Yu and Shao, Jing , journal =. 2024 , url =
work page 2024
-
[9]
Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , journal =. A. 2024 , url =
work page 2024
-
[10]
Advances in Neural Information Processing Systems , year =
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author =. Advances in Neural Information Processing Systems , year =. doi:10.52202/079017-1493 , url =
-
[11]
Advances in Neural Information Processing Systems, Datasets and Benchmarks Track , year =
The Art of Saying No: Contextual Noncompliance in Language Models , author =. Advances in Neural Information Processing Systems, Datasets and Benchmarks Track , year =
-
[12]
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction , author =. arXiv preprint arXiv:2406.11717 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Programming Refusal with Conditional Activation Steering , author =. arXiv preprint arXiv:2409.05907 , year =
-
[14]
Improving Instruction-Following in Language Models through Activation Steering , author =. arXiv preprint arXiv:2410.12877 , year =
-
[15]
Advances in Neural Information Processing Systems , year =
Steering When Necessary: Flexible Steering Large Language Models with Backtracking , author =. Advances in Neural Information Processing Systems , year =
-
[16]
Advances in Neural Information Processing Systems , year =
Angular Steering: Behavior Control via Rotation in Activation Space , author =. Advances in Neural Information Processing Systems , year =
-
[17]
International Conference on Learning Representations , year =
Activation Steering with a Feedback Controller , author =. International Conference on Learning Representations , year =
-
[18]
Gemma 3 Technical Report , author =. arXiv preprint arXiv:2503.19786 , year =
work page internal anchor Pith review Pith/arXiv arXiv
- [19]
-
[20]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. arXiv preprint arXiv:2307.15043 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Jailbreaking Black Box Large Language Models in Twenty Queries
Jailbreaking Black Box Large Language Models in Twenty Queries , author =. arXiv preprint arXiv:2310.08419 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
and Tramer, Florian and Hassani, Hamed and Wong, Eric , journal =
Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J. and Tramer, Florian and Hassani, Hamed and Wong, Eric , journal =. 2024 , url =
work page 2024
-
[23]
Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , journal =. 2024 , url =
work page 2024
-
[24]
Xie, Tinghao and Qi, Xiangyu and Zeng, Yi and Huang, Yangsibo and Sehwag, Udari Madhushani and Huang, Kaixuan and He, Luxi and Wei, Boyi and Li, Dacheng and Sheng, Ying and Jia, Ruoxi and Li, Bo and Li, Kai and Chen, Danqi and Henderson, Peter and Mittal, Prateek , journal =. 2024 , url =
work page 2024
-
[25]
Semantics-Adaptive Activation Intervention for
Wang, Weixuan and Yang, Jingyuan and Peng, Wei , booktitle =. Semantics-Adaptive Activation Intervention for. 2025 , url =
work page 2025
-
[26]
Marshall, Thomas and Scherlis, Adam and Belrose, Nora , journal =. Refusal in. 2024 , url =
work page 2024
-
[27]
Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence , author =. arXiv preprint arXiv:2502.17420 , year =
-
[28]
Zhang, Zhehao and Xu, Weijie and Wu, Fanyou and Reddy, Chandan K. , journal =. 2025 , url =
work page 2025
-
[29]
Zhao, Jiachen and Huang, Jing and Wu, Zhengxuan and Bau, David and Shi, Weiyan , journal =. 2025 , url =
work page 2025
-
[30]
Advances in Neural Information Processing Systems , year =
Refusal Direction is Universal Across Safety-Aligned Languages , author =. Advances in Neural Information Processing Systems , year =
-
[31]
Workshop on Actionable Interpretability at ICML , year =
Single Feature Tips the Balance: Reducing Language Model Over-Refusal with Sparse Representations , author =. Workshop on Actionable Interpretability at ICML , year =
-
[32]
Workshop on Actionable Interpretability at ICML , year =
Steering Language Model Refusal with Sparse Autoencoders , author =. Workshop on Actionable Interpretability at ICML , year =
-
[33]
Liu, Xiaogeng and Xu, Nan and Chen, Muhao and Xiao, Chaowei , booktitle =. 2024 , url =
work page 2024
-
[34]
and Liu, Yang and Song, Dawn and Wang, Chenguang , journal =
Siu, Vincent and Crispino, Nicholas and Henry, Nathan W. and Liu, Yang and Song, Dawn and Wang, Chenguang , journal =. 2025 , url =
work page 2025
-
[35]
Refusal Steering: Fine-grained Control over
Garc. Refusal Steering: Fine-grained Control over. arXiv preprint arXiv:2512.16602 , year =
-
[36]
and Crispino, Nicholas and Liu, Yang and Song, Dawn and Wang, Chenguang , booktitle =
Siu, Vincent and Henry, Nathan W. and Crispino, Nicholas and Liu, Yang and Song, Dawn and Wang, Chenguang , booktitle =. 2026 , url =
work page 2026
-
[37]
arXiv preprint arXiv:2603.05773 , year =
Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models , author =. arXiv preprint arXiv:2603.05773 , year =
-
[38]
Advances in Neural Information Processing Systems , year =
Locating and Editing Factual Associations in GPT , author =. Advances in Neural Information Processing Systems , year =
-
[39]
International Conference on Learning Representations , year =
Mass-Editing Memory in a Transformer , author =. International Conference on Learning Representations , year =
-
[40]
International Conference on Learning Representations , year =
Editing Models with Task Arithmetic , author =. International Conference on Learning Representations , year =
-
[41]
Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.