Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

Bryce Hinkley; Peyman Najafirad

arxiv: 2605.20262 · v1 · pith:NYUNBAJEnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

Bryce Hinkley , Peyman Najafirad This is my paper

Pith reviewed 2026-05-21 07:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords selective refusal editingresidual editingroute selectivityoracle routingfrozen transformersinstruction-tuned modelsbottleneck diagnosisrouted residual updates

0 comments

The pith

Learned route selectivity is the main bottleneck in selective refusal editing of frozen transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies selective refusal editing as a control task that must suppress refusal on chosen edit prompts while leaving benign behavior and harmful refusals unchanged elsewhere. It introduces Residual Paving, which splits the problem into an early-layer router that decides whether to intervene and later-layer residual experts that supply the actual edit. An oracle-routing test replaces only the learned gate with perfect edit-or-keep labels while leaving the rest of the system untouched. Across six backbones this oracle version raises the keep-side diagnostic score on every row, with a median lift of 12.9 points, showing that current learned routers, not the edit mechanism itself, account for most observed failures.

Core claim

Residual Paving separates route selectivity from residual-edit capacity by using an early-layer router to predict a scalar gate and expert mixture; when the gate is active, prompt-conditioned bottleneck residual experts produce later-layer updates without altering the frozen backbone. Replacing the learned gate with held-out edit/keep labels while keeping the editor fixed improves keep-side preservation on every tested backbone. On Gemma-3-4B-IT the learned system drops edit refusal from 88.6% to 4.0% while retaining 95.5% benign and 87.3% harmful distribution scores, outperforming one-direction steering baselines.

What carries the argument

The oracle-routing diagnostic, which substitutes held-out edit/keep labels for only the learned scalar gate while the residual editor and backbone remain fixed.

If this is right

Methods that improve routing accuracy alone should close most of the remaining gap without changing the residual experts.
The persistent off-target harmful-keep degradation even under oracle routing indicates that residual-edit capacity still needs refinement.
Trajectory analysis on two backbones shows edits move toward edit-target continuations rather than generic refusal suppression.
The same routed-residual split can be applied to other selective-control tasks on frozen models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routing improvements developed for mixture-of-experts models could be directly reused to strengthen selective editing.
Auxiliary losses that supervise the gate more explicitly might reduce the observed routing bottleneck.
The decomposition suggests testing whether multi-turn or chained refusal edits would expose additional routing limits.
If routing remains the dominant issue, hybrid systems that combine learned gates with lightweight external verifiers could be explored.

Load-bearing premise

The oracle-routing test isolates route selectivity without side effects from label quality or distribution shift.

What would settle it

An experiment that applies the same oracle labels but observes no keep-side improvement on a new backbone or with noisier labels would show the diagnostic does not cleanly isolate routing.

Figures

Figures reproduced from arXiv: 2605.20262 by Bryce Hinkley, Peyman Najafirad.

**Figure 2.** Figure 2: Operational view of Residual Paving. Early residual states ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Gate policies used for learned and diagnostic routing. Soft routing applies the sigmoid gate, [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: The residual edit moves more toward the edit-target trajectory than toward a refusal-like [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

read the original abstract

We study selective refusal editing as a three-way control problem: induce non-refusal on designated edit prompts while preserving benign behavior and harmful refusals outside the edit set. We introduce Residual Paving, a routed residual editing method for frozen instruction-tuned transformers that separates route selectivity, whether to intervene, from residual-edit capacity, what edit to apply. An early-layer router predicts a scalar gate and expert mixture; when active, prompt-conditioned bottleneck residual experts apply later-layer residual updates while leaving the backbone unchanged. This decomposition supports an oracle-routing diagnostic where only the learned scalar gate is replaced with the held-out edit/keep label, leaving the residual editor and frozen backbone fixed. On the primary Gemma-3-4B-IT held-out split, learned Residual Paving reduces edit refusal from 88.6% to 4.0%, with 95.5% benign distribution preservation and 87.3% harmful distribution preservation. Same-protocol one-direction steering controls are much weaker on edit success, leaving edit refusal at 86.8% for Edit-target ActAdd and 78.9% for DIM-style refusal steering. The remaining failure is off-target harmful-keep degradation: harmful refusal remains below the frozen-base rate, 65.3% vs. 81.6%. Across six backbones, oracle routing improves the keep-side diagnostic score on every reported row, with median gain +12.9 pp, supporting the interpretation that learned route selectivity is the main observed bottleneck. Trajectory diagnostics on two backbones further suggest directed movement toward edit-target continuations rather than generic refusal suppression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Routing looks like the main limit on selective refusal edits here, but joint training of the prompt-conditioned experts weakens how cleanly the oracle isolates that bottleneck.

read the letter

The main takeaway is that this work makes a reasonable case that learned route selectivity is limiting how well we can do selective refusal edits in LLMs, with their oracle routing test showing solid gains on the preservation side across six different backbones. They separate the problem into route selectivity and residual-edit capacity using Residual Paving. An early router decides the gate and which experts to mix, then prompt-conditioned residual experts in later layers make the actual changes to the activations. This setup allows the oracle diagnostic where they swap the learned gate for held-out true labels but leave the experts and backbone alone. That change improves the keep-side score every time, median +12.9 points. On their main Gemma model they cut edit refusal way down while mostly keeping benign and harmful behaviors intact, and they outperform simple steering approaches that don't move the needle much on edits. The paper does a good job with the cross-model results and pointing out the remaining off-target issue where harmful refusals aren't fully preserved. The trajectory diagnostics add a bit more color on what the edits are actually doing. The concern that stands out is whether the oracle really pins down routing as the bottleneck. Because the experts are prompt-conditioned and trained jointly with the router, using oracle gates at test time changes the distribution of inputs the experts receive. Some of the observed improvement might come from the experts seeing more familiar conditions rather than from routing being the only weak link. The fact that harmful-keep performance still lags the base model supports that there's more to fix. This kind of paper is useful for researchers working on targeted behavioral edits and AI safety controls. Readers who care about activation patching or steering methods will see value in the decomposition even if the full technique isn't adopted. It has enough structure and results to merit a serious referee. The central interpretation is plausible from the numbers given, though the methods will need scrutiny on training details and data handling. I would recommend putting it through peer review.

Referee Report

2 major / 3 minor

Summary. The paper introduces Residual Paving, a routed residual editing approach for selective refusal editing in frozen instruction-tuned transformers. It decomposes the task into route selectivity (early-layer router predicting scalar gate and expert mixture) and residual-edit capacity (prompt-conditioned bottleneck residual experts applied later). An oracle-routing diagnostic replaces only the learned scalar gate with held-out edit/keep labels while keeping the residual editor and backbone fixed. On Gemma-3-4B-IT, learned Residual Paving achieves 4.0% edit refusal (vs. 88.6% baseline) with high preservation rates; across six backbones, oracle routing yields consistent keep-side gains (median +12.9 pp), supporting the claim that learned route selectivity is the primary bottleneck. Comparisons to one-direction steering baselines and trajectory diagnostics are also reported.

Significance. If the oracle diagnostic isolates route selectivity without substantial confounding from joint training of prompt-conditioned experts, the work offers a useful diagnostic lens on selective editing failures and highlights routing as a key lever for improving edit success while preserving out-of-distribution behavior. The cross-backbone consistency and explicit separation of routing from editing capacity are strengths; the method also supplies a concrete, externally grounded test (held-out labels) rather than relying solely on fitted quantities.

major comments (2)

[§3.3 (Oracle Routing Diagnostic)] §3.3 (Oracle Routing Diagnostic) and associated results: The diagnostic replaces only the scalar gate while fixing the residual experts. Because the experts are explicitly prompt-conditioned and trained jointly under the learned gate/mixture distribution, substituting the gate at inference can change the effective selection statistics and conditioning seen by the experts. This raises the possibility that part of the reported median +12.9 pp keep-side gain reflects reduced train-test mismatch for the experts rather than a pure demonstration that route selectivity is the dominant bottleneck. An ablation or quantitative argument addressing this confound is needed to support the central interpretation.
[Cross-backbone results table] Table reporting cross-backbone oracle results (period-5 or equivalent row): while oracle gains are shown on every row, the paper does not report whether the residual experts were retrained or re-evaluated under the oracle distribution; if they remain fixed to the learned distribution, the isolation of routing as the sole bottleneck is incomplete.

minor comments (3)

[Method section] The abstract and method description omit the precise loss terms used to train the router (scalar gate + mixture) versus the residual experts; including the mathematical objective (e.g., Eq. for combined routing and editing loss) would improve reproducibility.
[Experimental setup] Data-split details for the primary Gemma-3-4B-IT held-out set and how edit/keep labels are generated without leakage into router training are not fully specified; clarifying this would strengthen the claim of external grounding for the oracle.
[Trajectory diagnostics] Figure or table captions for trajectory diagnostics should explicitly state the two backbones used and the metric for 'directed movement toward edit-target continuations'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address the major comments point by point below, clarifying the oracle diagnostic procedure and committing to revisions that strengthen the presentation of our results.

read point-by-point responses

Referee: [§3.3 (Oracle Routing Diagnostic)] §3.3 (Oracle Routing Diagnostic) and associated results: The diagnostic replaces only the scalar gate while fixing the residual experts. Because the experts are explicitly prompt-conditioned and trained jointly under the learned gate/mixture distribution, substituting the gate at inference can change the effective selection statistics and conditioning seen by the experts. This raises the possibility that part of the reported median +12.9 pp keep-side gain reflects reduced train-test mismatch for the experts rather than a pure demonstration that route selectivity is the dominant bottleneck. An ablation or quantitative argument addressing this confound is needed to support the central interpretation.

Authors: We appreciate the referee's identification of this potential distributional effect. The residual experts are prompt-conditioned, meaning they take the full prompt as input to compute the bottleneck residual update at the later layers. The scalar gate solely determines whether this update is applied to the residual stream. Although training is joint, the experts are optimized to produce effective edits for prompts on which the router activates. In the oracle setting, we intervene precisely on the edit prompts using the same fixed expert parameters. To address the concern, we will include in the revision a quantitative argument comparing the distribution of prompts under learned versus oracle gating, along with a discussion of why the prompt-conditioning mitigates the train-test mismatch. This will support our interpretation that route selectivity remains the primary bottleneck. revision: yes
Referee: [Cross-backbone results table] Table reporting cross-backbone oracle results (period-5 or equivalent row): while oracle gains are shown on every row, the paper does not report whether the residual experts were retrained or re-evaluated under the oracle distribution; if they remain fixed to the learned distribution, the isolation of routing as the sole bottleneck is incomplete.

Authors: As described in the manuscript, the oracle-routing diagnostic 'replaces only the learned scalar gate with the held-out edit/keep label, leaving the residual editor and frozen backbone fixed.' The residual experts are therefore not retrained under the oracle distribution; they are evaluated using the parameters learned under the original routing. This fixed-expert design is central to isolating the contribution of route selectivity. We will revise the relevant table caption and the text in §3.3 to make this explicit and remove any potential ambiguity regarding the experimental protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; oracle diagnostic uses independent held-out labels

full rationale

The paper's derivation introduces Residual Paving by separating scalar gating from prompt-conditioned residual experts, then supports the claim that learned route selectivity is the main bottleneck via an oracle diagnostic. This diagnostic replaces only the learned scalar gate with held-out edit/keep labels while fixing the residual editor and backbone. The held-out labels are external to the fitted router parameters, and the experts remain unchanged, providing grounding independent of the model's learned distribution. No equations or steps reduce by construction to inputs (no self-definitional loops, no fitted parameters renamed as predictions). No load-bearing self-citations or uniqueness theorems from prior author work appear in the derivation. The empirical gains across backbones are presented as direct measurements rather than tautological renamings. The chain is self-contained against the stated diagnostic protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of the oracle diagnostic and standard assumptions that held-out labels reflect desired behavior and that residual updates can be isolated without backbone interference.

axioms (1)

domain assumption Held-out edit/keep labels from the test split accurately represent the target behavior for oracle routing.
Invoked when the oracle replaces the learned gate to diagnose routing as the bottleneck.

invented entities (1)

Residual Paving router and bottleneck residual experts no independent evidence
purpose: To separate route selectivity from edit application in frozen transformers.
Newly introduced technique whose performance is evaluated in the reported experiments.

pith-pipeline@v0.9.0 · 5821 in / 1363 out tokens · 53871 ms · 2026-05-21T07:40:55.087340+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

An early-layer router predicts a scalar gate and expert mixture; when active, prompt-conditioned bottleneck residual experts apply later-layer residual updates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 10 internal anchors

[1]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. arXiv preprint arXiv:1701.06538 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. arXiv preprint arXiv:2101.03961 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R. arXiv preprint arXiv:2308.01263 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Steering Language Models With Activation Engineering

Steering Language Models With Activation Engineering , author =. arXiv preprint arXiv:2308.10248 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Steering Llama 2 via Contrastive Activation Addition

Steering Llama 2 via Contrastive Activation Addition , author =. arXiv preprint arXiv:2312.06681 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[7]

2024 , url =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , journal =. 2024 , url =

work page 2024
[8]

2024 , url =

Li, Lijun and Dong, Bowen and Wang, Ruohui and Hu, Xuhao and Zuo, Wangmeng and Lin, Dahua and Qiao, Yu and Shao, Jing , journal =. 2024 , url =

work page 2024
[9]

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , journal =. A. 2024 , url =

work page 2024
[10]

Advances in Neural Information Processing Systems , year =

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author =. Advances in Neural Information Processing Systems , year =. doi:10.52202/079017-1493 , url =

work page doi:10.52202/079017-1493
[11]

Advances in Neural Information Processing Systems, Datasets and Benchmarks Track , year =

The Art of Saying No: Contextual Noncompliance in Language Models , author =. Advances in Neural Information Processing Systems, Datasets and Benchmarks Track , year =

work page
[12]

Refusal in Language Models Is Mediated by a Single Direction

Refusal in Language Models Is Mediated by a Single Direction , author =. arXiv preprint arXiv:2406.11717 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Programming Refusal with Conditional Activation Steering , author =. arXiv preprint arXiv:2409.05907 , year =

work page arXiv
[14]

Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877,

Improving Instruction-Following in Language Models through Activation Steering , author =. arXiv preprint arXiv:2410.12877 , year =

work page arXiv
[15]

Advances in Neural Information Processing Systems , year =

Steering When Necessary: Flexible Steering Large Language Models with Backtracking , author =. Advances in Neural Information Processing Systems , year =

work page
[16]

Advances in Neural Information Processing Systems , year =

Angular Steering: Behavior Control via Rotation in Activation Space , author =. Advances in Neural Information Processing Systems , year =

work page
[17]

International Conference on Learning Representations , year =

Activation Steering with a Feedback Controller , author =. International Conference on Learning Representations , year =

work page
[18]

Gemma 3 Technical Report

Gemma 3 Technical Report , author =. arXiv preprint arXiv:2503.19786 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Steering

Fayyaz, Mohsen and Modarressi, Ali and Deilamsalehy, Hanieh and Dernoncourt, Franck and Rossi, Ryan and Bui, Trung and Sch. Steering. arXiv preprint arXiv:2509.09660 , year =

work page arXiv
[20]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. arXiv preprint arXiv:2307.15043 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Jailbreaking Black Box Large Language Models in Twenty Queries

Jailbreaking Black Box Large Language Models in Twenty Queries , author =. arXiv preprint arXiv:2310.08419 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[22]

and Tramer, Florian and Hassani, Hamed and Wong, Eric , journal =

Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J. and Tramer, Florian and Hassani, Hamed and Wong, Eric , journal =. 2024 , url =

work page 2024
[23]

2024 , url =

Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , journal =. 2024 , url =

work page 2024
[24]

2024 , url =

Xie, Tinghao and Qi, Xiangyu and Zeng, Yi and Huang, Yangsibo and Sehwag, Udari Madhushani and Huang, Kaixuan and He, Luxi and Wei, Boyi and Li, Dacheng and Sheng, Ying and Jia, Ruoxi and Li, Bo and Li, Kai and Chen, Danqi and Henderson, Peter and Mittal, Prateek , journal =. 2024 , url =

work page 2024
[25]

Semantics-Adaptive Activation Intervention for

Wang, Weixuan and Yang, Jingyuan and Peng, Wei , booktitle =. Semantics-Adaptive Activation Intervention for. 2025 , url =

work page 2025
[26]

Refusal in

Marshall, Thomas and Scherlis, Adam and Belrose, Nora , journal =. Refusal in. 2024 , url =

work page 2024
[27]

Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence , author =. arXiv preprint arXiv:2502.17420 , year =

work page arXiv
[28]

, journal =

Zhang, Zhehao and Xu, Weijie and Wu, Fanyou and Reddy, Chandan K. , journal =. 2025 , url =

work page 2025
[29]

2025 , url =

Zhao, Jiachen and Huang, Jing and Wu, Zhengxuan and Bau, David and Shi, Weiyan , journal =. 2025 , url =

work page 2025
[30]

Advances in Neural Information Processing Systems , year =

Refusal Direction is Universal Across Safety-Aligned Languages , author =. Advances in Neural Information Processing Systems , year =

work page
[31]

Workshop on Actionable Interpretability at ICML , year =

Single Feature Tips the Balance: Reducing Language Model Over-Refusal with Sparse Representations , author =. Workshop on Actionable Interpretability at ICML , year =

work page
[32]

Workshop on Actionable Interpretability at ICML , year =

Steering Language Model Refusal with Sparse Autoencoders , author =. Workshop on Actionable Interpretability at ICML , year =

work page
[33]

2024 , url =

Liu, Xiaogeng and Xu, Nan and Chen, Muhao and Xiao, Chaowei , booktitle =. 2024 , url =

work page 2024
[34]

and Liu, Yang and Song, Dawn and Wang, Chenguang , journal =

Siu, Vincent and Crispino, Nicholas and Henry, Nathan W. and Liu, Yang and Song, Dawn and Wang, Chenguang , journal =. 2025 , url =

work page 2025
[35]

Refusal Steering: Fine-grained Control over

Garc. Refusal Steering: Fine-grained Control over. arXiv preprint arXiv:2512.16602 , year =

work page arXiv
[36]

and Crispino, Nicholas and Liu, Yang and Song, Dawn and Wang, Chenguang , booktitle =

Siu, Vincent and Henry, Nathan W. and Crispino, Nicholas and Liu, Yang and Song, Dawn and Wang, Chenguang , booktitle =. 2026 , url =

work page 2026
[37]

arXiv preprint arXiv:2603.05773 , year =

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models , author =. arXiv preprint arXiv:2603.05773 , year =

work page arXiv
[38]

Advances in Neural Information Processing Systems , year =

Locating and Editing Factual Associations in GPT , author =. Advances in Neural Information Processing Systems , year =

work page
[39]

International Conference on Learning Representations , year =

Mass-Editing Memory in a Transformer , author =. International Conference on Learning Representations , year =

work page
[40]

International Conference on Learning Representations , year =

Editing Models with Task Arithmetic , author =. International Conference on Learning Representations , year =

work page
[41]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

work page

[1] [1]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. arXiv preprint arXiv:1701.06538 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. arXiv preprint arXiv:2101.03961 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R. arXiv preprint arXiv:2308.01263 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Steering Language Models With Activation Engineering

Steering Language Models With Activation Engineering , author =. arXiv preprint arXiv:2308.10248 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Steering Llama 2 via Contrastive Activation Addition

Steering Llama 2 via Contrastive Activation Addition , author =. arXiv preprint arXiv:2312.06681 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

2024 , url =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , journal =. 2024 , url =

work page 2024

[8] [8]

2024 , url =

Li, Lijun and Dong, Bowen and Wang, Ruohui and Hu, Xuhao and Zuo, Wangmeng and Lin, Dahua and Qiao, Yu and Shao, Jing , journal =. 2024 , url =

work page 2024

[9] [9]

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , journal =. A. 2024 , url =

work page 2024

[10] [10]

Advances in Neural Information Processing Systems , year =

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author =. Advances in Neural Information Processing Systems , year =. doi:10.52202/079017-1493 , url =

work page doi:10.52202/079017-1493

[11] [11]

Advances in Neural Information Processing Systems, Datasets and Benchmarks Track , year =

The Art of Saying No: Contextual Noncompliance in Language Models , author =. Advances in Neural Information Processing Systems, Datasets and Benchmarks Track , year =

work page

[12] [12]

Refusal in Language Models Is Mediated by a Single Direction

Refusal in Language Models Is Mediated by a Single Direction , author =. arXiv preprint arXiv:2406.11717 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Programming Refusal with Conditional Activation Steering , author =. arXiv preprint arXiv:2409.05907 , year =

work page arXiv

[14] [14]

Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877,

Improving Instruction-Following in Language Models through Activation Steering , author =. arXiv preprint arXiv:2410.12877 , year =

work page arXiv

[15] [15]

Advances in Neural Information Processing Systems , year =

Steering When Necessary: Flexible Steering Large Language Models with Backtracking , author =. Advances in Neural Information Processing Systems , year =

work page

[16] [16]

Advances in Neural Information Processing Systems , year =

Angular Steering: Behavior Control via Rotation in Activation Space , author =. Advances in Neural Information Processing Systems , year =

work page

[17] [17]

International Conference on Learning Representations , year =

Activation Steering with a Feedback Controller , author =. International Conference on Learning Representations , year =

work page

[18] [18]

Gemma 3 Technical Report

Gemma 3 Technical Report , author =. arXiv preprint arXiv:2503.19786 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Steering

Fayyaz, Mohsen and Modarressi, Ali and Deilamsalehy, Hanieh and Dernoncourt, Franck and Rossi, Ryan and Bui, Trung and Sch. Steering. arXiv preprint arXiv:2509.09660 , year =

work page arXiv

[20] [20]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. arXiv preprint arXiv:2307.15043 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Jailbreaking Black Box Large Language Models in Twenty Queries

Jailbreaking Black Box Large Language Models in Twenty Queries , author =. arXiv preprint arXiv:2310.08419 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

and Tramer, Florian and Hassani, Hamed and Wong, Eric , journal =

Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J. and Tramer, Florian and Hassani, Hamed and Wong, Eric , journal =. 2024 , url =

work page 2024

[23] [23]

2024 , url =

Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , journal =. 2024 , url =

work page 2024

[24] [24]

2024 , url =

Xie, Tinghao and Qi, Xiangyu and Zeng, Yi and Huang, Yangsibo and Sehwag, Udari Madhushani and Huang, Kaixuan and He, Luxi and Wei, Boyi and Li, Dacheng and Sheng, Ying and Jia, Ruoxi and Li, Bo and Li, Kai and Chen, Danqi and Henderson, Peter and Mittal, Prateek , journal =. 2024 , url =

work page 2024

[25] [25]

Semantics-Adaptive Activation Intervention for

Wang, Weixuan and Yang, Jingyuan and Peng, Wei , booktitle =. Semantics-Adaptive Activation Intervention for. 2025 , url =

work page 2025

[26] [26]

Refusal in

Marshall, Thomas and Scherlis, Adam and Belrose, Nora , journal =. Refusal in. 2024 , url =

work page 2024

[27] [27]

Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence , author =. arXiv preprint arXiv:2502.17420 , year =

work page arXiv

[28] [28]

, journal =

Zhang, Zhehao and Xu, Weijie and Wu, Fanyou and Reddy, Chandan K. , journal =. 2025 , url =

work page 2025

[29] [29]

2025 , url =

Zhao, Jiachen and Huang, Jing and Wu, Zhengxuan and Bau, David and Shi, Weiyan , journal =. 2025 , url =

work page 2025

[30] [30]

Advances in Neural Information Processing Systems , year =

Refusal Direction is Universal Across Safety-Aligned Languages , author =. Advances in Neural Information Processing Systems , year =

work page

[31] [31]

Workshop on Actionable Interpretability at ICML , year =

Single Feature Tips the Balance: Reducing Language Model Over-Refusal with Sparse Representations , author =. Workshop on Actionable Interpretability at ICML , year =

work page

[32] [32]

Workshop on Actionable Interpretability at ICML , year =

Steering Language Model Refusal with Sparse Autoencoders , author =. Workshop on Actionable Interpretability at ICML , year =

work page

[33] [33]

2024 , url =

Liu, Xiaogeng and Xu, Nan and Chen, Muhao and Xiao, Chaowei , booktitle =. 2024 , url =

work page 2024

[34] [34]

and Liu, Yang and Song, Dawn and Wang, Chenguang , journal =

Siu, Vincent and Crispino, Nicholas and Henry, Nathan W. and Liu, Yang and Song, Dawn and Wang, Chenguang , journal =. 2025 , url =

work page 2025

[35] [35]

Refusal Steering: Fine-grained Control over

Garc. Refusal Steering: Fine-grained Control over. arXiv preprint arXiv:2512.16602 , year =

work page arXiv

[36] [36]

and Crispino, Nicholas and Liu, Yang and Song, Dawn and Wang, Chenguang , booktitle =

Siu, Vincent and Henry, Nathan W. and Crispino, Nicholas and Liu, Yang and Song, Dawn and Wang, Chenguang , booktitle =. 2026 , url =

work page 2026

[37] [37]

arXiv preprint arXiv:2603.05773 , year =

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models , author =. arXiv preprint arXiv:2603.05773 , year =

work page arXiv

[38] [38]

Advances in Neural Information Processing Systems , year =

Locating and Editing Factual Associations in GPT , author =. Advances in Neural Information Processing Systems , year =

work page

[39] [39]

International Conference on Learning Representations , year =

Mass-Editing Memory in a Transformer , author =. International Conference on Learning Representations , year =

work page

[40] [40]

International Conference on Learning Representations , year =

Editing Models with Task Arithmetic , author =. International Conference on Learning Representations , year =

work page

[41] [41]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

work page