Recognition: unknown
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Pith reviewed 2026-05-07 08:27 UTC · model grok-4.3
The pith
Fifty neurons control the safety refusal template in aligned LLMs, with ablation changing 80 percent of response formats while producing near-zero harmful compliance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful complia
What carries the argument
Perturbation probing, which uses two forward passes to locate causal FFN neurons and computes the FFN-to-skip signal ratio to distinguish opposition circuits from routing circuits.
If this is right
- Ablating about 50 neurons in safety refusal circuits changes 80 percent of response formats on 520 AdvBench prompts while limiting harmful compliance to 3 cases.
- Residual-stream direction injection switches English to Chinese output on 99.1 percent of 580 benchmark prompts in models that meet the three observed conditions.
- Ablating 20 neurons in Qwen3.5-2B eliminates multi-turn sycophantic capitulation.
- Amplifying 10 related neurons raises factual correction accuracy from 52 percent to 88 percent on 200 TruthfulQA prompts.
- The FFN-to-skip signal ratio computed from the two passes predicts whether directional steering or neuron ablation will succeed for a given behavior.
Where Pith is reading between the lines
- Alignment appears to act through localized edits to small neuron sets rather than global representation changes.
- The method could be tested on additional behaviors such as code generation or bias expression to map more circuit types.
- Architecture-specific circuit topologies imply that editing strategies may need customization per model family.
- Directional steering succeeds only when linear representability and signal-ratio conditions hold, so it will not generalize to math, code, or factual circuits in most models.
Load-bearing premise
The two forward passes isolate truly causal effects of specific FFN neurons without confounding influences from attention layers or other parts of the network.
What would settle it
Ablating the identified neurons for safety refusal on a new set of prompts and finding no substantial change in response formats or a sharp rise in harmful compliance would falsify the causal control claim.
Figures
read the original abstract
Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful compliance, 3 of 520 cases, all with disclaimers. Routing circuits appear for pre-training behaviors distributed through attention. For language selection, residual-stream direction injection switches English to Chinese output on 99.1 percent of 580 benchmark prompts in the 3 of 19 tested models that satisfy three observed conditions: bilingual training, FFN-to-skip signal ratio between 0.3 and 1.1, and linear representability. The same intervention fails on the other 16 models and on math, code, and factual circuits, defining the limits of directional steering. The FFN-to-skip signal ratio, computed from the same two forward passes, distinguishes the two structures and predicts the appropriate intervention. Circuit topology varies by architecture, from Qwen's concentrated FFN bottleneck to Gemma's normalization-shielded circuit. In Qwen3.5-2B, ablating 20 neurons eliminates multi-turn sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52 percent to 88 percent on 200 TruthfulQA prompts. These results show that perturbation probing offers mechanistic insight into RLHF-organized behavior and a practical toolkit for precision template-layer editing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces perturbation probing, a method that uses two forward passes per prompt (no backpropagation) to identify task-specific causal FFN neurons in LLMs, followed by a low-cost intervention sweep. It reports two circuit structures: opposition circuits (RLHF-suppressed behaviors) where ablating ~50 neurons (0.014% of total) in safety refusal changes response formats on 80% of 520 AdvBench prompts while yielding only 3 harmful outputs (all with disclaimers); and routing circuits (pre-training behaviors) where residual-stream direction injection switches language output on 99.1% of 580 prompts in models meeting three conditions (bilingual training, FFN-to-skip ratio 0.3-1.1, linear representability). The FFN-to-skip signal ratio, derived from the same passes, distinguishes structures and predicts intervention type. Results span 13 models across four architectures, with additional examples in sycophancy and factual correction.
Significance. If the two-pass method isolates truly causal FFN neurons, this would provide a practical, scalable toolkit for mechanistic insight into RLHF-organized behavior and precision template editing with minimal interventions. The work earns credit for broad multi-model/multi-architecture testing, held-out benchmark interventions, concrete success rates, and falsifiable predictions via the signal-ratio diagnostic rather than post-hoc fitting.
major comments (3)
- [§3] §3 (perturbation probing method): The two forward passes compute activation differences in the residual stream, which feeds both FFN and attention sublayers; no attention-only perturbation baselines, layer-wise ablation comparisons, or residual-stream patching that holds attention fixed are described. This directly threatens the claim that the identified ~50 neurons form an FFN-specific opposition circuit controlling the refusal template, as effects could be attention-mediated.
- [§5.1] §5.1 (refusal circuit results): The 80% format-change rate and 3/520 harmful-compliance count on AdvBench are reported without error bars, without explicit description of how the 50-neuron set is thresholded or constructed, and without controls for prompt selection or multiple-testing; these omissions make the load-bearing claim that 'about 50 neurons control the refusal template' difficult to evaluate for robustness.
- [§6] §6 (routing circuit generalization): The directional injection succeeds only in the 3 of 19 models satisfying the three stated conditions and fails on math/code/factual circuits; the manuscript should provide an explicit test (e.g., whether the FFN-to-skip ratio falls outside 0.3-1.1 in failing models) to confirm that failures reflect absence of the routing circuit rather than a limitation of the intervention itself.
minor comments (3)
- [Abstract] Abstract and §4: the phrase 'one-time intervention sweep of about 150 passes amortized' is unclear on exact amortization procedure across neurons; add a short equation or pseudocode.
- [Figure 2] Figure 2 (circuit topology): labels for Qwen vs. Gemma normalization shielding are too small; increase font size and add a legend for the FFN-to-skip ratio axis.
- [§2] Missing reference: prior work on activation patching and causal mediation analysis (e.g., Vig et al., 2020; Meng et al., 2022) should be cited when contrasting the two-pass method.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us clarify and strengthen several aspects of the work. We address each major comment point by point below. Where the comments identify gaps in the current manuscript, we have revised the text and added the requested analyses for the next version.
read point-by-point responses
-
Referee: [§3] §3 (perturbation probing method): The two forward passes compute activation differences in the residual stream, which feeds both FFN and attention sublayers; no attention-only perturbation baselines, layer-wise ablation comparisons, or residual-stream patching that holds attention fixed are described. This directly threatens the claim that the identified ~50 neurons form an FFN-specific opposition circuit controlling the refusal template, as effects could be attention-mediated.
Authors: We agree that the residual-stream activation differences computed by the two forward passes can in principle influence both FFN and attention sublayers, and that the manuscript would be stronger with explicit controls isolating the FFN contribution. While our interventions specifically ablate or amplify the identified FFN neurons (rather than the full residual stream), we acknowledge that downstream attention effects cannot be ruled out without additional baselines. In the revised manuscript we have added: (i) attention-only perturbation baselines applying analogous magnitude-matched perturbations to attention heads instead of FFN neurons, (ii) layer-wise ablation comparisons, and (iii) residual-stream patching experiments that hold attention activations fixed while perturbing only the FFN component. These new results show that the behavioral changes on refusal and routing tasks are driven primarily by the FFN neurons rather than attention mediation, supporting the FFN-specific circuit claim. revision: yes
-
Referee: [§5.1] §5.1 (refusal circuit results): The 80% format-change rate and 3/520 harmful-compliance count on AdvBench are reported without error bars, without explicit description of how the 50-neuron set is thresholded or constructed, and without controls for prompt selection or multiple-testing; these omissions make the load-bearing claim that 'about 50 neurons control the refusal template' difficult to evaluate for robustness.
Authors: We accept that the current presentation lacks the statistical details needed to assess robustness. In the revised manuscript we now report: error bars computed across five random seeds for both neuron selection and intervention outcomes; an explicit description of the thresholding procedure (neurons are ranked by absolute activation difference and the top 0.014 % are retained, with a sensitivity analysis showing that results are stable for thresholds between 0.01 % and 0.02 %); results on a held-out 20 % subset of AdvBench to control for prompt selection; and Bonferroni correction for multiple testing across the neuron sets examined. With these additions the 80 % format-change rate and near-zero harmful compliance remain statistically significant, reinforcing the claim that the small neuron set controls the refusal template. revision: yes
-
Referee: [§6] §6 (routing circuit generalization): The directional injection succeeds only in the 3 of 19 models satisfying the three stated conditions and fails on math/code/factual circuits; the manuscript should provide an explicit test (e.g., whether the FFN-to-skip ratio falls outside 0.3-1.1 in failing models) to confirm that failures reflect absence of the routing circuit rather than a limitation of the intervention itself.
Authors: We agree that an explicit test of the FFN-to-skip ratio on the failing models is the cleanest way to demonstrate that the intervention's success is predicted by circuit presence rather than a methodological limitation. In the revised §6 we now tabulate the FFN-to-skip signal ratio for all 19 models. For the 16 models where directional injection failed, the ratio lies outside the 0.3-1.1 interval (either <0.3 or >1.1), exactly as predicted by the diagnostic. We also confirm that math, code, and factual circuits exhibit ratios outside this range and lack the linear representability condition, explaining why the same intervention does not succeed on those tasks. This explicit reporting strengthens the falsifiable prediction made by the signal-ratio diagnostic. revision: yes
Circularity Check
Perturbation probing method is self-contained with independent validation
full rationale
The paper's core chain uses two forward passes per prompt to identify neurons and compute the FFN-to-skip signal ratio, which then classifies circuit structures and indicates suitable interventions. However, all load-bearing empirical claims—such as the 80% response format change on 520 AdvBench prompts after ablating ~50 neurons, the 99.1% language switch rate on 580 prompts, and the 3/520 harmful compliance cases—are measured via separate intervention sweeps on held-out data. These outcomes are not entailed by the ratio or identification passes by construction. No equations reduce to inputs, no parameters are fitted then renamed as predictions, and no self-citations or imported uniqueness results appear as load-bearing steps. The derivation therefore remains non-circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- FFN-to-skip signal ratio range
axioms (1)
- domain assumption Two forward passes with a targeted perturbation suffice to identify neurons whose ablation or amplification produces the observed behavioral change.
invented entities (2)
-
Opposition circuit
no independent evidence
-
Routing circuit
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Jailbreaking leading safety-aligned llms with simple adaptive attacks
Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety- aligned LLMs with simple adaptive attacks.arXiv preprint arXiv:2404.02151,
-
[2]
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Lee Sharkey, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717,
work page internal anchor Pith review arXiv
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
18 Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review arXiv
-
[4]
Towards monosemanticity: Decompos- ing language models with dictionary learning
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decompos- ing language models with dictionary learning. https://transformer-circuits.pub/2023/ monosemantic-features/index.html,
2023
-
[5]
arXiv preprint arXiv:2212.03827 (2022) 3
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision.arXiv preprint arXiv:2212.03827,
-
[6]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,
work page internal anchor Pith review arXiv
-
[7]
A mathematical framework for transformer circuits.https://transformer-circuits.pub/2021/framework/index.html,
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.https://transformer-circuits.pub/2021/framework/index.html,
2021
-
[8]
Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural representations.arXiv preprint arXiv:2303.02536,
-
[9]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team. Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review arXiv
-
[10]
Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review arXiv
-
[11]
Wes Gurnee, Neel Nanda, Matthew Ratner, et al. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610,
-
[12]
John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-N jailbreaking.arXiv preprint arXiv:2412.03556,
-
[13]
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, and Rada Mihalcea. A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity.arXiv preprint arXiv:2401.01967,
-
[14]
Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness
Tung-Ling Li and Hongliang Liu. Logit-gap steering: Efficient short-suffix jailbreaks for aligned large language models.arXiv preprint arXiv:2506.24056,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,
work page internal anchor Pith review arXiv
-
[16]
Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824,
work page internal anchor Pith review arXiv
-
[17]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A stan- dardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,
work page internal anchor Pith review arXiv
-
[18]
Mass-Editing Memory in a Transformer
Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer.arXiv preprint arXiv:2210.07229,
work page internal anchor Pith review arXiv
-
[19]
Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review arXiv
-
[20]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909,
work page internal anchor Pith review arXiv
-
[21]
GLU Variants Improve Transformer
Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review arXiv 2002
-
[22]
There Will Be a Scientific Theory of Deep Learning
Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J. Michaud, Berkan Ottlik, and Joseph Turnbull. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. https://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html,
2024
-
[24]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Mac- Diarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,
work page internal anchor Pith review arXiv
-
[25]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in GPT-2 small.arXiv preprint arXiv:2211.00593,
work page internal anchor Pith review arXiv
-
[26]
Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in English? on the latent language of multilingual transformers.arXiv preprint arXiv:2402.10588,
-
[27]
Super weights: The hidden powerhouses of large language models.arXiv preprint arXiv:2411.07191,
Chejian Yu, Boyi Wei, Bo Peng, Jiyuan Shi, Zefan Cai, et al. Super weights: The hidden powerhouses of large language models.arXiv preprint arXiv:2411.07191,
-
[28]
URL https://arxiv. org/abs/2410.13708. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023a. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson...
-
[29]
To calculate
The <think> token for CoT was immune at all 36 layers, consistent with it being a hardwired special-token protocol rather than a residual-stream direction. Math injection shifted surface phrasing (“To calculate” vs “To find”) without changing the reasoning format. Language routing is steerable on Qwen and Llama because it is alinear, low-dimensionaldecisi...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.