pith. machine review for the scientific record. sign in

arxiv: 2604.27401 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.LG

Recognition: unknown

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:27 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords perturbation probingFFN neuronsopposition circuitsrouting circuitssafety refusalLLM alignmentcausal interventionmechanistic interpretability
0
0 comments X

The pith

Fifty neurons control the safety refusal template in aligned LLMs, with ablation changing 80 percent of response formats while producing near-zero harmful compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops perturbation probing, a method using two forward passes per prompt to identify causal neurons in the feed-forward layers of language models without any backpropagation. It applies the technique across eight behavioral circuits in thirteen models and discovers two organizing structures: opposition circuits that emerge when RLHF overrides pre-training tendencies, and routing circuits that handle pre-training behaviors distributed through attention. For safety refusal, ablating roughly fifty neurons alters response formats on 80 percent of 520 adversarial prompts yet triggers harmful compliance in only three cases. The approach also supports precise edits, such as improving factual correction by amplifying selected neurons or switching output languages under specific conditions. The FFN-to-skip signal ratio computed from the same passes distinguishes the circuit types and guides the choice of intervention.

Core claim

Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful complia

What carries the argument

Perturbation probing, which uses two forward passes to locate causal FFN neurons and computes the FFN-to-skip signal ratio to distinguish opposition circuits from routing circuits.

If this is right

  • Ablating about 50 neurons in safety refusal circuits changes 80 percent of response formats on 520 AdvBench prompts while limiting harmful compliance to 3 cases.
  • Residual-stream direction injection switches English to Chinese output on 99.1 percent of 580 benchmark prompts in models that meet the three observed conditions.
  • Ablating 20 neurons in Qwen3.5-2B eliminates multi-turn sycophantic capitulation.
  • Amplifying 10 related neurons raises factual correction accuracy from 52 percent to 88 percent on 200 TruthfulQA prompts.
  • The FFN-to-skip signal ratio computed from the two passes predicts whether directional steering or neuron ablation will succeed for a given behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alignment appears to act through localized edits to small neuron sets rather than global representation changes.
  • The method could be tested on additional behaviors such as code generation or bias expression to map more circuit types.
  • Architecture-specific circuit topologies imply that editing strategies may need customization per model family.
  • Directional steering succeeds only when linear representability and signal-ratio conditions hold, so it will not generalize to math, code, or factual circuits in most models.

Load-bearing premise

The two forward passes isolate truly causal effects of specific FFN neurons without confounding influences from attention layers or other parts of the network.

What would settle it

Ablating the identified neurons for safety refusal on a new set of prompts and finding no substantial change in response formats or a sharp rise in harmful compliance would falsify the causal control claim.

Figures

Figures reproduced from arXiv: 2604.27401 by Hongliang Liu, Tung-Ling Li, Yuhao Wu.

Figure 1
Figure 1. Figure 1: Perturbation probing pipeline. Two forward passes on the original and perturbed input yield view at source ↗
Figure 2
Figure 2. Figure 2: Safety circuit ablation on Qwen3-4B. (a) The gap drops monotonically with neuron count, view at source ↗
Figure 3
Figure 3. Figure 3: FFN/Skip ratio (per-component contribution to the refusal logit gap) versus ablation view at source ↗
Figure 4
Figure 4. Figure 4: EN→ZH language switch success rate under direction injection (α = 50) at each layer on Qwen3-4B (red, 90 prompts per layer). The bell-shaped causal window at layers 9–22 shows three regimes: destructive (L0–5), causal (L9–22, peak 100%), and committed (L24+, 0%). Gemma-3-4B (green) shows 0% at all layers. Data from prompts sampled from MMLU, GSM8K, TriviaQA, ARC, MT-Bench, and MBPP. trend has model-family-… view at source ↗
Figure 5
Figure 5. Figure 5: Left: Sycophancy dose-response on Qwen3.5-2B (5 runs of 8 prompts; min/max band, not view at source ↗
Figure 6
Figure 6. Figure 6: Language switch success rate versus injection strength view at source ↗
read the original abstract

Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful compliance, 3 of 520 cases, all with disclaimers. Routing circuits appear for pre-training behaviors distributed through attention. For language selection, residual-stream direction injection switches English to Chinese output on 99.1 percent of 580 benchmark prompts in the 3 of 19 tested models that satisfy three observed conditions: bilingual training, FFN-to-skip signal ratio between 0.3 and 1.1, and linear representability. The same intervention fails on the other 16 models and on math, code, and factual circuits, defining the limits of directional steering. The FFN-to-skip signal ratio, computed from the same two forward passes, distinguishes the two structures and predicts the appropriate intervention. Circuit topology varies by architecture, from Qwen's concentrated FFN bottleneck to Gemma's normalization-shielded circuit. In Qwen3.5-2B, ablating 20 neurons eliminates multi-turn sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52 percent to 88 percent on 200 TruthfulQA prompts. These results show that perturbation probing offers mechanistic insight into RLHF-organized behavior and a practical toolkit for precision template-layer editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces perturbation probing, a method that uses two forward passes per prompt (no backpropagation) to identify task-specific causal FFN neurons in LLMs, followed by a low-cost intervention sweep. It reports two circuit structures: opposition circuits (RLHF-suppressed behaviors) where ablating ~50 neurons (0.014% of total) in safety refusal changes response formats on 80% of 520 AdvBench prompts while yielding only 3 harmful outputs (all with disclaimers); and routing circuits (pre-training behaviors) where residual-stream direction injection switches language output on 99.1% of 580 prompts in models meeting three conditions (bilingual training, FFN-to-skip ratio 0.3-1.1, linear representability). The FFN-to-skip signal ratio, derived from the same passes, distinguishes structures and predicts intervention type. Results span 13 models across four architectures, with additional examples in sycophancy and factual correction.

Significance. If the two-pass method isolates truly causal FFN neurons, this would provide a practical, scalable toolkit for mechanistic insight into RLHF-organized behavior and precision template editing with minimal interventions. The work earns credit for broad multi-model/multi-architecture testing, held-out benchmark interventions, concrete success rates, and falsifiable predictions via the signal-ratio diagnostic rather than post-hoc fitting.

major comments (3)
  1. [§3] §3 (perturbation probing method): The two forward passes compute activation differences in the residual stream, which feeds both FFN and attention sublayers; no attention-only perturbation baselines, layer-wise ablation comparisons, or residual-stream patching that holds attention fixed are described. This directly threatens the claim that the identified ~50 neurons form an FFN-specific opposition circuit controlling the refusal template, as effects could be attention-mediated.
  2. [§5.1] §5.1 (refusal circuit results): The 80% format-change rate and 3/520 harmful-compliance count on AdvBench are reported without error bars, without explicit description of how the 50-neuron set is thresholded or constructed, and without controls for prompt selection or multiple-testing; these omissions make the load-bearing claim that 'about 50 neurons control the refusal template' difficult to evaluate for robustness.
  3. [§6] §6 (routing circuit generalization): The directional injection succeeds only in the 3 of 19 models satisfying the three stated conditions and fails on math/code/factual circuits; the manuscript should provide an explicit test (e.g., whether the FFN-to-skip ratio falls outside 0.3-1.1 in failing models) to confirm that failures reflect absence of the routing circuit rather than a limitation of the intervention itself.
minor comments (3)
  1. [Abstract] Abstract and §4: the phrase 'one-time intervention sweep of about 150 passes amortized' is unclear on exact amortization procedure across neurons; add a short equation or pseudocode.
  2. [Figure 2] Figure 2 (circuit topology): labels for Qwen vs. Gemma normalization shielding are too small; increase font size and add a legend for the FFN-to-skip ratio axis.
  3. [§2] Missing reference: prior work on activation patching and causal mediation analysis (e.g., Vig et al., 2020; Meng et al., 2022) should be cited when contrasting the two-pass method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us clarify and strengthen several aspects of the work. We address each major comment point by point below. Where the comments identify gaps in the current manuscript, we have revised the text and added the requested analyses for the next version.

read point-by-point responses
  1. Referee: [§3] §3 (perturbation probing method): The two forward passes compute activation differences in the residual stream, which feeds both FFN and attention sublayers; no attention-only perturbation baselines, layer-wise ablation comparisons, or residual-stream patching that holds attention fixed are described. This directly threatens the claim that the identified ~50 neurons form an FFN-specific opposition circuit controlling the refusal template, as effects could be attention-mediated.

    Authors: We agree that the residual-stream activation differences computed by the two forward passes can in principle influence both FFN and attention sublayers, and that the manuscript would be stronger with explicit controls isolating the FFN contribution. While our interventions specifically ablate or amplify the identified FFN neurons (rather than the full residual stream), we acknowledge that downstream attention effects cannot be ruled out without additional baselines. In the revised manuscript we have added: (i) attention-only perturbation baselines applying analogous magnitude-matched perturbations to attention heads instead of FFN neurons, (ii) layer-wise ablation comparisons, and (iii) residual-stream patching experiments that hold attention activations fixed while perturbing only the FFN component. These new results show that the behavioral changes on refusal and routing tasks are driven primarily by the FFN neurons rather than attention mediation, supporting the FFN-specific circuit claim. revision: yes

  2. Referee: [§5.1] §5.1 (refusal circuit results): The 80% format-change rate and 3/520 harmful-compliance count on AdvBench are reported without error bars, without explicit description of how the 50-neuron set is thresholded or constructed, and without controls for prompt selection or multiple-testing; these omissions make the load-bearing claim that 'about 50 neurons control the refusal template' difficult to evaluate for robustness.

    Authors: We accept that the current presentation lacks the statistical details needed to assess robustness. In the revised manuscript we now report: error bars computed across five random seeds for both neuron selection and intervention outcomes; an explicit description of the thresholding procedure (neurons are ranked by absolute activation difference and the top 0.014 % are retained, with a sensitivity analysis showing that results are stable for thresholds between 0.01 % and 0.02 %); results on a held-out 20 % subset of AdvBench to control for prompt selection; and Bonferroni correction for multiple testing across the neuron sets examined. With these additions the 80 % format-change rate and near-zero harmful compliance remain statistically significant, reinforcing the claim that the small neuron set controls the refusal template. revision: yes

  3. Referee: [§6] §6 (routing circuit generalization): The directional injection succeeds only in the 3 of 19 models satisfying the three stated conditions and fails on math/code/factual circuits; the manuscript should provide an explicit test (e.g., whether the FFN-to-skip ratio falls outside 0.3-1.1 in failing models) to confirm that failures reflect absence of the routing circuit rather than a limitation of the intervention itself.

    Authors: We agree that an explicit test of the FFN-to-skip ratio on the failing models is the cleanest way to demonstrate that the intervention's success is predicted by circuit presence rather than a methodological limitation. In the revised §6 we now tabulate the FFN-to-skip signal ratio for all 19 models. For the 16 models where directional injection failed, the ratio lies outside the 0.3-1.1 interval (either <0.3 or >1.1), exactly as predicted by the diagnostic. We also confirm that math, code, and factual circuits exhibit ratios outside this range and lack the linear representability condition, explaining why the same intervention does not succeed on those tasks. This explicit reporting strengthens the falsifiable prediction made by the signal-ratio diagnostic. revision: yes

Circularity Check

0 steps flagged

Perturbation probing method is self-contained with independent validation

full rationale

The paper's core chain uses two forward passes per prompt to identify neurons and compute the FFN-to-skip signal ratio, which then classifies circuit structures and indicates suitable interventions. However, all load-bearing empirical claims—such as the 80% response format change on 520 AdvBench prompts after ablating ~50 neurons, the 99.1% language switch rate on 580 prompts, and the 3/520 harmful compliance cases—are measured via separate intervention sweeps on held-out data. These outcomes are not entailed by the ratio or identification passes by construction. No equations reduce to inputs, no parameters are fitted then renamed as predictions, and no self-citations or imported uniqueness results appear as load-bearing steps. The derivation therefore remains non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claims rest on the domain assumption that small perturbations in FFN activations can isolate causal effects on downstream behavior. One observed range (0.3-1.1) for the FFN-to-skip ratio functions as a post-hoc selection criterion rather than a theoretically derived constant. No new physical entities are postulated; the two circuit types are descriptive labels for observed patterns.

free parameters (1)
  • FFN-to-skip signal ratio range
    Observed interval (0.3 to 1.1) used to select models where directional injection succeeds; appears chosen after inspecting results rather than fixed in advance.
axioms (1)
  • domain assumption Two forward passes with a targeted perturbation suffice to identify neurons whose ablation or amplification produces the observed behavioral change.
    Core premise of the perturbation probing method stated in the abstract.
invented entities (2)
  • Opposition circuit no independent evidence
    purpose: Label for FFN structure in which RLHF creates opposing signals to suppress pre-training tendencies such as sycophancy or unsafe outputs.
    New descriptive category introduced to organize findings across the eight behavioral circuits.
  • Routing circuit no independent evidence
    purpose: Label for FFN structure that routes pre-training behaviors such as language selection through attention pathways.
    New descriptive category introduced to organize findings across the eight behavioral circuits.

pith-pipeline@v0.9.0 · 5650 in / 1601 out tokens · 69098 ms · 2026-05-07T08:27:03.127736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 25 canonical work pages · 16 internal anchors

  1. [1]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety- aligned LLMs with simple adaptive attacks.arXiv preprint arXiv:2404.02151,

  2. [2]

    Refusal in Language Models Is Mediated by a Single Direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Lee Sharkey, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717,

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    18 Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

  4. [4]

    Towards monosemanticity: Decompos- ing language models with dictionary learning

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decompos- ing language models with dictionary learning. https://transformer-circuits.pub/2023/ monosemantic-features/index.html,

  5. [5]

    arXiv preprint arXiv:2212.03827 (2022) 3

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision.arXiv preprint arXiv:2212.03827,

  6. [6]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

  7. [7]

    A mathematical framework for transformer circuits.https://transformer-circuits.pub/2021/framework/index.html,

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.https://transformer-circuits.pub/2021/framework/index.html,

  8. [8]

    2023 , journal =

    Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural representations.arXiv preprint arXiv:2303.02536,

  9. [9]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team. Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295,

  10. [10]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  11. [11]

    2023 , archivePrefix=

    Wes Gurnee, Neel Nanda, Matthew Ratner, et al. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610,

  12. [12]

    Best-of-n jailbreaking

    John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-N jailbreaking.arXiv preprint arXiv:2412.03556,

  13. [13]

    A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity.arXiv preprint arXiv:2401.01967, 2024

    Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, and Rada Mihalcea. A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity.arXiv preprint arXiv:2401.01967,

  14. [14]

    Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

    Tung-Ling Li and Hongliang Liu. Logit-gap steering: Efficient short-suffix jailbreaks for aligned large language models.arXiv preprint arXiv:2506.24056,

  15. [15]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,

  16. [16]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824,

  17. [17]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A stan- dardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

  18. [18]

    Mass-Editing Memory in a Transformer

    Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer.arXiv preprint arXiv:2210.07229,

  19. [19]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  20. [20]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909,

  21. [21]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202,

  22. [22]

    There Will Be a Scientific Theory of Deep Learning

    Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J. Michaud, Berkan Ottlik, and Joseph Turnbull. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691,

  23. [23]

    Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. https://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html,

  24. [24]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Mac- Diarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

  25. [25]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in GPT-2 small.arXiv preprint arXiv:2211.00593,

  26. [26]

    Do llamas work in English? on the latent language of multilingual transformers.arXiv preprint arXiv:2402.10588,

    Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in English? on the latent language of multilingual transformers.arXiv preprint arXiv:2402.10588,

  27. [27]

    Super weights: The hidden powerhouses of large language models.arXiv preprint arXiv:2411.07191,

    Chejian Yu, Boyi Wei, Bo Peng, Jiyuan Shi, Zefan Cai, et al. Super weights: The hidden powerhouses of large language models.arXiv preprint arXiv:2411.07191,

  28. [28]

    I’m unable to assist

    URL https://arxiv. org/abs/2410.13708. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023a. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson...

  29. [29]

    To calculate

    The <think> token for CoT was immune at all 36 layers, consistent with it being a hardwired special-token protocol rather than a residual-stream direction. Math injection shifted surface phrasing (“To calculate” vs “To find”) without changing the reasoning format. Language routing is steerable on Qwen and Llama because it is alinear, low-dimensionaldecisi...