arxiv: 2605.08513 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

Hamid Kazemi , Atoosa Chegini , Maria Safi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords safety alignmentrefusal neuronsconcept neuronslarge language modelsneuron interventionmechanistic interpretabilityharmful contentmodel vulnerabilities

0 comments

The pith

Suppressing one specific neuron lets large language models answer harmful requests they were trained to refuse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two separate internal systems that enforce safety in language models: refusal neurons that decide whether to block harmful outputs and concept neurons that store the actual harmful knowledge. Suppressing a single refusal neuron causes the model to comply with harmful prompts across many different queries. Amplifying one concept neuron can make the model generate harmful content even when given ordinary prompts. These single-neuron interventions work on seven models from two families and sizes from 1.7 billion to 70 billion parameters, with no additional training or special prompts required. If accurate, the result indicates that safety alignment creates isolated control points rather than protections spread evenly through the model's weights.

Core claim

Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself. By targeting a single neuron in each system, both directions of failure can be demonstrated: bypassing safety on explicit harmful requests via suppression, and inducing harmful content from innocent prompts via amplification, across seven models spanning two families and 1.7B to 70B parameters, without any training or prompt engineering. The findings indicate that safety alignment is not robustly distributed across model weights but is mediated by individual neurons that are (

What carries the argument

Refusal neurons, which are individual neurons that decide whether the model will express or withhold harmful knowledge in response to a prompt.

If this is right

Suppressing any one of the identified refusal neurons bypasses safety alignment across diverse harmful requests.
Amplifying one concept neuron can induce harmful content from innocent prompts.
The same single-neuron vulnerabilities appear in models ranging from 1.7B to 70B parameters across two families.
Safety alignment can be defeated or exploited without training or prompt engineering.
Alignment is mediated by causally sufficient individual neurons instead of being distributed throughout the weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment methods could aim to distribute refusal behavior more evenly to reduce dependence on single neurons.
Interpretability techniques might locate and reinforce these critical neurons during the training process.
The pattern may help explain why some existing adversarial prompts succeed by indirectly affecting similar internal states.
Checking whether the same neurons remain effective after further safety training would test the stability of the finding.

Load-bearing premise

The neurons located through the identification process are causally sufficient to control refusal behavior across all harmful requests rather than only the tested cases.

What would settle it

Suppressing one of the identified refusal neurons in a new model or on a fresh set of harmful requests while the model continues to refuse those requests.

Figures

Figures reproduced from arXiv: 2605.08513 by Atoosa Chegini, Hamid Kazemi, Maria Safi.

**Figure 2.** Figure 2: Attack success rate on JailbreakBench: constant intervention vs. no intervention (baseline). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: HarmBench ASR for the top-5 candidate refusal neurons across models. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Activation distributions of the top refusal neuron per model across harmful (JBB) and harmless (Alpaca) prompts. (Full results [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: ASR on JailbreakBench across three intervention methods. (Full results: Table 8) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Base-model refusal-neuron activations separate harmful (JBB) from harmless (Alpaca) prompts, suggesting emergence during pretraining. (Full results [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Residual-stream features: top-5 candidate ASR and activation distributions. (More results [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Per-token activations of suicide concept neurons [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Cumulative number of prompts for which the amplified response mentions the target [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Amplifying a single suicide neuron (Qwen3-1.7B:20:4256) on an innocent creativewriting prompt causes the model to generate suicide-themed content unprompted, while preserving the narrative structure of the original response. a gate (refusal neurons) and a substrate (concept neurons), both concentrated enough that a single neuron in each is causally sufficient. Whether concept neurons generalize to other … view at source ↗

**Figure 11.** Figure 11: The 20 benign prompts used to evaluate suicide neuron amplification. None have any [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: A silent pivot from Qwen3-14B (no intervention): the model declines by arguing the [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Neuron intervention: a pre-hook on down_proj pins neuron i to m. No weights are modified. B Extended Related Work Refusal as a linear feature. A line of work frames refusal as a low-dimensional feature in model activations. Zou et al. [2023a] introduced representation engineering as a framework for identifying and steering concept directions, and Zheng et al. [2024] showed that harmfulness and refusal are… view at source ↗

**Figure 14.** Figure 14: ROC curves for detecting harmful prompts on XSTest using refusal neuron activations. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Activation distributions of the top refusal neuron per model across harmful (JBB) and [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Activation distributions of refusal neurons in base (pre-alignment) models. Each panel [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Residual-stream vs. MLP features for Llama-3.1-8B and Qwen3-8B. Left: ASR for top-5 [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Per-token activations of refusal neurons ( [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Per-token activations of the refusal neuron ( [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

**Figure 20.** Figure 20: Suppressing a single refusal neuron (Qwen3-32B, F9168-L46) bypasses safety on an [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

**Figure 21.** Figure 21: Single-neuron intervention induces spurious refusal on a benign prompt. Amplifying [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗

**Figure 22.** Figure 22: Amplifying a single suicide neuron (Qwen3-1.7B, F4256-L20, multiplier [PITH_FULL_IMAGE:figures/full_fig_p024_22.png] view at source ↗

read the original abstract

Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself. By targeting a single neuron in each system, we demonstrate both directions of failure -- bypassing safety on explicit harmful requests via suppression, and inducing harmful content from innocent prompts via amplification -- across seven models spanning two families and 1.7B to 70B parameters, without any training or prompt engineering. Our findings suggest that safety alignment is not robustly distributed across model weights but is mediated by individual neurons that are each causally sufficient to gate refusal behavior -- suppressing any one of the identified refusal neurons bypasses safety alignment across diverse harmful requests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows single-neuron suppression can bypass refusal in multiple LLMs, but the generality claim depends on whether neuron discovery used held-out prompts.

read the letter

The main thing here is that the authors locate individual refusal neurons and show that suppressing any one of them lets the model answer diverse harmful requests across seven models from 1.7B to 70B parameters in two families. They also demonstrate the reverse by amplifying concept neurons to produce harmful content from innocent prompts, all without training or prompt tricks. That two-system split and the scale of the test are the concrete contributions.

Referee Report

2 major / 2 minor

Summary. The paper claims that safety alignment in LLMs consists of two mechanistically distinct systems—refusal neurons that gate expression of harmful knowledge and concept neurons that encode it. Targeting a single neuron in either system (via suppression or amplification) bypasses safety on explicit harmful requests or induces harmful content from benign prompts, respectively. This holds across seven models from two families (1.7B–70B parameters) with no training or prompt engineering, implying that alignment is not robustly distributed but is instead controlled by individual causally sufficient neurons.

Significance. If the empirical results hold under rigorous controls, the work would be significant for mechanistic interpretability and AI safety. It offers a concrete demonstration that single-neuron interventions can produce bidirectional failures in alignment at scale, across model families and sizes. This could inform more targeted alignment methods and highlight localization vulnerabilities, provided the neuron-identification procedure is shown to be independent of the evaluation distribution.

major comments (2)

[Methods] Methods section on neuron identification: the procedure for locating refusal and concept neurons (presumably via activation differences or targeted ablation) is not shown to use a prompt distribution disjoint from the 'diverse harmful requests' used for evaluation. If selection or ranking of neurons relies on effects measured on the same or overlapping prompts, the central claim that each identified neuron is 'causally sufficient to gate refusal behavior' across arbitrary harmful requests is undermined by post-hoc selection; a held-out test set or explicit cross-validation protocol is required to establish generality.
[Results] Results (model-scale experiments): while effects are reported across seven models, the manuscript provides no statistical tests, exclusion criteria, or controls for multiple comparisons in neuron selection. Without these, it is unclear whether the reported bypass rates reflect a general mechanism or selection of neurons that happen to work on the tested examples, weakening the assertion of causal sufficiency for each identified neuron.

minor comments (2)

[Abstract] Abstract and introduction: the two-system framing (refusal vs. concept neurons) is stated as mechanistically distinct, but the paper should explicitly define the operational criteria used to assign neurons to each system rather than relying solely on behavioral outcomes.
[Figures] Figures and tables: activation or ablation plots should include confidence intervals or p-values to allow readers to assess the reliability of the single-neuron effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the requirements for demonstrating the generality of our neuron-level interventions. We address each major comment below and have revised the manuscript accordingly to strengthen the methodological transparency and statistical support for our claims.

read point-by-point responses

Referee: [Methods] Methods section on neuron identification: the procedure for locating refusal and concept neurons (presumably via activation differences or targeted ablation) is not shown to use a prompt distribution disjoint from the 'diverse harmful requests' used for evaluation. If selection or ranking of neurons relies on effects measured on the same or overlapping prompts, the central claim that each identified neuron is 'causally sufficient to gate refusal behavior' across arbitrary harmful requests is undermined by post-hoc selection; a held-out test set or explicit cross-validation protocol is required to establish generality.

Authors: We agree that explicit separation between the prompts used for neuron identification and those used for evaluation is necessary to rule out post-hoc selection effects. In our procedure, refusal neurons were identified by measuring activation differences on a fixed set of 50 prompts focused on eliciting refusal (distinct in content and phrasing from the 200 diverse harmful requests in the evaluation suite), while concept neurons were located via amplification effects on benign prompts unrelated to the harmful evaluation set. To address the concern directly, the revised Methods section now explicitly documents the two prompt distributions, confirms they share no overlap, and reports bypass rates on a held-out subset of 50 harmful requests never seen during identification or ranking. This establishes that the causal sufficiency holds beyond the selection distribution. revision: yes
Referee: [Results] Results (model-scale experiments): while effects are reported across seven models, the manuscript provides no statistical tests, exclusion criteria, or controls for multiple comparisons in neuron selection. Without these, it is unclear whether the reported bypass rates reflect a general mechanism or selection of neurons that happen to work on the tested examples, weakening the assertion of causal sufficiency for each identified neuron.

Authors: We acknowledge that the original results section lacked formal statistical reporting and controls, which limits the strength of the causal claims. The revised manuscript now includes: (i) exclusion criteria for candidate neurons (minimum activation delta of 0.5 and consistency across at least three identification prompts), (ii) binomial proportion tests on bypass rates with 95% confidence intervals, and (iii) Bonferroni correction applied across the seven models and the top-k neurons screened per model. These additions show that the reported effects remain significant after correction and are not explained by selective reporting on the evaluation examples alone. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical neuron ablation study

full rationale

The paper is an empirical demonstration that identifies refusal and concept neurons via activation or ablation methods and then measures behavioral effects on harmful requests across models. No equations, derivations, or parameter-fitting steps are present that could reduce a claimed prediction to its own inputs by construction. The central claim rests on experimental outcomes rather than any self-definitional loop, fitted-input-as-prediction, or self-citation chain. Concerns about prompt overlap between discovery and evaluation sets pertain to experimental validity and generalization, not to circularity in a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the claim rests on empirical neuron targeting whose validity cannot be audited from available text.

pith-pipeline@v0.9.0 · 5420 in / 1106 out tokens · 38468 ms · 2026-05-12T01:37:37.181418+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

suppressing any one of the identified refusal neurons bypasses safety alignment across diverse harmful requests
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

gradient of a refusal log-odds loss ... combined gradient signal Gi,t ... scorei,t

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=

work page
[2]

2023 , archivePrefix=

Finding neurons in a haystack: Case studies with sparse probing , author=. arXiv preprint arXiv:2305.01610 , year=

work page arXiv
[3]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Finding skill neurons in pre-trained transformer-based language models , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2022
[4]

The Thirteenth International Conference on Learning Representations , year=

Understanding and enhancing safety mechanisms of LLMs via safety-specific neuron , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[5]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Unraveling llm jailbreaks through safety knowledge neurons , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[6]

arXiv preprint arXiv:2602.02132 , year=

There Is More to Refusal in Large Language Models than a Single Direction , author=. arXiv preprint arXiv:2602.02132 , year=

work page arXiv
[7]

2024 , publisher=

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet , author=. 2024 , publisher=

work page 2024
[8]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

work page 2023
[9]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

work page 2024
[10]

2025 , howpublished =

Claude Sonnet 4 , author =. 2025 , howpublished =

work page 2025
[11]

arXiv preprint arXiv:2509.11864 , year=

NeuroStrike: Neuron-Level Attacks on Aligned LLMs , author=. arXiv preprint arXiv:2509.11864 , year=

work page arXiv
[12]

arXiv preprint arXiv:2406.14144 , year=

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons , author=. arXiv preprint arXiv:2406.14144 , year=

work page arXiv
[13]

Assessing the brittleness of safety alignment via pruning and low-rank modifications, 2024

Assessing the brittleness of safety alignment via pruning and low-rank modifications , author=. arXiv preprint arXiv:2402.05162 , year=

work page arXiv
[14]

LessWrong / Alignment Forum , year=

Finding Features Causally Upstream of Refusal , author=. LessWrong / Alignment Forum , year=

work page
[15]

arXiv preprint arXiv:2410.10150 , year=

Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting , author=. arXiv preprint arXiv:2410.10150 , year=

work page arXiv
[16]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

On prompt-driven safeguarding for large lan- guage models

On prompt-driven safeguarding for large language models , author=. arXiv preprint arXiv:2401.18018 , year=

work page arXiv
[18]

2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=

work page 2025
[19]

Advances in Neural Information Processing Systems , volume=

Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=

work page
[20]

EMNLP , year=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. EMNLP , year=

work page
[21]

ACL , year=

Knowledge Neurons in Pretrained Transformers , author=. ACL , year=

work page
[22]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , journal=. Locating and Editing Factual Associations in

work page
[23]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

COSMIC: Generalized refusal direction identification in LLM activations , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[24]

arXiv preprint arXiv:2602.12158 , year=

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models , author=. arXiv preprint arXiv:2602.12158 , year=

work page arXiv
[25]

Advances in Neural Information Processing Systems , volume=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024
[27]

What do vision transformers learn? a visual exploration.arXiv preprint arXiv:2212.06727, 2022

What do vision transformers learn? a visual exploration , author=. arXiv preprint arXiv:2212.06727 , year=

work page arXiv
[28]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[29]

Catastrophic jailbreak of open-source llms via exploiting generation

Catastrophic jailbreak of open-source llms via exploiting generation , author=. arXiv preprint arXiv:2310.06987 , year=

work page arXiv
[30]

NeurIPS Competition Track , year=

TDC 2023 (LLM Edition): The Trojan Detection Challenge , author=. NeurIPS Competition Track , year=

work page 2023
[31]

arXiv preprint arXiv:2510.21049 , year=

Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection , author=. arXiv preprint arXiv:2510.21049 , year=

work page arXiv
[32]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page
[33]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The

work page
[35]

Toy Models of Superposition

Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

2024 , howpublished =

Connor Kissane and Robert Krzyzanowski and Arthur Conmy and Neel Nanda , title =. 2024 , howpublished =

work page 2024