pith. machine review for the scientific record. sign in

arxiv: 2604.11663 · v1 · submitted 2026-04-13 · 💻 cs.AI

Recognition: unknown

Why Do Large Language Models Generate Harmful Content?

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:27 UTC · model grok-4.3

classification 💻 cs.AI
keywords large language modelsharmful contentcausal mediation analysisMLP blocksneuron gatingmodel layerssafetyinterpretability
0
0 comments X

The pith

Harmful outputs in large language models arise mainly from failures in later-layer MLP blocks that are controlled by a small set of gating neurons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses causal mediation analysis to trace the internal flow of harmful generation inside LLMs. Early layers first build an understanding of whether a prompt has harmful potential. That information travels forward and produces both the harmful text and a separate signal inside the MLP blocks of later layers. A sparse collection of neurons in the final layer then acts as a gate that decides whether the harmful content is emitted. Because the process is localized in this way, the work points to specific places where targeted changes could reduce unwanted outputs.

Core claim

Harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model to generate harmfulness in the late layers as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons which receives the signal and determines the generation of harmful content accordingly.

What carries the argument

Multi-granular causal mediation analysis that measures the causal effect of each layer, each module type (MLP versus attention), and each neuron on the production of harmful text.

If this is right

  • Interventions aimed at MLP blocks in later layers will be more effective at reducing harm than interventions in attention blocks.
  • Ablating or editing the sparse set of gating neurons in the final layer can block harmful generation while leaving other behavior intact.
  • Disrupting the harmfulness signal in early layers can prevent its propagation to the later stages where generation occurs.
  • The two-stage separation between prompt understanding and output gating suggests that safety measures can be applied selectively rather than uniformly across the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Post-training model editing that targets only the identified gating neurons could reduce harm without requiring full retraining.
  • The same analysis method could be used to locate pathways for other undesired behaviors such as factual errors or biased responses.
  • Safety fine-tuning might become more efficient if it focuses on strengthening the gate-like neurons rather than adjusting the entire network.

Load-bearing premise

Causal mediation analysis can isolate the true causes of harmful generation without being distorted by the model's architecture, training data, or the specific interventions used in the experiments.

What would settle it

A test in which ablating or clamping the identified late-layer MLP blocks and final-layer gating neurons leaves the rate of harmful outputs unchanged on the same prompts.

Figures

Figures reproduced from arXiv: 2604.11663 by Raha Moraffah, Rajesh Ganguli.

Figure 1
Figure 1. Figure 1: Overview of the CMA approach. Baseline runs establish initial activations and token prob [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Results of layer wise interventions on the six models. Llama-3.2-1B has 16 layers, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Component level heatmaps of Qwen2.5-7B-Instruct. Indicates .15 IE in the late MLPs in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Calculation of the average change in IE across the model when performing the neuron [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average IE of each token group within the same layer per model indicating a strong IE [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparative analysis of layer-wise interventions between Qwen2.5-7B-Instruct and [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of vanilla baseline refusals, intervened refusals, and the alignment-improved defense. This defense applies a basic steering vector upon activations within the residual stream at the targeted late layers. This steer￾ing vector aims to further bias the model towards the latent representation of “harm￾less". The LLM determines the late layers for intervention based upon K number of lay￾ers with th… view at source ↗
Figure 8
Figure 8. Figure 8: Calculation of Mean IE per token group across the layers of the model. All groups were [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Graph depicting the flip rate for the top 1 predicted next token across the layers of the [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Component level heatmaps of Qwen2.5-7B. Indicates .15 IE in the late MLPs in the late [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Component level heatmaps of Qwen2.5-3B. Indicates .25 IE in the late MLPs in the late [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Component level heatmaps of Llama-3.2-3B. Indicates .01 IE in the late MLPs in the [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Component level heatmaps of Llama-3.2-3B-Instruct. Indicates .05 IE in the late MLPs [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons, which receives the signal and determines the generation of harmful content accordingly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a multi-granular causal mediation analysis to identify the internal mechanisms behind harmful content generation in LLMs. It claims that early layers perform contextual understanding of prompt harmfulness, this signal propagates primarily through MLP blocks (rather than attention) to later layers, and a sparse set of neurons in the final layer acts as a gating mechanism that determines whether harmful content is generated.

Significance. If the central claims hold after addressing controls for positional and module-type confounds, the work would advance mechanistic interpretability of LLMs by localizing harm-related computation to specific layers, modules, and neurons. This could inform targeted safety interventions, such as neuron-level editing or layer-specific regularization, and complements existing activation patching and probing studies in the field.

major comments (2)
  1. [Abstract] Abstract and results description: The claim that harmful generation 'results primarily from failures in MLP blocks rather than attention blocks' and is driven by a 'gating mechanism' in late-layer neurons lacks controls that hold total activation magnitude, information flow volume, or downstream logit impact constant when comparing interventions across layers and module types. Without such controls, the observed MLP dominance and neuron localization could arise from the forward-pass position of MLPs rather than harm-specific causality.
  2. [Results] The multi-granular causal mediation analysis identifies early-layer contextual encoding and late-layer gating but does not report tests on whether the same late-layer neurons or MLP blocks gate non-harmful but high-variance outputs (e.g., creative text or refusal violations). This specificity test is load-bearing for interpreting the neurons as a harm-specific gate rather than a general output modulator.
minor comments (2)
  1. [Abstract] The abstract states high-level results but provides no details on experimental setup, model sizes, datasets, statistical validation, or intervention implementation; these should be summarized in the abstract or a dedicated methods subsection for reproducibility.
  2. [Methods] Notation for causal effects (e.g., direct vs. indirect effects in the mediation analysis) should be defined explicitly with equations to avoid ambiguity when describing propagation from early to late layers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful and constructive comments, which highlight important considerations for strengthening the causal claims in our multi-granular mediation analysis. We address each major comment below and describe the revisions we will incorporate into the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results description: The claim that harmful generation 'results primarily from failures in MLP blocks rather than attention blocks' and is driven by a 'gating mechanism' in late-layer neurons lacks controls that hold total activation magnitude, information flow volume, or downstream logit impact constant when comparing interventions across layers and module types. Without such controls, the observed MLP dominance and neuron localization could arise from the forward-pass position of MLPs rather than harm-specific causality.

    Authors: We acknowledge that our current mediation interventions do not explicitly normalize for total activation magnitude, information flow volume, or downstream logit impact when contrasting MLP and attention blocks. The causal mediation framework does hold the remainder of the forward pass fixed during each intervention, which provides isolation beyond simple correlation, but we agree this does not fully rule out positional or module-size confounds. In the revised manuscript we will add a dedicated subsection on potential confounds and include new experiments that (i) scale intervention strength by activation L2 norm and (ii) match the number of intervened units across module types. These additions will allow us to report whether the MLP dominance persists under normalized conditions. revision: yes

  2. Referee: [Results] The multi-granular causal mediation analysis identifies early-layer contextual encoding and late-layer gating but does not report tests on whether the same late-layer neurons or MLP blocks gate non-harmful but high-variance outputs (e.g., creative text or refusal violations). This specificity test is load-bearing for interpreting the neurons as a harm-specific gate rather than a general output modulator.

    Authors: We agree that demonstrating the specificity of the late-layer gating neurons to harmful content is essential. Our existing experiments already contrast harmful prompts against safe prompts and show differential effects, yet we have not applied the same neuron-level mediation analysis to non-harmful high-variance tasks such as creative text generation or other refusal-violation scenarios. In the revised manuscript we will add these specificity experiments, reporting the mediation effects of the identified neurons on such prompts to confirm whether they function as a general output modulator or exhibit harm-specific gating behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical causal mediation analysis

full rationale

The paper presents an empirical study applying standard causal mediation analysis across layers, MLP/attention modules, and neurons in LLMs to localize harmful generation. It reports results from interventions and measurements on model activations without any mathematical derivation chain, fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations of uniqueness theorems. All claims rest on direct experimental outcomes rather than reducing to inputs by construction, satisfying the criteria for a self-contained empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; the work is an empirical analysis using causal mediation.

pith-pipeline@v0.9.0 · 5471 in / 1244 out tokens · 35394 ms · 2026-05-10T15:27:40.875501+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 27 canonical work pages · 3 internal anchors

  1. [1]

    AlDahoul, N., Tan, M.J.T., Kasireddy, H.R., Zaki, Y .: Advancing content moderation: Eval- uating large language models for detecting sensitive content across text, images, and videos (2024), https://arxiv.org/abs/2411.17123

  2. [2]

    Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., Nanda, N.: Refusal in language models is mediated by a single direction (2024), https://arxiv.org/abs/2406.11717

  3. [3]

    Ball, S., Kreuter, F., Panickssery, N.: Understanding jailbreak success: A study of latent space dynamics in large language models (2024), https://arxiv.org/abs/2406.09289

  4. [4]

    Chen, J., Wang, X., Yao, Z., Bai, Y ., Hou, L., Li, J.: Finding safety neurons in large language models (2024), https://arxiv.org/abs/2406.14144

  5. [5]

    FU, Z., Bao, G., Zhang, H., Hu, C., Zhang, Y .: Correlation or causation: Analyzing the causal structures of llm and lrm reasoning process (2025), https://arxiv.org/abs/2509.17380

  6. [6]

    In: Proceedings of the Thirty- ThirdInternational Joint Conference on Artificial Intelligence

    García-Carrasco, J., Maté, A., Trujillo, J.: Detecting and understanding vulnerabili- ties in language models via mechanistic interpretability. In: Proceedings of the Thirty- ThirdInternational Joint Conference on Artificial Intelligence. p. 385–393. IJCAI-2024, International Joint Conferences on Artificial Intelligence Organization (Aug 2024). https: //d...

  7. [7]

    He, Z., Wang, Z., Chu, Z., Xu, H., Zhang, W., Wang, Q., Zheng, R.: Jailbreaklens: Interpret- ing jailbreak mechanism in the lens of representation and circuit (2025), https://arxiv.org/ abs/2411.11114

  8. [8]

    Kirch, N., Weisser, C., Field, S., Yannakoudakis, H., Casper, S.: What features in prompts jailbreak llms? investigating the mechanisms behind attacks (2025), https://arxiv.org/abs/ 2411.03343 16 R Ganguli and R Moraffah

  9. [9]

    Lee, A., Bai, X., Pres, I., Wattenberg, M., Kummerfeld, J.K., Mihalcea, R.: A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity (2024), https:// arxiv.org/abs/2401.01967

  10. [10]

    Li, S., Yao, L., Zhang, L., Li, Y .: Safety layers in aligned large language models: The key to llm security (2025), https://arxiv.org/abs/2408.17003

  11. [11]

    Mao, Y ., Cui, T., Liu, P., You, D., Zhu, H.: From llms to mllms to agents: A survey of emerging paradigms in jailbreak attacks and defenses within llm ecosystem (2025), https: //arxiv.org/abs/2506.15170

  12. [12]

    Meta AI: Llama 3.2: Revolutionizing edge ai and vision with open, customizable models (September 2024), https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/

  13. [13]

    Ranjan, R., Gupta, S., Singh, S.N.: A comprehensive survey of bias in llms: Current land- scape and future directions (2024), https://arxiv.org/abs/2409.16430

  14. [14]

    Roy, D.: Causal intervention framework for variational auto encoder mechanistic inter- pretability (2025), https://arxiv.org/abs/2505.03530

  15. [15]

    Schölkopf, B., Locatello, F., Bauer, S., Ke, N.R., Kalchbrenner, N., Goyal, A., Bengio, Y .: Towards causal representation learning (2021), https://arxiv.org/abs/2102.11107

  16. [16]

    Shi, D., Shen, T., Huang, Y ., Li, Z., Leng, Y ., Jin, R., Liu, C., Wu, X., Guo, Z., Yu, L., Shi, L., Jiang, B., Xiong, D.: Large language model safety: A holistic survey (2024), https: //arxiv.org/abs/2412.17686

  17. [17]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023), https://aclanthology.org/2023

    Stolfo, A., Belinkov, Y ., Sachan, M.: A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023), https://aclanthology.org/2023. emnlp-main.435/

  18. [18]

    Wang, Y ., Jordan, M.I.: Desiderata for representation learning: A causal perspective (2022), https://arxiv.org/abs/2109.03795

  19. [19]

    Winninger, T., Addad, B., Kapusta, K.: Using mechanistic interpretability to craft adversarial attacks against large language models (2025), https://arxiv.org/abs/2503.06269

  20. [20]

    Xu, W., Parhi, K.K.: A survey of attacks on large language models (2025), https://arxiv.org/ abs/2505.12567

  21. [21]

    Xu, Z., Huang, R., Chen, C., Wang, X.: Uncovering safety risks of large language models through concept activation vector (2024), https://arxiv.org/abs/2404.12038

  22. [22]

    Yan, H., Vaidya, S.S., Zhang, X., Yao, Z.: Guiding ai to fix its own flaws: An empirical study on llm-driven secure code generation (2025), https://arxiv.org/abs/2506.23034

  23. [23]

    Qwen2 Technical Report

    Yang, A., Yang, B., et al.: Qwen2 technical report. arXiv preprint arXiv:2407.10671 (2024)

  24. [24]

    Zhang, C., Zhu, C., Xiong, J., Xu, X., Li, L., Liu, Y ., Lu, Z.: Guardians and offenders: A survey on harmful content generation and safety mitigation of llm (2025), https://arxiv.org/ abs/2508.05775

  25. [25]

    In: Proceedings of the Conference on Safe AI Systems (2024)

    Zhao, M., Liu, C.: Understanding safety failures in modern language models. In: Proceedings of the Conference on Safe AI Systems (2024)

  26. [26]

    Zhao, W., Li, Z., Sun, J.: Causality analysis for evaluating the security of large language models (2023), https://arxiv.org/abs/2312.07876

  27. [27]

    Zhao, W., Guo, J., Hu, Y ., Deng, Y ., Zhang, A., Sui, X., Han, X., Zhao, Y ., Qin, B., Chua, T.S., Liu, T.: Adasteer: Your aligned llm is inherently an adaptive jailbreak defender (2025), https://arxiv.org/abs/2504.09466

  28. [28]

    Zheng, Y ., Yuan, Y ., Zhuo, Y ., Li, Y ., Kreiman, G., Poggio, T., Santi, P.: Probing neural topology of large language models (2026), https://arxiv.org/abs/2506.01042

  29. [29]

    Zhou, Z., Yu, H., Zhang, X., Xu, R., Huang, F., Li, Y .: How alignment and jailbreak work: Ex- plain llm safety through intermediate hidden states (2024), https://arxiv.org/abs/2406.05644

  30. [30]

    I’m unable to assist

    Zhou, Z., Yu, H., Zhang, X., Xu, R., Huang, F., Wang, K., Liu, Y ., Fang, J., Li, Y .: On the role of attention heads in large language model safety (2025), https://arxiv.org/abs/2410.13708 Harmful Generation 17 A Dataset Example Below is an example of our dataset. The first column is the prompts taken from the AdvBench dataset, the second column is the L...